Apple dropped a bombshell on Hugging Face, releasing a demonstration of their 4M model paper from last year. This model is capable of handling and generating various modal content, including text, images, and 3D scenes. A single model can extract all the information from an image, including depth maps and line drawings. AIbase tested it with previously generated ancient-style imagery, and indeed, it's quite impressive. After uploading an image, it quickly obtained the following breakdown infographic:

QQ截图20240705100442.jpg

Just by uploading a photo, you can easily obtain all the information about that photo, such as the main contours, the dominant colors in the scene, and the image size, etc.

This represents a bold turn for Apple in the traditional secrecy of research and development. They not only showcased their AI capabilities on the open-source AI stage of Hugging Face but also extended an olive branch to developers, hoping to build an ecosystem around the 4M. The multi-modal architecture of 4M heralds the possibility of more coherent and multifunctional AI applications in the Apple ecosystem, such as Siri becoming more intelligent in handling complex queries or Final Cut Pro automatically editing videos based on your spoken instructions.

However, the introduction of 4M also brings challenges in data practice and AI ethics. Apple has always claimed to be the guardian of user privacy, but will their stance be tested by this data-intensive AI model? Apple must carefully balance ensuring that while pushing technological progress, the trust of users is not compromised.

Let's take a brief look at the technical principles behind 4M. The biggest highlight of 4M is its "large-scale multi-modal masked modeling" training method, which can handle multiple visual modalities simultaneously, converting images, semantics, and geometric information into unified tokens for seamless inter-modal integration.

During the training process, 4M employs a clever approach: randomly selecting part of the annotations as input and the other part as the target, thereby achieving the scalability of training goals. This means that for 4M, both images and text are just strings of digital tokens, which greatly enhances the generalizability of the model.

The training data and methods of 4M are also noteworthy. It uses one of the largest open-source datasets, CC12M, although the dataset is rich in data but lacks in annotation information. To address this, researchers adopted the weak supervision pseudo-labeling method, using technologies like CLIP and MaskRCNN to make comprehensive predictions on the dataset, and then converting these predictions into tokens to lay the foundation for 4M's multi-modal compatibility.

After extensive experiments and testing, 4M has proven to be capable of executing multi-modal tasks directly without the need for large amounts of specific task pre-training or fine-tuning. This is like giving AI a multi-modal Swiss Army knife, allowing it to flexibly respond to various challenges.

Demo link: https://huggingface.co/spaces/EPFL-VILAB/4M