Researchers from Apple and the Swiss Federal Institute of Technology in Lausanne (EPFL) have jointly developed a single model capable of arbitrary-to-arbitrary modality, which can be trained on dozens of highly diverse modalities and synergistically trained on large-scale multimodal datasets and text corpora. Named 4M-21, this model is trained under 21 different modalities, accomplishing at least three times more tasks than existing models without performance loss.

image.png

The study employed a 4M pre-training scheme, enhancing model performance and adaptability by expanding the model and dataset sizes, increasing the types and numbers of modalities involved in training, and conducting joint training across multiple datasets. Researchers used different tokenization methods to discretize modalities with distinct characteristics, such as global image embeddings, human poses, and semantic instances. In terms of architecture, the study adopted a Transformer-based 4M encoder-decoder architecture with additional modality embeddings to accommodate new modalities.

image.png

The model is not only capable of performing a range of common visual tasks out-of-the-box, such as DIODE surface normals and depth estimation, COCO semantic and instance segmentation, 3DPW3D human pose estimation, but also supports several methods for fine-grained and multimodal generation, and can retrieve RGB images or other modalities using other modalities as queries. Additionally, researchers conducted multimodal transfer experiments on NYUv2, Hypersim semantic segmentation, and ARKitScenes.

Key features of the model include:

Arbitrary-to-arbitrary modality: Increased from the best existing arbitrary-to-arbitrary model's 7 modalities to 21 different modalities, enabling cross-modal retrieval, controllable generation, and robust out-of-the-box performance.

Diversity support: Added support for more structured data, such as human poses, SAM instances, metadata, etc.

Tokenization: Investigated modality-specific methods for discretizing different modalities, such as global image embeddings, human poses, and semantic instances.

Expansion: Scaled the model size to 3 billion parameters and the dataset to 0.5 billion samples.

Synergistic training: Conducted synergistic training on both vision and language simultaneously.

Key points:

- Researchers from Apple and EPFL have jointly developed a single model capable of arbitrary-to-arbitrary modality, trained under 21 different modalities.

- The model can perform a range of common visual tasks out-of-the-box and supports several methods for fine-grained and multimodal generation.

- Researchers conducted multimodal transfer experiments on NYUv2, Hypersim semantic segmentation, and ARKitScenes.