Google has officially launched its new Gemini Embedding2 model. As Google's first native multimodal embedding model, it breaks the limitations of traditional models that only support a single data type, and can map text, images, videos, audio, and documents simultaneously into the same mathematical vector space, thus achieving deep understanding across media.

Differing from generative models like Gemini3, which focus on content creation, the core function of an embedding model is "understanding." It transforms complex data into machine-readable vectors, helping systems identify semantic relationships, significantly outperforming traditional keyword search in search accuracy and contextual relevance.

image.png

Technical Features and Breakthroughs of Gemini Embedding2:

  • Comprehensive Multimodal Support: This model not only supports text, but can also directly process PNG/JPEG images, MP4/MOV videos up to 120 seconds long, native audio data, and PDF documents with up to 6 pages.

  • Global Language Understanding: It supports accurate identification of users' semantic intent in over 100 languages worldwide.

  • Multi-Dimensional Joint Analysis: The model can receive combined inputs such as "image + text" in a single request, thereby deeply analyzing the internal relationships between different media types.

  • Extensive Application Scenarios: The new model will significantly improve the performance of retrieval-augmented generation (RAG), semantic search, sentiment analysis, and large-scale data clustering.

Google mentioned in its official blog that in complex scenarios such as legal evidence collection, Gemini Embedding2 can quickly locate key evidence among millions of cross-media records, greatly improving the accuracy and recall rate of retrieval. Currently, the model is available for public preview through the Gemini API and Vertex AI.