New Multimodal Embedding Learning Framework VLM2Vec-V2: Unifying Retrieval Tasks for Images, Videos, and Visual Documents
VLM2Vec-V2, a multimodal embedding framework developed by Salesforce Research and other institutions, breaks through the limitations of traditional models. Based on the Qwen2-VL architecture, this framework innovatively unifies image, video, and visual document retrieval tasks. It adds five new evaluation tasks to expand the MMEB dataset. With key technologies such as dynamic resolution and M-RoPE, it achieved an average score of 58.0 across 78 datasets, leading in performance, especially in video tasks. Although its document retrieval is slightly behind ColPali, it