Recently, the Allen Institute for Artificial Intelligence (AI2) released a new open-source video-language model called Molmo2. This series of new models and related training data demonstrates the non-profit organization's strong commitment to open source, which is a significant benefit, especially in a business environment where companies want to have control over model usage.

Molmo2 includes several different model versions, including Molmo2-4B and Molmo2-8B based on Alibaba's Qwen3 language model, as well as a fully open-source version called Molmo2-O-7B based on AI2Olmo language model. In addition to the models, AI2 also introduced nine new datasets, including long-form quality assurance datasets with multi-image and video input, as well as an open video pointing and tracking dataset.
A significant feature of Molmo2 is its enhanced functionality. According to AI2's introduction, Molmo2-O-7B is a transparent model that allows users to conduct end-to-end research and customization. This means users can have full access to the visual language model and its language learning model (LLM), enabling them to more flexibly adjust the model to meet specific needs.
The Molmo2 model allows users to ask questions about images or videos and can reason based on patterns identified in the video. Ranjay Krishna, the head of perception reasoning and interaction research at AI2, stated that these models not only provide answers but also clearly indicate the time and space when certain events occur. In addition, Molmo2 has the ability to generate descriptive captions, track the number of objects, and detect rare events in long video sequences.
Users can use Molmo2 on Hugging Face and Ai2Playground, the latter being a platform provided by AI2 where users can experience various tools and models. This release highlights AI2's commitment to open source. Analyst Bradley Shimmin pointed out that releasing data and weights related to the model is crucial for enterprises, especially in the context of data sovereignty.
The models in the Molmo series have relatively small parameters (4 billion or 8 billion parameters), which is more cost-effective for many enterprises. Shimmin emphasized that enterprises are gradually realizing that the size of the model is not the only key factor, and the transparency and responsibility of the training data are equally important.
Project: https://allenai.org/blog/molmo2
Key Points:
1. 🚀 AI2 released the Molmo2 series of open-source video-language models, enhancing enterprise control over model usage.
2. 🎥 New models support multi-image and video input, capable of event reasoning and generating descriptive captions.
3. 📊 AI2 maintains its open-source commitment, emphasizing the importance of data transparency and model customization.



