Liquid AI has recently released the LFM2-VL series of vision-language foundation models, marking a faster development trend for multimodal AI towards "lightweight, fast, and deployable on devices."

These two models are LFM2-VL-450M and LFM2-VL-1.6B. The former is designed for resource-constrained hardware environments with fewer than 500 million parameters, while the latter, though with more parameters, remains lightweight and suitable for direct deployment on a single GPU or device.

image.png

LFM2-VL extends from Liquid AI's previous LFM2 architecture, integrating visual and language processing capabilities, supporting multi-resolution image input, and capable of handling text and images, offering excellent flexibility and compatibility (liquid.ai, Venturebeat). The model achieved a significant improvement in GPU inference speed, up to "twice," and performed well in common performance evaluations (Venturebeat, liquid.ai).

In image processing, LFM2-VL can input images at their original resolution (up to 512×512), avoiding distortion caused by forced scaling. For larger images, the model processes them using a non-overlapping tiling method and combines thumbnails to obtain global context information (Venturebeat, liquid.ai). Its architecture consists of a language model backbone, a SigLIP2NaFlex visual encoder, and a multimodal projector. The projector uses two layers of MLP (with pixel unshuffle technology) to reduce the number of image tokens, thereby improving processing speed (Venturebeat, liquid.ai).

Regarding training data, LFM2-VL involves approximately 10 billion multimodal training tokens, sourced from open-source datasets and company-generated synthetic image data (Venturebeat, liquid.ai). Evaluation results show that LFM2-VL-1.6B performs excellently in tasks such as RealWorldQA (65.23), InfoVQA (58.68), and OCRBench (742), and it leads in inference efficiency compared to similar models (Venturebeat, liquid.ai).

Currently, these models have been released on Hugging Face, along with fine-tuning example code on Colab, compatible with Hugging Face Transformers and TRL libraries. They use a new "LFM1.0 license agreement" based on the Apache 2.0 principle, allowing academic use, and companies with annual revenue below $10 million can use it for commercial purposes, while enterprises with higher annual revenue need to contact Liquid AI for authorization (Venturebeat, liquid.ai).

Liquid AI's LFM2-VL model portfolio provides a new path for deploying visual and text fusion AI on devices, especially suitable for mobile phones, laptops, wearable devices, and other scenarios, helping to reduce reliance on the cloud, improve privacy, and response speed.

Project: https://huggingface.co/LiquidAI/LFM2-VL-1.6B

Key Points:

  • 🆕 Two Model Designs: LFM2-VL-450M (suitable for minimal resource environments) and LFM2-VL-1.6B (more powerful but still lightweight), suitable for device-side deployment.

  • Speed and Efficiency: Achieved up to twice the GPU inference speed, while maintaining excellent performance in multimodal tasks.

  • Multi-platform Friendly Environment: Released on Hugging Face, with licensing options, compatible with mainstream development tools, suitable for academic and small-to-medium enterprise commercial use.