The Alibaba Qwen team recently released two lightweight models in the Qwen3-VL series - the 4B and 8B parameter versions. The Qwen3-VL series was first introduced last month and is the most powerful visual language model family released by Qwen to date. The newly added small-parameter versions aim to lower the deployment threshold while maintaining strong performance.

The newly released models come in two parameter sizes: 4B and 8B, with each size offering an Instruct (instruction following) and a Thinking (chain-of-thought reasoning) version, providing developers with more flexible options. From a functional perspective, these two models significantly reduce the size while maintaining full capabilities.

In terms of technical implementation, the new models have achieved three core goals. First, they significantly reduce hardware resource requirements, with the reduction in parameter size leading to a noticeable decrease in memory usage, allowing developers to deploy and run them on a wider range of consumer-grade and edge devices. Second, despite the significant reduction in model size, the models fully inherit all core capabilities of the Qwen3-VL series, including multimodal understanding, long text processing, and complex reasoning functions.

image.png

From a performance standpoint, these lightweight models demonstrate strength exceeding that of similarly sized competitors in multiple authoritative benchmark tests. In scenarios such as STEM subject Q&A, visual question answering (VQA), optical character recognition (OCR), video understanding, and Agent tasks, the 4B and 8B models not only surpass lightweight models like Google Gemini2.5Flash Lite and OpenAI GPT-5Nano, but in some tasks, they can even approach the performance of the 72B parameter flagship model Qwen2.5-VL-72B, which was released half a year ago.

This release marks another advancement in the trend of "miniaturization" of large models. Through model compression and optimization technologies, the development team has significantly reduced the number of parameters and computational costs while maintaining the integrity of capabilities, paving the way for the application of visual language models in resource-constrained scenarios such as mobile devices and IoT devices. For enterprise users who need local deployment or are sensitive to inference costs, these two new models provide a more cost-effective solution.

Model link: https://huggingface.co/collections/Qwen/qwen3-vl-68d2a7c1b8a8afce4ebd2dbe