Efficient MoE Architecture Reshapes Edge AI. Liquid AI's LFM2-8B-A1B is the first Mixture-of-Experts (MoE) model in its LFM2 series, with a total parameter scale of 8.3B, but only about 1.5B parameters are activated per token. This sparse activation mechanism significantly reduces computational load while maintaining high representational capability, making it suitable for resource-constrained device-side scenarios. Unlike traditional cloud-based MoE models, this design is optimized for real-time interaction, challenging the industry perception that "small-scale MoE is inefficient."

image.png

The model is based on the LFM2 hybrid backbone architecture, including 18 gate short convolution blocks and 6 group query attention (GQA) blocks. Except for the first two layers, which remain dense to ensure stability, the rest of the layers integrate sparse MoE feedforward networks. Each layer has 32 experts, activating only the top-4 experts, and uses a normalized sigmoid router combined with adaptive bias to achieve load balancing. It supports a 32K context length and is compatible with multiple languages such as English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish.

Training and Performance: 12T Tokens Forge 3-4B Level Capabilities. LFM2-8B-A1B achieves 3-4B-level capabilities through pre-training with approximately 12T tokens, including 55% English, 25% multilingual, and 20% code data distribution. Subsequently, post-training is conducted using Liquid Preference Alignment (length-normalized DPO/APO-Zero fusion), employing mixed BF16/FP8 precision to improve training efficiency by more than three times.

image.png

In benchmark tests, the model demonstrates strength surpassing competitors of similar scale:

  • Knowledge and Instruction Following: MMLU-Pro score of 37.4 (an increase of 11.5 from LFM2-2.6B), IFEval 77.6, Multi-IF 58.2.
  • Mathematical Ability: GSM8K 84.4, GSMPlus 64.8, MATH500 74.2.
  • Multi-language Processing: MGSM 72.4, MMMLU 55.3.
  • Coding and Writing: HumanEval+ 69.5, LiveCodeBench v6 21.0, EQ-Bench 44.2.

Overall, its output quality rivals 3-4B dense models, performing well in tasks such as multi-turn dialogue, creative writing, RAG retrieval-enhanced generation, and tool calling. Deployment and Integration: 5x Speedup, Compatible with Mainstream Frameworks. The LFM2-8B-A1B shows significant improvement in inference speed on CPUs and GPUs.

On devices such as AMD Ryzen AI9HX370 and Samsung Galaxy S24 Ultra, using custom XNNPACK MoE kernels with int4 quantization and int8 dynamic activation, its decoding throughput is up to 5 times faster than Qwen3-1.7B and IBM Granite4.0. On the GPU side, integration with vLLM supports FlashInfer and CUDA-graph compilation, enabling efficient operation for single requests and online batching.

Quantized variants have been optimized for high-end mobile phones/tablets/laptops: Q4_0 is approximately 4.7GB, F16 is approximately 16.7GB. Supported frameworks include llama.cpp (requires b6709+ version support for lfm2moe), ExecuTorch (mobile/embedded CPU), and vLLM (GPU). Additionally, GGUF quantized files are available on Hugging Face, along with Colab fine-tuning notebooks, facilitating quick development. The model is now available for testing on Liquid Playground.

Open Source and Impact: Promoting AI Accessibility at the Edge. LFM2-8B-A1B is open-sourced under the LFM Open License v1.0 (based on Apache2.0), and weights and technical details have been uploaded to Hugging Face (LiquidAI/LFM2-8B-A1B). This release not only lowers the barrier to AI deployment but also injects new vitality into edge computing—benefiting everything from real-time private chat to embedded intelligent systems. AIbase Perspective: In the current era of soaring cloud AI costs, efficient models like LFM2-8B-A1B are accelerating the trend of "AI decentralization."

Project: https://huggingface.co/LiquidAI/LFM2-8B-A1B