Ali International Digital Commerce team recently introduced a new member to the Marco-MoE series model—Marco-Mini-Instruct, once again demonstrating the ultimate efficiency concept of "achieving great results with small scale." The model has a total parameter count of 17.3B, but only 0.86B parameters are activated (about 5%), resulting in extremely high inference efficiency, even allowing smooth operation on a regular CPU.

Extreme Lightweight: Runs Smoothly on CPU
According to official estimates, with 8-bit quantization and 4 DDR4 2400 memory modules, the model's inference speed can reach about 30 tokens/s. This performance brings the MoE architecture closer to the stage of "accessible to all," greatly lowering the local deployment threshold.
Core Innovation: Upcycling Technology "Turns Stones into Gold"
The biggest highlight of Marco-Mini-Instruct is not its parameter size or speed, but its creation method. Rather than being trained from scratch, the model was transformed from the Qwen3-0.6B-Base model using upcycling technology.

The specific process involves splitting or copying parts of the Dense small model into multiple experts and introducing a routing mechanism. At the same time, it combines fine-grained sub-matrix partitioning and Drop-Upcycling strategies (randomly discarding some experts or routing paths during training with a certain probability, adding regularization to improve robustness), achieving a smooth upgrade from a pure Dense model to the MoE architecture. This method provides the industry with a new low-cost and high-efficiency path for MoE training.
Context and Training Configuration Details
The model's config has expanded the max_position_embeddings to 32K, but during the SFT phase, an 8192-token context is actually used. Therefore, the default context length is more suitable for most practical application scenarios.
Post-training Highlights: Cascaded On-Policy Distillation
The post-training process is also impressive: first, perform SFT preheating, then use the cascaded On-Policy Distillation strategy—first distill using Qwen3-30B-A3B-Instruct as the teacher model, then switch to a more powerful Qwen3-Next-80B-A3B-Instruct. The distillation data covers multiple dimensions such as instruction following, complex reasoning, alignment security, and mathematical ability, ensuring that the model maintains efficiency while significantly enhancing overall intelligence.
Performance Testing: 0.86B Activated Parameters Outperform 4B Dense Models
The final released Marco-Mini-Instruct outperformed many dense models like Qwen3-4B on most mainstream benchmarks, with only 0.86B activated parameters, fully validating the huge potential of the MoE architecture on the "small yet powerful" path.
Industry Significance: A New Open-Source MoE Training Paradigm
AIbase believes that the greatest value of this achievement lies in opening a new door for developers—there is no need to train large-scale MoE models from scratch. Instead, simply select a suitable small Dense model and strictly reproduce the upcycling + Drop-Upcycling process described in the paper. The entire training cost is controllable: the SFT phase requires 64 GPUs × 24 hours, and the distillation phase requires 64 GPUs × 110 hours, greatly lowering the threshold for small and medium-sized teams to try MoE.
Alibaba's latest "modification" once again proves that breakthroughs in model efficiency do not necessarily depend on parameter stacking; innovative training paradigms can also bring qualitative leaps. The release of Marco-Mini-Instruct will undoubtedly accelerate the adoption of MoE technology in edge devices and personal developer scenarios, and it is worth continuous attention from the entire industry.



