Ali's Black Tech Shocks the Scene! A 0.6B Small Model is Modified into a 17B MoE with Only 5% Activated Parameters, Running Directly on CPU at 30 Token/s!

Ali International Digital Commerce team recently introduced a new member to the Marco-MoE series model—Marco-Mini-Instruct, once again demonstrating the ultimate efficiency concept of "achieving great results with small scale." The model has a total parameter count of 17.3B, but only 0.86B parameters are activated (about 5%), resulting in extremely high inference efficiency, even allowing smooth operation on a regular CPU.

Extreme Lightweight: Runs Smoothly on CPU

According to official estimates, with 8-bit quantization and 4 DDR4 2400 memory modules, the model's inference speed can reach about 30 tokens/s. This performance brings the MoE architecture closer to the stage of "accessible to all," greatly lowering the local deployment threshold.

Core Innovation: Upcycling Technology "Turns Stones into Gold"

The biggest highlight of Marco-Mini-Instruct is not its parameter size or speed, but its creation method. Rather than being trained from scratch, the model was transformed from the Qwen3-0.6B-Base model using upcycling technology.

The specific process involves splitting or copying parts of the Dense small model into multiple experts and introducing a routing mechanism. At the same time, it combines fine-grained sub-matrix partitioning and Drop-Upcycling strategies (randomly discarding some experts or routing paths during training with a certain probability, adding regularization to improve robustness), achieving a smooth upgrade from a pure Dense model to the MoE architecture. This method provides the industry with a new low-cost and high-efficiency path for MoE training.

Context and Training Configuration Details

The model's config has expanded the max_position_embeddings to 32K, but during the SFT phase, an 8192-token context is actually used. Therefore, the default context length is more suitable for most practical application scenarios.

Post-training Highlights: Cascaded On-Policy Distillation

The post-training process is also impressive: first, perform SFT preheating, then use the cascaded On-Policy Distillation strategy—first distill using Qwen3-30B-A3B-Instruct as the teacher model, then switch to a more powerful Qwen3-Next-80B-A3B-Instruct. The distillation data covers multiple dimensions such as instruction following, complex reasoning, alignment security, and mathematical ability, ensuring that the model maintains efficiency while significantly enhancing overall intelligence.

Performance Testing: 0.86B Activated Parameters Outperform 4B Dense Models

The final released Marco-Mini-Instruct outperformed many dense models like Qwen3-4B on most mainstream benchmarks, with only 0.86B activated parameters, fully validating the huge potential of the MoE architecture on the "small yet powerful" path.

Industry Significance: A New Open-Source MoE Training Paradigm

AIbase believes that the greatest value of this achievement lies in opening a new door for developers—there is no need to train large-scale MoE models from scratch. Instead, simply select a suitable small Dense model and strictly reproduce the upcycling + Drop-Upcycling process described in the paper. The entire training cost is controllable: the SFT phase requires 64 GPUs × 24 hours, and the distillation phase requires 64 GPUs × 110 hours, greatly lowering the threshold for small and medium-sized teams to try MoE.

Alibaba's latest "modification" once again proves that breakthroughs in model efficiency do not necessarily depend on parameter stacking; innovative training paradigms can also bring qualitative leaps. The release of Marco-Mini-Instruct will undoubtedly accelerate the adoption of MoE technology in edge devices and personal developer scenarios, and it is worth continuous attention from the entire industry.

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

GEO Brand Visibility

AI Visibility Audit

AI Search Visibility Checker

GEO Ranking Monitor

AI Conversation Insight

GEO Promotion Link Detection

GEO Ranking Optimization System

GEO Ranking Optimization

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

LLM API Hub

AI Models Finder

Model Providers

LLM Leaderboard

Compare LLMs

LLM Cost Calculator

LLM Arena

AI Model Compatibility Checker

AI Deployment Calculator

Ali's Black Tech Shocks the Scene! A 0.6B Small Model is Modified into a 17B MoE with Only 5% Activated Parameters, Running Directly on CPU at 30 Token/s!

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Triples! Enterprise AI Search Unicorn Glean Surpasses $300 Million in Annual Revenue

AI Daily: Claude Opus 4.8 Launches; Xiaohongshu PC Version Launches AI Search Assistant Dian Dian; Jieyu Star Open Sources Step 3.7 Flash Large Model

Oculus Founder's New Venture! Conversational AI Star Sesame Launches iOS App, Focused on Thinking and Speaking Simultaneously

Partnering with OpenAI! Mitsubishi UFJ Financial Group Promotes Full AI Transformation

Alibaba Cloud BaiLian Fully CLI-Enabled and Open-Sourced: A Single Command to Enable Full Stack Capabilities of AI Agent Orchestration

The Largest Chip Leasing Transaction in History Has Emerged! Apollo Joins Blackstone to Raise $36 Billion to Aggressively Acquire Google TPU for Anthropic

Breakthrough in Edge-side Large Models! Liquid AI Opens Source Hybrid Expert Model LFM2.5

The Tech World is Changing: MiniMax's Business Customers Exceed 1 Million, Changxiang 3D Opens the Door to the Hong Kong Stock Market

Hawk Eye 2.0 Is Here! NBA Will Introduce an AI System to Replace Manual Out-of-Bounds Calls

Preparation in Advance! The CEO of Mistral AI Says Developing Their Own Chips Is Inevitable