At the beginning of today, Microsoft officially released its latest innovative small-parameter model, Mu. This model has only 330 million parameters but can match the performance of Microsoft's previously released Phi-3.5-mini, while being just one-tenth the size of Phi-3.5-mini. More impressively, Mu achieves a response speed of over 100 tokens per second on offline NPU laptops, which is a rare breakthrough in the field of small-parameter models.
A major highlight of the Mu model is its support for setting up agents in Windows. Users simply need to issue natural language instructions, and the agent can be converted into system operations in real time. For example, a sentence like "Make the mouse pointer larger and adjust the screen brightness" can allow the agent to accurately locate and complete the settings with one click, greatly enhancing the usability of the Windows system.
Mu Architecture: Excellent Optimization for Small-Scale Local Deployment
The Mu model draws inspiration from Microsoft's previously released Phi Silica model, and is optimized specifically for small-scale local deployment, especially suitable for Copilot+ PCs equipped with an NPU. Its core architecture is a decoder-only Transformer, and it introduces three key innovations on this basis:
- Dual Layer Normalization: By applying LayerNorm operations before and after each sub-layer in the Transformer architecture, it effectively ensures that the distribution of activation values has good statistical properties, significantly enhancing the stability of the training process. It avoids common training instability issues in deep networks, thereby improving training efficiency and reducing resource consumption.
- Rotary Position Embedding (RoPE): Compared to traditional absolute position embeddings, RoPE introduces a rotation operation in the complex domain, transforming position encoding into a dynamic and scalable function mapping. This allows the model to directly reflect the relative distance between tokens, solving the performance degradation problem in traditional methods when handling long sequences, and giving the model excellent long-sequence extrapolation capabilities.
- Grouped-Query Attention: This optimization addresses the high parameter and memory consumption issues in traditional multi-head attention mechanisms. By sharing keys (Key) and values (Value) among head groups, it significantly reduces the number of attention parameters and memory usage, thus lowering latency and power consumption on NPUs and improving model efficiency. At the same time, by maintaining head diversity, it ensures performance comparable to traditional multi-head attention mechanisms.
In addition, the Mu model also employs advanced training techniques such as warm-up stable decay schedules and the Muon optimizer to further optimize performance. Microsoft trained Mu using an A100 GPU, following the technical approach pioneered in the development of the Phi model. First, it was pre-trained on hundreds of billions of high-quality educational tokens to learn the grammar, semantics, and world knowledge of the language. To further improve accuracy, Mu also performed knowledge distillation from the Phi model, achieving significant parameter efficiency. With only one-tenth the parameters of Phi-3.5-mini, it achieved similar performance.
Empowering Windows Agents: The Perfect Combination of Low Latency and High Accuracy
To enhance the usability of the Windows system, Microsoft has been committed to creating an AI agent capable of understanding natural language and seamlessly modifying system settings. Microsoft plans to integrate the agent driven by the Mu model into the existing search bar to achieve a smooth user experience, which requires ultra-low latency responses for numerous possible settings.
After testing various models, Mu was selected due to its appropriate characteristics. Although the baseline Mu model experiences a 50% drop in accuracy without fine-tuning, Microsoft successfully bridged this gap by expanding the training scale to 3.6 million samples (a 1300-fold increase) and expanding the number of settings processed from about 50 to hundreds. Through technologies such as automated annotation-based synthetic methods, prompt tuning with metadata, diverse phrasing, noise injection, and intelligent sampling, the Mu fine-tuned model for the setting agent successfully met the quality goals. Testing shows that the agent built with the Mu model performs well in understanding and executing operations in Windows settings, with response times kept within 500 milliseconds.