If you're a tech enthusiast doing local large model development on Mac, then the "performance package" just released by Ollama is definitely not to be missed.

On March 31, the local large model operation solution Ollama officially released an update, announcing the introduction of Apple's self-developed machine learning framework MLX. This change in the underlying architecture has brought a significant performance leap for Mac devices with Apple chips, elevating the response speed of local AI to a new level.

Core Improvements: Response Speed Doubled, M5 Performance Surprising

According to official data, after integrating the MLX framework, Ollama achieved a "two-step leap" in performance:

  • Prefill Phase Speed Increased 1.6 Times: During the processing of user input prompts, the system becomes more responsive.

  • Decode Phase Speed Doubled: During the process of generating replies, the speed at which words appear has almost increased by 100%.

  • New Model Special Offer: For the latest models equipped with the M5 series chip, due to the addition of a brand-new GPU Neural Accelerator (Neural Accelerator) in the hardware, the benefits are most significant, and the inference experience is close to "instant response."

Memory Management Optimization: Long Conversations No Longer "Stuck"

Aside from pure speed improvements, this update also deeply optimized memory management strategies:

  • Efficient Scheduling: The new version can more flexibly utilize the unified memory of Mac, maintaining smooth interaction even during long and large-context sessions.

  • Professional Recommendation: The official recommends running it on a Mac with 32GB or higher memory for the best inference performance.

First Batch: Alibaba Qwen 3.5 Supports First

During the preview phase, this MLX-accelerated version (Ollama 0.19 Preview) mainly provided specialized support for the Alibaba Group's Qwen 3.5 model. However, Ollama has clearly stated that it will gradually adapt to more mainstream AI models later.

Industry Insight: The "Millisecond-Level" Era of Local AI Assistants

For developers who rely on Ollama to power local AI coding tools (such as OpenClaw) or code assistants (such as Claude Code, Codex), this update means a major workflow closure. When latency is reduced to sub-second levels, large models running locally will no longer be "lab toys," but rather real-time productivity tools capable of competing with cloud services.

Conclusion: Apple Ecosystem's Computing Closed Loop

From self-developed chips to self-developed frameworks, Apple is gradually consolidating control over AI development. Ollama's embrace of MLX not only solidifies Mac's position as the top choice for local AI development, but also shows developers the ultimate benefits of software-hardware integration.