With the popularity of Apple's M4 chip, how to run large language models (LLMs) smoothly locally without relying on cloud computing has become a focus for developers. Recently, developer jola shared an in-depth practice of deploying a local AI workflow on a 24GB memory M4 MacBook Pro. Test results show that the optimized Qwen 3.5-9B model can generate 40 tokens per second, offering an efficient alternative for offline work and private development.

Model Selection: Why is the 9B Model the "Best Choice"

In the initial stages of model deployment, jola conducted a comparative evaluation of various popular solutions. The test list included models ranging from lightweight Gemma 4B to larger ones like GPT-OSS 20B, and the testing environment involved platforms such as Ollama, llama.cpp, and LM Studio.

Testing found that although models above 20B theoretically could fit into 24GB of memory, they were basically unusable in practice due to extremely high resource consumption. Meanwhile, smaller 4B models responded quickly but performed poorly in handling complex tool use tasks. Ultimately, Qwen 3.5-9B (Q4_K_S quantized version) stood out. This version significantly reduced memory load while maintaining reasoning capabilities, even leaving enough space for other development tools. More importantly, it supports a context window of up to 128K, which provides significant advantages for reading long documents or analyzing large codebases.

Tuning Details: Unlocking the Potential of Chain-of-Thought

To make the local model more "intelligent" in programming and logical reasoning scenarios, jola made fine-tuned adjustments to the inference parameters in LM Studio. By setting the Temperature to 0.6 and the Top_p value to 0.95, a balance was achieved between creativity and accuracy in responses.

Additionally, this solution specifically enabled the thinking mode. By manually injecting specific parameters into the Prompt template, the model performs a reasoning process similar to "self-thinking" before outputting the final answer. In terms of front-end integration, by calling the local API interface through tools like Pi and OpenCode, developers can flexibly configure context length and output limits, thus building a complete local AI assistant system.

Shift in Perspective: From "Outsourcing Assistant" to "Research Partner"

jola honestly pointed out in the report the gap between local models and top-tier cloud models (such as Claude or GPT-4). Even a 9B-scale local model still experiences distractions, logical loops, or semantic misunderstandings when performing multi-step complex tasks.

However, this limitation has actually fostered a more engaging working model. Unlike when using cloud models, where there is a tendency to outsource cognition, local models require users to provide clearer instructions and more rigorous guidance. In this interaction, AI plays the role of a "rubber duck" style research assistant with immediate memory capabilities, rather than a full-stack outsourcing tool.

For users who prioritize data privacy, zero subscription fees, and a controlled development environment, deploying this offline solution on an M4 MacBook is not only a technical attempt, but also a successful return to personal computing autonomy in the face of the trend toward black-box large models.