During the recent WWDC (Worldwide Developers Conference), the AI software platform LM Studio partnered with Apple to showcase a highly technologically impactful achievement: successfully running Moonshot AI's flagship model, Kimi K2.6, on a cluster of four Mac Studios. This demonstration highlighted the significant potential of Apple Silicon architecture in handling ultra-large-scale AI models.

The Kimi K2.6 model employs an advanced MoE (Mixture of Experts) architecture, with a total parameter count of up to one trillion. Although the dynamic expert scheduling mechanism reduces the computational load during inference by activating only about 32 billion parameters, loading the full weight of the entire model still presents a stringent memory challenge—requiring at least approximately 2TB of memory capacity when calculated at FP16 precision. In traditional data center environments, this typically requires server clusters consisting of 8 to 16 high-end GPUs, with costs often reaching millions of dollars.

However, this demonstration bypassed this barrier through an innovative technical approach. Four Mac Studios equipped with M3 Ultra chips were interconnected via Thunderbolt5 interfaces, utilizing the RDMA-over-Thunderbolt technology available in the latest version of macOS. This broke the boundaries of physical devices, allowing direct sharing of memory between multiple devices. The total of approximately 2TB of unified memory was integrated into a single logical "large memory pool," easily accommodating the weights of the trillion-parameter model. During the live demonstration, the cluster showcased excellent performance, generating about 28 tokens per second, and consuming significantly less power than traditional GPU computing centers.

In addition, LM Studio also released a key component called LM Link during this collaboration. This tool is based on the Tailscale Mesh VPN architecture and allows users to securely access this local Mac Studio cluster remotely through end-to-end encrypted channels. This means users do not need to be present at the host; whether using a MacBook or iPhone, they can remotely call the cluster's computing power for inference from any network environment. All sensitive data is processed within a closed loop locally, without passing through third-party cloud servers.

This demonstration was not only a technological showcase but also sent a clear industry signal: Apple Silicon, with its unified memory architecture and efficient multi-device connectivity capabilities, is becoming a new choice for large model deployment. For enterprises that require frequent and long-term operation of large model inference, this solution replaces expensive cloud rental costs with hardware purchase, offering significant cost advantages over long-term operations.

As the performance of "consumer-grade" hardware clusters continues to improve, the organizational barriers for AI technology applications are being further lowered. This achievement indicates that the source of cutting-edge artificial intelligence innovation will no longer be limited to a few tech giants with large supercomputing centers. A decentralized computing network may soon experience new development opportunities.