Renowned Chinese artificial intelligence team Zhipu officially announced today the launch of a new GLM-5.1 Highspeed API for selected enterprise customers. This model, codenamed "GLM-5.1-highspeed," has stunned the industry since its release, achieving an impressive output speed of 400 tokens/s.
This figure directly breaks the current global API speed limit set by large model providers, demonstrating strong technical dominance. In the past, the AI industry believed that model speed and size were mutually exclusive, with high speed usually requiring a trade-off in model capabilities.
Breaking Industry Conventions with Flagship Performance
However, the GLM-5.1 Highspeed version completely broke the industry convention that "fast means small." For the first time in domestic large models, this model successfully brought flagship-level technical capabilities and extremely low latency into real production environments.
It is reported that this model was jointly developed by Zhipu's GLM team and the TileRT team. Both teams carried out in-depth and thorough system-level optimizations at three levels: the inference engine, scheduling system, and underlying infrastructure, abandoning traditional dynamic scheduling.
Optimization at Three Levels Ensures Stable Output
In terms of technical details, the development team not only re-wrote the core inference path of the model architecture to improve single-card throughput but also reduced latency in high-concurrency scenarios through techniques like dynamic batching. Meanwhile, collaborative optimization around the infrastructure ensured that 400 TPS became a stable and usable production-level capability.
This high-speed model has a broad range of application prospects, especially suitable for scenarios with strict requirements on response latency. Whether it's AI programming, real-time voice interaction, or frequent business decisions, the model is already available on Zhipu's MaaS platform for selected enterprises.



