Google TurboQuant Launches: LLM Key-Value Cache Memory Compressed 6 Times, Speed Increased 8 Times, Zero Precision Loss, No Training Required!
Google introduces the TurboQuant algorithm, which uses PolarQuant and QJL technologies to reduce the key-value cache memory requirements for large language model inference by at least 6 times, and increase attention computation speed up to 8 times on H100 GPUs, while maintaining zero precision loss. This breakthrough has the potential to reduce AI deployment costs and accelerate the development of long context applications.