TurboQuant: Google aims to curb the memory hunger of large LLMs
Google's TurboQuant reduces the KV cache of large language models to 3 bits. Accuracy is said to remain, speed to multiply.
(Image: heise medien)
Google Research has published new technical details about its compression algorithm, TurboQuant. It is intended to compress the key-value cache of large language models down to 3 bits per value – without any measurable loss in model accuracy, the researchers announced. According to Google, the method achieves up to an eightfold acceleration in the calculation of attention logits on Nvidia H100 GPUs compared to unquantized 32-bit keys. However, unquantized key values are not typically found in modern applications. Many approaches aim to achieve below 4 bits per value; Google combines the PolarQuant and QJL methods for its approach.
Background
The key-value cache, in which transformer models temporarily store already calculated context information for quick access, requires large amounts of RAM. With long input sequences, this cache grows significantly and becomes a bottleneck. While previous vector quantization alleviates this, it creates its own memory overhead: for every small data block, quantization constants must be stored in full precision, which reduces the compression gain by 1 to 2 bits per value. TurboQuant aims to eliminate this problem by combining PolarQuant and QJL.
PolarQuant: Compression via Polar Coordinates
PolarQuant deviates from the usual approach of processing vectors in Cartesian coordinates. Instead, the method randomly rotates the data vectors and then converts them into polar coordinates. The data is no longer stored as distances along individual axes but as a combination of a radius, which describes the signal strength, and angles, which encode the meaning. Since the resulting angle distributions are highly concentrated and predictable, the otherwise necessary normalization step with its memory overhead is eliminated. PolarQuant handles the majority of the compression work in TurboQuant.
Videos by heise
QJL: Error Correction with One Bit
The second stage addresses the small residual error left by PolarQuant. QJL (Quantized Johnson-Lindenstrauss) uses the Johnson-Lindenstrauss transformation, known in theoretical computer science, to reduce the remaining high-dimensional error data to a single sign bit per value. The essential distances and relationships between the data points are preserved. QJL thus functions as a mathematical error correction: it eliminates systematic distortions in the attention scores without causing additional memory overhead.
Promising Benchmarks
Google tested all three algorithms with the open-source models Llama-3.1-8B-Instruct and Mistral-7B-Instruct on common long-context benchmarks, including LongBench, Needle in a Haystack, ZeroSCROLLS, RULER, and L-Eval. The results: In the needle-in-a-haystack tests, TurboQuant reduced KV memory by at least a factor of 6, according to Google; in the LongBench results shown in the ICLR paper, the compression rates are above or below this, depending on the bit width. The models reportedly did not lose quality in the tested tasks – question-answering scenarios, code generation, and summaries. In the needle-in-a-haystack tests, where a model must find a single piece of information within large amounts of text, TurboQuant achieved the same accuracy as the full-precision baseline across all benchmarks (Score: 0.997).
Training or fine-tuning the models is not required for the use of TurboQuant. Google compares TurboQuant in vector search with Product Quantization (PQ) and RabitQ. In the paper, the authors primarily criticize PQ for its dataset-dependent training effort and the need for large codebooks. Google criticizes RaBitQ for its lack of vectorization, lack of GPU support, and additional overheads.
Use in Gemini and Google Search
Google sees the main application of TurboQuant in eliminating KV cache bottlenecks in models like Gemini. Furthermore, the method is intended to accelerate semantic vector search, which searches for content similarity in billions of vectors rather than keywords. Due to the low memory requirement and the almost negligible preprocessing effort, large vector indices can be built and queried much more efficiently.
TurboQuant will be presented at ICLR 2026, PolarQuant and QJL at AISTATS 2026. Further information can be found in the Google Research Blog.
(fo)