kv-cache-compression

First open-source implementation of Google TurboQuant (ICLR 2026) -- near-optimal KV cache compression for LLM inference. 5x compression with near-zero quality loss.

machine-learning compression deep-learning pytorch transformer attention quantization iclr vector-quantization memory-optimization kv-cache google-research llm vllm llm-inference kv-cache-compression

Updated May 25, 2026
Python

snu-mllab / Context-Memory

Star

Pytorch implementation for "Compressed Context Memory For Online Language Model Interaction" (ICLR'24)

efficient-llm-inference context-compression kv-cache-compression

Updated Apr 18, 2024
Python

JIA-Lab-research / Q-LLM

Star

This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"

fast-inference inference-acceleration large-language-models long-context kv-cache-compression

Updated Jul 16, 2024
Python

abdelfattah-lab / xKV

Star

xKV: Cross-Layer SVD for KV-Cache Compression

mla low-rank long-context llm-inference deepseek kv-cache-compression inter-layer

Updated Jun 21, 2026
Python

Linking-ai / SCOPE

Star

(ACL2025 oral) SCOPE: Optimizing KV Cache Compression in Long-context Generation

long-context kv-cache-compression kvcache

Updated May 28, 2025
Jupyter Notebook

Janghyun1230 / FastKVzip

Star

Accurate and fast KV cache compression with a gating mechanism

large-language-models kv-cache-compression

Updated Apr 5, 2026
Python

Native Windows build of vLLM 0.23.0 - no WSL, no Docker. Python 3.13 + CUDA 12.8 + PyTorch 2.11 cu128 for RTX 30/40/50-series, pre-built wheel, Windows patchset, 10 KV-cache compression dtypes, OpenAI API server fixes, and Rust frontend support.

Updated Jun 30, 2026
Python

OnlyTerp / kvtc

Star

First open-source KVTC implementation (NVIDIA, ICLR 2026) -- 8-32x KV cache compression via PCA + adaptive quantization + entropy coding

compression pytorch nvidia transformer pca attention dynamic-programming quantization deflate entropy-coding memory-optimization kv-cache llm llm-inference kv-cache-compression iclr-2026

Updated Apr 17, 2026
Python

MAC-AutoML / Awesome-Efficient-Large-Models

Star

A list of awesome papers on compression and acceleration of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs).

acceleration compression survey pruning quantization knowledge-distillation awesome-papers large-language-models multimodal-large-language-models speculative-decoding kv-cache-compression

Updated May 12, 2026

AMD-AGI / AMD-Hybrid-Models

Star

Official repo for AMD hybrid models training and inference workflow

amd attention-mechanism mamba mla hybrid-models kv-cache large-language-models llm kv-cache-compression zebra-llama

Updated May 14, 2026
Python

FluffyAIcode / LLM-KV--Cache-compress

Star

Discrete Kakeya cover for LLM KV cache: D4/E8 nested-lattice quantisation realising a Kakeya-style tube-cover over the direction sphere. 2.4x-2.8x compression at <1% perplexity loss on Qwen3, Llama-3, DeepSeek, GLM-4, Gemma. Drop-in transformers.DynamicCache. pip install kakeyalattice.

transformers quantization discrete-geometry kv-cache long-context vllm llm-inference kv-cache-compression qwen3 lattice-quantization e8-lattice d4-lattice kakeya kakeya-set

Updated Jun 15, 2026
Python

MGDDestiny / Lava

Star

LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation

llm kv-cache-compression

Updated Sep 17, 2025
Python

Ryuketsukami / turboquant-skill

Star

AI agent skill implementing Google's TurboQuant compression algorithm (ICLR 2026) — 6x KV cache memory reduction, 8x speedup, zero accuracy loss. Compatible with Claude Code, Codex CLI, and all Agent Skills-compatible tools.

Updated Mar 28, 2026
Python

Improve this page

Add a description, image, and links to the kv-cache-compression topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the kv-cache-compression topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv-cache-compression

Here are 34 public repositories matching this topic...

Zefan-Cai / KVCache-Factory

NVIDIA / kvpress

Zefan-Cai / Awesome-LLM-KV-Cache

AtomicBot-ai / atomic-llama-cpp-turboquant

snu-mllab / KVzip

itsnamgyu / block-transformer

shadowpa0327 / Palu

OnlyTerp / turboquant

snu-mllab / Context-Memory

JIA-Lab-research / Q-LLM

abdelfattah-lab / xKV

Linking-ai / SCOPE

Janghyun1230 / FastKVzip

aivrar / vllm-windows-build

OnlyTerp / kvtc

MAC-AutoML / Awesome-Efficient-Large-Models

AMD-AGI / AMD-Hybrid-Models

FluffyAIcode / LLM-KV--Cache-compress

MGDDestiny / Lava

Ryuketsukami / turboquant-skill

Improve this page

Add this topic to your repo