1 Beihang University, 2Hong Kong University of Science and Technology
In this work, we propose SlimInfer, an innovative framework designed to accelerate long-context inference for Large Language Models (LLMs) by dynamically pruning less critical prompt tokens during the forward pass. While existing methods often focus on sparse attention or decoding optimization, they still process the full set of hidden states at each layer, limiting overall efficiency.
Our method builds on a key insight called the Information Diffusion Phenomenon: as information from critical tokens propagates through layers, it becomes distributed across the entire sequence. This allows LLMs to maintain semantic integrity even when excessive tokens (including critical ones) are pruned in intermediate hidden states.
Motivated by this, SlimInfer introduces:
- Dynamic Block-wise Pruning: A mechanism that accurately removes redundant tokens of hidden states at intermediate layers.
- Asynchronous KV Cache Manager: A predictor-free strategy that prefetches required token blocks from CPU to GPU, reducing memory usage and I/O costs.
Extensive experiments show that SlimInfer can achieve up to 2.53× Time-To-First-Token (TTFT) speedup and 1.88× end-to-end latency reduction on LLaMA-3.1-8B-Instruct on a single RTX 4090, without sacrificing performance on LongBench.
Our framework is implemented in PyTorch and integrates with the Transformers library.
git clone https://github.com/Longxmas/SlimInfer.git
cd SlimInferThe key dependencies are:
torch==2.5.1transformers==4.53.0flash-attn==2.6.3
We provide a demo.py script to quickly verify the inference speedup of SlimInfer. This script benchmarks the generation time across different input lengths and output lengths.
Run the demo with the following command:
python demo.py --model <MODEL_PATH> --mode sliminfer--model: Path to your local Llama-3.1-8B-Instruct or Qwen2.5-7B-Instruct model.--mode: Choose betweensliminfer(our method) ororigin(original Hugging Face implementation).--pruning_config: (Optional) Path to the pruning configuration file. Defaults toprune_configs/b64_t09_w4_prune_fx_9_8_19_4_29_2.yaml.
We provide scripts to evaluate the performance of SlimInfer on standard benchmarks.
Before running the scripts, please update the MODEL_PATH variable in the scripts to point to your local model directory.
To evaluate the model accuracy on the LongBench dataset (including Single-Doc QA, Multi-Doc QA, Summarization, etc.), run:
bash scripts/longbench_eval.shTo run the "Needle In A Haystack" pressure test for context retrieval capabilities:
bash scripts/niah.shDespite aggressive pruning, SlimInfer maintains near-lossless accuracy across various tasks in LongBench, establishing a superior Pareto frontier between efficiency and performance.
SlimInfer significantly reduces both TTFT and End-to-End latency compared to FlashAttention2 (Full KV) and other state-of-the-art pruning methods (LazyLLM, FlexPrefill, MInference).
If you find this work useful, please cite our paper:
@misc{long2025sliminferacceleratinglongcontextllm,
title={SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning},
author={Lingkun Long and Rubing Yang and Yushi Huang and Desheng Hui and Ao Zhou and Jianlei Yang},
year={2025},
eprint={2508.06447},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.06447},
}


