SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning

Lingkun Long¹, Rubing Yang¹, Yushi Huang², Desheng Hui¹, Ao Zhou^1,*, Jianlei Yang^1,*

¹ Beihang University, ²Hong Kong University of Science and Technology

Introduction

In this work, we propose SlimInfer, an innovative framework designed to accelerate long-context inference for Large Language Models (LLMs) by dynamically pruning less critical prompt tokens during the forward pass. While existing methods often focus on sparse attention or decoding optimization, they still process the full set of hidden states at each layer, limiting overall efficiency.

Our method builds on a key insight called the Information Diffusion Phenomenon: as information from critical tokens propagates through layers, it becomes distributed across the entire sequence. This allows LLMs to maintain semantic integrity even when excessive tokens (including critical ones) are pruned in intermediate hidden states.

Motivated by this, SlimInfer introduces:

Dynamic Block-wise Pruning: A mechanism that accurately removes redundant tokens of hidden states at intermediate layers.
Asynchronous KV Cache Manager: A predictor-free strategy that prefetches required token blocks from CPU to GPU, reducing memory usage and I/O costs.

Extensive experiments show that SlimInfer can achieve up to 2.53× Time-To-First-Token (TTFT) speedup and 1.88× end-to-end latency reduction on LLaMA-3.1-8B-Instruct on a single RTX 4090, without sacrificing performance on LongBench.

Installation

Requirements

Our framework is implemented in PyTorch and integrates with the Transformers library.

git clone https://github.com/Longxmas/SlimInfer.git
cd SlimInfer

The key dependencies are:

torch==2.5.1
transformers==4.53.0
flash-attn==2.6.3

Quick Start

We provide a demo.py script to quickly verify the inference speedup of SlimInfer. This script benchmarks the generation time across different input lengths and output lengths.

Run the demo with the following command:

python demo.py --model <MODEL_PATH> --mode sliminfer

--model: Path to your local Llama-3.1-8B-Instruct or Qwen2.5-7B-Instruct model.
--mode: Choose between sliminfer (our method) or origin (original Hugging Face implementation).
--pruning_config: (Optional) Path to the pruning configuration file. Defaults to prune_configs/b64_t09_w4_prune_fx_9_8_19_4_29_2.yaml.

Evaluation

We provide scripts to evaluate the performance of SlimInfer on standard benchmarks. Before running the scripts, please update the MODEL_PATH variable in the scripts to point to your local model directory.

LongBench Evaluation

To evaluate the model accuracy on the LongBench dataset (including Single-Doc QA, Multi-Doc QA, Summarization, etc.), run:

bash scripts/longbench_eval.sh

Needle In A Haystack (NIAH)

To run the "Needle In A Haystack" pressure test for context retrieval capabilities:

bash scripts/niah.sh

Results

Accuracy

Despite aggressive pruning, SlimInfer maintains near-lossless accuracy across various tasks in LongBench, establishing a superior Pareto frontier between efficiency and performance.

Efficiency

SlimInfer significantly reduces both TTFT and End-to-End latency compared to FlashAttention2 (Full KV) and other state-of-the-art pruning methods (LazyLLM, FlexPrefill, MInference).

Citation

If you find this work useful, please cite our paper:

@misc{long2025sliminferacceleratinglongcontextllm,
      title={SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning}, 
      author={Lingkun Long and Rubing Yang and Yushi Huang and Desheng Hui and Ao Zhou and Jianlei Yang},
      year={2025},
      eprint={2508.06447},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.06447}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
configs		configs
data		data
models		models
prune_configs		prune_configs
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
demo.py		demo.py
longbench_eval.py		longbench_eval.py
longbench_pred.py		longbench_pred.py
needle_in_haystack.py		needle_in_haystack.py
niah_visualize.py		niah_visualize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning

Introduction

Installation

Requirements

Quick Start

Evaluation

LongBench Evaluation

Needle In A Haystack (NIAH)

Results

Accuracy

Efficiency

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning

Introduction

Installation

Requirements

Quick Start

Evaluation

LongBench Evaluation

Needle In A Haystack (NIAH)

Results

Accuracy

Efficiency

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages