Skip to content

Longxmas/SlimInfer

Repository files navigation

SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning

Lingkun Long1, Rubing Yang1, Yushi Huang2, Desheng Hui1, Ao Zhou1,*, Jianlei Yang1,*

1 Beihang University, 2Hong Kong University of Science and Technology

[📝 Paper] | [🚀 Code]

Introduction

In this work, we propose SlimInfer, an innovative framework designed to accelerate long-context inference for Large Language Models (LLMs) by dynamically pruning less critical prompt tokens during the forward pass. While existing methods often focus on sparse attention or decoding optimization, they still process the full set of hidden states at each layer, limiting overall efficiency.

Our method builds on a key insight called the Information Diffusion Phenomenon: as information from critical tokens propagates through layers, it becomes distributed across the entire sequence. This allows LLMs to maintain semantic integrity even when excessive tokens (including critical ones) are pruned in intermediate hidden states.

Motivated by this, SlimInfer introduces:

  1. Dynamic Block-wise Pruning: A mechanism that accurately removes redundant tokens of hidden states at intermediate layers.
  2. Asynchronous KV Cache Manager: A predictor-free strategy that prefetches required token blocks from CPU to GPU, reducing memory usage and I/O costs.

Extensive experiments show that SlimInfer can achieve up to 2.53× Time-To-First-Token (TTFT) speedup and 1.88× end-to-end latency reduction on LLaMA-3.1-8B-Instruct on a single RTX 4090, without sacrificing performance on LongBench.

Figure 1: Accuracy vs. Inference Efficiency

Figure 4: Overview of SlimInfer

Installation

Requirements

Our framework is implemented in PyTorch and integrates with the Transformers library.

git clone https://github.com/Longxmas/SlimInfer.git
cd SlimInfer

The key dependencies are:

  • torch==2.5.1
  • transformers==4.53.0
  • flash-attn==2.6.3

Quick Start

We provide a demo.py script to quickly verify the inference speedup of SlimInfer. This script benchmarks the generation time across different input lengths and output lengths.

Run the demo with the following command:

python demo.py --model <MODEL_PATH> --mode sliminfer
  • --model: Path to your local Llama-3.1-8B-Instruct or Qwen2.5-7B-Instruct model.
  • --mode: Choose between sliminfer (our method) or origin (original Hugging Face implementation).
  • --pruning_config: (Optional) Path to the pruning configuration file. Defaults to prune_configs/b64_t09_w4_prune_fx_9_8_19_4_29_2.yaml.

Evaluation

We provide scripts to evaluate the performance of SlimInfer on standard benchmarks. Before running the scripts, please update the MODEL_PATH variable in the scripts to point to your local model directory.

LongBench Evaluation

To evaluate the model accuracy on the LongBench dataset (including Single-Doc QA, Multi-Doc QA, Summarization, etc.), run:

bash scripts/longbench_eval.sh

Needle In A Haystack (NIAH)

To run the "Needle In A Haystack" pressure test for context retrieval capabilities:

bash scripts/niah.sh

Results

Accuracy

Despite aggressive pruning, SlimInfer maintains near-lossless accuracy across various tasks in LongBench, establishing a superior Pareto frontier between efficiency and performance.

Table 1: Performance comparison on LongBench (Bai et al. 2024).

Efficiency

SlimInfer significantly reduces both TTFT and End-to-End latency compared to FlashAttention2 (Full KV) and other state-of-the-art pruning methods (LazyLLM, FlexPrefill, MInference).

Figure 5: Inference efficiency comparison

Citation

If you find this work useful, please cite our paper:

@misc{long2025sliminferacceleratinglongcontextllm,
      title={SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning}, 
      author={Lingkun Long and Rubing Yang and Yushi Huang and Desheng Hui and Ao Zhou and Jianlei Yang},
      year={2025},
      eprint={2508.06447},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.06447}, 
}

About

[AAAI 2026] SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors