Real long-context prompts sampled from LongBench-v2
(/remote/vast0/share-mv/zai-org/LongBench-v2/data.json) for serving benchmarks.
Location: /remote/vast0/share-mv/longbenchv2-custom/
Generated by: sample_longbench_v2.py (in this dir).
| File | Target ISL (tokens) | Prompts | Recommended OSL |
|---|---|---|---|
longbenchv2-8k.jsonl |
8,192 | 256 | 1024 |
longbenchv2-10k.jsonl |
10,000 | 256 | 500 |
longbenchv2-100k.jsonl |
100,000 | 100 | 500 |
longbenchv2-1M.jsonl |
1,000,000 | 22 | 500 |
longbenchv2-manifest.json records the achieved token range per file.
The 1M file has only 22 prompts — that is every LongBench-v2 entry whose real context reaches 1,000,000 GLM-5 tokens (no synthetic padding / repetition). Keep
--num-prompts <= 22to use only unique prompts; a larger value makes the benchmark oversample (repeat) them.
JSONL, one request per line. Only prompt is read by the benchmark; the rest is provenance:
{"prompt": "...", "input_tokens": 8192, "target_isl": 8192,
"_id": "...", "domain": "...", "sub_domain": "...", "difficulty": "...",
"source_length": "long", "source_words": 232975}- Tokenizer: GLM-5 (
/remote/vast0/share-mv/zai-org/GLM-5-FP8/tokenizer.json). - ISL is controlled by the dataset. Each
prompttokenizes to its target ISL (exact for 8k/10k, off-by-≤1 for 100k/1M at unavoidable BPE boundaries) under the GLM-5 tokenizer with no special tokens — i.e. exactly what the benchmark measures when--skip-chat-templateis set. - OSL is NOT in the dataset. Set it at serve time with
--custom-output-len. - Prompt body uses the official LongBench-v2 0-shot template (instruction + context + question + 4 choices); the context is head-truncated to hit the target ISL.
Both tools take the same dataset flags (vllm-moreh vendors vLLM's dataset
loader and adds a few Moreh-only options). For every config you only change two
things: --dataset-path (which ISL file) and --custom-output-len (the OSL).
| Dataset | --custom-output-len |
--num-prompts |
|---|---|---|
longbenchv2-8k.jsonl |
1024 | ≤ 256 |
longbenchv2-10k.jsonl |
500 | ≤ 256 |
longbenchv2-100k.jsonl |
500 | ≤ 100 |
longbenchv2-1M.jsonl |
500 | ≤ 22 |
Three flags are required to get the intended behavior:
--skip-chat-template (so prompt_len == ISL), --custom-output-len <OSL>
(sets OSL), and --ignore-eos (forces the model to generate the full OSL).
DATA=/remote/vast0/share-mv/longbenchv2-custom
TOK=/remote/vast0/share-mv/zai-org/GLM-5-FP8
vllm bench serve \
--backend vllm \
--dataset-name custom \
--dataset-path $DATA/longbenchv2-8k.jsonl \
--skip-chat-template \
--custom-output-len 1024 \
--ignore-eos \
--tokenizer $TOK \
--model <served-model> \
--base-url http://<host>:<port> --endpoint /v1/completions \
--num-prompts 256 --max-concurrency <C>Same flags; just swap the command. vllm-moreh adds Moreh-only options
(--num-warmups, multi-value --base-url/--host/--port for PD-disaggregated
setups, --profile).
DATA=/remote/vast0/share-mv/longbenchv2-custom
TOK=/remote/vast0/share-mv/zai-org/GLM-5-FP8
vllm-moreh bench serve \
--backend vllm \
--dataset-name custom \
--dataset-path $DATA/longbenchv2-100k.jsonl \
--skip-chat-template \
--custom-output-len 500 \
--ignore-eos \
--tokenizer $TOK \
--model <served-model> \
--base-url http://<host>:<port> --endpoint /v1/completions \
--num-prompts 100 --max-concurrency <C> \
--num-warmups 8 # Moreh-only: warmup requests before measuring--skip-chat-templatematters. Without it, the prompt is wrapped in the served model's chat template, adding a handful of tokens, soprompt_lenbecomesISL + template_overhead. Use/v1/completions+--skip-chat-templatefor exact ISL.- Tokenizer must match for exact ISL. ISL was measured with the GLM-5 tokenizer;
benchmarking a model with a different tokenizer shifts the real
prompt_len. Pass a matching--tokenizer, or regenerate with that model's tokenizer (below). - 1M needs context room on the server: launch with
--max-model-len≥ ~1,000,500 (ISL + OSL + margin), and keep--num-prompts ≤ 22. - Why
custom, notsharegpt:ShareGPTDatasethard-filters prompts to ≤1024 tokens, silently dropping every long-context sample.CustomDatasethas no length filter.
Loads each file through the real CustomDataset and checks prompt_len == target ISL:
python3 /remote/vast0/share-mv/longbenchv2-custom/verify_dataset.pypython3 /remote/vast0/share-mv/longbenchv2-custom/sample_longbench_v2.py \
--tokenizer /path/to/model_or_tokenizer.json \
--output-dir /remote/vast0/share-mv/longbenchv2-custom