Skip to content

muliturn benchmark#1313

Merged
shihaobai merged 1 commit into
mainfrom
benchmark
May 19, 2026
Merged

muliturn benchmark#1313
shihaobai merged 1 commit into
mainfrom
benchmark

Conversation

@shihaobai
Copy link
Copy Markdown
Collaborator

No description provided.

@shihaobai shihaobai merged commit f850264 into main May 19, 2026
1 check failed
@shihaobai shihaobai deleted the benchmark branch May 19, 2026 08:46
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a multi-turn dialogue benchmark tool, a quantization configuration for Qwen 3.5 122B, and a server startup script. The benchmark script simulates concurrent multi-turn sessions to measure metrics such as TTFT, TPOT, and cache hit ratios. Review feedback highlights a potential issue with SSE message parsing in the benchmark script, suggesting the use of readline() for robustness. Other recommendations include optimizing prompt length calculations to prevent performance bottlenecks in long sessions, improving the handling of existing output files, and using script-relative paths in the shell script to increase portability.

Comment on lines +149 to +155
async for raw in response.content:
line = raw.strip()
if not line or not line.startswith(b"data:"):
continue
data_str = line[len(b"data:"):].strip()
if data_str == b"[DONE]":
break
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation iterates over response.content using async for, which in aiohttp yields arbitrary chunks of data as they arrive from the network. This is not guaranteed to align with SSE message boundaries (lines starting with data:). If a chunk contains multiple lines or a partial line, json.loads will fail or data will be skipped. Using readline() ensures that each iteration processes a complete line.

Suggested change
async for raw in response.content:
line = raw.strip()
if not line or not line.startswith(b"data:"):
continue
data_str = line[len(b"data:"):].strip()
if data_str == b"[DONE]":
break
while True:
line = await response.content.readline()
if not line:
break
line = line.strip()
if not line or not line.startswith(b"data:"):
continue
data_str = line[len(b"data:"):].strip()
if data_str == b"[DONE]":
break

else:
new_text = ""
new_prompt = prompt + generated_text + new_text
new_len = len(tokenizer.encode(new_prompt, add_special_tokens=False))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Re-encoding the entire prompt on every turn leads to $O(N^2)$ complexity relative to the number of turns. For long contexts (e.g., 160k tokens), this becomes a significant performance bottleneck on the client side, potentially deflating the reported QPS and TPM metrics as the session progresses. Consider calculating the new length by adding the lengths of the model's output and the new user tokens to the server-reported prompt_tokens from the previous turn to avoid redundant computation.

Comment on lines +468 to +471
if args.dump_file and os.path.exists(args.dump_file) and os.path.getsize(args.dump_file) > 0:
with open(args.dump_file, "r") as f:
print(json.dumps(json.load(f), indent=4))
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script exits immediately if the dump_file already exists and is non-empty. This behavior is counter-intuitive for a benchmark tool; users typically expect the script to run a new benchmark and either overwrite or append to the output file. It would be better to provide a warning or a flag to control this behavior (e.g., --force to overwrite or a separate command to view results).

)

if [[ "${ENABLE_DEEPEP:-0}" == "1" || "${ENABLE_DEEPEP:-}" == "true" ]]; then
ARGS+=(--quant_cfg ../../advanced_config/mixed_quantization/qwen3_5-122b-moe-only-fp8.yaml --enable_ep_moe --dp 8 --batch_max_tokens 4096 --graph_max_batch_size 64 --chunked_prefill_size 2048 --mem_fraction 0.8 --linear_att_cache_size 300)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a relative path for --quant_cfg makes the script fragile as it depends on the current working directory from which the script is executed. It is safer to resolve the path relative to the script's own location using $(dirname "$0").

Suggested change
ARGS+=(--quant_cfg ../../advanced_config/mixed_quantization/qwen3_5-122b-moe-only-fp8.yaml --enable_ep_moe --dp 8 --batch_max_tokens 4096 --graph_max_batch_size 64 --chunked_prefill_size 2048 --mem_fraction 0.8 --linear_att_cache_size 300)
ARGS+=(--quant_cfg "$(dirname "$0")/../../advanced_config/mixed_quantization/qwen3_5-122b-moe-only-fp8.yaml" --enable_ep_moe --dp 8 --batch_max_tokens 4096 --graph_max_batch_size 64 --chunked_prefill_size 2048 --mem_fraction 0.8 --linear_att_cache_size 300)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant