[Benchmarking]: Add multi-retriever comparison benchmarking with per-query metrics by oussamahansal · Pull Request #290 · awslabs/graphrag-toolkit

oussamahansal · 2026-05-28T16:06:29Z

Description of changes:
Add retriever parameterization, per-query latency/token tracking, aggregate metrics summaries, and cross-retriever comparison reporting to the benchmarking framework.

What's included:

Retriever parameterization — BENCHMARK_RETRIEVER env var selects which retriever to benchmark (traversal, topic_based, entity_based, chunk_based, entity_network, chunk_based_semantic, semantic_guided, topic-beam-chunk_only, semantic-path_weighted, agentic, byokg_agentic)
Per-query metrics — Each query records retrieval_ms, response_ms, total_ms, input_tokens, output_tokens, hop_classification in responses.jsonl
Aggregate summary — metrics_summary.json with avg/p50/p95 latency, total tokens, estimated cost (USD)
Retriever-specific output directories — Results written to benchmark-results/{dataset}/{retriever}/ to avoid overwriting between runs
Comparison report — comparison_report.json with cost-efficiency rankings, latency-efficiency rankings, and multi-hop breakdown
PGA dataset support — Added as fourth benchmark dataset (507 docs, 400 QA pairs)
OpenSearch date_detection fix — Prevents mapper_parsing_exception on date-like strings (e.g., "2016-17" seasons in PGA data)
Vector store connectivity check — Fails fast if graph isn't populated when reusing existing stacks
Property-based tests: 57 tests covering all 15 correctness properties (hypothesis, 100+ iterations each)
Backward compatible: When BENCHMARK_RETRIEVER is unset, behavior is identical to the previous implementation.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…it's retrieval strategies.

mykola-pereyma

Please remove meta information from tests about requirements or indexed property or replace with explicit description.
Please add date when cost was estimated to ensure clients will not be confused later when prices will change.

acarbonetto · 2026-05-28T20:47:47Z

+
+## Running Benchmarks
+
+### CUAD (Build → Query → Evaluate)


not sure if the prototype cuad dataset is useful for sanity checks.

If not, maybe we should remove the dataset.

I Think we can keep it if we want to validate pipeline in 2 min before multi-hour runs

Establish a reproducible benchmarking baseline for the graphrag-toolk…

82a1763

…it's retrieval strategies.

oussamahansal requested review from acarbonetto and mykola-pereyma May 28, 2026 16:06

mykola-pereyma reviewed May 28, 2026

View reviewed changes

Comment thread integration-tests/test-scripts/graphrag_toolkit_tests/benchmark_utils/test_agentic_retriever.py Outdated

Comment thread integration-tests/test-scripts/graphrag_toolkit_tests/benchmark_utils/test_agentic_retriever.py Outdated

Oussama Hansal added 2 commits May 28, 2026 13:46

clean up files

cc61b0f

Added pricing date

c9013d7

oussamahansal requested a review from mykola-pereyma May 28, 2026 20:50

acarbonetto approved these changes May 28, 2026

View reviewed changes

fix token issue and adress readme comments

59c2d75

mykola-pereyma approved these changes May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Benchmarking]: Add multi-retriever comparison benchmarking with per-query metrics#290

[Benchmarking]: Add multi-retriever comparison benchmarking with per-query metrics#290
oussamahansal wants to merge 4 commits into
poc-benchmark-concurrentqafrom
benchmark-latency-tokens-count

oussamahansal commented May 28, 2026

Uh oh!

mykola-pereyma left a comment

Uh oh!

Uh oh!

Uh oh!

acarbonetto May 28, 2026

Uh oh!

oussamahansal May 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		## Running Benchmarks

		### CUAD (Build → Query → Evaluate)

Conversation

oussamahansal commented May 28, 2026

Uh oh!

mykola-pereyma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

acarbonetto May 28, 2026

Choose a reason for hiding this comment

Uh oh!

oussamahansal May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants