Skip to content

[Benchmarking]: Add multi-retriever comparison benchmarking with per-query metrics#290

Open
oussamahansal wants to merge 4 commits into
poc-benchmark-concurrentqafrom
benchmark-latency-tokens-count
Open

[Benchmarking]: Add multi-retriever comparison benchmarking with per-query metrics#290
oussamahansal wants to merge 4 commits into
poc-benchmark-concurrentqafrom
benchmark-latency-tokens-count

Conversation

@oussamahansal
Copy link
Copy Markdown
Collaborator

Description of changes:
Add retriever parameterization, per-query latency/token tracking, aggregate metrics summaries, and cross-retriever comparison reporting to the benchmarking framework.

What's included:

  • Retriever parameterization — BENCHMARK_RETRIEVER env var selects which retriever to benchmark (traversal, topic_based, entity_based, chunk_based, entity_network, chunk_based_semantic, semantic_guided, topic-beam-chunk_only, semantic-path_weighted, agentic, byokg_agentic)
  • Per-query metrics — Each query records retrieval_ms, response_ms, total_ms, input_tokens, output_tokens, hop_classification in responses.jsonl
  • Aggregate summary — metrics_summary.json with avg/p50/p95 latency, total tokens, estimated cost (USD)
  • Retriever-specific output directories — Results written to benchmark-results/{dataset}/{retriever}/ to avoid overwriting between runs
  • Comparison report — comparison_report.json with cost-efficiency rankings, latency-efficiency rankings, and multi-hop breakdown
  • PGA dataset support — Added as fourth benchmark dataset (507 docs, 400 QA pairs)
  • OpenSearch date_detection fix — Prevents mapper_parsing_exception on date-like strings (e.g., "2016-17" seasons in PGA data)
  • Vector store connectivity check — Fails fast if graph isn't populated when reusing existing stacks
  • Property-based tests: 57 tests covering all 15 correctness properties (hypothesis, 100+ iterations each)
  • Backward compatible: When BENCHMARK_RETRIEVER is unset, behavior is identical to the previous implementation.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Copy Markdown
Collaborator

@mykola-pereyma mykola-pereyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove meta information from tests about requirements or indexed property or replace with explicit description.
Please add date when cost was estimated to ensure clients will not be confused later when prices will change.


## Running Benchmarks

### CUAD (Build → Query → Evaluate)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if the prototype cuad dataset is useful for sanity checks.

If not, maybe we should remove the dataset.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I Think we can keep it if we want to validate pipeline in 2 min before multi-hour runs

Comment thread integration-tests/BENCHMARKING.md Outdated
Comment thread integration-tests/BENCHMARKING.md Outdated
Comment thread integration-tests/BENCHMARKING.md Outdated
Comment thread integration-tests/BENCHMARKING.md Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants