-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Context
Our current LoCoMo benchmark measures retrieval quality (R@5, R@10, MRR). Competitors like Backboard report end-to-end answer accuracy using LLM-as-Judge (GPT-4.1), scoring 90.1% overall. We need the same metric to compare directly.
What to Build
Adapt the evaluation pipeline to add an LLM-as-Judge step after retrieval:
- Retrieve context using BM search (existing)
- Pass retrieved context + question to an LLM (GPT-4.1 or Claude)
- LLM generates an answer
- A judge LLM evaluates: CORRECT or WRONG against ground truth
- Report accuracy by category (single_hop, multi_hop, open_domain, temporal)
Reference Implementation
Backboard's open benchmark: https://github.com/Backboard-io/Backboard-Locomo-Benchmark
- Uses GPT-4.1 as judge with fixed prompts and seed
- Publishes logs, prompts, and verdicts for every question
- Skips category 5 (adversarial) — we should include it
Expected Outcome
Direct comparison table:
| Method | Single-Hop | Multi-Hop | Open Domain | Temporal | Overall |
|---|---|---|---|---|---|
| Backboard | 89.4% | 75.0% | 91.2% | 91.9% | 90.0% |
| Basic Memory | ? | ? | ? | ? | ? |
| Mem0 | 67.1% | 51.2% | 72.9% | 55.5% | 66.9% |
Our retrieval is already strong (86% R@5 vs Mem0's 66%). With a good LLM on top of our retrieved context, we should be competitive with Backboard.
Milestone
v0.19.0
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels