Benchmark: Add LLM-as-Judge evaluation (GPT-4.1) for LoCoMo

## Context

Our current LoCoMo benchmark measures retrieval quality (R@5, R@10, MRR). Competitors like Backboard report end-to-end answer accuracy using LLM-as-Judge (GPT-4.1), scoring 90.1% overall. We need the same metric to compare directly.

## What to Build

Adapt the evaluation pipeline to add an LLM-as-Judge step after retrieval:

1. Retrieve context using BM search (existing)
2. Pass retrieved context + question to an LLM (GPT-4.1 or Claude)
3. LLM generates an answer
4. A judge LLM evaluates: CORRECT or WRONG against ground truth
5. Report accuracy by category (single_hop, multi_hop, open_domain, temporal)

## Reference Implementation

Backboard's open benchmark: https://github.com/Backboard-io/Backboard-Locomo-Benchmark
- Uses GPT-4.1 as judge with fixed prompts and seed
- Publishes logs, prompts, and verdicts for every question
- Skips category 5 (adversarial) — we should include it

## Expected Outcome

Direct comparison table:

| Method | Single-Hop | Multi-Hop | Open Domain | Temporal | Overall |
|--------|-----------|-----------|-------------|----------|---------|
| Backboard | 89.4% | 75.0% | 91.2% | 91.9% | 90.0% |
| Basic Memory | ? | ? | ? | ? | ? |
| Mem0 | 67.1% | 51.2% | 72.9% | 55.5% | 66.9% |

Our retrieval is already strong (86% R@5 vs Mem0's 66%). With a good LLM on top of our retrieved context, we should be competitive with Backboard.

## Milestone
v0.19.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark: Add LLM-as-Judge evaluation (GPT-4.1) for LoCoMo #9

Context

What to Build

Reference Implementation

Expected Outcome

Milestone

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Method	Single-Hop	Multi-Hop	Open Domain	Temporal	Overall
Backboard	89.4%	75.0%	91.2%	91.9%	90.0%
Basic Memory	?	?	?	?	?
Mem0	67.1%	51.2%	72.9%	55.5%	66.9%

Benchmark: Add LLM-as-Judge evaluation (GPT-4.1) for LoCoMo #9

Description

Context

What to Build

Reference Implementation

Expected Outcome

Milestone

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions