Benchmark: Test with multiple eval LLMs to isolate memory quality from model capability

## Context

MemMachine's benchmark blog reveals that the eval LLM choice swings LoCoMo scores by 4+ points with zero retrieval changes:
- gpt-4o-mini: 87.5% overall
- gpt-4.1-mini: 91.2% overall (same retrieval, same memory)

Backboard uses Gemini 2.5 Pro + GPT-4.1 judge. Mem0 tested with older models. Nobody is comparing apples to apples.

## Proposal

When we add LLM-as-Judge (#615), test with multiple configurations:

### Eval LLMs (generate answers from retrieved context)
- gpt-4o-mini (baseline, what Mem0 uses)
- gpt-4.1-mini (what MemMachine uses)
- claude-sonnet-4-20250514 (our ecosystem)

### Judge LLMs
- gpt-4o-mini (Mem0/MemMachine standard)
- gpt-4.1 (Backboard standard)

### Report all combinations
This lets readers see:
1. How much is memory quality vs LLM reasoning
2. Direct comparison with any competitor's methodology
3. The honest picture — not cherry-picked best numbers

## Why this matters
If our retrieval feeds good context to the LLM, the eval LLM upgrade should boost our score too. We might find that BM + gpt-4.1-mini scores comparably to MemMachine — proving it's the retrieval that matters, not the proprietary memory layer.

## Related
- basicmachines-co/basic-memory-benchmarks#9 (LLM-as-Judge evaluation)
- basicmachines-co/basic-memory-benchmarks#8 (adopt Backboard methodology)

## Milestone
v0.19.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark: Test with multiple eval LLMs to isolate memory quality from model capability #7

Context

Proposal

Eval LLMs (generate answers from retrieved context)

Judge LLMs

Report all combinations

Why this matters

Related

Milestone

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Benchmark: Test with multiple eval LLMs to isolate memory quality from model capability #7

Description

Context

Proposal

Eval LLMs (generate answers from retrieved context)

Judge LLMs

Report all combinations

Why this matters

Related

Milestone

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions