Context
MemMachine's benchmark blog reveals that the eval LLM choice swings LoCoMo scores by 4+ points with zero retrieval changes:
- gpt-4o-mini: 87.5% overall
- gpt-4.1-mini: 91.2% overall (same retrieval, same memory)
Backboard uses Gemini 2.5 Pro + GPT-4.1 judge. Mem0 tested with older models. Nobody is comparing apples to apples.
Proposal
When we add LLM-as-Judge (#615), test with multiple configurations:
Eval LLMs (generate answers from retrieved context)
- gpt-4o-mini (baseline, what Mem0 uses)
- gpt-4.1-mini (what MemMachine uses)
- claude-sonnet-4-20250514 (our ecosystem)
Judge LLMs
- gpt-4o-mini (Mem0/MemMachine standard)
- gpt-4.1 (Backboard standard)
Report all combinations
This lets readers see:
- How much is memory quality vs LLM reasoning
- Direct comparison with any competitor's methodology
- The honest picture — not cherry-picked best numbers
Why this matters
If our retrieval feeds good context to the LLM, the eval LLM upgrade should boost our score too. We might find that BM + gpt-4.1-mini scores comparably to MemMachine — proving it's the retrieval that matters, not the proprietary memory layer.
Related
Milestone
v0.19.0