-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Context
MemMachine tests two modes:
- Memory mode: single retrieval, context injected, LLM answers (87.5%)
- Agent mode: LLM uses memory as a tool, can do multiple retrieval rounds (88.1%)
Agent mode scores higher because the LLM can refine its queries — ask a broad question, look at results, ask a more specific follow-up.
Why this matters for BM
BM already supports this naturally via MCP. An LLM using BM tools can:
search_notes('Sarah restaurant')- Look at results, realize it needs temporal context
search_notes('Sarah lunch May 2023')- Combine both result sets to answer
We should benchmark both modes:
- Single-shot: one search call, inject context, LLM answers (comparable to memory mode)
- Agent (MCP): LLM has access to search_notes + read_note + build_context tools, can do multiple rounds
The agent mode result shows what BM can do when paired with a capable LLM — which is the real-world usage pattern.
Implementation
- Single-shot: existing benchmark + LLM-as-Judge (#615)
- Agent mode: give the eval LLM MCP tool access to BM, let it search freely, then judge the answer
- Report both scores separately
Related
- Benchmark: Add LLM-as-Judge evaluation (GPT-4.1) for LoCoMo #9 (LLM-as-Judge)
- Benchmark: Adopt Backboard's LoCoMo methodology for reproducible comparison #8 (methodology)
Milestone
v0.19.0
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels