Benchmark: Add agent mode evaluation (multi-round retrieval via MCP)

## Context

MemMachine tests two modes:
- **Memory mode:** single retrieval, context injected, LLM answers (87.5%)
- **Agent mode:** LLM uses memory as a tool, can do multiple retrieval rounds (88.1%)

Agent mode scores higher because the LLM can refine its queries — ask a broad question, look at results, ask a more specific follow-up.

## Why this matters for BM

BM already supports this naturally via MCP. An LLM using BM tools can:
1. `search_notes('Sarah restaurant')`
2. Look at results, realize it needs temporal context
3. `search_notes('Sarah lunch May 2023')`
4. Combine both result sets to answer

We should benchmark both modes:
- **Single-shot:** one search call, inject context, LLM answers (comparable to memory mode)
- **Agent (MCP):** LLM has access to search_notes + read_note + build_context tools, can do multiple rounds

The agent mode result shows what BM can do when paired with a capable LLM — which is the real-world usage pattern.

## Implementation
- Single-shot: existing benchmark + LLM-as-Judge (#615)
- Agent mode: give the eval LLM MCP tool access to BM, let it search freely, then judge the answer
- Report both scores separately

## Related
- basicmachines-co/basic-memory-benchmarks#9 (LLM-as-Judge)
- basicmachines-co/basic-memory-benchmarks#8 (methodology)

## Milestone
v0.19.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark: Add agent mode evaluation (multi-round retrieval via MCP) #5

Context

Why this matters for BM

Implementation

Related

Milestone

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Benchmark: Add agent mode evaluation (multi-round retrieval via MCP) #5

Description

Context

Why this matters for BM

Implementation

Related

Milestone

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions