Standalone benchmark suite for retrieval quality evaluation

## Summary

Create a standalone benchmark suite (`basic-memory-bench`) for evaluating retrieval quality across BM deployments using academic datasets. Designed to be publicly shareable, runnable by anyone, and integrated into CI.

## Why

1. **Internal quality tracking** — benchmark before/after every release
2. **Cloud vs Local comparison** — validate that Cloud's better embeddings produce better retrieval
3. **Public credibility** — reproducible numbers on academic benchmarks
4. **Marketing** — "we benchmark in the open" content
5. **Competitive positioning** — compare against Mem0/Supermemory on the same datasets

## Current Results (from prototype in openclaw-basic-memory plugin)

Full LoCoMo benchmark — **1,982 queries across 10 conversations**:

| Metric | BM Local (v0.18.5) |
|--------|-------------------|
| Recall@5 | 76.4% |
| Recall@10 | 85.5% |
| MRR | 0.658 |
| Content Hit Rate | 25.4% |
| Mean Latency | 1,063ms |

By category:
| Category | N | R@5 |
|----------|---|-----|
| open_domain | 841 | 86.6% |
| multi_hop | 321 | 84.1% |
| adversarial | 446 | 67.0% |
| temporal | 92 | 59.1% |
| single_hop | 282 | 57.7% |

## Architecture

- **Python** (not TS) — same ecosystem as BM, uses BM's importer framework
- **Provider abstraction** — BM Local (MCP stdio), BM Cloud (API), Mem0 (optional)
- **Two eval modes** — retrieval metrics (R@K, MRR) + LLM-as-Judge (for Mem0 comparison)
- **Deterministic conversion** — LoCoMo/LongMemEval → BM markdown via `EntityMarkdown`
- **CI-ready** — `just full-locomo` runs everything, fail if recall drops >2%

## Datasets

- **LoCoMo** (ACL 2024, Snap Research) — 10 conversations, 1,986 QA pairs. Mem0 publishes numbers on this.
- **LongMemEval** (ICLR 2025) — Supermemory uses this. More challenging.
- **Synthetic** (our hand-crafted 38-query suite for CI smoke tests)

## Known Improvement Opportunities

From analysis of the 375 failures:

1. **RRF scoring flattens results** (#577) — hybrid search scores all ~0.016, ranking destroyed. FTS alone finds observations that hybrid misses.
2. **Single-hop recall at 57.7%** — specific fact lookups need better chunk matching
3. **Temporal at 59.1%** — date-aware scoring needed

## Implementation Phases

1. Repo setup + LoCoMo (Python port from current TS prototype)
2. BM Cloud provider + LongMemEval dataset
3. LLM-as-Judge + competitor comparison
4. CI integration + public results dashboard

## Reference

- Full spec: `drafts/spec-benchmark-suite.md` in openclaw workspace
- Current TS prototype: `benchmark/` in openclaw-basic-memory repo
- LoCoMo dataset: github.com/snap-research/locomo
- LongMemEval: github.com/xiaowu0162/LongMemEval
- Supermemory's harness: github.com/supermemoryai/memorybench

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standalone benchmark suite for retrieval quality evaluation #10

Summary

Why

Current Results (from prototype in openclaw-basic-memory plugin)

Architecture

Datasets

Known Improvement Opportunities

Implementation Phases

Reference

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	BM Local (v0.18.5)
Recall@5	76.4%
Recall@10	85.5%
MRR	0.658
Content Hit Rate	25.4%
Mean Latency	1,063ms

Category	N	R@5
open_domain	841	86.6%
multi_hop	321	84.1%
adversarial	446	67.0%
temporal	92	59.1%
single_hop	282	57.7%

Standalone benchmark suite for retrieval quality evaluation #10

Description

Summary

Why

Current Results (from prototype in openclaw-basic-memory plugin)

Architecture

Datasets

Known Improvement Opportunities

Implementation Phases

Reference

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions