-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Create a standalone benchmark suite (basic-memory-bench) for evaluating retrieval quality across BM deployments using academic datasets. Designed to be publicly shareable, runnable by anyone, and integrated into CI.
Why
- Internal quality tracking — benchmark before/after every release
- Cloud vs Local comparison — validate that Cloud's better embeddings produce better retrieval
- Public credibility — reproducible numbers on academic benchmarks
- Marketing — "we benchmark in the open" content
- Competitive positioning — compare against Mem0/Supermemory on the same datasets
Current Results (from prototype in openclaw-basic-memory plugin)
Full LoCoMo benchmark — 1,982 queries across 10 conversations:
| Metric | BM Local (v0.18.5) |
|---|---|
| Recall@5 | 76.4% |
| Recall@10 | 85.5% |
| MRR | 0.658 |
| Content Hit Rate | 25.4% |
| Mean Latency | 1,063ms |
By category:
| Category | N | R@5 |
|---|---|---|
| open_domain | 841 | 86.6% |
| multi_hop | 321 | 84.1% |
| adversarial | 446 | 67.0% |
| temporal | 92 | 59.1% |
| single_hop | 282 | 57.7% |
Architecture
- Python (not TS) — same ecosystem as BM, uses BM's importer framework
- Provider abstraction — BM Local (MCP stdio), BM Cloud (API), Mem0 (optional)
- Two eval modes — retrieval metrics (R@K, MRR) + LLM-as-Judge (for Mem0 comparison)
- Deterministic conversion — LoCoMo/LongMemEval → BM markdown via
EntityMarkdown - CI-ready —
just full-locomoruns everything, fail if recall drops >2%
Datasets
- LoCoMo (ACL 2024, Snap Research) — 10 conversations, 1,986 QA pairs. Mem0 publishes numbers on this.
- LongMemEval (ICLR 2025) — Supermemory uses this. More challenging.
- Synthetic (our hand-crafted 38-query suite for CI smoke tests)
Known Improvement Opportunities
From analysis of the 375 failures:
- RRF scoring flattens results (#577) — hybrid search scores all ~0.016, ranking destroyed. FTS alone finds observations that hybrid misses.
- Single-hop recall at 57.7% — specific fact lookups need better chunk matching
- Temporal at 59.1% — date-aware scoring needed
Implementation Phases
- Repo setup + LoCoMo (Python port from current TS prototype)
- BM Cloud provider + LongMemEval dataset
- LLM-as-Judge + competitor comparison
- CI integration + public results dashboard
Reference
- Full spec:
drafts/spec-benchmark-suite.mdin openclaw workspace - Current TS prototype:
benchmark/in openclaw-basic-memory repo - LoCoMo dataset: github.com/snap-research/locomo
- LongMemEval: github.com/xiaowu0162/LongMemEval
- Supermemory's harness: github.com/supermemoryai/memorybench
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request