Skip to content

Standalone benchmark suite for retrieval quality evaluation #10

@bm-clawd

Description

@bm-clawd

Summary

Create a standalone benchmark suite (basic-memory-bench) for evaluating retrieval quality across BM deployments using academic datasets. Designed to be publicly shareable, runnable by anyone, and integrated into CI.

Why

  1. Internal quality tracking — benchmark before/after every release
  2. Cloud vs Local comparison — validate that Cloud's better embeddings produce better retrieval
  3. Public credibility — reproducible numbers on academic benchmarks
  4. Marketing — "we benchmark in the open" content
  5. Competitive positioning — compare against Mem0/Supermemory on the same datasets

Current Results (from prototype in openclaw-basic-memory plugin)

Full LoCoMo benchmark — 1,982 queries across 10 conversations:

Metric BM Local (v0.18.5)
Recall@5 76.4%
Recall@10 85.5%
MRR 0.658
Content Hit Rate 25.4%
Mean Latency 1,063ms

By category:

Category N R@5
open_domain 841 86.6%
multi_hop 321 84.1%
adversarial 446 67.0%
temporal 92 59.1%
single_hop 282 57.7%

Architecture

  • Python (not TS) — same ecosystem as BM, uses BM's importer framework
  • Provider abstraction — BM Local (MCP stdio), BM Cloud (API), Mem0 (optional)
  • Two eval modes — retrieval metrics (R@K, MRR) + LLM-as-Judge (for Mem0 comparison)
  • Deterministic conversion — LoCoMo/LongMemEval → BM markdown via EntityMarkdown
  • CI-readyjust full-locomo runs everything, fail if recall drops >2%

Datasets

  • LoCoMo (ACL 2024, Snap Research) — 10 conversations, 1,986 QA pairs. Mem0 publishes numbers on this.
  • LongMemEval (ICLR 2025) — Supermemory uses this. More challenging.
  • Synthetic (our hand-crafted 38-query suite for CI smoke tests)

Known Improvement Opportunities

From analysis of the 375 failures:

  1. RRF scoring flattens results (#577) — hybrid search scores all ~0.016, ranking destroyed. FTS alone finds observations that hybrid misses.
  2. Single-hop recall at 57.7% — specific fact lookups need better chunk matching
  3. Temporal at 59.1% — date-aware scoring needed

Implementation Phases

  1. Repo setup + LoCoMo (Python port from current TS prototype)
  2. BM Cloud provider + LongMemEval dataset
  3. LLM-as-Judge + competitor comparison
  4. CI integration + public results dashboard

Reference

  • Full spec: drafts/spec-benchmark-suite.md in openclaw workspace
  • Current TS prototype: benchmark/ in openclaw-basic-memory repo
  • LoCoMo dataset: github.com/snap-research/locomo
  • LongMemEval: github.com/xiaowu0162/LongMemEval
  • Supermemory's harness: github.com/supermemoryai/memorybench

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions