Skip to content

Benchmark: Adopt Backboard's LoCoMo methodology for reproducible comparison #8

@bm-clawd

Description

@bm-clawd

Context

Backboard published a fully reproducible LoCoMo benchmark with:

  • Per-conversation isolated evaluation
  • Multi-session conversation ingestion with timestamps
  • GPT-4.1 judge with fixed prompts and seed
  • Published logs, prompts, and verdicts for every question
  • One-click replication script

We should adopt the same methodology so our results are directly comparable.

What to Adapt

From their approach:

  1. Per-conversation isolation — create separate BM projects per conversation (they create separate assistants)
  2. Turn-by-turn ingestion — ingest conversation turns sequentially, preserving session boundaries and timestamps
  3. Separate question thread — ask questions after all sessions ingested, using only BM search for context
  4. Fixed judge config — same GPT-4.1 judge prompts, same seed, deterministic evaluation
  5. Full transparency — publish all prompts, retrieved context, generated answers, and judge verdicts

What we do differently (advantages):

  1. Include adversarial category — they skip it, we test it
  2. Report retrieval metrics alongside accuracy — shows WHERE improvements come from (retrieval vs LLM reasoning)
  3. Local-first execution — no cloud API dependency, fully reproducible offline (except judge step)
  4. Multiple retrieval strategies — test FTS, vector, hybrid, with/without time-decay

Repo

Results should go in the benchmark repo (openclaw-basic-memory or standalone per #10)

Milestone

v0.19.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions