-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Context
Backboard published a fully reproducible LoCoMo benchmark with:
- Per-conversation isolated evaluation
- Multi-session conversation ingestion with timestamps
- GPT-4.1 judge with fixed prompts and seed
- Published logs, prompts, and verdicts for every question
- One-click replication script
We should adopt the same methodology so our results are directly comparable.
What to Adapt
From their approach:
- Per-conversation isolation — create separate BM projects per conversation (they create separate assistants)
- Turn-by-turn ingestion — ingest conversation turns sequentially, preserving session boundaries and timestamps
- Separate question thread — ask questions after all sessions ingested, using only BM search for context
- Fixed judge config — same GPT-4.1 judge prompts, same seed, deterministic evaluation
- Full transparency — publish all prompts, retrieved context, generated answers, and judge verdicts
What we do differently (advantages):
- Include adversarial category — they skip it, we test it
- Report retrieval metrics alongside accuracy — shows WHERE improvements come from (retrieval vs LLM reasoning)
- Local-first execution — no cloud API dependency, fully reproducible offline (except judge step)
- Multiple retrieval strategies — test FTS, vector, hybrid, with/without time-decay
Repo
Results should go in the benchmark repo (openclaw-basic-memory or standalone per #10)
Milestone
v0.19.0
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels