Skip to content

Benchmark: Add LLM-as-Judge evaluation (GPT-4.1) for LoCoMo #9

@bm-clawd

Description

@bm-clawd

Context

Our current LoCoMo benchmark measures retrieval quality (R@5, R@10, MRR). Competitors like Backboard report end-to-end answer accuracy using LLM-as-Judge (GPT-4.1), scoring 90.1% overall. We need the same metric to compare directly.

What to Build

Adapt the evaluation pipeline to add an LLM-as-Judge step after retrieval:

  1. Retrieve context using BM search (existing)
  2. Pass retrieved context + question to an LLM (GPT-4.1 or Claude)
  3. LLM generates an answer
  4. A judge LLM evaluates: CORRECT or WRONG against ground truth
  5. Report accuracy by category (single_hop, multi_hop, open_domain, temporal)

Reference Implementation

Backboard's open benchmark: https://github.com/Backboard-io/Backboard-Locomo-Benchmark

  • Uses GPT-4.1 as judge with fixed prompts and seed
  • Publishes logs, prompts, and verdicts for every question
  • Skips category 5 (adversarial) — we should include it

Expected Outcome

Direct comparison table:

Method Single-Hop Multi-Hop Open Domain Temporal Overall
Backboard 89.4% 75.0% 91.2% 91.9% 90.0%
Basic Memory ? ? ? ? ?
Mem0 67.1% 51.2% 72.9% 55.5% 66.9%

Our retrieval is already strong (86% R@5 vs Mem0's 66%). With a good LLM on top of our retrieved context, we should be competitive with Backboard.

Milestone

v0.19.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions