Skip to content

Benchmark: Track token usage per query for cost comparison #6

@bm-clawd

Description

@bm-clawd

Context

MemMachine reports 80% fewer tokens than Mem0 for the same benchmark — a major cost story. We should track this too.

Proposal

During benchmark evaluation, measure and report:

  • Input tokens per query — how much context is injected from memory
  • Output tokens per query — how much the LLM generates
  • Total tokens across benchmark — overall cost comparison
  • Tokens per correct answer — efficiency metric (quality per token)

Why it matters

BM returns raw markdown chunks. We don't extract/compress memories like Mem0 does. This means:

  • We might use MORE tokens per query (full context vs extracted facts)
  • But our context might be richer and lead to better answers
  • The tokens-per-correct-answer metric tells the real story

If we can show competitive accuracy with fewer tokens, that's a cost argument. If we use more tokens but get better answers, that's a quality argument. Either way, the data tells a story.

Implementation

  • Count tokens using tiktoken (cl100k_base) on retrieved context + LLM prompt
  • Log per-query: {query, category, tokens_in, tokens_out, correct, latency}
  • Aggregate: total tokens, tokens/query by category, tokens/correct answer

Related

Milestone

v0.19.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions