-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Context
MemMachine reports 80% fewer tokens than Mem0 for the same benchmark — a major cost story. We should track this too.
Proposal
During benchmark evaluation, measure and report:
- Input tokens per query — how much context is injected from memory
- Output tokens per query — how much the LLM generates
- Total tokens across benchmark — overall cost comparison
- Tokens per correct answer — efficiency metric (quality per token)
Why it matters
BM returns raw markdown chunks. We don't extract/compress memories like Mem0 does. This means:
- We might use MORE tokens per query (full context vs extracted facts)
- But our context might be richer and lead to better answers
- The tokens-per-correct-answer metric tells the real story
If we can show competitive accuracy with fewer tokens, that's a cost argument. If we use more tokens but get better answers, that's a quality argument. Either way, the data tells a story.
Implementation
- Count tokens using tiktoken (cl100k_base) on retrieved context + LLM prompt
- Log per-query:
{query, category, tokens_in, tokens_out, correct, latency} - Aggregate: total tokens, tokens/query by category, tokens/correct answer
Related
- Benchmark: Add LLM-as-Judge evaluation (GPT-4.1) for LoCoMo #9 (LLM-as-Judge)
- MemMachine blog: 419K input tokens (MemMachine) vs 1.92M (Mem0) for same benchmark
Milestone
v0.19.0
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels