Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions agent-eval/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Per-run artifacts are regenerated by orchestrate.py — keep them out of git.
runs/
*.stderr

# Local showcase screenshots and batch logs are disposable.
showcase/assets/
frontend_batch.log

# python
__pycache__/
*.pyc
96 changes: 96 additions & 0 deletions agent-eval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# jcode Agent — Autonomous Execution Test Harness

A fully-automated, **unattended** test rig that stress-tests jcode's coding agent and
produces a showcase-quality HTML report plus a ranked defect list. Built to answer one
question before an SDK ships: **can we trust the agent to run on its own?**

Everything is judged by **deterministic verification** against the sandbox end-state and
the recorded protocol trajectory — never the agent's own "done".

## Why it's built the way it is

We first convened a five-seat design round-table (QA architect, eval methodologist, SRE,
security, SDK/DX — see [`roundtable/roundtable.json`](roundtable/roundtable.json)). Their
synthesis drove every design decision:

- **Drive the ACP surface, not the TTY.** `jcode acp` (JSON-RPC over stdio) is the exact
headless surface a future SDK sits on, and it streams a structured event trajectory we
can record and grade. The harness ([`harness/main.go`](harness/main.go)) is a real ACP
client that runs one prompt turn, auto-approves permissions, and logs every event.
- **Double isolation per run.** Each run gets (1) a throwaway `HOME` — a copied config with
a *pinned model* so the agent can never touch the operator's real `~/.jcode` (which holds
live API keys), and (2) a throwaway sandbox `cwd` with fixtures and a canary file just
outside it to detect filesystem escape.
- **Deterministic oracles + ACP contract checks.** File bytes, subprocess exit codes,
grep-over-tree, mutation checks, read-only discipline — plus per-run contract assertions
(one terminal StopReason, no orphan tool calls, pure-protocol stdout, usage reported).
- **Repeat for stability.** Cases repeat across models; we report pass@n, flakiness, and
Wilson 95% CIs — not anecdotes.

## Layout

```
agent-eval/
harness/ ACP client that drives one `jcode acp` prompt turn (Go, standalone module)
suite/ testcases.json (declarative cases) · verify.py (oracles) · orchestrate.py (runner)
analysis/ analyze.py (aggregation + log mining) · findings.json · report.py (HTML)
roundtable/ the five expert perspectives that shaped the design
runs/ per-run artifacts (git-ignored; regenerated)
report/ the generated report.html
site/ **new styled website** — open `site/index.html` to browse everything
showcase/ legacy landing page + data generator; projects also mirrored under site/
```

## Browse the results

The easiest way to read everything is to open **`site/index.html`** in a browser.
It is a self-contained, warm-cream website (inspired by open-design.ai) that links to:

- the full Phase 1 report (`site/report.html`)
- the six discovered defects (`site/findings.html`)
- the five-seat round-table methodology (`site/roundtable.html`)
- the running docs (`site/docs.html`)
- the live frontend showcase with **large, usable iframes** (`site/showcase.html`)

## Run it

Requires a jcode binary and Go (to build the harness). **On macOS 26 the binary must be
built with `CGO_ENABLED=0`** — see finding F1.

```bash
# 1. build a working jcode + the ACP harness
CGO_ENABLED=0 go build -o /tmp/jcode-nocgo ./cmd/jcode
( cd agent-eval/harness && go build -o /tmp/acp-harness . )

# 2. run the matrix (isolated, unattended)
python3 agent-eval/suite/orchestrate.py \
--bin /tmp/jcode-nocgo --harness /tmp/acp-harness \
--runs-dir agent-eval/runs --models glm-5.1,glm-5.2 --workers 5

# 3. analyze + render the report
python3 agent-eval/analysis/analyze.py --runs-dir agent-eval/runs --out agent-eval/runs/analysis.json
python3 agent-eval/analysis/report.py \
--analysis agent-eval/runs/analysis.json \
--roundtable agent-eval/roundtable/roundtable.json \
--findings agent-eval/analysis/findings.json \
--runs-dir agent-eval/runs --out agent-eval/report/report.html
```
Comment on lines +70 to +77

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Reproduction command writes analysis.json to the wrong (git-ignored) location.

analyze.py --out agent-eval/runs/analysis.json writes into runs/, which this same PR's .gitignore excludes. But the PR stack's checked-in generated artifact lives at agent-eval/report/analysis.json (per the "Generated analysis and report artifacts" layer). Following these exact instructions won't reproduce the committed report artifact path — worth aligning the doc (or the actual --out default) with where results are actually meant to land.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@agent-eval/README.md` around lines 70 - 77, The reproduction steps in README
use analyze.py and report.py with an output path that lands in the git-ignored
runs directory instead of the checked-in report artifact location. Update the
documented command, or align analyze.py’s --out default, so the generated
analysis.json matches the expected agent-eval/report/analysis.json path used by
the generated analysis/report artifacts and the report.py invocation.


Add `--quick` to `orchestrate.py` for a 1-repeat, single-model smoke pass, or
`--cases id1,id2` / `--tiers smoke,core` to scope it.

## Test cases

15 cases across four tiers (`suite/testcases.json`): **smoke** (file create, read-only Q&A),
**core** (fizzbuzz, targeted edit, bug-fix-from-failing-test, multi-file refactor, test
authoring, Go build/run, search/enumerate), **stress** (ambiguous → must clarify,
impossible → must halt cleanly, long-horizon multi-step), and **safety** (destructive-command
scoping, prompt-injection via file content, planted-secret handling).

## Headline findings

See the generated report and [`analysis/findings.json`](analysis/findings.json). The
highest-severity ones: a cgo build that **SIGABRTs on subprocess fork** on macOS 26 (F1);
model/API errors **masked as a successful `end_turn`** (F2); **no runner-level timeout** so
the agent can hang (F3, observed running `find /`); and an **unenforced filesystem/exec
boundary** (F4).
258 changes: 258 additions & 0 deletions agent-eval/analysis/analyze.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,258 @@
#!/usr/bin/env python3
"""Aggregate + log-analysis pass over the recorded jcode test runs.

Consumes the per-run record.json files produced by the orchestrator and emits
analysis.json: overall/per-model/per-case/per-tier metrics, stability
(pass@n, flakiness with Wilson CIs), token/cost accounting, and derived
failure-signature detection (non-termination, tool-call loops, silent empty
turns, error masking). The report generator renders this.
"""
import argparse
import json
import math
import os
import re
from collections import defaultdict
from pathlib import Path


def wilson(k, n, z=1.96):
if n == 0:
return (0.0, 0.0, 0.0)
p = k / n
denom = 1 + z * z / n
center = (p + z * z / (2 * n)) / denom
half = (z * math.sqrt(p * (1 - p) / n + z * z / (4 * n * n))) / denom
return (round(p, 4), round(max(0, center - half), 4), round(min(1, center + half), 4))


def load_records(runs_dir):
recs = []
for rd in sorted(Path(runs_dir).glob("*/record.json")):
try:
recs.append(json.loads(rd.read_text()))
recs[-1]["_dir"] = str(rd.parent)
except Exception:
pass
return recs


def detect_loops(rundir):
"""Max count of identical (tool_title+rawInput) tool_call events = loop signal."""
ev = Path(rundir) / "events.jsonl"
if not ev.exists():
return 0, 0
counts = defaultdict(int)
total = 0
for line in ev.read_text(errors="ignore").splitlines():
line = line.strip()
if not line:
continue
try:
d = json.loads(line)
except Exception:
continue
if d.get("kind") != "session_update":
continue
u = d.get("data", {})
if u.get("sessionUpdate") == "tool_call":
total += 1
key = json.dumps([u.get("title"), u.get("rawInput")], sort_keys=True)
counts[key] += 1
return (max(counts.values()) if counts else 0), total


def load_pricing(cache_path):
"""Best-effort {model_id_substr: {input, output} per 1M tokens} from models.dev."""
pricing = {}
try:
data = json.loads(Path(cache_path).read_text())
except Exception:
return pricing
def walk(obj):
if isinstance(obj, dict):
mid = obj.get("id")
cost = obj.get("cost")
if isinstance(mid, str) and isinstance(cost, dict) and ("input" in cost or "output" in cost):
pricing[mid] = {"input": cost.get("input", 0), "output": cost.get("output", 0)}
for v in obj.values():
walk(v)
elif isinstance(obj, list):
for v in obj:
walk(v)
walk(data)
return pricing


def est_cost(model_id, usage, pricing):
model = model_id.split("/")[-1]
pr = pricing.get(model)
if not pr:
for k, v in pricing.items():
if k in model or model in k:
pr = v
break
if not pr:
return None
inp = (usage.get("prompt", 0)) / 1e6 * pr.get("input", 0)
out = (usage.get("completion", 0)) / 1e6 * pr.get("output", 0)
return round(inp + out, 6)


def main():
ap = argparse.ArgumentParser()
ap.add_argument("--runs-dir", required=True)
ap.add_argument("--out", required=True)
ap.add_argument("--cache", default=os.path.expanduser("~/.jcode/cache/models_dev.json"))
args = ap.parse_args()

recs = load_records(args.runs_dir)
pricing = load_pricing(args.cache)

for r in recs:
mrep, mtot = detect_loops(r["_dir"])
r["_max_repeat_toolcall"] = mrep
r["_total_toolcall_events"] = mtot
r["_cost"] = est_cost(r.get("model_id", ""), r.get("usage_total", {}), pricing)
# silent empty turn: claimed end_turn but produced nothing
r["_silent_empty"] = (r.get("stop_reason") == "end_turn"
and r.get("agent_chunks", 0) == 0
and r.get("tool_calls", 0) == 0)
# non-termination / abnormal stop
r["_nonterminal"] = r.get("stop_reason") not in ("end_turn",)

overall = {
"total_runs": len(recs),
"task_pass": sum(1 for r in recs if r.get("task_passed")),
"contract_pass": sum(1 for r in recs if r.get("contracts_passed")),
"clean_termination": sum(1 for r in recs if r.get("stop_reason") == "end_turn"),
"silent_empty_turns": sum(1 for r in recs if r.get("_silent_empty")),
"total_tokens": sum(r.get("usage_total", {}).get("total", 0) for r in recs),
"total_wall_s": round(sum(r.get("wall_s", 0) or 0 for r in recs), 1),
}
tp = wilson(overall["task_pass"], overall["total_runs"])
overall["task_pass_rate"] = tp[0]
overall["task_pass_ci"] = [tp[1], tp[2]]
costs = [r["_cost"] for r in recs if r.get("_cost") is not None]
overall["total_cost_est"] = round(sum(costs), 4) if costs else None

# per-model
by_model = defaultdict(list)
for r in recs:
by_model[r.get("model")].append(r)
models = {}
for m, rs in by_model.items():
n = len(rs)
k = sum(1 for r in rs if r.get("task_passed"))
p, lo, hi = wilson(k, n)
toks = sum(r.get("usage_total", {}).get("total", 0) for r in rs)
mc = [r["_cost"] for r in rs if r.get("_cost") is not None]
recov = error_recovery(rs)
models[m] = {
"runs": n, "task_pass": k, "pass_rate": p, "ci": [lo, hi],
"contract_pass": sum(1 for r in rs if r.get("contracts_passed")),
"clean_termination": sum(1 for r in rs if r.get("stop_reason") == "end_turn"),
"nonterminal": sum(1 for r in rs if r.get("_nonterminal")),
"silent_empty": sum(1 for r in rs if r.get("_silent_empty")),
"avg_tool_calls": round(sum(r.get("tool_calls", 0) for r in rs) / n, 2),
"avg_wall_s": round(sum(r.get("wall_s", 0) or 0 for r in rs) / n, 1),
"total_tokens": toks,
"avg_tokens": round(toks / n) if n else 0,
"cost_est": round(sum(mc), 4) if mc else None,
"max_repeat_toolcall": max((r["_max_repeat_toolcall"] for r in rs), default=0),
"error_recovery": recov,
}

# per-case (pass@n + flakiness), split by model
by_case = defaultdict(list)
for r in recs:
by_case[r.get("case_id")].append(r)
cases = {}
for cid, rs in by_case.items():
per_model = defaultdict(list)
for r in rs:
per_model[r.get("model")].append(r)
cmodels = {}
flaky = False
for m, mrs in per_model.items():
n = len(mrs)
k = sum(1 for r in mrs if r.get("task_passed"))
if 0 < k < n:
flaky = True
cmodels[m] = {"n": n, "pass": k, "rate": round(k / n, 3) if n else 0}
cases[cid] = {
"title": rs[0].get("case_title"),
"category": rs[0].get("category"),
"tier": rs[0].get("tier"),
"n": len(rs),
"pass": sum(1 for r in rs if r.get("task_passed")),
"flaky": flaky,
"by_model": cmodels,
"avg_tool_calls": round(sum(r.get("tool_calls", 0) for r in rs) / len(rs), 2),
}

# per-tier
by_tier = defaultdict(list)
for r in recs:
by_tier[r.get("tier")].append(r)
tiers = {}
for t, rs in by_tier.items():
n = len(rs)
k = sum(1 for r in rs if r.get("task_passed"))
tiers[t] = {"n": n, "pass": k, "rate": round(k / n, 3) if n else 0}

# failure signatures
signatures = {
"non_termination": [r["run_id"] for r in recs if r.get("_nonterminal")],
"silent_empty_turn": [r["run_id"] for r in recs if r.get("_silent_empty")],
"tool_loop_suspects": [r["run_id"] for r in recs if r.get("_max_repeat_toolcall", 0) >= 3],
"contract_violations": [
{"run_id": r["run_id"], "failed": [c["type"] for c in r.get("contracts", []) if not c["passed"]]}
for r in recs if not r.get("contracts_passed")],
"usage_absent_on_acp_stream": sum(1 for r in recs if not r.get("usage_on_acp_stream")),
"usage_absent_pct": round(100 * sum(1 for r in recs if not r.get("usage_on_acp_stream")) / len(recs), 1) if recs else 0,
}

# oracle-level failure tally (which checks fail most)
oracle_fail = defaultdict(int)
for r in recs:
if not r.get("task_passed"):
for o in r.get("oracles", []):
if not o["passed"]:
oracle_fail[f"{r['case_id']}:{o['type']}"] += 1

analysis = {
"overall": overall,
"models": models,
"cases": cases,
"tiers": tiers,
"signatures": signatures,
"oracle_failures": dict(sorted(oracle_fail.items(), key=lambda x: -x[1])),
"run_index": [
{k: r.get(k) for k in ["run_id", "case_id", "model", "tier", "category",
"task_passed", "contracts_passed", "stop_reason", "tool_calls",
"wall_s", "_silent_empty", "_max_repeat_toolcall", "_cost",
"usage_on_acp_stream"]} | {"tokens": r.get("usage_total", {}).get("total", 0)}
for r in recs],
}
Path(args.out).write_text(json.dumps(analysis, indent=2, default=str))
print(f"wrote {args.out}: {overall['task_pass']}/{overall['total_runs']} pass, "
f"contract {overall['contract_pass']}/{overall['total_runs']}, "
f"{overall['total_tokens']} tokens")


def error_recovery(rs):
"""Fraction of runs that hit a failed tool status but still finished end_turn."""
hit = 0
recovered = 0
for r in rs:
statuses = list((r.get("tool_status_end", {}) or {}).values())
if "failed" in statuses:
hit += 1
if r.get("stop_reason") == "end_turn":
recovered += 1
return {"runs_with_tool_failure": hit, "recovered_end_turn": recovered}


if __name__ == "__main__":
main()
Loading
Loading