cnjack · cnjack · Jul 3, 2026 · Jul 2, 2026 · Jul 2, 2026 · Jul 2, 2026
diff --git a/agent-eval/.gitignore b/agent-eval/.gitignore
@@ -0,0 +1,11 @@
+# Per-run artifacts are regenerated by orchestrate.py — keep them out of git.
+runs/
+*.stderr
+
+# Local showcase screenshots and batch logs are disposable.
+showcase/assets/
+frontend_batch.log
+
+# python
+__pycache__/
+*.pyc
diff --git a/agent-eval/README.md b/agent-eval/README.md
@@ -0,0 +1,96 @@
+# jcode Agent — Autonomous Execution Test Harness
+
+A fully-automated, **unattended** test rig that stress-tests jcode's coding agent and
+produces a showcase-quality HTML report plus a ranked defect list. Built to answer one
+question before an SDK ships: **can we trust the agent to run on its own?**
+
+Everything is judged by **deterministic verification** against the sandbox end-state and
+the recorded protocol trajectory — never the agent's own "done".
+
+## Why it's built the way it is
+
+We first convened a five-seat design round-table (QA architect, eval methodologist, SRE,
+security, SDK/DX — see [`roundtable/roundtable.json`](roundtable/roundtable.json)). Their
+synthesis drove every design decision:
+
+- **Drive the ACP surface, not the TTY.** `jcode acp` (JSON-RPC over stdio) is the exact
+  headless surface a future SDK sits on, and it streams a structured event trajectory we
+  can record and grade. The harness ([`harness/main.go`](harness/main.go)) is a real ACP
+  client that runs one prompt turn, auto-approves permissions, and logs every event.
+- **Double isolation per run.** Each run gets (1) a throwaway `HOME` — a copied config with
+  a *pinned model* so the agent can never touch the operator's real `~/.jcode` (which holds
+  live API keys), and (2) a throwaway sandbox `cwd` with fixtures and a canary file just
+  outside it to detect filesystem escape.
+- **Deterministic oracles + ACP contract checks.** File bytes, subprocess exit codes,
+  grep-over-tree, mutation checks, read-only discipline — plus per-run contract assertions
+  (one terminal StopReason, no orphan tool calls, pure-protocol stdout, usage reported).
+- **Repeat for stability.** Cases repeat across models; we report pass@n, flakiness, and
+  Wilson 95% CIs — not anecdotes.
+
+## Layout
+
+```
+agent-eval/
+  harness/         ACP client that drives one `jcode acp` prompt turn (Go, standalone module)
+  suite/           testcases.json (declarative cases) · verify.py (oracles) · orchestrate.py (runner)
+  analysis/        analyze.py (aggregation + log mining) · findings.json · report.py (HTML)
+  roundtable/      the five expert perspectives that shaped the design
+  runs/            per-run artifacts (git-ignored; regenerated)
+  report/          the generated report.html
+  site/            **new styled website** — open `site/index.html` to browse everything
+  showcase/        legacy landing page + data generator; projects also mirrored under site/
+```
+
+## Browse the results
+
+The easiest way to read everything is to open **`site/index.html`** in a browser.
+It is a self-contained, warm-cream website (inspired by open-design.ai) that links to:
+
+- the full Phase 1 report (`site/report.html`)
+- the six discovered defects (`site/findings.html`)
+- the five-seat round-table methodology (`site/roundtable.html`)
+- the running docs (`site/docs.html`)
+- the live frontend showcase with **large, usable iframes** (`site/showcase.html`)
+
+## Run it
+
+Requires a jcode binary and Go (to build the harness). **On macOS 26 the binary must be
+built with `CGO_ENABLED=0`** — see finding F1.
+
+```bash
+# 1. build a working jcode + the ACP harness
+CGO_ENABLED=0 go build -o /tmp/jcode-nocgo ./cmd/jcode
+( cd agent-eval/harness && go build -o /tmp/acp-harness . )
+
+# 2. run the matrix (isolated, unattended)
+python3 agent-eval/suite/orchestrate.py \
+  --bin /tmp/jcode-nocgo --harness /tmp/acp-harness \
+  --runs-dir agent-eval/runs --models glm-5.1,glm-5.2 --workers 5
+
+# 3. analyze + render the report
+python3 agent-eval/analysis/analyze.py --runs-dir agent-eval/runs --out agent-eval/runs/analysis.json
+python3 agent-eval/analysis/report.py \
+  --analysis agent-eval/runs/analysis.json \
+  --roundtable agent-eval/roundtable/roundtable.json \
+  --findings agent-eval/analysis/findings.json \
+  --runs-dir agent-eval/runs --out agent-eval/report/report.html
+```
+
+Add `--quick` to `orchestrate.py` for a 1-repeat, single-model smoke pass, or
+`--cases id1,id2` / `--tiers smoke,core` to scope it.
+
+## Test cases
+
+15 cases across four tiers (`suite/testcases.json`): **smoke** (file create, read-only Q&A),
+**core** (fizzbuzz, targeted edit, bug-fix-from-failing-test, multi-file refactor, test
+authoring, Go build/run, search/enumerate), **stress** (ambiguous → must clarify,
+impossible → must halt cleanly, long-horizon multi-step), and **safety** (destructive-command
+scoping, prompt-injection via file content, planted-secret handling).
+
+## Headline findings
+
+See the generated report and [`analysis/findings.json`](analysis/findings.json). The
+highest-severity ones: a cgo build that **SIGABRTs on subprocess fork** on macOS 26 (F1);
+model/API errors **masked as a successful `end_turn`** (F2); **no runner-level timeout** so
+the agent can hang (F3, observed running `find /`); and an **unenforced filesystem/exec
+boundary** (F4).
diff --git a/agent-eval/analysis/analyze.py b/agent-eval/analysis/analyze.py
@@ -0,0 +1,258 @@
+#!/usr/bin/env python3
+"""Aggregate + log-analysis pass over the recorded jcode test runs.
+
+Consumes the per-run record.json files produced by the orchestrator and emits
+analysis.json: overall/per-model/per-case/per-tier metrics, stability
+(pass@n, flakiness with Wilson CIs), token/cost accounting, and derived
+failure-signature detection (non-termination, tool-call loops, silent empty
+turns, error masking). The report generator renders this.
+"""
+import argparse
+import json
+import math
+import os
+import re
+from collections import defaultdict
+from pathlib import Path
+
+
+def wilson(k, n, z=1.96):
+    if n == 0:
+        return (0.0, 0.0, 0.0)
+    p = k / n
+    denom = 1 + z * z / n
+    center = (p + z * z / (2 * n)) / denom
+    half = (z * math.sqrt(p * (1 - p) / n + z * z / (4 * n * n))) / denom
+    return (round(p, 4), round(max(0, center - half), 4), round(min(1, center + half), 4))
+
+
+def load_records(runs_dir):
+    recs = []
+    for rd in sorted(Path(runs_dir).glob("*/record.json")):
+        try:
+            recs.append(json.loads(rd.read_text()))
+            recs[-1]["_dir"] = str(rd.parent)
+        except Exception:
+            pass
+    return recs
+
+
+def detect_loops(rundir):
+    """Max count of identical (tool_title+rawInput) tool_call events = loop signal."""
+    ev = Path(rundir) / "events.jsonl"
+    if not ev.exists():
+        return 0, 0
+    counts = defaultdict(int)
+    total = 0
+    for line in ev.read_text(errors="ignore").splitlines():
+        line = line.strip()
+        if not line:
+            continue
+        try:
+            d = json.loads(line)
+        except Exception:
+            continue
+        if d.get("kind") != "session_update":
+            continue
+        u = d.get("data", {})
+        if u.get("sessionUpdate") == "tool_call":
+            total += 1
+            key = json.dumps([u.get("title"), u.get("rawInput")], sort_keys=True)
+            counts[key] += 1
+    return (max(counts.values()) if counts else 0), total
+
+
+def load_pricing(cache_path):
+    """Best-effort {model_id_substr: {input, output} per 1M tokens} from models.dev."""
+    pricing = {}
+    try:
+        data = json.loads(Path(cache_path).read_text())
+    except Exception:
+        return pricing
+    def walk(obj):
+        if isinstance(obj, dict):
+            mid = obj.get("id")
+            cost = obj.get("cost")
+            if isinstance(mid, str) and isinstance(cost, dict) and ("input" in cost or "output" in cost):
+                pricing[mid] = {"input": cost.get("input", 0), "output": cost.get("output", 0)}
+            for v in obj.values():
+                walk(v)
+        elif isinstance(obj, list):
+            for v in obj:
+                walk(v)
+    walk(data)
+    return pricing
+
+
+def est_cost(model_id, usage, pricing):
+    model = model_id.split("/")[-1]
+    pr = pricing.get(model)
+    if not pr:
+        for k, v in pricing.items():
+            if k in model or model in k:
+                pr = v
+                break
+    if not pr:
+        return None
+    inp = (usage.get("prompt", 0)) / 1e6 * pr.get("input", 0)
+    out = (usage.get("completion", 0)) / 1e6 * pr.get("output", 0)
+    return round(inp + out, 6)
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--runs-dir", required=True)
+    ap.add_argument("--out", required=True)
+    ap.add_argument("--cache", default=os.path.expanduser("~/.jcode/cache/models_dev.json"))
+    args = ap.parse_args()
+
+    recs = load_records(args.runs_dir)
+    pricing = load_pricing(args.cache)
+
+    for r in recs:
+        mrep, mtot = detect_loops(r["_dir"])
+        r["_max_repeat_toolcall"] = mrep
+        r["_total_toolcall_events"] = mtot
+        r["_cost"] = est_cost(r.get("model_id", ""), r.get("usage_total", {}), pricing)
+        # silent empty turn: claimed end_turn but produced nothing
+        r["_silent_empty"] = (r.get("stop_reason") == "end_turn"
+                              and r.get("agent_chunks", 0) == 0
+                              and r.get("tool_calls", 0) == 0)
+        # non-termination / abnormal stop
+        r["_nonterminal"] = r.get("stop_reason") not in ("end_turn",)
+
+    overall = {
+        "total_runs": len(recs),
+        "task_pass": sum(1 for r in recs if r.get("task_passed")),
+        "contract_pass": sum(1 for r in recs if r.get("contracts_passed")),
+        "clean_termination": sum(1 for r in recs if r.get("stop_reason") == "end_turn"),
+        "silent_empty_turns": sum(1 for r in recs if r.get("_silent_empty")),
+        "total_tokens": sum(r.get("usage_total", {}).get("total", 0) for r in recs),
+        "total_wall_s": round(sum(r.get("wall_s", 0) or 0 for r in recs), 1),
+    }
+    tp = wilson(overall["task_pass"], overall["total_runs"])
+    overall["task_pass_rate"] = tp[0]
+    overall["task_pass_ci"] = [tp[1], tp[2]]
+    costs = [r["_cost"] for r in recs if r.get("_cost") is not None]
+    overall["total_cost_est"] = round(sum(costs), 4) if costs else None
+
+    # per-model
+    by_model = defaultdict(list)
+    for r in recs:
+        by_model[r.get("model")].append(r)
+    models = {}
+    for m, rs in by_model.items():
+        n = len(rs)
+        k = sum(1 for r in rs if r.get("task_passed"))
+        p, lo, hi = wilson(k, n)
+        toks = sum(r.get("usage_total", {}).get("total", 0) for r in rs)
+        mc = [r["_cost"] for r in rs if r.get("_cost") is not None]
+        recov = error_recovery(rs)
+        models[m] = {
+            "runs": n, "task_pass": k, "pass_rate": p, "ci": [lo, hi],
+            "contract_pass": sum(1 for r in rs if r.get("contracts_passed")),
+            "clean_termination": sum(1 for r in rs if r.get("stop_reason") == "end_turn"),
+            "nonterminal": sum(1 for r in rs if r.get("_nonterminal")),
+            "silent_empty": sum(1 for r in rs if r.get("_silent_empty")),
+            "avg_tool_calls": round(sum(r.get("tool_calls", 0) for r in rs) / n, 2),
+            "avg_wall_s": round(sum(r.get("wall_s", 0) or 0 for r in rs) / n, 1),
+            "total_tokens": toks,
+            "avg_tokens": round(toks / n) if n else 0,
+            "cost_est": round(sum(mc), 4) if mc else None,
+            "max_repeat_toolcall": max((r["_max_repeat_toolcall"] for r in rs), default=0),
+            "error_recovery": recov,
+        }
+
+    # per-case (pass@n + flakiness), split by model
+    by_case = defaultdict(list)
+    for r in recs:
+        by_case[r.get("case_id")].append(r)
+    cases = {}
+    for cid, rs in by_case.items():
+        per_model = defaultdict(list)
+        for r in rs:
+            per_model[r.get("model")].append(r)
+        cmodels = {}
+        flaky = False
+        for m, mrs in per_model.items():
+            n = len(mrs)
+            k = sum(1 for r in mrs if r.get("task_passed"))
+            if 0 < k < n:
+                flaky = True
+            cmodels[m] = {"n": n, "pass": k, "rate": round(k / n, 3) if n else 0}
+        cases[cid] = {
+            "title": rs[0].get("case_title"),
+            "category": rs[0].get("category"),
+            "tier": rs[0].get("tier"),
+            "n": len(rs),
+            "pass": sum(1 for r in rs if r.get("task_passed")),
+            "flaky": flaky,
+            "by_model": cmodels,
+            "avg_tool_calls": round(sum(r.get("tool_calls", 0) for r in rs) / len(rs), 2),
+        }
+
+    # per-tier
+    by_tier = defaultdict(list)
+    for r in recs:
+        by_tier[r.get("tier")].append(r)
+    tiers = {}
+    for t, rs in by_tier.items():
+        n = len(rs)
+        k = sum(1 for r in rs if r.get("task_passed"))
+        tiers[t] = {"n": n, "pass": k, "rate": round(k / n, 3) if n else 0}
+
+    # failure signatures
+    signatures = {
+        "non_termination": [r["run_id"] for r in recs if r.get("_nonterminal")],
+        "silent_empty_turn": [r["run_id"] for r in recs if r.get("_silent_empty")],
+        "tool_loop_suspects": [r["run_id"] for r in recs if r.get("_max_repeat_toolcall", 0) >= 3],
+        "contract_violations": [
+            {"run_id": r["run_id"], "failed": [c["type"] for c in r.get("contracts", []) if not c["passed"]]}
+            for r in recs if not r.get("contracts_passed")],
+        "usage_absent_on_acp_stream": sum(1 for r in recs if not r.get("usage_on_acp_stream")),
+        "usage_absent_pct": round(100 * sum(1 for r in recs if not r.get("usage_on_acp_stream")) / len(recs), 1) if recs else 0,
+    }
+
+    # oracle-level failure tally (which checks fail most)
+    oracle_fail = defaultdict(int)
+    for r in recs:
+        if not r.get("task_passed"):
+            for o in r.get("oracles", []):
+                if not o["passed"]:
+                    oracle_fail[f"{r['case_id']}:{o['type']}"] += 1
+
+    analysis = {
+        "overall": overall,
+        "models": models,
+        "cases": cases,
+        "tiers": tiers,
+        "signatures": signatures,
+        "oracle_failures": dict(sorted(oracle_fail.items(), key=lambda x: -x[1])),
+        "run_index": [
+            {k: r.get(k) for k in ["run_id", "case_id", "model", "tier", "category",
+             "task_passed", "contracts_passed", "stop_reason", "tool_calls",
+             "wall_s", "_silent_empty", "_max_repeat_toolcall", "_cost",
+             "usage_on_acp_stream"]} | {"tokens": r.get("usage_total", {}).get("total", 0)}
+            for r in recs],
+    }
+    Path(args.out).write_text(json.dumps(analysis, indent=2, default=str))
+    print(f"wrote {args.out}: {overall['task_pass']}/{overall['total_runs']} pass, "
+          f"contract {overall['contract_pass']}/{overall['total_runs']}, "
+          f"{overall['total_tokens']} tokens")
+
+
+def error_recovery(rs):
+    """Fraction of runs that hit a failed tool status but still finished end_turn."""
+    hit = 0
+    recovered = 0
+    for r in rs:
+        statuses = list((r.get("tool_status_end", {}) or {}).values())
+        if "failed" in statuses:
+            hit += 1
+            if r.get("stop_reason") == "end_turn":
+                recovered += 1
+    return {"runs_with_tool_failure": hit, "recovered_end_turn": recovered}
+
+
+if __name__ == "__main__":
+    main()