-
Notifications
You must be signed in to change notification settings - Fork 1
Feat/agent eval showcase #106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
68eda9e
feat(eval): frontend showcase + unattended agent evaluation artifacts
cnjack 96ce5b9
feat(eval/site): redesign showcase as warm-cream website with large i…
cnjack 36a86cc
feat(eval/site): add Chinese ICP footer and fix roundtable typo
cnjack fbd54d0
feat(site): product landing page, maximize showcase, real docs
cnjack File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| # Per-run artifacts are regenerated by orchestrate.py — keep them out of git. | ||
| runs/ | ||
| *.stderr | ||
|
|
||
| # Local showcase screenshots and batch logs are disposable. | ||
| showcase/assets/ | ||
| frontend_batch.log | ||
|
|
||
| # python | ||
| __pycache__/ | ||
| *.pyc |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,96 @@ | ||
| # jcode Agent — Autonomous Execution Test Harness | ||
|
|
||
| A fully-automated, **unattended** test rig that stress-tests jcode's coding agent and | ||
| produces a showcase-quality HTML report plus a ranked defect list. Built to answer one | ||
| question before an SDK ships: **can we trust the agent to run on its own?** | ||
|
|
||
| Everything is judged by **deterministic verification** against the sandbox end-state and | ||
| the recorded protocol trajectory — never the agent's own "done". | ||
|
|
||
| ## Why it's built the way it is | ||
|
|
||
| We first convened a five-seat design round-table (QA architect, eval methodologist, SRE, | ||
| security, SDK/DX — see [`roundtable/roundtable.json`](roundtable/roundtable.json)). Their | ||
| synthesis drove every design decision: | ||
|
|
||
| - **Drive the ACP surface, not the TTY.** `jcode acp` (JSON-RPC over stdio) is the exact | ||
| headless surface a future SDK sits on, and it streams a structured event trajectory we | ||
| can record and grade. The harness ([`harness/main.go`](harness/main.go)) is a real ACP | ||
| client that runs one prompt turn, auto-approves permissions, and logs every event. | ||
| - **Double isolation per run.** Each run gets (1) a throwaway `HOME` — a copied config with | ||
| a *pinned model* so the agent can never touch the operator's real `~/.jcode` (which holds | ||
| live API keys), and (2) a throwaway sandbox `cwd` with fixtures and a canary file just | ||
| outside it to detect filesystem escape. | ||
| - **Deterministic oracles + ACP contract checks.** File bytes, subprocess exit codes, | ||
| grep-over-tree, mutation checks, read-only discipline — plus per-run contract assertions | ||
| (one terminal StopReason, no orphan tool calls, pure-protocol stdout, usage reported). | ||
| - **Repeat for stability.** Cases repeat across models; we report pass@n, flakiness, and | ||
| Wilson 95% CIs — not anecdotes. | ||
|
|
||
| ## Layout | ||
|
|
||
| ``` | ||
| agent-eval/ | ||
| harness/ ACP client that drives one `jcode acp` prompt turn (Go, standalone module) | ||
| suite/ testcases.json (declarative cases) · verify.py (oracles) · orchestrate.py (runner) | ||
| analysis/ analyze.py (aggregation + log mining) · findings.json · report.py (HTML) | ||
| roundtable/ the five expert perspectives that shaped the design | ||
| runs/ per-run artifacts (git-ignored; regenerated) | ||
| report/ the generated report.html | ||
| site/ **new styled website** — open `site/index.html` to browse everything | ||
| showcase/ legacy landing page + data generator; projects also mirrored under site/ | ||
| ``` | ||
|
|
||
| ## Browse the results | ||
|
|
||
| The easiest way to read everything is to open **`site/index.html`** in a browser. | ||
| It is a self-contained, warm-cream website (inspired by open-design.ai) that links to: | ||
|
|
||
| - the full Phase 1 report (`site/report.html`) | ||
| - the six discovered defects (`site/findings.html`) | ||
| - the five-seat round-table methodology (`site/roundtable.html`) | ||
| - the running docs (`site/docs.html`) | ||
| - the live frontend showcase with **large, usable iframes** (`site/showcase.html`) | ||
|
|
||
| ## Run it | ||
|
|
||
| Requires a jcode binary and Go (to build the harness). **On macOS 26 the binary must be | ||
| built with `CGO_ENABLED=0`** — see finding F1. | ||
|
|
||
| ```bash | ||
| # 1. build a working jcode + the ACP harness | ||
| CGO_ENABLED=0 go build -o /tmp/jcode-nocgo ./cmd/jcode | ||
| ( cd agent-eval/harness && go build -o /tmp/acp-harness . ) | ||
|
|
||
| # 2. run the matrix (isolated, unattended) | ||
| python3 agent-eval/suite/orchestrate.py \ | ||
| --bin /tmp/jcode-nocgo --harness /tmp/acp-harness \ | ||
| --runs-dir agent-eval/runs --models glm-5.1,glm-5.2 --workers 5 | ||
|
|
||
| # 3. analyze + render the report | ||
| python3 agent-eval/analysis/analyze.py --runs-dir agent-eval/runs --out agent-eval/runs/analysis.json | ||
| python3 agent-eval/analysis/report.py \ | ||
| --analysis agent-eval/runs/analysis.json \ | ||
| --roundtable agent-eval/roundtable/roundtable.json \ | ||
| --findings agent-eval/analysis/findings.json \ | ||
| --runs-dir agent-eval/runs --out agent-eval/report/report.html | ||
| ``` | ||
|
|
||
| Add `--quick` to `orchestrate.py` for a 1-repeat, single-model smoke pass, or | ||
| `--cases id1,id2` / `--tiers smoke,core` to scope it. | ||
|
|
||
| ## Test cases | ||
|
|
||
| 15 cases across four tiers (`suite/testcases.json`): **smoke** (file create, read-only Q&A), | ||
| **core** (fizzbuzz, targeted edit, bug-fix-from-failing-test, multi-file refactor, test | ||
| authoring, Go build/run, search/enumerate), **stress** (ambiguous → must clarify, | ||
| impossible → must halt cleanly, long-horizon multi-step), and **safety** (destructive-command | ||
| scoping, prompt-injection via file content, planted-secret handling). | ||
|
|
||
| ## Headline findings | ||
|
|
||
| See the generated report and [`analysis/findings.json`](analysis/findings.json). The | ||
| highest-severity ones: a cgo build that **SIGABRTs on subprocess fork** on macOS 26 (F1); | ||
| model/API errors **masked as a successful `end_turn`** (F2); **no runner-level timeout** so | ||
| the agent can hang (F3, observed running `find /`); and an **unenforced filesystem/exec | ||
| boundary** (F4). | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,258 @@ | ||
| #!/usr/bin/env python3 | ||
| """Aggregate + log-analysis pass over the recorded jcode test runs. | ||
|
|
||
| Consumes the per-run record.json files produced by the orchestrator and emits | ||
| analysis.json: overall/per-model/per-case/per-tier metrics, stability | ||
| (pass@n, flakiness with Wilson CIs), token/cost accounting, and derived | ||
| failure-signature detection (non-termination, tool-call loops, silent empty | ||
| turns, error masking). The report generator renders this. | ||
| """ | ||
| import argparse | ||
| import json | ||
| import math | ||
| import os | ||
| import re | ||
| from collections import defaultdict | ||
| from pathlib import Path | ||
|
|
||
|
|
||
| def wilson(k, n, z=1.96): | ||
| if n == 0: | ||
| return (0.0, 0.0, 0.0) | ||
| p = k / n | ||
| denom = 1 + z * z / n | ||
| center = (p + z * z / (2 * n)) / denom | ||
| half = (z * math.sqrt(p * (1 - p) / n + z * z / (4 * n * n))) / denom | ||
| return (round(p, 4), round(max(0, center - half), 4), round(min(1, center + half), 4)) | ||
|
|
||
|
|
||
| def load_records(runs_dir): | ||
| recs = [] | ||
| for rd in sorted(Path(runs_dir).glob("*/record.json")): | ||
| try: | ||
| recs.append(json.loads(rd.read_text())) | ||
| recs[-1]["_dir"] = str(rd.parent) | ||
| except Exception: | ||
| pass | ||
| return recs | ||
|
|
||
|
|
||
| def detect_loops(rundir): | ||
| """Max count of identical (tool_title+rawInput) tool_call events = loop signal.""" | ||
| ev = Path(rundir) / "events.jsonl" | ||
| if not ev.exists(): | ||
| return 0, 0 | ||
| counts = defaultdict(int) | ||
| total = 0 | ||
| for line in ev.read_text(errors="ignore").splitlines(): | ||
| line = line.strip() | ||
| if not line: | ||
| continue | ||
| try: | ||
| d = json.loads(line) | ||
| except Exception: | ||
| continue | ||
| if d.get("kind") != "session_update": | ||
| continue | ||
| u = d.get("data", {}) | ||
| if u.get("sessionUpdate") == "tool_call": | ||
| total += 1 | ||
| key = json.dumps([u.get("title"), u.get("rawInput")], sort_keys=True) | ||
| counts[key] += 1 | ||
| return (max(counts.values()) if counts else 0), total | ||
|
|
||
|
|
||
| def load_pricing(cache_path): | ||
| """Best-effort {model_id_substr: {input, output} per 1M tokens} from models.dev.""" | ||
| pricing = {} | ||
| try: | ||
| data = json.loads(Path(cache_path).read_text()) | ||
| except Exception: | ||
| return pricing | ||
| def walk(obj): | ||
| if isinstance(obj, dict): | ||
| mid = obj.get("id") | ||
| cost = obj.get("cost") | ||
| if isinstance(mid, str) and isinstance(cost, dict) and ("input" in cost or "output" in cost): | ||
| pricing[mid] = {"input": cost.get("input", 0), "output": cost.get("output", 0)} | ||
| for v in obj.values(): | ||
| walk(v) | ||
| elif isinstance(obj, list): | ||
| for v in obj: | ||
| walk(v) | ||
| walk(data) | ||
| return pricing | ||
|
|
||
|
|
||
| def est_cost(model_id, usage, pricing): | ||
| model = model_id.split("/")[-1] | ||
| pr = pricing.get(model) | ||
| if not pr: | ||
| for k, v in pricing.items(): | ||
| if k in model or model in k: | ||
| pr = v | ||
| break | ||
| if not pr: | ||
| return None | ||
| inp = (usage.get("prompt", 0)) / 1e6 * pr.get("input", 0) | ||
| out = (usage.get("completion", 0)) / 1e6 * pr.get("output", 0) | ||
| return round(inp + out, 6) | ||
|
|
||
|
|
||
| def main(): | ||
| ap = argparse.ArgumentParser() | ||
| ap.add_argument("--runs-dir", required=True) | ||
| ap.add_argument("--out", required=True) | ||
| ap.add_argument("--cache", default=os.path.expanduser("~/.jcode/cache/models_dev.json")) | ||
| args = ap.parse_args() | ||
|
|
||
| recs = load_records(args.runs_dir) | ||
| pricing = load_pricing(args.cache) | ||
|
|
||
| for r in recs: | ||
| mrep, mtot = detect_loops(r["_dir"]) | ||
| r["_max_repeat_toolcall"] = mrep | ||
| r["_total_toolcall_events"] = mtot | ||
| r["_cost"] = est_cost(r.get("model_id", ""), r.get("usage_total", {}), pricing) | ||
| # silent empty turn: claimed end_turn but produced nothing | ||
| r["_silent_empty"] = (r.get("stop_reason") == "end_turn" | ||
| and r.get("agent_chunks", 0) == 0 | ||
| and r.get("tool_calls", 0) == 0) | ||
| # non-termination / abnormal stop | ||
| r["_nonterminal"] = r.get("stop_reason") not in ("end_turn",) | ||
|
|
||
| overall = { | ||
| "total_runs": len(recs), | ||
| "task_pass": sum(1 for r in recs if r.get("task_passed")), | ||
| "contract_pass": sum(1 for r in recs if r.get("contracts_passed")), | ||
| "clean_termination": sum(1 for r in recs if r.get("stop_reason") == "end_turn"), | ||
| "silent_empty_turns": sum(1 for r in recs if r.get("_silent_empty")), | ||
| "total_tokens": sum(r.get("usage_total", {}).get("total", 0) for r in recs), | ||
| "total_wall_s": round(sum(r.get("wall_s", 0) or 0 for r in recs), 1), | ||
| } | ||
| tp = wilson(overall["task_pass"], overall["total_runs"]) | ||
| overall["task_pass_rate"] = tp[0] | ||
| overall["task_pass_ci"] = [tp[1], tp[2]] | ||
| costs = [r["_cost"] for r in recs if r.get("_cost") is not None] | ||
| overall["total_cost_est"] = round(sum(costs), 4) if costs else None | ||
|
|
||
| # per-model | ||
| by_model = defaultdict(list) | ||
| for r in recs: | ||
| by_model[r.get("model")].append(r) | ||
| models = {} | ||
| for m, rs in by_model.items(): | ||
| n = len(rs) | ||
| k = sum(1 for r in rs if r.get("task_passed")) | ||
| p, lo, hi = wilson(k, n) | ||
| toks = sum(r.get("usage_total", {}).get("total", 0) for r in rs) | ||
| mc = [r["_cost"] for r in rs if r.get("_cost") is not None] | ||
| recov = error_recovery(rs) | ||
| models[m] = { | ||
| "runs": n, "task_pass": k, "pass_rate": p, "ci": [lo, hi], | ||
| "contract_pass": sum(1 for r in rs if r.get("contracts_passed")), | ||
| "clean_termination": sum(1 for r in rs if r.get("stop_reason") == "end_turn"), | ||
| "nonterminal": sum(1 for r in rs if r.get("_nonterminal")), | ||
| "silent_empty": sum(1 for r in rs if r.get("_silent_empty")), | ||
| "avg_tool_calls": round(sum(r.get("tool_calls", 0) for r in rs) / n, 2), | ||
| "avg_wall_s": round(sum(r.get("wall_s", 0) or 0 for r in rs) / n, 1), | ||
| "total_tokens": toks, | ||
| "avg_tokens": round(toks / n) if n else 0, | ||
| "cost_est": round(sum(mc), 4) if mc else None, | ||
| "max_repeat_toolcall": max((r["_max_repeat_toolcall"] for r in rs), default=0), | ||
| "error_recovery": recov, | ||
| } | ||
|
|
||
| # per-case (pass@n + flakiness), split by model | ||
| by_case = defaultdict(list) | ||
| for r in recs: | ||
| by_case[r.get("case_id")].append(r) | ||
| cases = {} | ||
| for cid, rs in by_case.items(): | ||
| per_model = defaultdict(list) | ||
| for r in rs: | ||
| per_model[r.get("model")].append(r) | ||
| cmodels = {} | ||
| flaky = False | ||
| for m, mrs in per_model.items(): | ||
| n = len(mrs) | ||
| k = sum(1 for r in mrs if r.get("task_passed")) | ||
| if 0 < k < n: | ||
| flaky = True | ||
| cmodels[m] = {"n": n, "pass": k, "rate": round(k / n, 3) if n else 0} | ||
| cases[cid] = { | ||
| "title": rs[0].get("case_title"), | ||
| "category": rs[0].get("category"), | ||
| "tier": rs[0].get("tier"), | ||
| "n": len(rs), | ||
| "pass": sum(1 for r in rs if r.get("task_passed")), | ||
| "flaky": flaky, | ||
| "by_model": cmodels, | ||
| "avg_tool_calls": round(sum(r.get("tool_calls", 0) for r in rs) / len(rs), 2), | ||
| } | ||
|
|
||
| # per-tier | ||
| by_tier = defaultdict(list) | ||
| for r in recs: | ||
| by_tier[r.get("tier")].append(r) | ||
| tiers = {} | ||
| for t, rs in by_tier.items(): | ||
| n = len(rs) | ||
| k = sum(1 for r in rs if r.get("task_passed")) | ||
| tiers[t] = {"n": n, "pass": k, "rate": round(k / n, 3) if n else 0} | ||
|
|
||
| # failure signatures | ||
| signatures = { | ||
| "non_termination": [r["run_id"] for r in recs if r.get("_nonterminal")], | ||
| "silent_empty_turn": [r["run_id"] for r in recs if r.get("_silent_empty")], | ||
| "tool_loop_suspects": [r["run_id"] for r in recs if r.get("_max_repeat_toolcall", 0) >= 3], | ||
| "contract_violations": [ | ||
| {"run_id": r["run_id"], "failed": [c["type"] for c in r.get("contracts", []) if not c["passed"]]} | ||
| for r in recs if not r.get("contracts_passed")], | ||
| "usage_absent_on_acp_stream": sum(1 for r in recs if not r.get("usage_on_acp_stream")), | ||
| "usage_absent_pct": round(100 * sum(1 for r in recs if not r.get("usage_on_acp_stream")) / len(recs), 1) if recs else 0, | ||
| } | ||
|
|
||
| # oracle-level failure tally (which checks fail most) | ||
| oracle_fail = defaultdict(int) | ||
| for r in recs: | ||
| if not r.get("task_passed"): | ||
| for o in r.get("oracles", []): | ||
| if not o["passed"]: | ||
| oracle_fail[f"{r['case_id']}:{o['type']}"] += 1 | ||
|
|
||
| analysis = { | ||
| "overall": overall, | ||
| "models": models, | ||
| "cases": cases, | ||
| "tiers": tiers, | ||
| "signatures": signatures, | ||
| "oracle_failures": dict(sorted(oracle_fail.items(), key=lambda x: -x[1])), | ||
| "run_index": [ | ||
| {k: r.get(k) for k in ["run_id", "case_id", "model", "tier", "category", | ||
| "task_passed", "contracts_passed", "stop_reason", "tool_calls", | ||
| "wall_s", "_silent_empty", "_max_repeat_toolcall", "_cost", | ||
| "usage_on_acp_stream"]} | {"tokens": r.get("usage_total", {}).get("total", 0)} | ||
| for r in recs], | ||
| } | ||
| Path(args.out).write_text(json.dumps(analysis, indent=2, default=str)) | ||
| print(f"wrote {args.out}: {overall['task_pass']}/{overall['total_runs']} pass, " | ||
| f"contract {overall['contract_pass']}/{overall['total_runs']}, " | ||
| f"{overall['total_tokens']} tokens") | ||
|
|
||
|
|
||
| def error_recovery(rs): | ||
| """Fraction of runs that hit a failed tool status but still finished end_turn.""" | ||
| hit = 0 | ||
| recovered = 0 | ||
| for r in rs: | ||
| statuses = list((r.get("tool_status_end", {}) or {}).values()) | ||
| if "failed" in statuses: | ||
| hit += 1 | ||
| if r.get("stop_reason") == "end_turn": | ||
| recovered += 1 | ||
| return {"runs_with_tool_failure": hit, "recovered_end_turn": recovered} | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Reproduction command writes analysis.json to the wrong (git-ignored) location.
analyze.py --out agent-eval/runs/analysis.jsonwrites intoruns/, which this same PR's.gitignoreexcludes. But the PR stack's checked-in generated artifact lives atagent-eval/report/analysis.json(per the "Generated analysis and report artifacts" layer). Following these exact instructions won't reproduce the committed report artifact path — worth aligning the doc (or the actual--outdefault) with where results are actually meant to land.🤖 Prompt for AI Agents