feat(e2e-harness): drive and snapshot the real wizard TUI#702
feat(e2e-harness): drive and snapshot the real wizard TUI#702gewenyu99 wants to merge 40 commits into
Conversation
…ord/replay A control plane over the TUI store that drives the wizard end-to-end with no terminal and no browser, for CI/e2e and agent-driven testing. The render is a pure function of the nanostore, so driving committed state == driving the UI. Core files (src/lib/ci-driver/): - wizard-ci-driver.ts — read_state / list_actions / perform_action over a live WizardStore. read_state is a truthful, secret-free projection of committed state (+ derived currentScreen); perform_action commits via the exact store setter the Ink screen's key handler calls. - action-registry.ts — declarative screen -> commit-action map (exhaustive over ScreenId/Overlay). The actuation surface: name an action, not a keystroke. - wizard-ci-tools.ts — in-process MCP server exposing the three tools, so an external harness or LLM can drive a real run. - e2e-profile.ts — WizardE2eProfile: a program's declarative e2e test definition (the UI choices). decideE2eAction(state, profile) maps screen -> commit, so the harness is generic and the choices live on the program. - recorder.ts — captures a frame at each key moment (route/task/status/runPhase/ overlay change) off the store's version counter; redacts the access token. - replay.ts — reconstructs a throwaway store per frame and renders the REAL Ink screen back to ANSI, so a run replays in the terminal. - DRIVING-E2E-FROM-AN-AGENT.md — how a future agent drives these. - __tests__/ — control-plane walk, flow snapshot (TUI-snapshot analog), recorder. Programs declare their flow's UI choices: - programs/program-step.ts — ProgramConfig.e2e?: WizardE2eProfile. - programs/posthog-integration/index.ts — the integration program's e2e profile. Harness/entry scripts: - scripts/e2e-full-run.no-jest.ts — headless full run: real WizardStore + InkUI (never rendered) + concurrent driver + real runAgent; emits a structured result + a recording. - scripts/replay-e2e.no-jest.ts — replay a recording in the terminal. - scripts/ci-driver-demo.ts — offline control-plane demo (no agent). Additive; no core wizard behavior changed. The workbench `wizard-ci --e2e` (PostHog/wizard-workbench) orchestrates these against real test apps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
🧙 Wizard CIRun the Wizard CI and test your changes against wizard-workbench example apps by replying with a GitHub comment using one of the following commands: Test all apps:
Test all apps in a directory:
Test an individual app:
Show more apps
Results will be posted here when complete. |
The e2e UI-choices object moves out of index.ts into a co-located e2e.ts (POSTHOG_INTEGRATION_E2E_PROFILE), keeping the program config lean and the flow's test definition in its own file. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
scripts/record-demo.no-jest.ts — produces a recording offline (no agent, no network) by driving the integration flow with the e2e profile + a WizardRecorder, so `replay-e2e.no-jest.ts` can be tried without a full run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
scripts/README.md documents the manual control-plane + record/replay tools (what each does, what it needs, how to run). Also commits ci-driver-live-agent.ts (real gateway LLM drives the wizard-ci-tools MCP server) so the index is complete. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
main added two confirm-and-continue intro screens (WarehouseIntro, SelfDrivingIntro, both call store.completeSetup()). The action-registry exhaustiveness test flagged them as uncovered. Register both as confirm_setup in ACTION_REGISTRY and in the e2e walk policy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…l refs Move DRIVING-E2E-FROM-AN-AGENT.md → ARCHITECTURE.md to match the co-located subsystem-doc convention (cf. programs/self-driving/ARCHITECTURE.md). Remove content that shouldn't ship in the public repo: the internal test project id + team name, the workbench test-api-key.txt secret file, and pointers to workbench-only scratch files. Keep the architecture, profiles, record/replay, and MCP-loop guidance; generalize the run instructions. Update the scripts/README link. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
scripts/render-snapshots.no-jest.ts renders every key-moment frame of a recording to a real-Ink ANSI snapshot (one <seq>-<screen>.ans per frame), via replay's renderFrame under tsx. These feed the workbench visual-regression flow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
None of the control-plane / recording / e2e machinery belongs in the wizard's production source. Relocate src/lib/ci-driver/ → e2e-harness/ at the repo root (next to e2e-tests/), and sever every prod coupling: - Remove the ProgramConfig.e2e field (program-step.ts) and the on-program profile (delete posthog-integration/e2e.ts, unwire index.ts). Per-program profiles now live in the harness — e2e-harness/profiles.ts, profileFor(programId). - Add an @e2e-harness/* path alias (tsconfig.build.json + jest moduleNameMapper); repoint scripts/tests off @lib/ci-driver. Result: src/ has ZERO references to the harness, and the published tsdown bundle contains none of it (previously the ~90-byte profile object shipped). Full suite (1045 tests, 3 snapshots) passes; real-recording render verified under tsx. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ARCHITECTURE.md now documents the wizard-ci-snapshots visual-regression flow (real run → render → diff → side-by-side report) and the env it needs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…gram A test/ README documents this program's e2e test definition — the path the headless run walks and the option it auto-takes at each screen (confirm intro, dismiss outage, first setup option, skip mcp/slack, delete skills). It's the human description; the runnable profile stays in e2e-harness/profiles.ts. No e2e machinery returns to prod src — this is documentation only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…oads Each program declares its e2e test path as src/lib/programs/<program>/test/e2e.json — a `profile` (the options the headless run auto-takes) plus a documented `path` of every screen. The harness imports the `profile` in e2e-harness/profiles.ts (single source of truth, no prose duplication). Matches the repo's existing JSON-data pattern (mcp-role-prompts.copy.json); resolveJsonModule already on. It's data, imported only by the harness — zero prod imports, absent from the tsdown bundle. Full harness suite + runtime load verified. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the end-to-end trace (agent → perform_action → driver → action-registry → store.completeSetup → emitChange → router re-resolve → readState) as a comment at the perform_action tool, with cross-referenced breadcrumbs at the driver hop (one committed mutation per call) and the action-registry hop (the store setter + flag-flip the screen sequence reacts to). Harness-only; prod store.ts untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…dule Add a header note to wizard-ci-tools / wizard-ci-driver / action-registry / recorder / replay: each lives in e2e-harness/, is imported only by scripts/tests, and is absent from the tsdown bundle (bin.ts is the only entry). Addresses the "this looks shippable" worry right where a reader meets the code (esp. the MCP server + SDK import). Verified: no e2e symbols in dist/. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
This is a test of snapshotting not a snapshot
There was a problem hiding this comment.
Instruments the interactivity. We can basically build branching CI on every path we care about.
Moving the trace / never-ships / credentials notes to PR review comments anchored to the lines instead — keep the source uncluttered. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| createSdkMcpServer: (opts: unknown) => unknown; | ||
| }> { | ||
| if (!_sdkModule) { | ||
| _sdkModule = await import('@anthropic-ai/claude-agent-sdk'); |
There was a problem hiding this comment.
Doesn't pollute prod. This imports the agent SDK and the module builds an MCP server — but the whole harness lives in e2e-harness/, out of src/. No production code imports it, and bin.ts is the only tsdown entry, so it's absent from the published bundle. Verified by grepping every dist/*.js for wizard-ci-tools / WizardCiDriver / read_state → zero hits. (The SDK is dynamically imported so the module also loads where the SDK is jest-mocked.)
| }), | ||
| ); | ||
|
|
||
| const performAction = tool( |
There was a problem hiding this comment.
End to end, one perform_action is a single committed store mutation that re-derives the screen:
agent → mcp__wizard-ci-tools__perform_action {action:"confirm_setup"}
→ driver.performAction("confirm_setup", {})
→ actionsForScreen("intro") finds confirm_setup
→ apply → store.completeSetup()
→ $session.setKey("setupConfirmed", true); emitChange()
→ $version 0→1 → router.resolve(session) skips intro
(isComplete) → returns "health-check"
→ driver.readState() → { currentScreen:"health-check", actions:[dismiss_outage], … }
The caller then calls read_state and picks the next action. The screen is re-derived from session state, never navigated to.
| detectedFrameworkLabel: s.detectedFrameworkLabel, | ||
| detectionComplete: s.detectionComplete, | ||
| setupConfirmed: s.setupConfirmed, | ||
| hasCredentials: s.credentials !== null, |
There was a problem hiding this comment.
Secrets never reach a driver LLM. Credentials are reduced to hasCredentials + projectId right here — the accessToken is never serialized into read_state. So the whole state snapshot is safe to hand an external model.
There was a problem hiding this comment.
Important for safety in CI. No leaked keys
| const confirmSetupAction: DriverAction = { | ||
| id: 'confirm_setup', | ||
| description: 'Confirm the intro and continue (sets setupConfirmed).', | ||
| apply: (store) => store.completeSetup(), |
There was a problem hiding this comment.
Actuation, not keystrokes. apply calls the exact store setter the Ink key handler would: completeSetup() does setKey('setupConfirmed', true) + emitChange(). One commit per action; router.resolve then treats the intro as complete and renders the next screen. The driver names an action — it never injects a keystroke or sees in-progress React-local input.
| if (!session.credentials) return session; | ||
| return { | ||
| ...session, | ||
| credentials: { ...session.credentials, accessToken: 'phx_***redacted***' }, |
There was a problem hiding this comment.
Recordings redact the token too. Every captured frame runs through redactSession, so accessToken becomes phx_***redacted***. Combined with read_state never serializing it, recordings are safe to share as artifacts.
There was a problem hiding this comment.
I will remove this beofre merging, same with the other demo files in this dir
Drop the three scripts that were scaffolding while building, not part of the shipped feature: - ci-driver-demo.ts offline no-agent control-loop demo (covered by tests) - ci-driver-live-agent.ts manual LLM-drives-MCP proof (needs a key) - record-demo.no-jest.ts offline sample-recording generator (real --e2e records) Keep the three the workbench actually orchestrates: e2e-full-run, render-snapshots, replay-e2e. Update scripts/README.md + ARCHITECTURE.md accordingly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
EXPLORING-AS-AN-AGENT.md — a runbook for an agent that wants to run/drive/explore the wizard headlessly: ask the user for a key file path + set env, then either a full `wizard-ci --e2e` run or a hand-driven read_state→perform_action loop, with renderFrame to snapshot the TUI for itself to view. Gives wizard-ci-tools its documented use (agentic exploration). Recipe smoke-tested (intro → health-check, renders the real screen). ARCHITECTURE.md points at it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nt behavior - README: add "Explore with an agent" under Running locally → Testing (was wrongly placed in the workbench README). - scripts/README: drop the cross-PR pointer to the #703 repro scripts. - Trim header/inline comments across the harness + scripts to concise descriptions of what the code does now — no history, no change-rationale. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Move e2e-harness/EXPLORING-AS-AN-AGENT.md into .claude/skills/exploring-the-wizard/ so an agent auto-discovers it. Repoint the README + ARCHITECTURE links and list it in AGENTS.md. ARCHITECTURE.md stays co-located as the how-it-works reference. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-by-turn scripts/wizard-ci-mcp.no-jest.ts is a stdio MCP server over one live WizardStore: read_state / list_actions / perform_action / render_screen / run_agent. An agent registers it and makes every decision live, instead of the static scripted run. Rewrite the exploring-the-wizard skill to lead with this. Bump zod ^3.24→^3.25 (the MCP SDK needs the zod/v3 subpath; non-breaking) and add the SDK as a dep. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
read_state already returns the legal actions, so the separate tool is noise. Keeps the server's surface minimal: read_state, perform_action, render_screen, run_agent. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…hange Running prettier on these (not in lint-staged) reflowed the whole files — pure diff noise. Restore them to main and re-apply just the intended edits: the "Explore with an agent" section + the exploring-the-wizard skill row.
…d runbook EXPLORING-AS-AN-AGENT.md was promoted to .claude/skills/exploring-the-wizard/; this pointer fix was left uncommitted, so HEAD still linked the deleted file.
…ion start The skill told agents to `claude mcp add` then immediately call the tools, which is impossible (MCP servers load at session start), so agents fell back to a script. Lead with the in-session way that actually works — a WizardCiDriver script (read_state → perform_action → renderFrame), tested — and document the MCP server as the interactive option that needs registering before a fresh session.
…with it Connect the stdio transport first and build the store lazily on the first tool call — detection + the networked health probe used to run before connect(), which could stall the MCP handshake so Claude Code saw the server as broken. Verified end-to-end: `claude mcp add` → `claude mcp list` shows ✔ Connected → a headless session drove read_state → perform_action(confirm_setup) → auth → render_screen. Skill now leads with the two-phase MCP flow (register, then drive in a fresh session, since MCP tools bind at session start); the driver script is the fallback.
…drives in one session Register wizard-ci in .mcp.json so its tools are bound in every session in this repo. An agent following the exploring-the-wizard skill now drives the wizard over MCP (open_app -> read_state -> perform_action -> render_screen -> run_agent) without registering anything or starting a fresh session. The server boots app-agnostic; open_app picks the app + key at call time, so the committed config holds no secrets. Skill + README rewritten to the one-session MCP flow. Verified: a fresh headless agent given only the skill drove the wizard with four MCP calls and wrote zero scripts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Just say to point appDir at the directory that has the package.json. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
appDir is just the throwaway copy of the app; let the agent find the path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
auth (and run) are NO_ACTION screens: session.credentials is set only inside bootstrapProgram, which runs via run_agent. So nothing advances past auth without run_agent — but the tool description said "call when currentScreen=run" and the skill walk skipped auth, so an agent landed on auth and polled instead of calling run_agent. Fix the run_agent description and the skill walk/key-facts to say run_agent bootstraps creds and advances auth+run; don't poll those screens. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ves the run
A real run_agent call blocked the stdio MCP server for ~3 minutes; the client
treated the server as unhealthy, reconnected, and the restarted process lost its
in-memory store ("No app open", runPhase reset to idle). run_agent now starts the
integration in the background and returns immediately; read_state stays responsive
and reports runPhase running -> completed plus an integration status, so the agent
polls instead of blocking. Skill + tool descriptions updated to the poll model;
noted that run_agent creates real PostHog resources each run.
Proven: run_agent returns in 0.0s; read_state during the run answers in 1-2ms with
runPhase=running.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…or both routes Both e2e routes run the real wizard TUI (startTUI) driven by store state manipulation — no keystrokes — and capture the real rendered screen from a PTY. Auth is satisfied by setCredentials with the phx key (same bearer as an OAuth token), so the TUI advances with no browser. - e2e-harness/tui-capture.ts — run a command in a PTY (node-pty), read its screen via @xterm/headless. - scripts/tui-host.no-jest.ts — the real-TUI host. MODE=fixed self-drives the fixed e2e profile, signals each screen, writes a structured result JSON; MODE=serve takes drive commands over a unix socket. - scripts/tui-snapshots.no-jest.ts — CI route: real-TUI text snapshot per screen. - scripts/wizard-ci-mcp.no-jest.ts — agent route: MCP server proxying the host. - scripts/wizard-ci-explore.no-jest.ts — drive the MCP route, print the real TUI. - scripts/tui-replay.no-jest.ts — replay captured snapshots in the terminal. Deletes the record-then-reconstruct machinery (recorder, replay, e2e-full-run, render-snapshots, replay-e2e) and the in-process wizard-ci-tools server. Adds node-pty + @xterm/headless. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sition Snapshot on key moments — a screen change, a task-list update, or a runPhase change — via a store subscription, and snap each screen before the driver acts on it. The run screen (the agent working) is captured as it progresses, and fast transitions (intro/auth/outro/mcp/slack) are no longer skipped by throttling. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ed loop Snapshot on every key-moment change (no throttle spacing, just a settle). And don't await the driver loop at exit — on the cheap (no-agent) path it's parked in waitForChange, so awaiting it hung the process and exited non-zero, which would fail CI. The process now exits 0 cleanly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The fixed CI route always drives the full real agent run — a no-agent path was pointless (and is what hung at exit). Removes the RUN_AGENT branch and the auth-by-state shortcut it needed in fixed mode; auth is bootstrapped by the run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
node-pty ships no linux-x64 prebuilt, so CI must compile it; pnpm 10 blocks build scripts unless allowlisted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ink renders non-interactively when it detects CI (CI / GITHUB_ACTIONS), leaving the captured xterm buffer blank. Strip them from the spawned host's env. Verified locally: with CI=true, render_screen now returns the real TUI instead of blank. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
How to test
Agent route — drive the wizard yourself. In a fresh session in this repo, run the
exploring-the-wizardskill.wizard-ciis registered in.mcp.json, so the tools are already bound:open_appboots the real TUI on an app, thenread_state/perform_action/render_screen(which returns the real rendered screen).CI snapshots — real-TUI visual regression. From a
wizard-workbenchcheckout next to this repo (PostHog creds in its.env):Runs the full real agent flow against express-todo through the real TUI, captures each key moment, diffs the committed baseline, and writes
report.html. Or comment/wizard-cion a PR — same run, posted back as a comment. (Pairs with PostHog/wizard-workbench#2012.)What this is
A headless e2e control plane that drives the real wizard TUI and captures what it renders. Both routes share one primitive:
scripts/tui-host.no-jest.ts) runs the realstartTUIand drives its store by state manipulation — no keystrokes. Auth uses the phx key (same bearer as an OAuth token), so the TUI advances with no browser.e2e-harness/tui-capture.ts) runs the host in a PTY (node-pty) and reads the real rendered screen via@xterm/headless.Routes:
tui-snapshots): the fixed e2e profile self-drives the host through the real agent run → one real-TUI text snapshot per key moment (including the run screen's progression), diffed against a committed baseline.wizard-ci-mcp): an MCP server proxies the host so an agent decides each screen;render_screenreturns the real frame. Theexploring-the-wizardskill is the how-to.None of it ships — it lives in
e2e-harness/+scripts/, out ofsrc/.