fix(env-loader): assertRickyWorkflowEnv honors START_FROM resume signal#93
fix(env-loader): assertRickyWorkflowEnv honors START_FROM resume signal#93
Conversation
`assertRickyWorkflowEnv(['NANGO_SECRET_KEY', ...])` is injected at the top of `main()` and runs at module-load before the SDK has a chance to honor `--start-from`. That made `ricky run <workflow> --start-from <step>` impossible whenever the missing env var wasn't actually used by the resumed step — the upstream-only step needed it, but it was being skipped. Concrete repro from this thread: a workflow generated against the proactive-pr-remediation spec asserted NANGO_SECRET_KEY at module load. The user's first attempt had it set; a subsequent `--start-from final-hard-validation --previous-run-id <id>` resume did not. The resumed step (deterministic shell: tsc + scoped npm test + git diff + grep) does not use NANGO_SECRET_KEY at all, but the assert fired before the SDK could route execution to that step. Fix: when process.env.START_FROM is set (the SDK exports this for --start-from resumes — see src/local/entrypoint.ts:799), the helper warns-and-continues instead of throwing. Resumed steps that genuinely need a missing env var will still fail with their own signal at the point of use; the check just stops blocking resumes where it has no information about whether the missing var matters. First-run invocations (no START_FROM) still throw fast — the original fail-fast contract for fresh runs is preserved. Both injected variants are updated: - rickyWorkflowEnvLoaderSource (full helper bundle, used when env loading is being added to a workflow that has neither the loader nor the assert) - rickyWorkflowEnvAssertSource (assert-only helper, used when the loader is already present but the assert is missing) Tests: - New regression case in auto-fix-loop.test.ts: "injects an env-assert helper that honors START_FROM for --start-from resumes" — asserts the injected source contains the process.env.START_FROM check, the console.warn message acknowledging --start-from, and that the throw path is preserved for non-resume invocations. - Existing 29 auto-fix-loop tests still pass; full ricky suite at 1066 / 1066 (was 1064). Out of scope (worth a follow-up): - The proper long-term fix is per-step env declaration so the renderer can scope assertions to the steps that actually consume each variable. That requires SDK changes beyond this PR. The warn-and- continue gate is the pragmatic unblock for the common case (resuming past an assertion that doesn't apply to the resumed surface). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThe PR makes the injected environment assertion helper resume-aware: when ChangesResume-Aware Environment Assertion
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/local/auto-fix-loop.test.ts (1)
620-638: ⚡ Quick winAdd an explicit regression assertion for the assert-only helper path.
This test validates the loader-injection variant well, but not the branch where
RICKY_WORKFLOW_ENV_LOADERalready exists and onlyassertRickyWorkflowEnvis injected. Adding a focused assertion for that path would lock in START_FROM behavior for both variants.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/local/auto-fix-loop.test.ts` around lines 620 - 638, Add a focused assertion verifying the "assert-only" helper path when RICKY_WORKFLOW_ENV_LOADER is already present: in the test 'injects an env-assert helper that honors START_FROM for --start-from resumes' (which calls repairWorkflowDeterministically), simulate the scenario where RICKY_WORKFLOW_ENV_LOADER is present and ensure the returned repair.content contains the injected assertRickyWorkflowEnv helper usage that honors process.env.START_FROM (e.g., contains 'assertRickyWorkflowEnv' and 'process.env.START_FROM' and the console.warn / throw behaviors), so both the loader-injection and assert-only branches are locked in for START_FROM behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/local/auto-fix-loop.test.ts`:
- Around line 620-638: Add a focused assertion verifying the "assert-only"
helper path when RICKY_WORKFLOW_ENV_LOADER is already present: in the test
'injects an env-assert helper that honors START_FROM for --start-from resumes'
(which calls repairWorkflowDeterministically), simulate the scenario where
RICKY_WORKFLOW_ENV_LOADER is present and ensure the returned repair.content
contains the injected assertRickyWorkflowEnv helper usage that honors
process.env.START_FROM (e.g., contains 'assertRickyWorkflowEnv' and
'process.env.START_FROM' and the console.warn / throw behaviors), so both the
loader-injection and assert-only branches are locked in for START_FROM behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 61d29f7d-a742-49d8-8da4-8b4bdb5aa403
📒 Files selected for processing (2)
src/local/auto-fix-loop.test.tssrc/local/auto-fix-loop.ts
Ricky Eval ReviewRun: Passed: 8 | Needs human: 43 | Reviewable: 43 | Missing output: 0 | Failed: 0 | Skipped: 0 Human Review CasesThese cases passed deterministic checks and include captured Ricky output for a human verdict against their REVIEW
|
| Parameter | Description | Example |
|---|---|---|
WORKFLOW_NAME |
Name of the workflow to execute | data‑ingest‑pipeline |
FAILED_STEP |
Exact step identifier that failed on the first run | transform‑csv |
PREV_RUN_ID |
Run‑ID of the previous execution (used for artifact lookup) | run‑20240528‑a1b2c3 |
AUTO_FIX |
Flag to enable auto‑fix (true) | true |
WORKFLOW_NAME="data-ingest-pipeline"
FAILED_STEP="transform-csv"
PREV_RUN_ID="run-20240528-a1b2c3"
AUTO_FIX=true
...[truncated]Must
- Use a bounded retry budget and summarize every attempt.
- Ask the Workforce workflow persona to repair the workflow artifact when a resolvable artifact exists.
- Resume from the failed step with the previous run id when those values are available.
Must Not
- Edit arbitrary repository source files as the default auto-fix surface.
- Keep retrying after the configured max attempts.
- Lose the single Ricky tracking run id across repair/resume attempts.
Deterministic checks
- PASS
status: case executed - PASS
maxToolCalls: expected <= 0 tool calls, got 0
REVIEW runtime-recovery.no-auto-fix-preserves-single-attempt (runtime-recovery/manual)
User message
A user runs
ricky run workflows/foo.ts --no-auto-fixand the workflow fails.
Ricky output
**Quick‑Start Troubleshooting Checklist**
| Step | Action | Expected outcome |
|------|--------|-------------------|
| 1️⃣ | **Collect the run log**<br>`ricky logs workflows/foo.ts --last` | A full JSON/NDJSON log file (`foo.run.log`) appears in `./ricky/logs/`. |
| 2️⃣ | **Identify the failing gate**<br>Open the log and look for the first entry with `"status":"FAIL"` or `"error":`. | You’ll see the exact **gate name**, **payload**, and **error message** (e.g., `ValidateSchema`, `MissingRequiredField`). |
| 3️⃣ | **Run the failing gate in isolation**<br>`ricky gate <gate‑name> --payload <payload‑file>` | The gate runs deterministically and prints a concise error stack. |
| 4️⃣ | **Check contract compliance**<br>Open `workflows/foo.ts` and locate the **contract block** for the failing gate (usually a `contract { … }` object). | Verify that the contract’s `inputSchema`, `outputSchema`, and any `preconditions` match the payload you supplied. |
| 5️⃣ | **Run the workflow with auto‑fix disabled** (already done) – now **enable auto‑fix** to see if the issue is auto‑repairable: <br>`ricky run workflows/foo.ts` | If the run now passes, the failure was a known, auto‑fixable drift (e.g., mi
...[truncated]
Must
- Preserve one-attempt behavior when auto-fix is disabled.
- Return the classified blocker, diagnosis, recovery steps, and non-zero exit code.
- Make clear that the user chose manual inspection over repair/resume automation.
Must Not
- Start a repair loop despite
--no-auto-fix. - Suppress the diagnosis because no repair was attempted.
- Present the failure as a completed repair attempt.
Deterministic checks
- PASS
status: case executed - PASS
maxToolCalls: expected <= 0 tool calls, got 0
REVIEW runtime-recovery.in-process-local-runner (runtime-recovery/manual)
User message
Explain how Ricky should execute a local TypeScript workflow artifact in the primary local path.
Ricky output
**Ricky – Executing a Local TypeScript Workflow Artifact**
*(primary local path = `./workflows/`)*
---
## 1. Preconditions (Contract v1.0)
| Item | Requirement | Source |
|------|-------------|--------|
| **Node runtime** | `>=18.0.0` (LTS) | `package.json` `engines.node` |
| **TypeScript compiler** | `tsc` available (installed locally) | `devDependencies` |
| **Workflow entry point** | A single exported async function `run()` in the artifact file | `./workflows/<artifact>.ts` |
| **Environment** | All required env vars declared in `.env.local` and loaded via `dotenv` | `README.md` |
| **Dependencies** | `npm ci` must have succeeded, lockfile present | `package-lock.json` |
| **Git state** | Clean working tree (no uncommitted changes) | `git status --porcelain` |
If any contract item fails, **Ricky reports a blocker** and aborts execution (see § 7).
---
## 2. Execution Blueprint (Deterministic Steps)
| Step | Command | Purpose | Verification Gate |
|------|---------|---------|--------------------|
| 1️⃣ | `git rev-parse --is-inside-work-tree` | Confirm we are inside the repo | Must return `true` |
| 2️⃣ | `npm ci` | Install exact deps, guarantee reproducibility | Exit‑co
...[truncated]
Must
- Prefer the Node strip-types route or equivalent SDK/programmatic route over requiring the
agent-relaybinary on PATH. - Precheck that Node and
@agent-relay/sdkare resolvable for the workflow. - Record the actual spawn command in execution evidence.
Must Not
- Fail solely because
agent-relayis not on PATH when the SDK route is available. - Hide the actual runtime command from evidence.
- Conflate the user-facing reproduction command with the primary internal spawn route.
Deterministic checks
- PASS
status: case executed - PASS
maxToolCalls: expected <= 0 tool calls, got 0
REVIEW runtime-recovery.escalation-is-not-generic-failure (runtime-recovery/manual)
User message
Ricky reaches a boundary after a structural failure persists after a fix attempt.
Ricky output
**Ricky – Boundary‑Escalation Protocol for Persistent Structural Failures**
When a structural failure remains after an initial remediation attempt, the workflow must transition to a *Boundary‑Escalation* state. The following contract‑driven sequence guarantees deterministic handling, traceability, and rapid resolution.
---
## 1. Trigger Condition
| Condition | Source |
|----------|--------|
| `failure.persisted == true` **AND** `fix.attempted == true` **AND** `verification.passed == false` | `StructuralFailureMonitor` (service) |
*The monitor emits a `BoundaryEscalationEvent` that starts the protocol.*
---
## 2. Escalation Contract (YAML)
```yaml
name: structural-failure-escalation
version: 1.0.0
trigger: BoundaryEscalationEvent
steps:
- id: log_boundary
type: audit
description: Record boundary breach in immutable audit log.
- id: create_blocker
type: issue
title: "[BLOCKER] Persistent structural failure – <failure-id>"
assignee: reliability-lead
labels: [blocker, structural-failure, escalation]
fields:
severity: critical
impact: >-
Production pipeline halted; downstream services degraded.
- id: branch_fork
type: gi
...[truncated]
Must
- Escalate with the attempted fix, failed validation, classified blocker, and recommendation.
- Distinguish escalation from a generic product failure.
- Preserve enough context for a human operator to continue.
Must Not
- Retry speculative fixes indefinitely.
- Collapse to "something went wrong" without the attempted actions.
- Discard evidence from failed repair attempts.
Deterministic checks
- PASS
status: case executed - PASS
maxToolCalls: expected <= 0 tool calls, got 0
REVIEW runtime-recovery.analytics-from-structured-evidence (runtime-recovery/manual)
User message
Produce a workflow health digest from many Ricky workflow runs.
Ricky output
**Ricky Workflow Health Digest – Compiled from Recent Runs**
*(Generated 2026‑05‑10)*
---
## 1. Overview
| Metric | Value | Trend (vs. prior period) |
|--------|-------|--------------------------|
| Total runs processed | **1 842** | ↗ +7 % |
| Successful completions | **1 617** | ↗ +5 % |
| Failed runs (blocked) | **225** | ↘ ‑12 % |
| Runs with **≥ 80 %** validation loops | **1 403** | ↗ +9 % |
| Runs requiring **manual review** | **112** | ↘ ‑18 % |
| Average run duration | **3 m 27 s** | ↘ ‑4 % |
| Median validation loops per run | **84** | ↔ stable |
> **Health status:** **🟢 Good** – Success rate 88 %, validation loop compliance 76 % (target ≥ 80 %). The primary focus is to lift validation‑loop compliance to the 80 % threshold and eliminate the remaining blockers.
---
## 2. Success‑Rate Breakdown by Workflow Type
| Workflow | Runs | Success % | Avg. Loops | Avg. Duration |
|----------|------|-----------|------------|----------------|
| **Contract‑Creation** | 462 | 92 % | 87 | 3 m 12 s |
| **Release‑Gate** | 389 | 85 % | 81 | 4 m 05 s |
| **Data‑Sync** | 274 | 90 % | 92 | 2 m 48 s |
| **Model‑Deploy** | 217 | 78 % | 76 | 5 m 21 s |
| **Audit‑Report** | 200 | 94
...[truncated]
Must
- Consume normalized
WorkflowRunEvidencerather than raw logs as the primary input. - Identify recurring failure classes, weak validation, oversized steps, and runtime duration patterns.
- Produce concrete recommendations tied to specific workflows, steps, or metrics.
Must Not
- Mutate evidence while analyzing it.
- Return generic advice like "improve your workflow" without references.
- Mix environment failures and workflow-logic failures into one undifferentiated bucket.
Deterministic checks
- PASS
status: case executed - PASS
maxToolCalls: expected <= 0 tool calls, got 0
REVIEW surfaces-ingress.co-equal-surfaces (surfaces-ingress/manual)
User message
Design a new Ricky Slack surface for workflow debugging.
Ricky output
**Ricky Slack Surface – “Workflow Debugger”**
*Version 0.1 – Design Draft*
---
## 1. Overview
The **Workflow Debugger** Slack surface gives engineers and product owners a deterministic, low‑latency way to inspect, step‑through, and repair Ricky‑managed workflows directly from a Slack channel. It follows the Ricky repository conventions:
| Aspect | Specification |
|--------|----------------|
| **Contract** | Typed JSON request/response schema (see §2). |
| **Verification Gate** | Schema validation → sandboxed execution → deterministic diff check. |
| **Review Artifacts** | Auto‑generated OpenAPI spec, unit‑test matrix, and a “debug‑session” log artifact. |
| **Validation Loops** | 80‑100 loops per PR (schema → sandbox → diff → commit). |
| **Blocker Reporting** | Immediate Slack “⚠️ Blocker” message with error code and reproducible payload. |
| **Branch/PR Scope** | `feature/slack-debugger` → PR must touch only `slack/debugger/` and `core/contracts/`. |
---
## 2. Public Contract (OpenAPI 3.1)
```yaml
openapi: 3.1.0
info:
title: Ricky Slack Workflow Debugger
version: 0.1.0
paths:
/debug:
post:
summary: Start or query a debugging session
operationId: st
...[truncated]
Must
- Treat Slack as a co-equal product surface, not a wrapper around CLI.
- Route domain work through the same normalization, executor, evidence, and specialist stages.
- Keep Slack-specific formatting, thread handling, and interactive components in the surface layer.
Must Not
- Put workflow generation or diagnosis domain logic directly in the Slack handler.
- Degrade Slack to a developer shortcut with weaker routing than CLI.
- Skip signature verification, dedup, or thread handling when the surface is implemented.
Deterministic checks
- PASS
status: case executed - PASS
maxToolCalls: expected <= 0 tool calls, got 0
REVIEW surfaces-ingress.normalizer-is-only-translation-boundary (surfaces-ingress/manual)
User message
Add a new web handoff type that submits a workflow spec and mode.
Ricky output
**New Hand‑off Type:** `web_submit_spec`
| Element | Definition |
|---------|------------|
| **Purpose** | Allows a web client (e.g., a UI form or API gateway) to submit a complete workflow specification together with the desired execution *mode* (`sync` | `async` | `test`). The system validates the spec, stores it, and either starts the workflow immediately (sync/async) or returns a validation‑only response (test). |
| **Location in Repository** | `src/handoffs/web_submit_spec/` |
| **Branch** | `feature/web-submit-spec` (open PR #‑) |
| **Owner** | `@ricky-workflow` |
| **Dependencies** | `workflow-core`, `validation-engine`, `mode‑dispatcher` |
---
## 1. Contract (OpenAPI 3.1)
```yaml
openapi: 3.1.0
info:
title: Web Submit Spec Hand‑off
version: 1.0.0
paths:
/handoff/web-submit-spec:
post:
summary: Submit a workflow spec and execution mode
operationId: submitSpec
requestBody:
required: true
content:
application/json:
schema:
$ref: '#/components/schemas/SubmitSpecRequest'
responses:
'200':
description: Successful processing
content:
application/json:
...[truncated]
Must
- Add a handoff variant that normalizes into
LocalInvocationRequestorCloudGenerateRequest. - Keep auth/session validation in the web surface before normalization.
- Ensure downstream code does not need to know the request came from web.
Must Not
- Short-circuit directly from the web handler to executors.
- Create a parallel domain model for web requests.
- Make the web surface the owner of local or Cloud routing semantics.
Deterministic checks
- PASS
status: case executed - PASS
maxToolCalls: expected <= 0 tool calls, got 0
REVIEW surfaces-ingress.mcp-claude-context-is-metadata (surfaces-ingress/manual)
User message
A Claude session hands Ricky a workflow spec plus conversation context and asks Ricky to determine whether to generate, debug, restart, analyze, or coordinate.
Ricky output
**Decision Matrix for the Claude‑to‑Ricky Hand‑off**
| Claude Input | What Ricky Should Do |
|-------------
---
_Comment truncated to stay within GitHub limits. Download the `ricky-eval-run` artifact for the full `human-review.md`._
Summary
assertRickyWorkflowEnv(['<VAR>', ...])is injected at the top ofmain()and runs at module-load before the SDK has a chance to honor--start-from. That madericky run <workflow> --start-from <step>impossible whenever the missing env var wasn't actually used by the resumed step — the upstream-only step needed it, but it was being skipped.Repro
A workflow generated against the proactive-pr-remediation spec asserted
NANGO_SECRET_KEYat module load. The user's first attempt had it set; a subsequent resume did not:The resumed step (
final-hard-validation) is a deterministic shell step that does not useNANGO_SECRET_KEYat all (just runs typecheck + scopednpm test+git diff+ a marker grep). But the assert fires before the SDK can route execution to that step, blocking the resume entirely.Fix
When
process.env.START_FROMis set — the SDK exports this for--start-fromresumes (seesrc/local/entrypoint.ts:799) — the helper warns-and-continues instead of throwing. Resumed steps that genuinely need a missing env var will still fail with their own signal at the point of use; the check just stops blocking resumes where it has no information about whether the missing var matters.First-run invocations (no
START_FROM) still throw fast — the original fail-fast contract for fresh runs is preserved.Both injected variants are updated:
rickyWorkflowEnvLoaderSource— full helper bundle, used when env loading is being added to a workflow that has neither the loader nor the assert.rickyWorkflowEnvAssertSource— assert-only helper, used when the loader is already present but the assert is missing.Test plan
npm run typecheck— cleannpm test— 1066 / 1066 (was 1064)auto-fix-loop.test.ts:injects an env-assert helper that honors START_FROM for --start-from resumes. Asserts the injected source contains:process.env.START_FROMcheckconsole.warn(...)message acknowledging--start-from activethrow new Error(\MISSING_ENV_VAR:` path for non-resume invocationsOut of scope (worth a follow-up)
The proper long-term fix is per-step env declaration so the renderer can scope assertions to the steps that actually consume each variable. That requires:
usesEnv: ['NANGO_SECRET_KEY']per step (or detect via static analysis of the step'scommand)That's a meaningful design change beyond this PR. The warn-and-continue gate is the pragmatic unblock for the common case (resuming past an assertion that doesn't apply to the resumed surface).
Relationship to other PRs
Fifth in a stack from this thread that together unblock
ricky --mode local --spec-file <md> --runagainst monorepo specs and their resumes:Each is independently shippable.
🤖 Generated with Claude Code