Skip to content

docs(evals): add initial integrations e2e spec#127

Merged
khaliqgant merged 1 commit intomainfrom
codex/initial-integrations-e2e-eval
May 9, 2026
Merged

docs(evals): add initial integrations e2e spec#127
khaliqgant merged 1 commit intomainfrom
codex/initial-integrations-e2e-eval

Conversation

@khaliqgant
Copy link
Copy Markdown
Member

Summary

  • Adds a live E2E eval suite for the initial Relayfile integrations: Linear, Slack, Notion, and GitHub.
  • Specifies setup/OAuth/mount prerequisites, discovery-contract checks, provider-specific file-native writeback flows, cleanup rules, and evidence artifacts.
  • Adds a rubric for PASS/BLOCKED/FAIL review so another agent can execute the run consistently.

Testing

  • Not run live; this is a live-provider eval spec requiring OAuth and disposable provider resources.
  • Sanity checked the Markdown suite locally for ASCII/structure before committing.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 9, 2026

Review Change Stack

📝 Walkthrough

Hidden review stack artifact

Walkthrough

This PR adds a comprehensive E2E specification for initial Relayfile integrations (Linear, Slack, Notion, GitHub): seven live cases validating mounted discovery and file-native writeback, shared polling/evidence helpers, an acceptance rubric, and trajectory records documenting the completed spec.

Changes

Initial Integrations E2E Evaluation Suite

Layer / File(s) Summary
Foundation & Preconditions
evals/suites/initial-integrations-e2e/cases.md
E2E suite scope spans four initial providers; global preconditions define required tools, OAuth setup, environment variables, run metadata, and safety blocking rules.
Evidence Contract & Helpers
evals/suites/initial-integrations-e2e/cases.md
Evidence bundle contract enumerates required output artifacts and includes shared helpers (wait_for_file_contains, wait_for_writeback_drain, wait_for_provider_roots) for bounded polling and error capture.
Setup, Mount & Discovery
evals/suites/initial-integrations-e2e/cases.md
initial-e2e.setup-connect-mount creates workspace, connects integrations, pulls and mounts state; initial-e2e.discovery-contract validates mounted .adapter.md, .schema.json, .create.example.json, forbids new.json, and runs Node schema validation.
Provider Writeback Cases
evals/suites/initial-integrations-e2e/cases.md
Cases test file-native writeback for Linear (issue create/patch/delete), Slack (message/reply/reaction), Notion (page create/patch), and GitHub (PR review submit); each uses non-canonical filenames, checks read-only rejection, and records provider evidence.
Final Health & Validation
evals/suites/initial-integrations-e2e/cases.md
initial-e2e.final-health-and-regression-sweep captures final writeback/Relayfile status, ensures no pending/dead-lettered operations, and enforces absence of new.json.
Acceptance Rubric
evals/suites/initial-integrations-e2e/rubric.md
Defines PASS/BLOCKED/FAIL criteria, required connections and discovery behaviors, terminal writeback state (pending: 0, empty deadLettered), and an evidence-review checklist.
Trajectory Metadata
.trajectories/completed/2026-05/traj_brjdrgcnnwhs.json, .trajectories/completed/2026-05/traj_brjdrgcnnwhs.md, .trajectories/index.json
Adds a completed trajectory JSON/markdown documenting the spec completion and updates the trajectories index lastUpdated and entries.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 I hopped through mounts and schemas bright,
Connected Slack and Linear by moonlight,
Notion and GitHub joined the play,
Files wrote back and then ran away—
E2E done! I nibble a carrot of delight.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding documentation for an initial integrations E2E evaluation spec.
Description check ✅ Passed The description is directly related to the changeset, explaining the E2E eval suite for initial Relayfile integrations and testing approach.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/initial-integrations-e2e-eval

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.trajectories/completed/2026-05/traj_brjdrgcnnwhs.json:
- Line 19: The projectId field contains a machine-specific absolute path;
replace this value with a portable repo-relative identifier or remove the
projectId entry altogether if unused. Locate the "projectId" key in the JSON
blob (symbol: projectId) and change
"/Users/khaliqgant/Projects/AgentWorkforce/relayfile" to a neutral value such as
"relayfile" or "./relayfile" (or delete the projectId property) so the metadata
no longer leaks local environment details.

In @.trajectories/index.json:
- Line 244: Replace the absolute user-specific path value in the "path" field
inside .trajectories/index.json with a repository-relative path (e.g., change
"/Users/khaliqgant/Projects/AgentWorkforce/relayfile/.trajectories/completed/2026-05/traj_brjdrgcnnwhs.json"
to ".trajectories/completed/2026-05/traj_brjdrgcnnwhs.json"); update the JSON
entry so the "path" key holds the repo-relative string to avoid leaking local
environment details and ensure portability.

In `@evals/suites/initial-integrations-e2e/cases.md`:
- Around line 341-343: The jq invocations that build JSON from literals/args
(the lines using jq --arg run "$EVAL_RUN_ID" '{description: ("Patched by " +
$run)}' > "$EVAL_LOCAL_DIR/<canonical-linear-issue-path>") must run with no
input, so add the -n flag to jq (e.g., jq -n --arg run "$EVAL_RUN_ID" ...) to
prevent jq from waiting for stdin or failing in automated runs; apply the same
-n addition to the other similar jq invocation around the EVAL_LOCAL_DIR path at
the second occurrence.
- Around line 176-183: After mounting in background with relayfile mount, add a
bounded readiness poll before running the deterministic assertions (relayfile
status, relayfile tree, relayfile writeback status): call the existing polling
helper to wait until the mount is reported ready (e.g., relayfile status shows
the workspace is mounted and provider roots/pending==0, or relayfile tree
returns the expected root listing) with a sensible timeout and interval, then
proceed to tee the outputs; apply the same readiness-wait change to the similar
block referenced at lines 187-194 to avoid race flakes.
- Around line 102-103: The grep in wait_for_file_contains currently treats
$needle as a regex which can mis-match when needle contains metacharacters;
change the check that uses grep -q "$needle" "$target" to use fixed-string mode
grep -Fq "$needle" "$target" so $needle is matched literally (locate the shell
function wait_for_file_contains and the lines referencing target and needle to
update the grep invocation).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: e0a4326f-4ab3-4725-8385-8f41917129a4

📥 Commits

Reviewing files that changed from the base of the PR and between 6ad8074 and 78c4cb8.

📒 Files selected for processing (5)
  • .trajectories/completed/2026-05/traj_brjdrgcnnwhs.json
  • .trajectories/completed/2026-05/traj_brjdrgcnnwhs.md
  • .trajectories/index.json
  • evals/suites/initial-integrations-e2e/cases.md
  • evals/suites/initial-integrations-e2e/rubric.md

Comment thread .trajectories/completed/2026-05/traj_brjdrgcnnwhs.json Outdated
Comment thread .trajectories/index.json Outdated
Comment thread evals/suites/initial-integrations-e2e/cases.md Outdated
Comment thread evals/suites/initial-integrations-e2e/cases.md
Comment thread evals/suites/initial-integrations-e2e/cases.md Outdated
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 3 additional findings in Devin Review.

Open in Devin Review

Comment on lines +341 to +342
jq --arg run "$EVAL_RUN_ID" '{description: ("Patched by " + $run)}' \
> "$EVAL_LOCAL_DIR/<canonical-linear-issue-path>"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Missing jq -n flag causes eval script to hang on stdin

The jq command on line 341 constructs a new JSON object ({description: ...}) but is missing the -n flag, unlike the create commands on lines 327, 408, and 527 which all correctly use jq -n. Without -n, jq reads from stdin and will block indefinitely in an interactive terminal. An agent following this template would copy the jq invocation as-is (only substituting the <canonical-linear-issue-path> placeholder) and produce a hanging command. The fix is to add -n to match the pattern used everywhere else in the file.

Suggested change
jq --arg run "$EVAL_RUN_ID" '{description: ("Patched by " + $run)}' \
> "$EVAL_LOCAL_DIR/<canonical-linear-issue-path>"
jq -n --arg run "$EVAL_RUN_ID" '{description: ("Patched by " + $run)}' \
> "$EVAL_LOCAL_DIR/<canonical-linear-issue-path>"
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

5. Attempt a read-only mutation against the canonical issue:

```bash
jq '{id: "not-the-real-id"}' > "$EVAL_LOCAL_DIR/<canonical-linear-issue-path>"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Missing jq -n flag causes eval script to hang on stdin

The jq command on line 348 constructs a new JSON object ({id: "not-the-real-id"}) but is missing the -n flag. Same root cause as the patch command above — without -n, jq reads from stdin and will block. Every other jq invocation in this file that creates a new object uses -n (lines 327, 408, 527).

Suggested change
jq '{id: "not-the-real-id"}' > "$EVAL_LOCAL_DIR/<canonical-linear-issue-path>"
jq -n '{id: "not-the-real-id"}' > "$EVAL_LOCAL_DIR/<canonical-linear-issue-path>"
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@khaliqgant khaliqgant force-pushed the codex/initial-integrations-e2e-eval branch from 78c4cb8 to 98cd3f4 Compare May 9, 2026 19:39
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@evals/suites/initial-integrations-e2e/cases.md`:
- Around line 198-203: The readiness/drain helper calls wait_for_provider_roots
and wait_for_writeback_drain can return non-zero but the script keeps running;
modify the block that calls wait_for_provider_roots and wait_for_writeback_drain
so failures immediately abort the run: check each command's exit status and on
non-zero print a clear error message and exit non-zero (or enable strict mode
like set -e at the top of the script), and also treat any "dead letters"
detection from wait_for_writeback_drain as a fatal condition—use the functions'
return codes (wait_for_provider_roots, wait_for_writeback_drain) to gate
continuing to the evidence collection commands (relayfile status/tree/writeback
status).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 9b41f23e-1e4a-48eb-b6af-f1b55c5a6bc1

📥 Commits

Reviewing files that changed from the base of the PR and between 78c4cb8 and 98cd3f4.

📒 Files selected for processing (5)
  • .trajectories/completed/2026-05/traj_brjdrgcnnwhs.json
  • .trajectories/completed/2026-05/traj_brjdrgcnnwhs.md
  • .trajectories/index.json
  • evals/suites/initial-integrations-e2e/cases.md
  • evals/suites/initial-integrations-e2e/rubric.md
✅ Files skipped from review due to trivial changes (3)
  • .trajectories/completed/2026-05/traj_brjdrgcnnwhs.md
  • .trajectories/index.json
  • .trajectories/completed/2026-05/traj_brjdrgcnnwhs.json

Comment on lines +198 to +203
wait_for_provider_roots 180
wait_for_writeback_drain 180
relayfile status "$EVAL_WORKSPACE" | tee "$EVAL_EVIDENCE_DIR/status-after-mount.txt"
relayfile tree "$EVAL_WORKSPACE" / --depth 3 | tee "$EVAL_EVIDENCE_DIR/02-tree-before.txt"
relayfile writeback status "$EVAL_WORKSPACE" --json \
| tee "$EVAL_EVIDENCE_DIR/04-writeback-status-before.json"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when readiness/drain checks time out or detect dead letters.

On Line 198 and Line 199, the wait helpers can return non-zero, but the script continues because the block doesn’t enforce abort semantics. That can produce misleading evidence and false PASS interpretation.

Suggested doc patch
-wait_for_provider_roots 180
-wait_for_writeback_drain 180
+wait_for_provider_roots 180 || {
+  echo "BLOCKED_PROVIDER_ROOTS_TIMEOUT" | tee -a "$EVAL_EVIDENCE_DIR/SUMMARY.md"
+  exit 22
+}
+wait_for_writeback_drain 180 || {
+  rc=$?
+  if [ "$rc" -eq 2 ]; then
+    echo "FAIL_DEAD_LETTERED_WRITEBACK" | tee -a "$EVAL_EVIDENCE_DIR/SUMMARY.md"
+  else
+    echo "BLOCKED_WRITEBACK_DRAIN_TIMEOUT" | tee -a "$EVAL_EVIDENCE_DIR/SUMMARY.md"
+  fi
+  exit 23
+}
 relayfile status "$EVAL_WORKSPACE" | tee "$EVAL_EVIDENCE_DIR/status-after-mount.txt"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
wait_for_provider_roots 180
wait_for_writeback_drain 180
relayfile status "$EVAL_WORKSPACE" | tee "$EVAL_EVIDENCE_DIR/status-after-mount.txt"
relayfile tree "$EVAL_WORKSPACE" / --depth 3 | tee "$EVAL_EVIDENCE_DIR/02-tree-before.txt"
relayfile writeback status "$EVAL_WORKSPACE" --json \
| tee "$EVAL_EVIDENCE_DIR/04-writeback-status-before.json"
wait_for_provider_roots 180 || {
echo "BLOCKED_PROVIDER_ROOTS_TIMEOUT" | tee -a "$EVAL_EVIDENCE_DIR/SUMMARY.md"
exit 22
}
wait_for_writeback_drain 180 || {
rc=$?
if [ "$rc" -eq 2 ]; then
echo "FAIL_DEAD_LETTERED_WRITEBACK" | tee -a "$EVAL_EVIDENCE_DIR/SUMMARY.md"
else
echo "BLOCKED_WRITEBACK_DRAIN_TIMEOUT" | tee -a "$EVAL_EVIDENCE_DIR/SUMMARY.md"
fi
exit 23
}
relayfile status "$EVAL_WORKSPACE" | tee "$EVAL_EVIDENCE_DIR/status-after-mount.txt"
relayfile tree "$EVAL_WORKSPACE" / --depth 3 | tee "$EVAL_EVIDENCE_DIR/02-tree-before.txt"
relayfile writeback status "$EVAL_WORKSPACE" --json \
| tee "$EVAL_EVIDENCE_DIR/04-writeback-status-before.json"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/suites/initial-integrations-e2e/cases.md` around lines 198 - 203, The
readiness/drain helper calls wait_for_provider_roots and
wait_for_writeback_drain can return non-zero but the script keeps running;
modify the block that calls wait_for_provider_roots and wait_for_writeback_drain
so failures immediately abort the run: check each command's exit status and on
non-zero print a clear error message and exit non-zero (or enable strict mode
like set -e at the top of the script), and also treat any "dead letters"
detection from wait_for_writeback_drain as a fatal condition—use the functions'
return codes (wait_for_provider_roots, wait_for_writeback_drain) to gate
continuing to the evidence collection commands (relayfile status/tree/writeback
status).

@khaliqgant khaliqgant merged commit 5fe347c into main May 9, 2026
7 checks passed
@khaliqgant khaliqgant deleted the codex/initial-integrations-e2e-eval branch May 9, 2026 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant