Add eval harness for testing AGENTS.md changes by RoyLee1224 · Pull Request #69308 · apache/airflow

RoyLee1224 · 2026-07-03T08:02:27Z

What

Adds an eval harness (dev/skill-evals/) for testing AGENTS.md guidance against real scenarios. It answers: "does my AGENTS.md change actually affect agent behavior?"

uv run dev/skill-evals/eval.py --repeat 3

Compares the main branch AGENTS.md against your working tree. Each arm is a git worktree with the full repo, so the agent sees real source files. If AGENTS.md is unchanged, the working arm is skipped automatically.

No API key needed — authenticates via Claude Code OAuth (claude /login).

Demo: testing the newsfragment golden rule

Following the discussion in #release-management
Tested the golden rule (#67982) against real PRs where reviewers asked to remove newsfragments:

Case	with AGENTS.md	without
Provider pod leak #67333	3/3 ✓	0/3
API optimization #66696	3/3 ✓	1/3
Scheduler fix #64322	2/3	0/3
i18n cache fix #65720	1/3	0/3

Golden rule works for clear cases but struggles with ambiguous fixes where the model reasons "this bug affects users."

Open question: I'm not sure the current case selection is the right design. Would appreciate maintainers' thoughts on what cases are worth covering.

cli demo

UI demo(run `npx promptfoo@0.121.17 view`):

How it works

Creates git worktrees — one with main's AGENTS.md, one with your working tree version. Both are full repo checkouts.
Generates a promptfoo config with anthropic:claude-agent-sdk provider and structured JSON output.
Runs each case against all arms in parallel, reports diff.
Worktrees cleaned up on exit.

Files

dev/skill-evals/
  eval.py                      # entry point (Python, per dev/ guidelines)
  cases/newsfragment.yaml      # cases from real PRs (#64322, #65720, #66696, #67333)
  README.md                    # setup and usage
.pre-commit-config.yaml        # prek hook: remind to run eval on AGENTS.md changes
scripts/ci/prek/check_eval_reminder.py

Future use

Auto-generate a summary table (like the one in this PR) from promptfoo results. Might be useful for pasting into PRs that change AGENTS.md; open to discussion
As models improve, some guidance may become unnecessary — the eval helps determine which rules the model still depends on and which can be safely removed
Add cases for routing rules that matter — the prek hook reminds contributors to run the eval when AGENTS.md changes
Use the eval to validate moving guidance from AGENTS.md into skills — run before and after to confirm no regression
Currently tests Claude only; the architecture (promptfoo + structured output) could extend to other agent runtimes

Was generative AI tooling used to co-author this PR?

Yes, Claude Code (Opus 4.8)

Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
When adding dependency, check compliance with the ASF 3rd Party License Policy.
For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

…latest

…real PRs

- Verify CLAUDE.md → AGENTS.md symlink before running; baseline arm removes both - Pin promptfoo to 0.121.17 - Fail fast on git worktree errors and clean up worktrees on partial failure - Reject SKILL_NAME + --full (baseline arm would always fail skill-used assert) - Add tests for check_eval_reminder; match guidance file basenames exactly

RoyLee1224 · 2026-07-03T08:14:33Z

This PR focuses on the harness design plus one base case (the newsfragment golden rule); expanding coverage can come in follow-up PRs. I'm also not sure the current case design is right, maintainers' thoughts welcome!

potiuk · 2026-07-03T10:17:39Z

Nice. Just a static check failure :)

potiuk · 2026-07-03T11:02:12Z

+"""Remind contributors to run skill-eval when guidance files change.
+
+This hook prints a warning when AGENTS.md or SKILL.md is staged for
+commit. It always exits 0 — it never blocks the commit.


I am not sure if that is going to be enough of a nudge. I think what we could do instead is to have somewhere a proof that the evals have been run after the change. That woud likely be a good idea to make a pre-push prek hook that verifies that (and failing in CI in case the proof is not there).

This could be done rather simply:

calculate consistent hash of all the input AGENTS.md and related files

when you run eval and it succeeds -> store the hash in "Last succesufl eval run hash".

the prek hook should verify if the hashes match

Of course - we have to also take into account that somoene might not be able to run the eval and have escape hatch for that - in this case likely we need some way or process where maintainer who has agentic access takes over and runs the evals for such contributor and updates the PR.

potiuk · 2026-07-03T11:08:18Z

+    config_lines.append(OUTPUT_SCHEMA)
+    config_lines.append("")
+
+


I think we should have several options how to run the evals:

we should not limit ourselves to Claude - all kind of LLMs - including open-weight / local LLMs should be OK

we should allow the evals to be used both inside the agents and outside. The token economics currently is strongly in favour of running agentic workflows inside agentic CLIs - those tokens are billed within subscription and they are often 10x - 60x cheaper than API tokens billed as usage based ones. So we should basically have a SKILL to run the evals inside the agent that will make use of those tokens as an option to run evals (this also should be vendor-neutral).

Actually... That made me think that possibly we should, eventually add such "eval" framework to Magpie - or rather turn the eval framework we have there into reusable SKILL - I think that should be the right approach here - make sure that what we do here follows the same approach as Magpie (or easy way to convert the eval framework of Magpie to use what we do here) - and eventually upstream it to Magpie, so that Magpie provides a "shared eval framework" and "dogfoods" it at the same time.

jason810496

Nice! It's really simple and lightweight (for the user setup perspective, I didn't check whether the promptfoo itself is lightweight or not).

I will update the casees I that just thought which also suitable for eval shortly.

jason810496 · 2026-07-03T10:55:55Z

+      Should I create a newsfragment?
+  assert:
+    - type: javascript
+      value: 'output.should_create === false'


Do we need to add the positive test case? Here're some cases that I thought: Airflow security boundary changes, the recent coordinator interface, the recent TestConnection change (execute the TestConnection workload on worker instead of directly being executed on API server).

Corresponding PRs:

Coordinator interface: Add Coordinator Layer and Java Coordinator #65958

TestConnection executed on worker (the security boundary change): Add async connection testing via workers for security isolation #62343

jason810496 · 2026-07-03T10:58:13Z

+    sdk_dir = Path.home() / ".promptfoo-sdk" / "node_modules" / "@anthropic-ai" / "claude-agent-sdk"
+    if not sdk_dir.is_dir():
+        print("Error: Claude Agent SDK not found. Run:", file=sys.stderr)
+        print(
+            "  mkdir -p ~/.promptfoo-sdk && cd ~/.promptfoo-sdk"


Small nits: that currently all the user generated content should be located under the /files artifact directory.

This convention is also introduced in AGENTS.md recently by TP (might also be the classic case for eval?)

Corresponding PR:

files/ output convention in AGENTS.md: Instruct agents to put generated files in files/ #69097

jason810496 · 2026-07-03T11:00:05Z

+    if not shutil.which("node"):
+        print("Error: Node.js not found. Install Node.js >=22.22.0", file=sys.stderr)
+        sys.exit(1)
+    if not shutil.which("npx"):
+        print("Error: npx not found.", file=sys.stderr)
+        sys.exit(1)


I wonder could we ensure these system level deps be handled by prek hook level. IIRC, we're able to set system deps on the pre-commit config. Then user don't need to install the node runtime themself.

jason810496 · 2026-07-03T11:02:21Z

+        # Default test config
+        config_lines.append("defaultTest:")
+        config_lines.append("  options:")
+        config_lines.append("    disableVarExpansion: true")
+        if skill_name:
+            config_lines.append("  assert:")
+            config_lines.append("    - type: skill-used")
+            config_lines.append(f"      value: {skill_name}")
+        config_lines.append("")


Not sure could it be better to construct using dict then serialize at the end of the script as yaml?
(Additionally, we can use TypeDict to represent those hierarchy structurally.

jason810496 · 2026-07-03T11:06:01Z

+# View results in browser (use the pinned version printed by eval.py):
+npx promptfoo@0.121.17 view


Then perhaps we can use the another stage: manual prek hook to show the result (to manage the node deps by prek).

jason810496 · 2026-07-03T11:07:53Z

+"""Remind contributors to run skill-eval when guidance files change.
+
+This hook prints a warning when AGENTS.md or SKILL.md is staged for
+commit. It always exits 0 — it never blocks the commit.
+"""


Nice, perhaps we can add the slack bot as follow-up to run this as CI monthly then post the message to the #internal-ci-cd-channel then someone will eval for sure!

RoyLee1224 requested review from amoghrajesh, ashb, bugraoz93, choo121600, ephraimbuddy, gopidesupavan, jason810496, jedcunningham, jscheffl, potiuk and vatsrahul1001 as code owners July 3, 2026 08:02

boring-cyborg Bot added area:dev-tools backport-to-v3-3-test Backport to v3-3-test labels Jul 3, 2026

RoyLee1224 added 10 commits July 3, 2026 16:02

Add skill-eval harness scaffold with promptfoo

f5b3f5e

ci: Add prek hook to remind eval on AGENTS.md and SKILL.md changes

e5c6ec3

ci: add OAuth auth and runtime config generation to skill-eval harness

d9de5c9

feat: add skill-eval harness for AGENTS.md regression testing

11480b4

docs: update skill-eval README and use Helm routing as starter case

733d3e0

fix: point skill-eval reminder hook at eval.py, not nonexistent eval.sh

fb91e0e

refactor: skip working arm when unchanged, show model in output

6634304

refactor: skip working arm when unchanged, show model, use promptfoo@…

86c3886

…latest

refactor: replace command-routing cases with newsfragment cases from …

abf9122

…real PRs

RoyLee1224 force-pushed the feat/skill-eval-harness branch from b73f705 to e95339b Compare July 3, 2026 08:02

RoyLee1224 changed the title ~~Feat/skill eval harness~~ Add eval harness for testing AGENTS.md changes Jul 3, 2026

potiuk reviewed Jul 3, 2026

View reviewed changes

jason810496 reviewed Jul 3, 2026

View reviewed changes

jason810496 removed the backport-to-v3-3-test Backport to v3-3-test label Jul 3, 2026

		# View results in browser (use the pinned version printed by eval.py):
		npx promptfoo@0.121.17 view

Uh oh!

Conversation

RoyLee1224 commented Jul 3, 2026

What

Demo: testing the newsfragment golden rule

cli demo

UI demo(run npx promptfoo@0.121.17 view):

How it works

Files

Future use

Was generative AI tooling used to co-author this PR?

Uh oh!

RoyLee1224 commented Jul 3, 2026

Uh oh!

potiuk commented Jul 3, 2026

Uh oh!

potiuk Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

potiuk Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

potiuk Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

potiuk Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

jason810496 left a comment

Choose a reason for hiding this comment

Uh oh!

jason810496 Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jason810496 Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jason810496 Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

jason810496 Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

jason810496 Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

jason810496 Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UI demo(run `npx promptfoo@0.121.17 view`):

potiuk Jul 3, 2026 •

edited

Loading

jason810496 Jul 3, 2026 •

edited

Loading

jason810496 Jul 3, 2026 •

edited

Loading