Skip to content

Add eval harness for testing AGENTS.md changes#69308

Open
RoyLee1224 wants to merge 10 commits into
apache:mainfrom
RoyLee1224:feat/skill-eval-harness
Open

Add eval harness for testing AGENTS.md changes#69308
RoyLee1224 wants to merge 10 commits into
apache:mainfrom
RoyLee1224:feat/skill-eval-harness

Conversation

@RoyLee1224

Copy link
Copy Markdown
Contributor

What

Adds an eval harness (dev/skill-evals/) for testing AGENTS.md guidance against real scenarios. It answers: "does my AGENTS.md change actually affect agent behavior?"

uv run dev/skill-evals/eval.py --repeat 3

Compares the main branch AGENTS.md against your working tree. Each arm is a git worktree with the full repo, so the agent sees real source files. If AGENTS.md is unchanged, the working arm is skipped automatically.

No API key needed — authenticates via Claude Code OAuth (claude /login).

Demo: testing the newsfragment golden rule

Following the discussion in #release-management
Tested the golden rule (#67982) against real PRs where reviewers asked to remove newsfragments:

Case with AGENTS.md without
Provider pod leak #67333 3/3 ✓ 0/3
API optimization #66696 3/3 ✓ 1/3
Scheduler fix #64322 2/3 0/3
i18n cache fix #65720 1/3 0/3

Golden rule works for clear cases but struggles with ambiguous fixes where the model reasons "this bug affects users."

Open question: I'm not sure the current case selection is the right design. Would appreciate maintainers' thoughts on what cases are worth covering.

cli demo

CleanShot 2026-07-02 at 00 57 34@2x

UI demo(run npx promptfoo@0.121.17 view):

ui demo

How it works

  1. Creates git worktrees — one with main's AGENTS.md, one with your working tree version. Both are full repo checkouts.
  2. Generates a promptfoo config with anthropic:claude-agent-sdk provider and structured JSON output.
  3. Runs each case against all arms in parallel, reports diff.
  4. Worktrees cleaned up on exit.

Files

dev/skill-evals/
  eval.py                      # entry point (Python, per dev/ guidelines)
  cases/newsfragment.yaml      # cases from real PRs (#64322, #65720, #66696, #67333)
  README.md                    # setup and usage
.pre-commit-config.yaml        # prek hook: remind to run eval on AGENTS.md changes
scripts/ci/prek/check_eval_reminder.py

Future use

  • Auto-generate a summary table (like the one in this PR) from promptfoo results. Might be useful for pasting into PRs that change AGENTS.md; open to discussion
  • As models improve, some guidance may become unnecessary — the eval helps determine which rules the model still depends on and which can be safely removed
  • Add cases for routing rules that matter — the prek hook reminds contributors to run the eval when AGENTS.md changes
  • Use the eval to validate moving guidance from AGENTS.md into skills — run before and after to confirm no regression
  • Currently tests Claude only; the architecture (promptfoo + structured output) could extend to other agent runtimes

Was generative AI tooling used to co-author this PR?
  • Yes, Claude Code (Opus 4.8)

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

@RoyLee1224 RoyLee1224 force-pushed the feat/skill-eval-harness branch from b73f705 to e95339b Compare July 3, 2026 08:02
@RoyLee1224 RoyLee1224 changed the title Feat/skill eval harness Add eval harness for testing AGENTS.md changes Jul 3, 2026
@RoyLee1224

Copy link
Copy Markdown
Contributor Author

This PR focuses on the harness design plus one base case (the newsfragment golden rule); expanding coverage can come in follow-up PRs. I'm also not sure the current case design is right, maintainers' thoughts welcome!

@potiuk

potiuk commented Jul 3, 2026

Copy link
Copy Markdown
Member

Nice. Just a static check failure :)

"""Remind contributors to run skill-eval when guidance files change.

This hook prints a warning when AGENTS.md or SKILL.md is staged for
commit. It always exits 0 — it never blocks the commit.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if that is going to be enough of a nudge. I think what we could do instead is to have somewhere a proof that the evals have been run after the change. That woud likely be a good idea to make a pre-push prek hook that verifies that (and failing in CI in case the proof is not there).

This could be done rather simply:

  • calculate consistent hash of all the input AGENTS.md and related files
  • when you run eval and it succeeds -> store the hash in "Last succesufl eval run hash".
  • the prek hook should verify if the hashes match

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course - we have to also take into account that somoene might not be able to run the eval and have escape hatch for that - in this case likely we need some way or process where maintainer who has agentic access takes over and runs the evals for such contributor and updates the PR.

Comment thread dev/skill-evals/eval.py
config_lines.append(OUTPUT_SCHEMA)
config_lines.append("")


@potiuk potiuk Jul 3, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have several options how to run the evals:

  • we should not limit ourselves to Claude - all kind of LLMs - including open-weight / local LLMs should be OK
  • we should allow the evals to be used both inside the agents and outside. The token economics currently is strongly in favour of running agentic workflows inside agentic CLIs - those tokens are billed within subscription and they are often 10x - 60x cheaper than API tokens billed as usage based ones. So we should basically have a SKILL to run the evals inside the agent that will make use of those tokens as an option to run evals (this also should be vendor-neutral).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually... That made me think that possibly we should, eventually add such "eval" framework to Magpie - or rather turn the eval framework we have there into reusable SKILL - I think that should be the right approach here - make sure that what we do here follows the same approach as Magpie (or easy way to convert the eval framework of Magpie to use what we do here) - and eventually upstream it to Magpie, so that Magpie provides a "shared eval framework" and "dogfoods" it at the same time.

@jason810496 jason810496 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! It's really simple and lightweight (for the user setup perspective, I didn't check whether the promptfoo itself is lightweight or not).

I will update the casees I that just thought which also suitable for eval shortly.

Should I create a newsfragment?
assert:
- type: javascript
value: 'output.should_create === false'

@jason810496 jason810496 Jul 3, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add the positive test case? Here're some cases that I thought: Airflow security boundary changes, the recent coordinator interface, the recent TestConnection change (execute the TestConnection workload on worker instead of directly being executed on API server).

Corresponding PRs:

Comment thread dev/skill-evals/eval.py
Comment on lines +79 to +83
sdk_dir = Path.home() / ".promptfoo-sdk" / "node_modules" / "@anthropic-ai" / "claude-agent-sdk"
if not sdk_dir.is_dir():
print("Error: Claude Agent SDK not found. Run:", file=sys.stderr)
print(
" mkdir -p ~/.promptfoo-sdk && cd ~/.promptfoo-sdk"

@jason810496 jason810496 Jul 3, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nits: that currently all the user generated content should be located under the /files artifact directory.

This convention is also introduced in AGENTS.md recently by TP (might also be the classic case for eval?)

Corresponding PR:

Comment thread dev/skill-evals/eval.py
Comment on lines +72 to +77
if not shutil.which("node"):
print("Error: Node.js not found. Install Node.js >=22.22.0", file=sys.stderr)
sys.exit(1)
if not shutil.which("npx"):
print("Error: npx not found.", file=sys.stderr)
sys.exit(1)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder could we ensure these system level deps be handled by prek hook level. IIRC, we're able to set system deps on the pre-commit config. Then user don't need to install the node runtime themself.

Comment thread dev/skill-evals/eval.py
Comment on lines +304 to +312
# Default test config
config_lines.append("defaultTest:")
config_lines.append(" options:")
config_lines.append(" disableVarExpansion: true")
if skill_name:
config_lines.append(" assert:")
config_lines.append(" - type: skill-used")
config_lines.append(f" value: {skill_name}")
config_lines.append("")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure could it be better to construct using dict then serialize at the end of the script as yaml?
(Additionally, we can use TypeDict to represent those hierarchy structurally.

Comment thread dev/skill-evals/README.md
Comment on lines +86 to +87
# View results in browser (use the pinned version printed by eval.py):
npx promptfoo@0.121.17 view

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then perhaps we can use the another stage: manual prek hook to show the result (to manage the node deps by prek).

Comment on lines +18 to +22
"""Remind contributors to run skill-eval when guidance files change.

This hook prints a warning when AGENTS.md or SKILL.md is staged for
commit. It always exits 0 — it never blocks the commit.
"""

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, perhaps we can add the slack bot as follow-up to run this as CI monthly then post the message to the #internal-ci-cd-channel then someone will eval for sure!

@jason810496 jason810496 removed the backport-to-v3-3-test Backport to v3-3-test label Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants