Add eval harness for testing AGENTS.md changes#69308
Conversation
- Verify CLAUDE.md → AGENTS.md symlink before running; baseline arm removes both - Pin promptfoo to 0.121.17 - Fail fast on git worktree errors and clean up worktrees on partial failure - Reject SKILL_NAME + --full (baseline arm would always fail skill-used assert) - Add tests for check_eval_reminder; match guidance file basenames exactly
b73f705 to
e95339b
Compare
|
This PR focuses on the harness design plus one base case (the newsfragment golden rule); expanding coverage can come in follow-up PRs. I'm also not sure the current case design is right, maintainers' thoughts welcome! |
|
Nice. Just a static check failure :) |
| """Remind contributors to run skill-eval when guidance files change. | ||
|
|
||
| This hook prints a warning when AGENTS.md or SKILL.md is staged for | ||
| commit. It always exits 0 — it never blocks the commit. |
There was a problem hiding this comment.
I am not sure if that is going to be enough of a nudge. I think what we could do instead is to have somewhere a proof that the evals have been run after the change. That woud likely be a good idea to make a pre-push prek hook that verifies that (and failing in CI in case the proof is not there).
This could be done rather simply:
- calculate consistent hash of all the input AGENTS.md and related files
- when you run eval and it succeeds -> store the hash in "Last succesufl eval run hash".
- the prek hook should verify if the hashes match
There was a problem hiding this comment.
Of course - we have to also take into account that somoene might not be able to run the eval and have escape hatch for that - in this case likely we need some way or process where maintainer who has agentic access takes over and runs the evals for such contributor and updates the PR.
| config_lines.append(OUTPUT_SCHEMA) | ||
| config_lines.append("") | ||
|
|
||
|
|
There was a problem hiding this comment.
I think we should have several options how to run the evals:
- we should not limit ourselves to Claude - all kind of LLMs - including open-weight / local LLMs should be OK
- we should allow the evals to be used both inside the agents and outside. The token economics currently is strongly in favour of running agentic workflows inside agentic CLIs - those tokens are billed within subscription and they are often 10x - 60x cheaper than API tokens billed as usage based ones. So we should basically have a SKILL to run the evals inside the agent that will make use of those tokens as an option to run evals (this also should be vendor-neutral).
There was a problem hiding this comment.
Actually... That made me think that possibly we should, eventually add such "eval" framework to Magpie - or rather turn the eval framework we have there into reusable SKILL - I think that should be the right approach here - make sure that what we do here follows the same approach as Magpie (or easy way to convert the eval framework of Magpie to use what we do here) - and eventually upstream it to Magpie, so that Magpie provides a "shared eval framework" and "dogfoods" it at the same time.
jason810496
left a comment
There was a problem hiding this comment.
Nice! It's really simple and lightweight (for the user setup perspective, I didn't check whether the promptfoo itself is lightweight or not).
I will update the casees I that just thought which also suitable for eval shortly.
| Should I create a newsfragment? | ||
| assert: | ||
| - type: javascript | ||
| value: 'output.should_create === false' |
There was a problem hiding this comment.
Do we need to add the positive test case? Here're some cases that I thought: Airflow security boundary changes, the recent coordinator interface, the recent TestConnection change (execute the TestConnection workload on worker instead of directly being executed on API server).
Corresponding PRs:
- Coordinator interface: Add Coordinator Layer and Java Coordinator #65958
- TestConnection executed on worker (the security boundary change): Add async connection testing via workers for security isolation #62343
| sdk_dir = Path.home() / ".promptfoo-sdk" / "node_modules" / "@anthropic-ai" / "claude-agent-sdk" | ||
| if not sdk_dir.is_dir(): | ||
| print("Error: Claude Agent SDK not found. Run:", file=sys.stderr) | ||
| print( | ||
| " mkdir -p ~/.promptfoo-sdk && cd ~/.promptfoo-sdk" |
There was a problem hiding this comment.
Small nits: that currently all the user generated content should be located under the /files artifact directory.
This convention is also introduced in AGENTS.md recently by TP (might also be the classic case for eval?)
Corresponding PR:
files/output convention in AGENTS.md: Instruct agents to put generated files in files/ #69097
| if not shutil.which("node"): | ||
| print("Error: Node.js not found. Install Node.js >=22.22.0", file=sys.stderr) | ||
| sys.exit(1) | ||
| if not shutil.which("npx"): | ||
| print("Error: npx not found.", file=sys.stderr) | ||
| sys.exit(1) |
There was a problem hiding this comment.
I wonder could we ensure these system level deps be handled by prek hook level. IIRC, we're able to set system deps on the pre-commit config. Then user don't need to install the node runtime themself.
| # Default test config | ||
| config_lines.append("defaultTest:") | ||
| config_lines.append(" options:") | ||
| config_lines.append(" disableVarExpansion: true") | ||
| if skill_name: | ||
| config_lines.append(" assert:") | ||
| config_lines.append(" - type: skill-used") | ||
| config_lines.append(f" value: {skill_name}") | ||
| config_lines.append("") |
There was a problem hiding this comment.
Not sure could it be better to construct using dict then serialize at the end of the script as yaml?
(Additionally, we can use TypeDict to represent those hierarchy structurally.
| # View results in browser (use the pinned version printed by eval.py): | ||
| npx promptfoo@0.121.17 view |
There was a problem hiding this comment.
Then perhaps we can use the another stage: manual prek hook to show the result (to manage the node deps by prek).
| """Remind contributors to run skill-eval when guidance files change. | ||
|
|
||
| This hook prints a warning when AGENTS.md or SKILL.md is staged for | ||
| commit. It always exits 0 — it never blocks the commit. | ||
| """ |
There was a problem hiding this comment.
Nice, perhaps we can add the slack bot as follow-up to run this as CI monthly then post the message to the #internal-ci-cd-channel then someone will eval for sure!
What
Adds an eval harness (
dev/skill-evals/) for testing AGENTS.md guidance against real scenarios. It answers: "does my AGENTS.md change actually affect agent behavior?"Compares the
mainbranch AGENTS.md against your working tree. Each arm is a git worktree with the full repo, so the agent sees real source files. If AGENTS.md is unchanged, the working arm is skipped automatically.No API key needed — authenticates via Claude Code OAuth (
claude /login).Demo: testing the newsfragment golden rule
Following the discussion in #release-management
Tested the golden rule (#67982) against real PRs where reviewers asked to remove newsfragments:
Golden rule works for clear cases but struggles with ambiguous fixes where the model reasons "this bug affects users."
Open question: I'm not sure the current case selection is the right design. Would appreciate maintainers' thoughts on what cases are worth covering.
cli demo
UI demo(run
npx promptfoo@0.121.17 view):How it works
main's AGENTS.md, one with your working tree version. Both are full repo checkouts.anthropic:claude-agent-sdkprovider and structured JSON output.Files
Future use
Was generative AI tooling used to co-author this PR?
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.