Close Harbor integration gaps: timer daemon, verifier isolation, artifact collection#37
Open
surelyMersad wants to merge 2 commits intoaisa-group:add_harbor_supportfrom
Open
Conversation
Two upstream-shifted dependencies were preventing the harbor_adapter Dockerfile from building on Modal: - vllm==0.11.0 requires xformers==0.0.32.post1, which is no longer on PyPI for manylinux_x86_64. Repinned to 0.19.1, which builds and runs end-to-end on Modal. - inspect_evals was cloned --depth=1 from main, but main HEAD now requires Python>=3.11 while the image installs python3.10. Switched to `uv pip install "inspect_evals @ git+...@<sha>"` pinned to commit 03cb4bc2 (2026-03-15), the last commit on main still declaring requires-python = ">=3.10". Also removes the manual git clone step. Also adds local-only artifacts to .gitignore so they don't sneak in.
…fact collection Closes aisa-group#8. Replaces aisa-group#33 with hrdkbhatnagar's review feedback addressed end-to-end on Modal. - Pre-agent timer is a real daemon, not a sentinel-file. Started at container boot via three redundant mechanisms (ENTRYPOINT, /etc/bash.bashrc, BASH_ENV profile script), all idempotent via a PID file. start_epoch is captured by the daemon's first iteration before any agent code runs. - tests/test.sh restructured around a single fail() helper that always writes reward.txt and metrics.json and snapshots the timer daemon's state into /logs/verifier/timer/, so reviewer-visible evidence lands on every exit path — not just the success path. - Artifact collection moved out of test.sh and onto Harbor's top-level artifacts: config in the generated job.yaml. Fixes a bug from aisa-group#33 where the in-script find/cp loop was unreachable on early exits (e.g. missing final_model) — exactly when artifacts are most useful. - Anti-tamper shebang check: test.sh verifies timer.sh and entrypoint.sh still start with their expected shebangs. - evaluate.py SHA256 integrity check: hash injected into test.sh at task-generation time; mismatch -> reward 0. - Requires Harbor >= 0.2.0 (context_dir fix and 4 MB chunked parallel file transfers both landed in 0.2.0).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #8. Replaces #33 with @hrdkbhatnagar's review feedback addressed
end-to-end on Modal.
What changed since #33
Pre-agent timer is a real daemon, not a sentinel-file. The
sentinel-file approach was correctly flagged as broken — the agent
could (and would) influence its start time. The new design starts a
background timer at container boot via three redundant mechanisms
(
ENTRYPOINT,/etc/bash.bashrc,BASH_ENVprofile script), allidempotent via a PID file.
start_epochis captured by the daemon'sfirst iteration before any agent code runs.
tests/test.shrestructured around a singlefail()helper thatalways writes
reward.txtandmetrics.jsonand snapshots the timerdaemon's state into
/logs/verifier/timer/— so reviewer-visibleevidence lands on every exit path, not just the success path.
Artifact collection moved out of
test.shonto Harbor's top-levelartifacts:config in the generatedjob.yaml. Fixes a bug from Close Harbor integration gaps: verifier isolation, artifact collection #33where the in-script
find … -exec cpblock was unreachable on earlyexits (e.g. missing
final_model) — exactly when artifacts are mostuseful.
Anti-tamper shebang check added:
test.shverifiestimer.shandentrypoint.shstill start with their expected shebangs.Harbor version bumped to ≥ 0.2.0 in install instructions. Harbor
0.2.0 fixed two upstream issues that were blocking the integration:
Image.from_dockerfile(path)now passescontext_dir=environment_dir,so
COPYdirectives resolve against the Dockerfile's directoryregardless of where
harbor runwas invoked.via an asyncio semaphore.
End-to-end verification on Modal
Real Modal run, gsm8k+qwen3-1.7b,
--agent nop,TASK_BUDGET_SECS=60:verifier/test-stdout.txt:
=== Verifying evaluate.py integrity ===
OK: evaluate.py integrity verified (02b97287d5cd2179...)
OK: timer.sh and entrypoint.sh shebangs intact
OK: timer state snapshotted
=== GPU Check ===
NVIDIA H100 80GB HBM3 ...
=== Checking final_model ===
FAIL: final_model directory not found
verifier/reward.txt: 0
verifier/metrics.json: {"error":"final_model directory not found","accuracy":0}
verifier/timer/start_epoch: 1777071723
verifier/timer/budget_secs: 60
verifier/timer/timer.pid: 5
verifier/timer/alert_5min ← all three sub-budget alerts fired,
verifier/timer/alert_10min ← proving the daemon loop ticked at least
verifier/timer/alert_30min ← once before the verifier ran
pid=5and pre-fired alerts together prove the daemon was started bythe entrypoint before the agent or verifier ran — which is the
property "we want a pre-agent hook" was asking for.
Test plan
--agent nop: build → verifier → reward.txt + timer state captured (outputabove),
result.json.exception_info: null--agent claude-codeand short timeout — skipped for smoke; can attach ifrequested