Close Harbor integration gaps: timer daemon, verifier isolation, artifact collection by surelyMersad · Pull Request #37 · aisa-group/PostTrainBench

surelyMersad · 2026-04-25T00:26:23Z

Closes #8. Replaces #33 with @hrdkbhatnagar's review feedback addressed
end-to-end on Modal.

Stacked on #36 ("Repin vllm and inspect_evals so the Dockerfile
builds"). The end-to-end Modal verification below was run with #36's
Dockerfile fix in place. If #36 merges first, this PR's diff against
add_harbor_support will become net-clean. If #36 hasn't merged yet,
the diff here will include #36's two-line vllm/inspect_evals repin —
please review them as part of #36, not here.

What changed since #33

Pre-agent timer is a real daemon, not a sentinel-file. The
sentinel-file approach was correctly flagged as broken — the agent
could (and would) influence its start time. The new design starts a
background timer at container boot via three redundant mechanisms
(ENTRYPOINT, /etc/bash.bashrc, BASH_ENV profile script), all
idempotent via a PID file. start_epoch is captured by the daemon's
first iteration before any agent code runs.
tests/test.sh restructured around a single fail() helper that
always writes reward.txt and metrics.json and snapshots the timer
daemon's state into /logs/verifier/timer/ — so reviewer-visible
evidence lands on every exit path, not just the success path.
Artifact collection moved out of test.sh onto Harbor's top-level
artifacts: config in the generated job.yaml. Fixes a bug from Close Harbor integration gaps: verifier isolation, artifact collection #33
where the in-script find … -exec cp block was unreachable on early
exits (e.g. missing final_model) — exactly when artifacts are most
useful.
Anti-tamper shebang check added: test.sh verifies timer.sh and
entrypoint.sh still start with their expected shebangs.
Harbor version bumped to ≥ 0.2.0 in install instructions. Harbor
0.2.0 fixed two upstream issues that were blocking the integration:
- Image.from_dockerfile(path) now passes context_dir=environment_dir,
  so COPY directives resolve against the Dockerfile's directory
  regardless of where harbor run was invoked.
- File transfers use 4 MB chunks (was 8 KB) and parallelise downloads
  via an asyncio semaphore.

End-to-end verification on Modal

Real Modal run, gsm8k+qwen3-1.7b, --agent nop, TASK_BUDGET_SECS=60:

verifier/test-stdout.txt:
=== Verifying evaluate.py integrity ===
OK: evaluate.py integrity verified (02b97287d5cd2179...)
OK: timer.sh and entrypoint.sh shebangs intact
OK: timer state snapshotted
=== GPU Check ===
NVIDIA H100 80GB HBM3 ...
=== Checking final_model ===
FAIL: final_model directory not found

verifier/reward.txt: 0
verifier/metrics.json: {"error":"final_model directory not found","accuracy":0}
verifier/timer/start_epoch: 1777071723
verifier/timer/budget_secs: 60
verifier/timer/timer.pid: 5
verifier/timer/alert_5min ← all three sub-budget alerts fired,
verifier/timer/alert_10min ← proving the daemon loop ticked at least
verifier/timer/alert_30min ← once before the verifier ran

pid=5 and pre-fired alerts together prove the daemon was started by
the entrypoint before the agent or verifier ran — which is the
property "we want a pre-agent hook" was asking for.

Test plan

Task generation: all 28 combinations generate without error
Modal e2e with --agent nop: build → verifier → reward.txt + timer state captured (output
above), result.json.exception_info: null
Modal e2e with --agent claude-code and short timeout — skipped for smoke; can attach if
requested
Modal SHA256 tamper variant — verified locally; can attach Modal log if requested

Two upstream-shifted dependencies were preventing the harbor_adapter Dockerfile from building on Modal: - vllm==0.11.0 requires xformers==0.0.32.post1, which is no longer on PyPI for manylinux_x86_64. Repinned to 0.19.1, which builds and runs end-to-end on Modal. - inspect_evals was cloned --depth=1 from main, but main HEAD now requires Python>=3.11 while the image installs python3.10. Switched to `uv pip install "inspect_evals @ git+...@<sha>"` pinned to commit 03cb4bc2 (2026-03-15), the last commit on main still declaring requires-python = ">=3.10". Also removes the manual git clone step. Also adds local-only artifacts to .gitignore so they don't sneak in.

…fact collection Closes aisa-group#8. Replaces aisa-group#33 with hrdkbhatnagar's review feedback addressed end-to-end on Modal. - Pre-agent timer is a real daemon, not a sentinel-file. Started at container boot via three redundant mechanisms (ENTRYPOINT, /etc/bash.bashrc, BASH_ENV profile script), all idempotent via a PID file. start_epoch is captured by the daemon's first iteration before any agent code runs. - tests/test.sh restructured around a single fail() helper that always writes reward.txt and metrics.json and snapshots the timer daemon's state into /logs/verifier/timer/, so reviewer-visible evidence lands on every exit path — not just the success path. - Artifact collection moved out of test.sh and onto Harbor's top-level artifacts: config in the generated job.yaml. Fixes a bug from aisa-group#33 where the in-script find/cp loop was unreachable on early exits (e.g. missing final_model) — exactly when artifacts are most useful. - Anti-tamper shebang check: test.sh verifies timer.sh and entrypoint.sh still start with their expected shebangs. - evaluate.py SHA256 integrity check: hash injected into test.sh at task-generation time; mismatch -> reward 0. - Requires Harbor >= 0.2.0 (context_dir fix and 4 MB chunked parallel file transfers both landed in 0.2.0).

surelyMersad added 2 commits April 24, 2026 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close Harbor integration gaps: timer daemon, verifier isolation, artifact collection#37

Close Harbor integration gaps: timer daemon, verifier isolation, artifact collection#37
surelyMersad wants to merge 2 commits intoaisa-group:add_harbor_supportfrom
surelyMersad:close-harbor-integration-gaps

surelyMersad commented Apr 25, 2026 •

edited by hrdkbhatnagar

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

surelyMersad commented Apr 25, 2026 • edited by hrdkbhatnagar Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed since #33

End-to-end verification on Modal

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

surelyMersad commented Apr 25, 2026 •

edited by hrdkbhatnagar

Loading