Skip to content

Close Harbor integration gaps: timer daemon, verifier isolation, artifact collection#37

Open
surelyMersad wants to merge 2 commits intoaisa-group:add_harbor_supportfrom
surelyMersad:close-harbor-integration-gaps
Open

Close Harbor integration gaps: timer daemon, verifier isolation, artifact collection#37
surelyMersad wants to merge 2 commits intoaisa-group:add_harbor_supportfrom
surelyMersad:close-harbor-integration-gaps

Conversation

@surelyMersad
Copy link
Copy Markdown

@surelyMersad surelyMersad commented Apr 25, 2026

Closes #8. Replaces #33 with @hrdkbhatnagar's review feedback addressed
end-to-end on Modal.

Stacked on #36 ("Repin vllm and inspect_evals so the Dockerfile
builds"). The end-to-end Modal verification below was run with #36's
Dockerfile fix in place. If #36 merges first, this PR's diff against
add_harbor_support will become net-clean. If #36 hasn't merged yet,
the diff here will include #36's two-line vllm/inspect_evals repin —
please review them as part of #36, not here.

What changed since #33

  • Pre-agent timer is a real daemon, not a sentinel-file. The
    sentinel-file approach was correctly flagged as broken — the agent
    could (and would) influence its start time. The new design starts a
    background timer at container boot via three redundant mechanisms
    (ENTRYPOINT, /etc/bash.bashrc, BASH_ENV profile script), all
    idempotent via a PID file. start_epoch is captured by the daemon's
    first iteration before any agent code runs.

  • tests/test.sh restructured around a single fail() helper that
    always writes reward.txt and metrics.json and snapshots the timer
    daemon's state into /logs/verifier/timer/ — so reviewer-visible
    evidence lands on every exit path, not just the success path.

  • Artifact collection moved out of test.sh onto Harbor's top-level
    artifacts: config in the generated job.yaml. Fixes a bug from Close Harbor integration gaps: verifier isolation, artifact collection #33
    where the in-script find … -exec cp block was unreachable on early
    exits (e.g. missing final_model) — exactly when artifacts are most
    useful.

  • Anti-tamper shebang check added: test.sh verifies timer.sh and
    entrypoint.sh still start with their expected shebangs.

  • Harbor version bumped to ≥ 0.2.0 in install instructions. Harbor
    0.2.0 fixed two upstream issues that were blocking the integration:

    • Image.from_dockerfile(path) now passes context_dir=environment_dir,
      so COPY directives resolve against the Dockerfile's directory
      regardless of where harbor run was invoked.
    • File transfers use 4 MB chunks (was 8 KB) and parallelise downloads
      via an asyncio semaphore.

End-to-end verification on Modal

Real Modal run, gsm8k+qwen3-1.7b, --agent nop, TASK_BUDGET_SECS=60:

verifier/test-stdout.txt:
=== Verifying evaluate.py integrity ===
OK: evaluate.py integrity verified (02b97287d5cd2179...)
OK: timer.sh and entrypoint.sh shebangs intact
OK: timer state snapshotted
=== GPU Check ===
NVIDIA H100 80GB HBM3 ...
=== Checking final_model ===
FAIL: final_model directory not found

verifier/reward.txt: 0
verifier/metrics.json: {"error":"final_model directory not found","accuracy":0}
verifier/timer/start_epoch: 1777071723
verifier/timer/budget_secs: 60
verifier/timer/timer.pid: 5
verifier/timer/alert_5min ← all three sub-budget alerts fired,
verifier/timer/alert_10min ← proving the daemon loop ticked at least
verifier/timer/alert_30min ← once before the verifier ran

pid=5 and pre-fired alerts together prove the daemon was started by
the entrypoint before the agent or verifier ran — which is the
property "we want a pre-agent hook" was asking for.

Test plan

  • Task generation: all 28 combinations generate without error
  • Modal e2e with --agent nop: build → verifier → reward.txt + timer state captured (output
    above), result.json.exception_info: null
  • Modal e2e with --agent claude-code and short timeout — skipped for smoke; can attach if
    requested
  • Modal SHA256 tamper variant — verified locally; can attach Modal log if requested

Two upstream-shifted dependencies were preventing the harbor_adapter
Dockerfile from building on Modal:

- vllm==0.11.0 requires xformers==0.0.32.post1, which is no longer on
  PyPI for manylinux_x86_64. Repinned to 0.19.1, which builds and
  runs end-to-end on Modal.

- inspect_evals was cloned --depth=1 from main, but main HEAD now
  requires Python>=3.11 while the image installs python3.10. Switched
  to `uv pip install "inspect_evals @ git+...@<sha>"` pinned to commit
  03cb4bc2 (2026-03-15), the last commit on main still declaring
  requires-python = ">=3.10". Also removes the manual git clone step.

Also adds local-only artifacts to .gitignore so they don't sneak in.
…fact collection

Closes aisa-group#8.

Replaces aisa-group#33 with hrdkbhatnagar's review feedback addressed end-to-end
on Modal.

- Pre-agent timer is a real daemon, not a sentinel-file. Started at
  container boot via three redundant mechanisms (ENTRYPOINT,
  /etc/bash.bashrc, BASH_ENV profile script), all idempotent via a PID
  file. start_epoch is captured by the daemon's first iteration before
  any agent code runs.

- tests/test.sh restructured around a single fail() helper that always
  writes reward.txt and metrics.json and snapshots the timer daemon's
  state into /logs/verifier/timer/, so reviewer-visible evidence lands
  on every exit path — not just the success path.

- Artifact collection moved out of test.sh and onto Harbor's top-level
  artifacts: config in the generated job.yaml. Fixes a bug from aisa-group#33
  where the in-script find/cp loop was unreachable on early exits
  (e.g. missing final_model) — exactly when artifacts are most useful.

- Anti-tamper shebang check: test.sh verifies timer.sh and
  entrypoint.sh still start with their expected shebangs.

- evaluate.py SHA256 integrity check: hash injected into test.sh at
  task-generation time; mismatch -> reward 0.

- Requires Harbor >= 0.2.0 (context_dir fix and 4 MB chunked parallel
  file transfers both landed in 0.2.0).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant