Add HF sandbox provider#841
Conversation
HF sandbox benchmarkRun id: Throughput by workload
Stage timing
All benchmark jobs completed successfully. |
sergiopaniego
left a comment
There was a problem hiding this comment.
Thanks for the first iteration!! tested locally and works nicely. looking forward to this integration :) Some AI-assisted review below with some ideas about when this is scaled.
I left inline comments (with suggested changes) on the specific spots. The framing I'd suggest: the correctness items and repo-convention items should land in this PR, while the scale-hardening items below are fine as documented limitations plus follow-up issues, as long as the design doesn't close the door on them.
Why the correctness items matter here specifically: this provider will be used to run many environments in parallel for RL training, where a single env that fails silently is worse than one that fails loudly. It either wastes wall-clock (the whole batch waits on a straggler) or feeds garbage observations and rewards into the policy.
Scale hardening (follow-up OK, but worth flagging)
- Configurable startup timeouts. The 120s budgets are hardcoded. Heavy images and GPU flavors can take much longer to schedule, so these should be
__init__params. - Retry/backoff on
run_job. At high concurrency the Jobs API will rate-limit or return transient init errors. A bounded retry would avoid losing envs to a single blip. - Reuse a shared
httpx.AsyncClientin the proxy. A new client (and connection pool) is created per request. With many tool-call steps across many envs this is measurable hot-path overhead. - Per-env proxy footprint. Each provider instance starts its own uvicorn server, thread, and event loop. At high env concurrency that's a lot of local servers on the trainer host. Worth documenting as a known limit, or considering a single shared proxy that multiplexes upstreams.
- Observability. The provider logs nothing. With many concurrent envs, structured logging of job id, stage, and failure cause is what makes a failing env debuggable.
Tests & docs
- No tests yet. The pure helpers (
_job_port_url,_to_ws_url,_find_available_port, hop-by-hop filtering) pluswait_for_readyare all testable with no network, matching the existing provider test suite undertests/test_core/. - Docstrings. Public methods need the HF doc-builder docstring format used by the other providers.
sergiopaniego
left a comment
There was a problem hiding this comment.
I've created a PR for making the docs for containers consistent (#868), it'd be cool to improve the documentation on this PR aligning it with that PR 😄
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 5 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Want higher recall? High effort reviews run extra passes and find more bugs. A team admin can switch effort levels in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 1be2ce8. Configure here.
|
Did a full fat implementation here: book_flight_lfm25_hf_sandbox.py
BookFlight experiment results:
Training loss:
|
|
Update: restored the in-repo BrowserGym example to the TRL GRPO training example from commit |
HF sandbox provider benchmark + example runsRan against this OpenEnv branch at Provider benchmark (
|
| case | result | start_s | ready_s | reset_s | step_s | total_s | note |
|---|---|---|---|---|---|---|---|
| cold_print | pass | 13.00 | 4.70 | 0.10 | 0.10 | 18.95 | cold pool host |
| warm_sleep_2s | pass | 0.27 | 4.69 | 0.10 | 2.11 | 8.30 | reused pool host, 2s workload |
| concurrent_print_0 | pass | 0.41 | 7.08 | 0.14 | 0.13 | 10.42 | 2 concurrent service starts |
| concurrent_print_1 | pass | 0.89 | 6.86 | 0.16 | 0.15 | 10.47 | 2 concurrent service starts |
BookFlight LFM2.5 training runRan the updated gist on this PR branch at https://gist.github.com/burtenshaw/3056a37728bc5d9a094db24cbacca96e Result: reward increased on MiniWoB
Final successful action sequence: Evidence is in the gist PR-side fixes pushed in
HF-side note from the run: Chromium requires a much higher sandbox |

This PR adds a minimal Hugging Face sandbox-backed provider for OpenEnv environment servers.
Note
Medium Risk
Touches connection lifecycle and adds a local proxy that forwards credentials to remote HF endpoints; mistakes could leak tokens or leave jobs running, but scope is additive with explicit token requirements.
Overview
Adds
HFSandboxProvider, which launches an OpenEnv server image as a Hugging Face job (run_job), waits for the exposed URL, and returns a local HTTP/WebSocket proxy that injects the HF bearer token so existing clients can talk to the remote server without custom auth.EnvClientnow accepts an optionalbase_urlwhen aprovideris supplied: onconnect(), it can start the provider, wait for readiness, and set the WebSocket URL. Failures during startup trigger cleanup viaclose(), and provider-started sessions reset_ws_urlon close so a reconnect can start a fresh job.Dependency
huggingface_hubis bumped to>=1.20.1for the jobs API. New examples smoke-testcoding_envthrough the provider and run TRL GRPO against BrowserGym hosted on HF Sandbox.Reviewed by Cursor Bugbot for commit 1be2ce8. Bugbot is set up for automated code reviews on this repo. Configure here.