Skip to content

Add HF sandbox provider#841

Open
burtenshaw wants to merge 23 commits into
huggingface:mainfrom
burtenshaw:ben/hf-sandbox-provider
Open

Add HF sandbox provider#841
burtenshaw wants to merge 23 commits into
huggingface:mainfrom
burtenshaw:ben/hf-sandbox-provider

Conversation

@burtenshaw

@burtenshaw burtenshaw commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

This PR adds a minimal Hugging Face sandbox-backed provider for OpenEnv environment servers.


Note

Medium Risk
Touches connection lifecycle and adds a local proxy that forwards credentials to remote HF endpoints; mistakes could leak tokens or leave jobs running, but scope is additive with explicit token requirements.

Overview
Adds HFSandboxProvider, which launches an OpenEnv server image as a Hugging Face job (run_job), waits for the exposed URL, and returns a local HTTP/WebSocket proxy that injects the HF bearer token so existing clients can talk to the remote server without custom auth.

EnvClient now accepts an optional base_url when a provider is supplied: on connect(), it can start the provider, wait for readiness, and set the WebSocket URL. Failures during startup trigger cleanup via close(), and provider-started sessions reset _ws_url on close so a reconnect can start a fresh job.

Dependency huggingface_hub is bumped to >=1.20.1 for the jobs API. New examples smoke-test coding_env through the provider and run TRL GRPO against BrowserGym hosted on HF Sandbox.

Reviewed by Cursor Bugbot for commit 1be2ce8. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread src/openenv/core/containers/runtime/hf_sandbox_provider.py
@burtenshaw burtenshaw requested a review from adithya-s-k June 24, 2026 10:15
@burtenshaw burtenshaw marked this pull request as ready for review June 24, 2026 10:15
@burtenshaw

Copy link
Copy Markdown
Collaborator Author

HF sandbox benchmark

Run id: 20260624T115124Z
Namespace: burtenshaw
Image/flavor: python:3.12 / cpu-basic
Workload: each job allocates and touches the requested memory, then holds it while looping/sleeping for the requested duration.
Wall/throughput use the conservative group wall, falling back to HF backend total_secs when it is larger than the local timer.

Throughput by workload

workload concurrency success wall scheduling p50/p95 running p50/p95 total p50/p95 throughput avg peak RSS
short (10s, 128MB) 1 1/1 17.1s 4.0s/4.0s 11.0s/11.0s 15.0s/15.0s 3.50 jobs/min 142.1 MB
short (10s, 128MB) 2 2/2 17.0s 4.0s/4.0s 11.0s/11.0s 16.0s/16.0s 7.05 jobs/min 142.2 MB
short (10s, 128MB) 4 4/4 17.0s 4.0s/5.0s 11.0s/11.0s 15.5s/16.0s 14.12 jobs/min 142.1 MB
medium (45s, 512MB) 1 1/1 53.1s 5.0s/5.0s 46.0s/46.0s 51.0s/51.0s 1.13 jobs/min 527.7 MB
medium (45s, 512MB) 2 2/2 53.1s 4.5s/5.0s 46.0s/46.0s 50.5s/51.0s 2.26 jobs/min 527.5 MB
medium (45s, 512MB) 4 4/4 53.3s 4.0s/5.0s 46.0s/46.0s 51.0s/51.0s 4.51 jobs/min 527.7 MB
long (120s, 1024MB) 1 1/1 126.0s 4.0s/4.0s 121.0s/121.0s 126.0s/126.0s 0.48 jobs/min 1041.5 MB
long (120s, 1024MB) 2 2/2 128.8s 4.5s/5.0s 121.0s/121.0s 126.5s/127.0s 0.93 jobs/min 1041.6 MB
long (120s, 1024MB) 4 4/4 136.0s 8.0s/11.0s 121.0s/121.0s 129.5s/133.0s 1.77 jobs/min 1041.6 MB

Stage timing

workload concurrency first running p50 all done alloc mean
short (10s, 128MB) 1 5.3s 17.1s 0.1s
short (10s, 128MB) 2 6.3s 17.0s 0.1s
short (10s, 128MB) 4 7.4s 17.0s 0.1s
medium (45s, 512MB) 1 7.3s 53.1s 0.2s
medium (45s, 512MB) 2 6.2s 53.1s 0.2s
medium (45s, 512MB) 4 7.5s 53.3s 0.2s
long (120s, 1024MB) 1 7.3s 126.0s 0.5s
long (120s, 1024MB) 2 6.3s 128.8s 0.4s
long (120s, 1024MB) 4 9.9s 136.0s 0.4s

All benchmark jobs completed successfully.

@sergiopaniego sergiopaniego left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the first iteration!! tested locally and works nicely. looking forward to this integration :) Some AI-assisted review below with some ideas about when this is scaled.


I left inline comments (with suggested changes) on the specific spots. The framing I'd suggest: the correctness items and repo-convention items should land in this PR, while the scale-hardening items below are fine as documented limitations plus follow-up issues, as long as the design doesn't close the door on them.

Why the correctness items matter here specifically: this provider will be used to run many environments in parallel for RL training, where a single env that fails silently is worse than one that fails loudly. It either wastes wall-clock (the whole batch waits on a straggler) or feeds garbage observations and rewards into the policy.

Scale hardening (follow-up OK, but worth flagging)

  • Configurable startup timeouts. The 120s budgets are hardcoded. Heavy images and GPU flavors can take much longer to schedule, so these should be __init__ params.
  • Retry/backoff on run_job. At high concurrency the Jobs API will rate-limit or return transient init errors. A bounded retry would avoid losing envs to a single blip.
  • Reuse a shared httpx.AsyncClient in the proxy. A new client (and connection pool) is created per request. With many tool-call steps across many envs this is measurable hot-path overhead.
  • Per-env proxy footprint. Each provider instance starts its own uvicorn server, thread, and event loop. At high env concurrency that's a lot of local servers on the trainer host. Worth documenting as a known limit, or considering a single shared proxy that multiplexes upstreams.
  • Observability. The provider logs nothing. With many concurrent envs, structured logging of job id, stage, and failure cause is what makes a failing env debuggable.

Tests & docs

  • No tests yet. The pure helpers (_job_port_url, _to_ws_url, _find_available_port, hop-by-hop filtering) plus wait_for_ready are all testable with no network, matching the existing provider test suite under tests/test_core/.
  • Docstrings. Public methods need the HF doc-builder docstring format used by the other providers.

Comment thread src/openenv/core/containers/runtime/hf_sandbox_provider.py
Comment thread src/openenv/core/containers/runtime/hf_sandbox_provider.py Outdated
Comment thread src/openenv/core/containers/runtime/hf_sandbox_provider.py Outdated
Comment thread src/openenv/core/containers/runtime/hf_sandbox_provider.py Outdated
Comment thread src/openenv/core/containers/runtime/hf_sandbox_provider.py
Comment thread src/openenv/core/containers/runtime/hf_sandbox_provider.py
Comment thread examples/hf_sandbox_coding_env.py
Comment thread src/openenv/core/containers/runtime/hf_sandbox_provider.py
Comment thread src/openenv/core/containers/runtime/hf_sandbox_provider.py

@sergiopaniego sergiopaniego left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created a PR for making the docs for containers consistent (#868), it'd be cool to improve the documentation on this PR aligning it with that PR 😄

@bot-ci-comment

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 5 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Want higher recall? High effort reviews run extra passes and find more bugs. A team admin can switch effort levels in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 1be2ce8. Configure here.

Comment thread src/openenv/core/containers/runtime/hf_sandbox_provider.py
Comment thread src/openenv/core/containers/runtime/hf_sandbox_provider.py
Comment thread src/openenv/core/containers/runtime/hf_sandbox_provider.py Outdated
Comment thread src/openenv/core/containers/runtime/hf_sandbox_provider.py Outdated
Comment thread examples/browsergym_trl_hf_sandbox.py Outdated
@burtenshaw

burtenshaw commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator Author

Did a full fat implementation here: book_flight_lfm25_hf_sandbox.py

examples/browsergym_hf_sandbox.py is minimal with hf.co/spaces/openenv/browsergym_env through HFSandboxProvider, resets click-test, clicks the observed button bid, and asserts reward 1.0.

BookFlight experiment results:

Item Value
Task MiniWoB book-flight
Model LiquidAI/LFM2.5-230M
Sandbox image hf.co/spaces/openenv/browsergym_env
Sandbox flavor cpu-basic
Train seed 0
Goal Book the shortest one-way flight from: Corpus Christi, TX to: SHG on 12/05/2016.
Oracle trajectory length 8 BrowserGym actions
Baseline reward 0.0
Final reward 1.0
Improvement point Step 20
LoRA r=16, alpha=32, dropout=0.0, target_modules="all-linear"
Optimizer AdamW, LR 2e-4
Batch 8 examples, sequence length 1312, supervised tokens 103
Trainable params 3,883,008 / 233,576,192 total
Generation max_new_tokens=64, greedy decoding

Training loss:

Step Loss
1 1.5559734106063843
10 0.01412913203239441
20 0.0005893930210731924

@burtenshaw

Copy link
Copy Markdown
Collaborator Author

Update: restored the in-repo BrowserGym example to the TRL GRPO training example from commit 1be2ce85 at examples/browsergym_trl_hf_sandbox.py. The BookFlight gist and experiment-results comment remain linked above as the longer out-of-repo experiment artifact.

@burtenshaw

Copy link
Copy Markdown
Collaborator Author

Moved the reusable EnvClient.new_session() implementation to #880 (commit ac5465a4). This PR now just consumes it in the BrowserGym TRL example, so #880 should land first.

Comment thread examples/browsergym_trl_hf_sandbox.py
@burtenshaw

burtenshaw commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator Author

HF sandbox provider benchmark + example runs

Ran against this OpenEnv branch at c0c8b383, the draft Hub client branch ben/sandbox-pool-serve (525f6c5), and a temporary sandbox-server@port-proxy-poc binary staged under my namespace for the run. The temporary binary is not part of this PR; it is only needed because the current central sandbox-server binary does not expose the port proxy route yet.

Provider benchmark (hf.co/spaces/openenv/coding_env, cpu-basic)

case result start_s ready_s reset_s step_s total_s note
cold_print pass 13.00 4.70 0.10 0.10 18.95 cold pool host
warm_sleep_2s pass 0.27 4.69 0.10 2.11 8.30 reused pool host, 2s workload
concurrent_print_0 pass 0.41 7.08 0.14 0.13 10.42 2 concurrent service starts
concurrent_print_1 pass 0.89 6.86 0.16 0.15 10.47 2 concurrent service starts

@burtenshaw

Copy link
Copy Markdown
Collaborator Author

BookFlight LFM2.5 training run

Ran the updated gist on this PR branch at 6e689f8f:

https://gist.github.com/burtenshaw/3056a37728bc5d9a094db24cbacca96e

Result: reward increased on MiniWoB book-flight seed 0.

phase reward done steps
baseline 0.0 false 1
final eval, step 20 1.0 true 8

Final successful action sequence:

fill('19', 'Corpus Christi, TX')
click('34')
fill('21', 'SHG')
click('38')
click('25')
click('83')
click('27')
click('189')

Evidence is in the gist run_summary.json and local run output at outputs/book-flight-lfm25-230m-2026-06-29_18-40-18/run_summary.json.

PR-side fixes pushed in 6e689f8f:

  • moved the MiniWoB bundle in the BrowserGym image from /app to /usr/local/share/miniwob-plusplus, because sandboxed processes can stat /app but cannot read file contents there;
  • updated the BrowserGym TRL example MINIWOB_URL to the new readable path.

HF-side note from the run: Chromium requires a much higher sandbox RLIMIT_AS than the current 2 GiB default. The gist currently patches the draft Hub client create request with max_mem_mb=1048576 and uses the staged sandbox-server@port-proxy-poc binary. Without that, Chromium crashes with partition_address_space.cc(77) before BrowserGym can reset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants