Add HF sandbox provider by burtenshaw · Pull Request #841 · huggingface/OpenEnv

burtenshaw · 2026-06-22T09:54:29Z

This PR adds a minimal Hugging Face sandbox-backed provider for OpenEnv environment servers.

Note

Medium Risk
Touches connection lifecycle and adds a local proxy that forwards credentials to remote HF endpoints; mistakes could leak tokens or leave jobs running, but scope is additive with explicit token requirements.

Overview
Adds HFSandboxProvider, which launches an OpenEnv server image as a Hugging Face job (run_job), waits for the exposed URL, and returns a local HTTP/WebSocket proxy that injects the HF bearer token so existing clients can talk to the remote server without custom auth.

EnvClient now accepts an optional base_url when a provider is supplied: on connect(), it can start the provider, wait for readiness, and set the WebSocket URL. Failures during startup trigger cleanup via close(), and provider-started sessions reset _ws_url on close so a reconnect can start a fresh job.

Dependency huggingface_hub is bumped to >=1.20.1 for the jobs API. New examples smoke-test coding_env through the provider and run TRL GRPO against BrowserGym hosted on HF Sandbox.

^{Reviewed by Cursor Bugbot for commit 1be2ce8. Bugbot is set up for automated code reviews on this repo. Configure here.}

burtenshaw · 2026-06-24T12:03:58Z

HF sandbox benchmark

Run id: 20260624T115124Z
Namespace: burtenshaw
Image/flavor: python:3.12 / cpu-basic
Workload: each job allocates and touches the requested memory, then holds it while looping/sleeping for the requested duration.
Wall/throughput use the conservative group wall, falling back to HF backend total_secs when it is larger than the local timer.

Throughput by workload

workload	concurrency	success	wall	scheduling p50/p95	running p50/p95	total p50/p95	throughput	avg peak RSS
short (10s, 128MB)	1	1/1	17.1s	4.0s/4.0s	11.0s/11.0s	15.0s/15.0s	3.50 jobs/min	142.1 MB
short (10s, 128MB)	2	2/2	17.0s	4.0s/4.0s	11.0s/11.0s	16.0s/16.0s	7.05 jobs/min	142.2 MB
short (10s, 128MB)	4	4/4	17.0s	4.0s/5.0s	11.0s/11.0s	15.5s/16.0s	14.12 jobs/min	142.1 MB
medium (45s, 512MB)	1	1/1	53.1s	5.0s/5.0s	46.0s/46.0s	51.0s/51.0s	1.13 jobs/min	527.7 MB
medium (45s, 512MB)	2	2/2	53.1s	4.5s/5.0s	46.0s/46.0s	50.5s/51.0s	2.26 jobs/min	527.5 MB
medium (45s, 512MB)	4	4/4	53.3s	4.0s/5.0s	46.0s/46.0s	51.0s/51.0s	4.51 jobs/min	527.7 MB
long (120s, 1024MB)	1	1/1	126.0s	4.0s/4.0s	121.0s/121.0s	126.0s/126.0s	0.48 jobs/min	1041.5 MB
long (120s, 1024MB)	2	2/2	128.8s	4.5s/5.0s	121.0s/121.0s	126.5s/127.0s	0.93 jobs/min	1041.6 MB
long (120s, 1024MB)	4	4/4	136.0s	8.0s/11.0s	121.0s/121.0s	129.5s/133.0s	1.77 jobs/min	1041.6 MB

Stage timing

workload	concurrency	first running p50	all done	alloc mean
short (10s, 128MB)	1	5.3s	17.1s	0.1s
short (10s, 128MB)	2	6.3s	17.0s	0.1s
short (10s, 128MB)	4	7.4s	17.0s	0.1s
medium (45s, 512MB)	1	7.3s	53.1s	0.2s
medium (45s, 512MB)	2	6.2s	53.1s	0.2s
medium (45s, 512MB)	4	7.5s	53.3s	0.2s
long (120s, 1024MB)	1	7.3s	126.0s	0.5s
long (120s, 1024MB)	2	6.3s	128.8s	0.4s
long (120s, 1024MB)	4	9.9s	136.0s	0.4s

All benchmark jobs completed successfully.

sergiopaniego

Thanks for the first iteration!! tested locally and works nicely. looking forward to this integration :) Some AI-assisted review below with some ideas about when this is scaled.

I left inline comments (with suggested changes) on the specific spots. The framing I'd suggest: the correctness items and repo-convention items should land in this PR, while the scale-hardening items below are fine as documented limitations plus follow-up issues, as long as the design doesn't close the door on them.

Why the correctness items matter here specifically: this provider will be used to run many environments in parallel for RL training, where a single env that fails silently is worse than one that fails loudly. It either wastes wall-clock (the whole batch waits on a straggler) or feeds garbage observations and rewards into the policy.

Scale hardening (follow-up OK, but worth flagging)

Configurable startup timeouts. The 120s budgets are hardcoded. Heavy images and GPU flavors can take much longer to schedule, so these should be __init__ params.
Retry/backoff on run_job. At high concurrency the Jobs API will rate-limit or return transient init errors. A bounded retry would avoid losing envs to a single blip.
Reuse a shared httpx.AsyncClient in the proxy. A new client (and connection pool) is created per request. With many tool-call steps across many envs this is measurable hot-path overhead.
Per-env proxy footprint. Each provider instance starts its own uvicorn server, thread, and event loop. At high env concurrency that's a lot of local servers on the trainer host. Worth documenting as a known limit, or considering a single shared proxy that multiplexes upstreams.
Observability. The provider logs nothing. With many concurrent envs, structured logging of job id, stage, and failure cause is what makes a failing env debuggable.

Tests & docs

No tests yet. The pure helpers (_job_port_url, _to_ws_url, _find_available_port, hop-by-hop filtering) plus wait_for_ready are all testable with no network, matching the existing provider test suite under tests/test_core/.
Docstrings. Public methods need the HF doc-builder docstring format used by the other providers.

sergiopaniego

I've created a PR for making the docs for containers consistent (#868), it'd be cool to improve the documentation on this PR aligning it with that PR 😄

bot-ci-comment · 2026-06-26T13:52:11Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

cursor

Cursor Bugbot has reviewed your changes using default effort and found 5 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Want higher recall? High effort reviews run extra passes and find more bugs. A team admin can switch effort levels in the Cursor dashboard.

^{Reviewed by Cursor Bugbot for commit 1be2ce8. Configure here.}

burtenshaw · 2026-06-29T11:30:52Z

Did a full fat implementation here: book_flight_lfm25_hf_sandbox.py

examples/browsergym_hf_sandbox.py is minimal with hf.co/spaces/openenv/browsergym_env through HFSandboxProvider, resets click-test, clicks the observed button bid, and asserts reward 1.0.

BookFlight experiment results:

Item	Value
Task	MiniWoB `book-flight`
Model	`LiquidAI/LFM2.5-230M`
Sandbox image	`hf.co/spaces/openenv/browsergym_env`
Sandbox flavor	`cpu-basic`
Train seed	`0`
Goal	`Book the shortest one-way flight from: Corpus Christi, TX to: SHG on 12/05/2016.`
Oracle trajectory length	`8` BrowserGym actions
Baseline reward	`0.0`
Final reward	`1.0`
Improvement point	Step `20`
LoRA	`r=16`, `alpha=32`, `dropout=0.0`, `target_modules="all-linear"`
Optimizer	AdamW, LR `2e-4`
Batch	`8` examples, sequence length `1312`, supervised tokens `103`
Trainable params	`3,883,008` / `233,576,192` total
Generation	`max_new_tokens=64`, greedy decoding

Training loss:

Step	Loss
1	`1.5559734106063843`
10	`0.01412913203239441`
20	`0.0005893930210731924`

burtenshaw · 2026-06-29T11:36:55Z

Update: restored the in-repo BrowserGym example to the TRL GRPO training example from commit 1be2ce85 at examples/browsergym_trl_hf_sandbox.py. The BookFlight gist and experiment-results comment remain linked above as the longer out-of-repo experiment artifact.

burtenshaw · 2026-06-29T13:06:20Z

Moved the reusable EnvClient.new_session() implementation to #880 (commit ac5465a4). This PR now just consumes it in the BrowserGym TRL example, so #880 should land first.

burtenshaw · 2026-06-29T15:46:00Z

HF sandbox provider benchmark + example runs

Ran against this OpenEnv branch at c0c8b383, the draft Hub client branch ben/sandbox-pool-serve (525f6c5), and a temporary sandbox-server@port-proxy-poc binary staged under my namespace for the run. The temporary binary is not part of this PR; it is only needed because the current central sandbox-server binary does not expose the port proxy route yet.

Provider benchmark (`hf.co/spaces/openenv/coding_env`, `cpu-basic`)

case	result	start_s	ready_s	reset_s	step_s	total_s	note
cold_print	pass	13.00	4.70	0.10	0.10	18.95	cold pool host
warm_sleep_2s	pass	0.27	4.69	0.10	2.11	8.30	reused pool host, 2s workload
concurrent_print_0	pass	0.41	7.08	0.14	0.13	10.42	2 concurrent service starts
concurrent_print_1	pass	0.89	6.86	0.16	0.15	10.47	2 concurrent service starts

burtenshaw · 2026-06-29T16:46:47Z

BookFlight LFM2.5 training run

Ran the updated gist on this PR branch at 6e689f8f:

https://gist.github.com/burtenshaw/3056a37728bc5d9a094db24cbacca96e

Result: reward increased on MiniWoB book-flight seed 0.

phase	reward	done	steps
baseline	0.0	false	1
final eval, step 20	1.0	true	8

Final successful action sequence:

fill('19', 'Corpus Christi, TX')
click('34')
fill('21', 'SHG')
click('38')
click('25')
click('83')
click('27')
click('189')

Evidence is in the gist run_summary.json and local run output at outputs/book-flight-lfm25-230m-2026-06-29_18-40-18/run_summary.json.

PR-side fixes pushed in 6e689f8f:

moved the MiniWoB bundle in the BrowserGym image from /app to /usr/local/share/miniwob-plusplus, because sandboxed processes can stat /app but cannot read file contents there;
updated the BrowserGym TRL example MINIWOB_URL to the new readable path.

HF-side note from the run: Chromium requires a much higher sandbox RLIMIT_AS than the current 2 GiB default. The gist currently patches the draft Hub client create request with max_mem_mb=1048576 and uses the staged sandbox-server@port-proxy-poc binary. Without that, Chromium crashes with partition_address_space.cc(77) before BrowserGym can reset.

Test User and others added 5 commits June 22, 2026 11:54

feat: add hf sandbox provider

9bc3583

feat: add hf sandbox example

f90f6f9

fix: quiet hf sandbox example

a50fad9

refactor: simplify hf sandbox provider

dfb6db4

refactor: use coding env hf sandbox smoke

507682c

burtenshaw commented Jun 24, 2026

View reviewed changes

Comment thread src/openenv/core/containers/runtime/hf_sandbox_provider.py

burtenshaw requested a review from adithya-s-k June 24, 2026 10:15

burtenshaw marked this pull request as ready for review June 24, 2026 10:15

sergiopaniego reviewed Jun 24, 2026

View reviewed changes

sergiopaniego mentioned this pull request Jun 26, 2026

Docs: provider documentation is incomplete and inconsistent #867

Closed

9 tasks

Merge branch 'main' into ben/hf-sandbox-provider

80b50be

sergiopaniego reviewed Jun 26, 2026

View reviewed changes

burtenshaw added 5 commits June 29, 2026 10:27

feat: add browsergym hf sandbox example

88807b6

refactor: use hf sandbox context managers

a7c1828

refactor: let browsergym client close sandbox

9641a21

refactor: defer provider startup to client

95b60a5

refactor: keep sandbox config on provider

1be2ce8

cursor Bot reviewed Jun 29, 2026

View reviewed changes

refactor: update browsergym hf sandbox example

85f9f18

burtenshaw mentioned this pull request Jun 29, 2026

Provider-owned startup #880

Open

refactor: simplify browsergym hf sandbox example

1c34d3a

refactor: restore browsergym trl example

2d58727

burtenshaw added 4 commits June 29, 2026 13:43

refactor: simplify browsergym trl example

d525ba8

fix: harden hf sandbox provider

33d94e6

fix: add reusable env sessions

9f1a9e1

refactor: rely on reusable env sessions

7c4a467

sergiopaniego reviewed Jun 29, 2026

View reviewed changes

Comment thread examples/browsergym_trl_hf_sandbox.py

burtenshaw added 2 commits June 29, 2026 16:31

feat: use hf sandbox pools

a433b6a

Use HF sandbox pool port proxy

c0c8b38

Fix BrowserGym sandbox asset paths

6e689f8

burtenshaw added 2 commits June 30, 2026 10:59

Merge main into hf sandbox provider

5f6ba3f

Format HF sandbox provider

43e258f

Uh oh!

Conversation

burtenshaw commented Jun 22, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

burtenshaw commented Jun 24, 2026

HF sandbox benchmark

Throughput by workload

Stage timing

Uh oh!

sergiopaniego left a comment

Choose a reason for hiding this comment

Scale hardening (follow-up OK, but worth flagging)

Tests & docs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sergiopaniego left a comment

Choose a reason for hiding this comment

Uh oh!

bot-ci-comment Bot commented Jun 26, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

burtenshaw commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

burtenshaw commented Jun 29, 2026

Uh oh!

burtenshaw commented Jun 29, 2026

Uh oh!

Uh oh!

burtenshaw commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

HF sandbox provider benchmark + example runs

Provider benchmark (hf.co/spaces/openenv/coding_env, cpu-basic)

Uh oh!

burtenshaw commented Jun 29, 2026

BookFlight LFM2.5 training run

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

burtenshaw commented Jun 22, 2026 •

edited by cursor Bot

Loading

burtenshaw commented Jun 29, 2026 •

edited

Loading

burtenshaw commented Jun 29, 2026 •

edited

Loading

Provider benchmark (`hf.co/spaces/openenv/coding_env`, `cpu-basic`)