fix(uv-provider): clone git+ project paths before uv run --project by acharyaanusha · Pull Request #854 · huggingface/OpenEnv

acharyaanusha · 2026-06-24T01:08:07Z

Summary

Two related framework bugs found while building a real training example against a deployed Space (#853).

1. `UVProvider`/`from_env(use_docker=False)` never actually clones the git URL

EnvClient.from_env(repo_id, use_docker=False) is documented to "clone the space and run it locally with uv," and builds a project_path of git+https://huggingface.co/spaces/{repo_id} for that purpose. But UVProvider:

Ran that string through os.path.abspath(), which mangles it (e.g. collapses https:// to https:/ and prefixes the cwd).
Passed the result straight to uv run --project <path>, which only discovers a project in a local directory — it has no support for remote git sources.

So the "clones the space" behavior never happened: reset()/step() would hang until the 60s readiness timeout.

Fix: UVProvider now detects a git+-prefixed project_path, leaves it untouched in __init__ (no abspath mangling), and clones it to a temp directory in start() via git clone --depth 1. The temp dir is removed in stop(). Also closes three follow-up leaks found in review: a temp-dir/process leak if start()/wait_for_ready() fails before the EnvClient exists to clean up, no timeout on the git clone subprocess (a hung remote blocked start() forever), and a leak if the spawned process died on its own and start() was called again.

2. `connect()` no-ops onto a websocket bound to a dead event loop

EnvClient.connect() has a guard — if self._ws is not None: return self — that ignores which event loop established that websocket. This breaks the officially documented pattern (docs/source/guides/async-sync.md):

client = await SomeClient.from_env(...)   # connects inside this loop
with client.sync() as sync_client:        # drives calls on a NEW, separate
    sync_client.reset()                   # background loop

from_env() ends with await client.connect(), binding _ws to whichever loop ran that await (e.g. an asyncio.run() call, closed by the time from_env() returns). .sync() then drives every later call through SyncEnvClient's own dedicated background-thread loop. The no-op guard means the websocket never gets rebound — every reset()/step() schedules work on a live loop while operating on a connection tied to a dead one, causing a hang or an asyncio "Future attached to a different loop" error. Not specific to any one example — any from_env() + .sync() caller hits this.

Fix: EnvClient now tracks which loop created _ws (self._ws_loop). connect() only treats an existing _ws as reusable if the current running loop matches; otherwise it drops the stale reference and connects fresh.

Known limitation, found while testing this against #853: the fix reconnects correctly, but it can't cleanly close the old connection first — its event loop is already gone, so there's no way to send a proper close frame on it. For environments that allow concurrent sessions this is harmless (just one extra idle connection on the server until it times out). But an environment with SUPPORTS_CONCURRENT_SESSIONS = False (single session, e.g. sophistry_bench_sprint_env) will reject the new (real) connection with a capacity error, because the abandoned old one still occupies the session slot from the server's point of view. #853's training script therefore still avoids from_env() + .sync() for that specific env and connects via UVProvider directly instead — this fix narrows the bug's blast radius (no more hang/crash for the common case) but doesn't make the pattern universally safe for single-session environments. Flagging this explicitly rather than overclaiming; a full fix would need a way to force-close a foreign-loop-bound websocket, which isn't straightforward with the websockets library's public API.

Test plan

tests/test_core/test_uv_provider.py: TestGitProjectPath (8 cases) covering the clone path, the timeout, and both leak scenarios.
tests/test_core/test_generic_client.py: TestForeignLoopReconnect (3 cases) covering same-loop no-op, foreign-loop reconnect, and disconnect() clearing loop state.
pytest tests/test_core/ -q — 198 passed, 11 pre-existing skips.
ruff format --check / ruff check clean.
Reproduced both original bugs and verified the fixes end-to-end against a real deployed Space (openenv-community/sophistry_bench_sprint_env) — before the fix, from_env(..., use_docker=False) times out after 60s; after, it clones, starts the server, and a real reset()/step() round-trip succeeds.
Also reproduced the single-session limitation above end-to-end (got a real CAPACITY_REACHED error), confirming the caveat is accurate rather than theoretical.

🤖 Generated with Claude Code

Note

Medium Risk
Touches core client connection lifecycle and subprocess/git startup for from_env UV mode; behavior changes are well-tested but single-session envs may still see capacity errors when an old loop-bound connection cannot be closed cleanly.

Overview
Fixes from_env(..., use_docker=False) so Hugging Face Spaces actually run locally: UVProvider now treats git+<url> project_path as a clone spec (no abspath mangling), runs git clone --depth 1 into a temp dir on start(), passes that path to uv run --project, and removes clones on stop() and on clone/Popen/restart failure paths, with a bounded clone timeout. EnvClient.from_env calls provider.stop() if start/connect fails before a client exists.

Fixes from_env() + .sync(): EnvClient records which asyncio loop owns the WebSocket (_ws_loop). connect() only skips reconnect when the current loop matches; otherwise it drops the stale socket and reconnects. _ensure_connected() always goes through connect() so lazy calls like sync().reset() still rebind after a closed loop.

New unit tests cover git clone/cleanup and foreign-loop reconnect behavior.

^{Reviewed by Cursor Bugbot for commit ca4fddd. Bugbot is set up for automated code reviews on this repo. Configure here.}

EnvClient.from_env(repo_id, use_docker=False) builds a project_path of git+https://huggingface.co/spaces/{repo_id} and hands it straight to UVProvider, which passed it through os.path.abspath() (mangling the URL, e.g. collapsing "https://" to "https:/") and then to `uv run --project`, which only discovers a project in a local directory -- it has no support for remote git sources at all. The documented "clones the space and runs locally" behavior never actually happened; reset()/step() would hang until the 60s readiness timeout. Verified by reproducing against a real deployed Space (openenv-community/sophistry_bench_sprint_env) before and after. Fix: UVProvider now detects a git+ project_path, leaves it untouched until start(), clones it to a temp dir there, and passes the local clone path to uv run. The temp dir is removed in stop().

acharyaanusha · 2026-06-24T14:39:11Z

@burtenshaw Updated this PR — it now covers a second, related framework bug found in the same training example: EnvClient.connect() ignores which event loop a websocket was created on, breaking from_env() + .sync() generally (not specific to one example). Folded that fix in here too (previously a separate PR, #860, now closed) since both are framework-level fixes found while building #853. Both are tested and verified end-to-end against a real deployed Space.

Three follow-up fixes from review of the git-clone fix: 1. provider.start()/wait_for_ready() in EnvClient.from_env()'s UV branch run before the EnvClient exists, so a failure there left no object to call stop() on -- the cloned temp dir (and process) leaked permanently on every such failure. Now wrapped so any exception during start/wait_for_ready/ connect calls provider.stop() before re-raising. 2. _clone_git_project had no timeout, so a hung/slow git remote blocked start() (and the caller's readiness wait) indefinitely. Now bounded by context_timeout_s, cleaning up the partial clone dir on timeout. 3. start()'s "already running" guard only checks process liveness, not whether a previous run's clone dir was ever cleaned up. If the spawned process died on its own and start() was called again, the new clone silently overwrote self._clone_dir, leaking the old directory. start() now clears any stale clone dir before cloning again, and also cleans up on a Popen failure right after a successful clone. Also fixes a docstring convention nit (double backticks where the file/ CLAUDE.md convention is single backticks) in the lines this PR touches.

…aanusha/OpenEnv into fix/uv-provider-git-clone

…loop EnvClient.connect() had a guard `if self._ws is not None: return self` that no-ops regardless of which event loop established that websocket. This breaks the officially documented pattern: client = await SomeClient.from_env(...) # connects inside this loop with client.sync() as sync_client: # drives calls on a NEW, sync_client.reset() # separate background loop `from_env()` ends with `await client.connect()`, binding `_ws` to whichever loop ran that await (e.g. an `asyncio.run()` call, which is closed by the time `from_env()` returns). `.sync()` then drives every later call through `SyncEnvClient`'s own dedicated background-thread loop via `run_coroutine_threadsafe`. `SyncEnvClient.connect()` does call `self._async.connect()`, but the no-op guard returns immediately since `_ws` is already set -- so the websocket never gets rebound to the loop that's actually being used, and every `reset()`/`step()` call schedules work on a live loop while operating on a connection object tied to a dead one. Found while building a training example that does exactly this (huggingface#853) -- not specific to that example; any `from_env()` + `.sync()` caller hits it. Fix: track which loop created `_ws` (`_ws_loop`). `connect()` only no-ops if the *current* running loop matches; otherwise it drops the stale reference (unusable anyway, since its loop is typically already closed) and connects fresh on the current loop. `disconnect()` clears `_ws_loop` alongside `_ws`. Added `TestForeignLoopReconnect` (3 cases): same-loop reconnect stays a no-op, a foreign-loop reconnect actually re-establishes the connection, and disconnect() clears the loop-tracking state.

From a self-review pass before requesting maintainer review on huggingface#853: - Validate --per-device-batch-size % --num-generations == 0 up front, before the ~180s env clone/start and dataset build -- previously this only surfaced as an opaque ValueError deep inside GRPOTrainer construction. - Extract completion-text parsing into _completion_text(), which now raises a clear error on an empty/malformed completion list instead of a bare IndexError/TypeError. - Assert completions and seed are the same length in the reward function, instead of letting zip() silently truncate and misalign reward<->task. - Write the components CSV under output_dir (which save_model() already guarantees exists) instead of a sibling path derived from --out's basename, which could fail if --out's parent directory doesn't exist. - Extract the CSV-writing block into write_metrics_csv(). Also tried switching make_sync_client() to the simpler from_env() + .sync() pattern, now that huggingface#854 fixes the event-loop mismatch that motivated building it manually in the first place -- and reverted. The fixed connect() does correctly reconnect on the new loop instead of hanging, but it can't cleanly close the *old* connection first (its event loop is already gone), so the old one is simply abandoned. That's harmless for envs that allow concurrent sessions, but this one doesn't (SUPPORTS_CONCURRENT_SESSIONS = False): the abandoned connection occupies the only session slot, and the real one fails with CAPACITY_REACHED. Confirmed by reproducing it locally. make_sync_client() avoids the problem by never creating that doomed first connection at all. Updated its docstring to explain both reasons.

@burtenshaw

@burtenshaw caught a gap in the previous fix: _ensure_connected() (called by _send()/reset()/step()) had its own guard -- `if self._ws is None: await self.connect()` -- that's independent of connect()'s loop-aware reconnect logic. Since _ws is NOT None in the foreign-loop case (it's set, just bound to a dead loop), this guard skipped connect() entirely, so the reconnect fix never ran for callers that never explicitly call .connect() themselves -- e.g. `client = await Client.from_env(...); client.sync().reset()`, which goes straight to _send() -> _ensure_connected(). Fix: _ensure_connected() now always delegates to connect(), which already knows how to no-op cheaply on the same loop or reconnect on a different one. Added test_lazy_send_reconnects_without_explicit_connect_call, exercising _ensure_connected() directly in the exact lazy-call pattern that was broken.

@burtenshaw

Addressing @burtenshaw's review on huggingface#853: - Drop training/hf_jobs_metrics.csv, training/metrics.csv, training/sophistry_bench_sprint.toml -- envs and examples should be decoupled; these were training artifacts, not env source. - Drop the custom metrics_log/components-CSV tracking in the example script entirely (reward_func just returns rewards now, like every other reward_funcs example in this repo) rather than wiring up trackio for a one-off script. - Inline make_sync_client() into main() -- it was only used once. - Cut the module docstring and inline comments down to the essentials; the full event-loop/single-session explanation lives in huggingface#854 and the CAPACITY_REACHED finding, not duplicated here in prose. - Condense the env README's "Training" section from two tables + ~40 lines of analysis to one paragraph; the full numbers are in the PR description. Re-verified end-to-end after the rewrite (4 episodes, 1 step): trains, saves a real checkpoint, no regressions.

cursor

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Want higher recall? High effort reviews run extra passes and find more bugs. A team admin can switch effort levels in the Cursor dashboard.

^{Reviewed by Cursor Bugbot for commit ca4fddd. Configure here.}

bot-ci-comment · 2026-06-29T10:35:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

burtenshaw

Nice work on this. Thanks.

@burtenshaw

* docs+examples(sophistry_bench_sprint_env): add training example and results Adds the prime-rl GRPO config and per-step metrics from a 100-step run against the deployed env, plus a README section showing the reward-hacking signature (aggregate_reward up, correctness_reward flat). Also adds a from-scratch TRL GRPOTrainer example for training against the Space directly, for anyone without Prime Intellect access. * examples(sophistry_bench_sprint_grpo): add --push-to-hub to publish checkpoint Closes the gap between local training output and an actual Hub artifact, matching the maintainer's "deployed to Hugging Face" ask for the training side, not just the Space. * fix(sophistry_bench_sprint_env): correct deployed Space repo id The env is actually hosted at openenv-community/sophistry_bench_sprint_env, not anushaacharya/sophistry_bench_sprint_env as originally documented in #787 -- verified via `hf spaces info`. * docs(sophistry_bench_sprint_env): lead with TRL example, demote Prime Intellect to a note Reframes the Training section so the TRL GRPOTrainer script (verified end-to-end against the deployed Space) is the primary documented path, matching this repo's own guidance that TRL is the recommended framework. The Prime Intellect run becomes supplementary evidence, not the headline. Also switches the script to GenericEnvClient + a directly-constructed UVProvider (avoiding a sync/async event-loop mismatch from mixing asyncio.run(from_env(...)) with .sync()), and bumps the default model to Qwen2.5-0.5B-Instruct for a cheaper, faster default run. * docs+examples(sophistry_bench_sprint_grpo): add real HF Jobs run results Runs the TRL GRPO example for real on Hugging Face Jobs (a10g-small, 100 steps, Qwen2.5-0.5B-Instruct) and documents the results honestly: the proxy reward (aggregate_reward) climbs and plateaus, confirming the example trains correctly end-to-end on HF infrastructure, but at this much smaller scale (~800 total rollouts vs. the Prime Intellect run's ~12,800) the policy collapses to near-empty completions rather than converging on the claim_count_cliff target -- a different reward-hacking shortcut, not a replication of the Prime Intellect run's specific curve. correctness_reward stays noisy/decoupled either way, which is the core finding both runs share. Also extends the reward_func to log per-step reward components (not just the scalar reward), since correctness_reward/n_claims live in observation["components"], which the trainer never needed but the README table does. Opts into SPRINT_EXPOSE_CORRECTNESS=1 for the locally-run clone (not the shared Space) since this is exactly the "trusted measurement code" use case the env's own README carves out -- never fed back into the prompt. Tuning notes from getting this to actually run without OOM on a10g-small: - per_device_train_batch_size is the *total* rollout count per step (must be divisible by num_generations), not unique-prompts * num_generations. - bf16 matters more than usual here: entropy/logprob computation materializes a [batch, completion_len, vocab_size] logits tensor, and a ~150K-token vocab (Qwen2.5) dominates memory at fp32. - gradient_checkpointing=True had no measurable effect in this setup (same OOM numbers with and without); reducing batch size was what actually fixed it. Left in since it's harmless, but don't rely on it alone. * fix(sophistry_bench_sprint_grpo): harden against review findings From a self-review pass before requesting maintainer review on #853: - Validate --per-device-batch-size % --num-generations == 0 up front, before the ~180s env clone/start and dataset build -- previously this only surfaced as an opaque ValueError deep inside GRPOTrainer construction. - Extract completion-text parsing into _completion_text(), which now raises a clear error on an empty/malformed completion list instead of a bare IndexError/TypeError. - Assert completions and seed are the same length in the reward function, instead of letting zip() silently truncate and misalign reward<->task. - Write the components CSV under output_dir (which save_model() already guarantees exists) instead of a sibling path derived from --out's basename, which could fail if --out's parent directory doesn't exist. - Extract the CSV-writing block into write_metrics_csv(). Also tried switching make_sync_client() to the simpler from_env() + .sync() pattern, now that #854 fixes the event-loop mismatch that motivated building it manually in the first place -- and reverted. The fixed connect() does correctly reconnect on the new loop instead of hanging, but it can't cleanly close the *old* connection first (its event loop is already gone), so the old one is simply abandoned. That's harmless for envs that allow concurrent sessions, but this one doesn't (SUPPORTS_CONCURRENT_SESSIONS = False): the abandoned connection occupies the only session slot, and the real one fails with CAPACITY_REACHED. Confirmed by reproducing it locally. make_sync_client() avoids the problem by never creating that doomed first connection at all. Updated its docstring to explain both reasons. * simplify(sophistry_bench_sprint): cut footprint per review Addressing @burtenshaw's review on #853: - Drop training/hf_jobs_metrics.csv, training/metrics.csv, training/sophistry_bench_sprint.toml -- envs and examples should be decoupled; these were training artifacts, not env source. - Drop the custom metrics_log/components-CSV tracking in the example script entirely (reward_func just returns rewards now, like every other reward_funcs example in this repo) rather than wiring up trackio for a one-off script. - Inline make_sync_client() into main() -- it was only used once. - Cut the module docstring and inline comments down to the essentials; the full event-loop/single-session explanation lives in #854 and the CAPACITY_REACHED finding, not duplicated here in prose. - Condense the env README's "Training" section from two tables + ~40 lines of analysis to one paragraph; the full numbers are in the PR description. Re-verified end-to-end after the rewrite (4 episodes, 1 step): trains, saves a real checkpoint, no regressions. --------- Co-authored-by: burtenshaw <ben.burtenshaw@gmail.com>

acharyaanusha added 2 commits June 23, 2026 18:07

Merge branch 'main' into fix/uv-provider-git-clone

490d258

acharyaanusha added 2 commits June 24, 2026 07:58

Merge branch 'fix/uv-provider-git-clone' of https://github.com/achary…

44cfc74

…aanusha/OpenEnv into fix/uv-provider-git-clone

acharyaanusha mentioned this pull request Jun 24, 2026

sophistry_bench_sprint_env: add training example and results #853

Merged

5 tasks

burtenshaw reviewed Jun 25, 2026

View reviewed changes

Comment thread src/openenv/core/env_client.py

Merge branch 'main' into fix/uv-provider-git-clone

ca4fddd

cursor Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread src/openenv/core/env_client.py

Comment thread src/openenv/core/containers/runtime/uv_provider.py

fix: address uv client review comments

63fc059

burtenshaw approved these changes Jun 29, 2026

View reviewed changes

burtenshaw merged commit 7fe8a86 into huggingface:main Jun 29, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(uv-provider): clone git+ project paths before uv run --project#854

fix(uv-provider): clone git+ project paths before uv run --project#854
burtenshaw merged 8 commits into
huggingface:mainfrom
acharyaanusha:fix/uv-provider-git-clone

acharyaanusha commented Jun 24, 2026 •

edited by cursor Bot

Loading

Uh oh!

acharyaanusha commented Jun 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

bot-ci-comment Bot commented Jun 29, 2026

Uh oh!

burtenshaw left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

acharyaanusha commented Jun 24, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. UVProvider/from_env(use_docker=False) never actually clones the git URL

2. connect() no-ops onto a websocket bound to a dead event loop

Test plan

Uh oh!

acharyaanusha commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bot-ci-comment Bot commented Jun 29, 2026

Uh oh!

burtenshaw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

acharyaanusha commented Jun 24, 2026 •

edited by cursor Bot

Loading

1. `UVProvider`/`from_env(use_docker=False)` never actually clones the git URL

2. `connect()` no-ops onto a websocket bound to a dead event loop

acharyaanusha commented Jun 24, 2026 •

edited

Loading