feat: LLM-synthesized hints for failed trajectories#1456
feat: LLM-synthesized hints for failed trajectories#1456dzorlu wants to merge 91 commits intoNovaSky-AI:mainfrom
Conversation
Add fleet_task environment that integrates Fleet-hosted tasks with SkyRL via OpenEnv's FleetTaskEnv abstraction layer. Supports multi-turn tool-use and computer-use (multimodal) modalities. - FleetTaskEnv(BaseTextEnv): provisions Fleet env, multi-turn episodes, reward via verifier, partial reward support, hint augmentation - Tool call parser: handles <tool_call>/<function_call> tag formats with JSON repair for missing closing braces - Multimodal observations: returns image_url content blocks for CUA, compatible with upstream's extract_images_from_conversation() - Per-env metrics aggregation with environment breakdown - Context management integration for long trajectories - Trace upload support for eval telemetry Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…onfigs Port Fleet-specific training infrastructure from fork to fresh SkyRL-v2: Entrypoints: - main_fleet.py: GRPO training on Fleet-hosted envs with S3 checkpoints - main_task_gen.py: Task generation training entrypoint - main_fleet_tinker.py: Tinker-based training with Fleet envs (LoRA, async) Dataset & Checkpoints: - prepare_dataset.py: Convert Fleet task JSON to SkyRL parquet format (stratified split, dedup, env capping, difficulty filtering) - s3_checkpoints.py: Async S3 upload, cross-VM resume, local cleanup - export_tasks.py: CLI to export tasks from Fleet API Training Scripts: - fleet-common-setup.sh: Shared setup (deps, OpenEnv, dataset download) - fleet-common-run.sh: Multi-node Ray cluster + training launch - fleet-35b-run.sh: Qwen3.5-35B config (TP=2, multi-node) - fleet-qwen35-extra-setup.sh: Qwen3.5 deps (transformers 5.3, flash-attn) - fleet-task-gen-run.sh: Task generation config SkyPilot YAML Configs: - openenv-fleet-grpo-qwen3_5-35b.yaml: 2-node H200 training - task-gen-grpo-qwen3_5-9b.yaml: Single-node task gen Also adds fleet_task and task_gen config to skyrl_gym_config/default.yaml. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Port the task generation environment from fleet-ai/SkyRL that enables RL-based training of task-generating models. The environment supports multi-turn task generation where the model generates (prompt, verifier) pairs that are evaluated via Fleet harness rollouts. Key components: - TaskGenEnv(BaseTextEnv): Multi-turn env with tool-based DB exploration, task generation, and reward computation via variance + hint gap - VerifierSandbox: AST-based static analysis for generated verifier code safety (blocked imports/builtins, complexity bounds, signature checks) - Tool call parser: Handles <tool_call>/<function_call> tag formats Reward formula: R = gate * (base_quality + alpha * var(raw_scores) + hint_gap) Depends on PR #2 (fleet/training) for integrations.fleet.task_gen_reward. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When all raw rollout samples for a prompt score 0, hint augmentation generates additional rollouts with verifier feedback injected into the prompt. This rescues GRPO signal for otherwise dead prompts. Key components: - _run_hint_augmentation() in SkyRLGymGenerator: groups outputs by instance_id, identifies failing prompts, builds hint text from verifier ERROR/SUCCESS_ACCUMULATOR, launches hinted rollouts - RLTF-SD: replaces hinted prompt_ids with original unhinted prompt_ids so the model learns to produce hint-quality outputs from the original prompt alone (grad log pi(y_hint | x_0) not grad log pi(y_hint | x_0 + hint)) - First-turn baseline in compute_grpo_outcome_advantage: when is_hinted is present, computes group mean/std from raw samples only, preventing hinted samples from contaminating the GRPO baseline - Metrics: hint/total_hinted_rollouts, hint/hint_success_rate, hint/prompts_hinted, hint/signal_rescued Config: enable_hints, hint_reward_threshold, n_hint_samples in fleet_task section of skyrl_gym_config. Only runs during training (not eval), only for non-step-wise trajectories, and only when fleet_task.enable_hints=true. Depends on PR #1 (fleet/task-env) for FleetTaskEnv.build_hint_text(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Port vision-language model support from SkyRL v1 (feat/vl-support-clean) to SkyRL-v2's architecture: - Generator: VL-aware chat template, image accumulation across turns, multi_modal_data construction for vLLM - Engine pipeline: thread multi_modal_data through preprocess/generate in both sync and async vLLM engines - Fleet env: Qwen coordinate adaptation ([0,1000] <-> pixel), initial screenshot capture, computer_use browser hints, done signal detection - Utilities: image extraction, base64 decode, processor loading, VL chat template with proper vision token expansion - New VL run script and SkyPilot YAML for CUA training - Update existing YAMLs to use fleet/all branch Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RunPod/Lambda/Nebius/Vast were all out of H200 capacity. Add GCP spot with proper NVIDIA 570 driver image. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SkyRL-v2 pyproject.toml defines 'fsdp' extra (includes vllm, flash-attn, torch, flashinfer) but not a standalone 'vllm' extra. The old SkyRL had 'vllm' as a separate extra. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
uv pip install silently fails to build causal-conv1d CUDA extension (reports "Checked 1 package" but module is not importable). Use pip with --no-build-isolation to ensure it finds torch from the venv. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In SkyRL-v2, scripts/ is directly under repo root (not nested under skyrl-train/). Changed cd from "../.." to ".." so the run scripts correctly resolve the repo root directory. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TrainerConfig: loss_chunk_size, use_hybrid_env_sampling, min_samples_per_env GeneratorConfig: inject_context_status, context_warning_threshold, trajectory_timeout_seconds SkyRL-v2's strict Hydra config rejects unknown keys (no + prefix), so these must be defined in the dataclass and YAML defaults. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The fleet entrypoints use @hydra.main which loads the legacy YAML directly, but validate_cfg expects generator.inference_engine.* (the new structured format). Apply translate_legacy_config to convert flat generator.* keys before validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The legacy YAML has flat generator.* keys (e.g. generator.backend) but validate_cfg expects generator.inference_engine.* with all fields including distributed_executor_backend. Add the full inference_engine section with defaults so all fields are present after Hydra loads the config and translate_legacy_config moves CLI overrides into it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove explicit fleet_task register() call from main_fleet.py since skyrl_gym.envs.__init__ already auto-registers it - Remove --data-dir-name task_gen from task-gen run script so it uses the default MODALITY-based path (matching setup's download path) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace OmegaConf.create() approach (loses dataclass type info) with in-place sync of flat generator.* CLI overrides into the structured generator.inference_engine section. This preserves the Hydra DictConfig and avoids TypeError on dataclasses.asdict(). - Remove --skip-prepare from task-gen YAML so parquet files are generated - Remove duplicate fleet_task registration (auto-registered by __init__) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hydra entrypoints pass DictConfig (not dataclass instances), so dataclasses.asdict() fails. Fall back to OmegaConf.to_yaml() for DictConfig objects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hydra's @hydra.main produces DictConfig objects, but the codebase expects typed dataclass instances (asdict(), attribute access, etc.). Switch Fleet entrypoints to use SkyRLTrainConfig.from_cli_overrides() which produces proper typed dataclasses via the legacy config translation path. - Add fleet_task/task_gen as Optional[Dict] fields on SkyRLGymConfig - Strip ++/+ Hydra prefixes from CLI args before from_cli_overrides - Remove _sync_legacy_generator_to_inference_engine (legacy path handles it) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
accelerate 1.12.0 passes param.__dict__ (which includes transformers 5.3.0's _is_hf_initialized flag) to Parameter.__new__() during init_empty_weights. PyTorch 2.10.0 rejects this unknown kwarg. Newer accelerate filters it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
uv pip install -U accelerate pulls newer torch with CUDA 13.0, breaking torchvision (CUDA 12.8). Use pip install --no-deps instead to upgrade only accelerate without re-resolving transitive dependencies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
accelerate's init_empty_weights passes param.__dict__ to Parameter() which includes _is_hf_initialized (set by transformers 5.x). torch 2.10 rejects this unknown kwarg. Patch Parameter.__new__ in fsdp_utils.py to filter it out. Revert accelerate upgrade attempt (latest is 1.13.0, still has the same issue). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- config.py: Always use legacy config path in from_cli_overrides to ensure flat keys (generator.backend etc.) are properly translated via translate_legacy_config. Fixes VL/35B ValueError on GeneratorConfig. - prepare_dataset.py: Add --env-class CLI arg (fleet_task|task_gen) to set per-record env_class in parquet data. Previously hardcoded to fleet_task, causing task_gen training to create FleetTaskEnv (requires tasks_file). - fleet-common-setup.sh: Accept --env-class and pass to prepare_dataset. - task-gen YAML: Pass --env-class task_gen in setup block. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- task_gen_env.py: Default ROLLOUT_DIR to ~/rollouts instead of /workspace/rollouts.
/workspace doesn't exist on GCP (only RunPod), causing PermissionError.
- config.py: Disable OmegaConf struct flag on base config before merging
CLI overrides. Empty dicts in YAML (like chat_template_kwargs: {}) are
loaded as closed structs, rejecting new keys during merge.
- config.py: Add try/except around asdict() in get_config_as_yaml_str
to handle edge cases where asdict fails on Ray-serialized dataclasses.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… logging FLEET_API_KEY was not being propagated to Ray workers via runtime_env, causing task_gen's import_single_task to fail with empty API key. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dataset prepare step stores the environment name as 'data_source' column, but TaskGenEnv.__init__ only looked for 'env_key'. This caused all import_single_task calls to use env_id='unknown', which fails with "Environment 'unknown' not found" from Fleet API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
expandable_segments:True in PYTORCH_CUDA_ALLOC_CONF is incompatible with vLLM's CuMemAllocator, causing AssertionError during model load. The 9B script already had this flag; the 35B was missing it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fleet env returns list-format content (from OpenEnv multimodal observations) that text-only templates like Qwen3.5-35B-A3B can't handle. This converts list content (strings or image_url dicts) to plain text before applying the chat template, preventing jinja2 TemplateError on non-VL models. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hint augmentation extends trajectory_ids in generator_input in-place, but the separate uids variable in the trainer was never updated. This caused IndexError in postprocess_generator_output when uids had fewer entries than rewards (128 raw + N hinted rewards vs 128 uids). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a Fleet environment fails to provision (e.g., list_tools timeout), return a zero-reward trajectory instead of propagating the exception through tqdm.gather and crashing the entire training step. This makes training resilient to transient Fleet API / MCP failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
vLLM 0.18.0 CuMemAllocator conflicts with expandable_segments. Without it, memory fragmentation causes OOM on 35B. Pin 0.17.0 (cudaMalloc, no conflict). Consolidated CHANGELOG: 5 fixes (merged old #4/#5 into single vLLM pin fix). Updated CLAUDE.md to match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The upstream vLLM 0.18 bump (d00b17e) removed backward-compat shims and added 0.18-only APIs (OpenAIModelRegistry, OpenAIServingRender). Since fleet-35b-run.sh pins vLLM 0.17.0 (for expandable_segments compatibility), revert vllm_engine.py to the version with 0.17.0 support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reverts vLLM 0.17.0 pin and vllm_engine.py pre-0.18 revert. Instead: - MAX_INPUT_LENGTH 96000→72000 to reduce memory pressure - --no-pytorch-alloc-conf (disables expandable_segments for 0.18.0 compat) - flash_attn=true + chunked lm_head + empty_cache at 72K Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. parse_tool_calls() returns ALL <tool_call> tags instead of just the first one. The model often batches multiple tool calls in one generation (73% of trajectories). Previously only the first was executed, the rest silently dropped. 2. Remove the "must explore before generating task" gate. With TCP errors on describe_db, this gate rejected 62-78% of generated tasks. The model should be free to generate tasks at any point. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…0.18.0 flash_attn=true + vLLM 0.18.0 triggers Xid 31 FAULT_PDE in GatedDeltaNet during ref forward at both 97K and 72K — not a memory issue but a CUDA memory mapping corruption from vLLM's CuMemAllocator. Trying SDPA at 72K. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backward pass verified working on sky-4da1-deniz. Re-enable step-0 eval for production training. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ep timing Corrected fix #4 to reflect the final working config: - flash_attn=false (SDPA), not flash_attn=true - flash_attn=true causes Xid 31 at both 97K AND 72K (CuMemAllocator issue) - Added "Verified working" note: ref forward 8.4 min, backward 45.6 min Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
12 hours on GCP spot 2×H200:8 with zero GPU errors. Avg step time ~70 min, checkpoint saved to S3 at step 10. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add enable_hints flag (default False) gated by env_config. Previously hints were always ON during training (controlled by is_eval). Now hints only run when enable_hints=True AND not in eval mode. Hints were net negative in iter#11 — verifier code dump confused evaluator. Reward now uses raw variance only: R = base_quality + judge_gate * alpha * var(raw). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add tool_call_reward_per_call config (default 0.0, set to 0.02 in run script) Rewards each successful meta-tool call (describe_db, query_db) to incentivize multi-turn DB exploration instead of single-turn guessing from system prompt. - MAX_INPUT_LENGTH 30720 → 65536: baseline runs showed 30K forced single-turn convergence by step 5 (describe_db schemas overflow context budget). - MAX_GENERATE_LENGTH 2048 → 4096: more room for task+verifier output. - eval_interval 20 → 10: get eval signal earlier. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rewrites _judge_task as a pre-filter optimized for very low false positive rate. Only rejects tasks with clear structural defects: 1. Phantom tables (not in env schema) 2. Undefined function/constant references 3. Vacuous checks (only user-exists or len>0) 4. Read-write mismatch (prompt asks reads, verifier checks writes) Passes env_schema to the classifier so it can verify table references. Uses Haiku via OpenRouter for low cost/latency (~$0.001 per call). Defaults to ACCEPT on any error (conservative). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Read-write mismatch is too subjective and risks false positives. Classifier now checks only: phantom tables, undefined refs, vacuous checks. Switch judge model from Haiku to Sonnet 4.5. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fleet integration: tool-use training, VL, task-gen, multi-node 35B
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
) Four changes to address v3 reward hacking (89% keyword-only verifiers): 1. Enhanced LLM judge prompt: Replaces lenient pre-filter (93.8% false positive rate) with verifier rigor classification. ACCEPT only for DB-grounded verifiers (mutation diff, DB-queried answer validation, specific record lookup). REJECT keyword-only, prompt-echo, dead-code DB, cargo-cult, phantom tables, undefined refs. Backtested on 6,409 v3 trajectories: PASS 4.7%, FAIL 95.3%. 2. AST node limit 500 → 700: Unblocks outlook verifiers (avg 414 nodes, 336 rejected at old limit) without accepting degenerate verifiers. 3. Exploration enforcement: Gates <task> submission on called_describe_db when max_turns > 1. Forces minimum 2 turns (describe_db → submit) so model sees actual schema before generating verifier. 4. Auto-populate env_schema: Calls describe_db("seed") during init_async() when env_schema is empty (all current datasets). Ensures judge prompt and system prompt always have the real schema for phantom table detection. Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
- Use TASK_GEN_ENV_CLASSES (not ENV_KEYS) to match run script - Add OPENROUTER_API_KEY (required for LLM judge) - Default to ticketmaster/zillow/outlook (not all 8) - Accept envs as CLI args Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
declare -A breaks with set -u on some shells. Use case/esac instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Step 16 OOM'd during forward_backward with 128K context + VL screenshots. GPU 0 had 133.4 GiB used, only 436 MiB free, needed 2.32 GiB. 96K matches the 35B run's approach for avoiding OOM without expandable_segments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add _format_compact_schema(): "table: col (type), ..." format instead of raw describe_db dump (152K → ~10K for zillow) - Remove describe_db from _META_TOOLS — schema is in the prompt - Remove describe_db exploration gate — was causing context starvation for large-schema envs (zillow 82 tables, outlook 62 tables) - Update system prompt: workflow starts with query_db Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…eness, unfiltered .all() rejection (#14) Adapt battle-tested patterns from orchestrator verifier into task-gen prompt and sandbox to fix v4.1 single-turn collapse and degenerate verifiers. - Exploration gate: bounce <task> submission if model hasn't called query_db yet (multi-turn mode only), preventing single-turn collapse - Strengthened verifier template: find_new_entries docstring, set-based comparison example (order-independent validation) - Three new Rules: unfiltered .all() prohibition, set-based comparison, anti-permissiveness (must return 0 on unmodified DB) - AST hard fail: sandbox rejects verifiers with .table("X").all() without preceding .eq()/.neq()/.select() filter (prevents warm-pool saturation) Co-authored-by: Deniz <deniz@Mac.localdomain> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Add Qwen3.5-35B-A3B task generation config: - 2-node (16 GPUs), TP=2, 8 inference engines - flash_attn=false (SDPA), 72K input, chunked lm_head - Task-gen entrypoint with judge, evaluator, and k_rollouts config - Lower LR (5e-7) matching 35B tool-use training Co-authored-by: Deniz <deniz@Mac.localdomain> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Replace static verifier-feedback hints with Claude Sonnet-powered hint synthesis that analyzes the full failed trajectory + verifier errors to produce actionable guidance. Falls back to static hints on any failure. Key changes: - New hint_synthesizer.py module (batch async synthesis with semaphore) - Expose chat_history in env_metrics for trajectory analysis - Track hint_category (llm_synthesized vs static_fallback) in metrics - Add use_llm_hints, hint_model, hint_llm_timeout config options - Add ANTHROPIC_API_KEY to 35B run script and SkyPilot YAML Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces SkyRL-v2, a fork optimized for multi-node FSDP2 training with features like synchronous model offloading, chunked logit computation, and hint-augmented advantage computation. It also adds Vision-Language support, S3 checkpoint management, and a task generation environment. Feedback identifies critical bugs in advantage computation where torch.tensor() is used on CUDA tensors instead of torch.stack(), and an invalid model identifier in the hint synthesizer. Additionally, the reviewer suggests moving imports to the top of the file for PEP 8 compliance and documenting environment variables required for the Supabase fallback mechanism.
| id2mean[idx] = torch.mean(torch.tensor(raw)) | ||
| id2std[idx] = torch.std(torch.tensor([raw])) |
There was a problem hiding this comment.
Using torch.tensor([raw]) where raw is a list of scalar tensors located on a GPU will trigger a TypeError because torch.tensor() attempts to convert the input to a NumPy array first, which is not allowed for CUDA tensors. Additionally, creating a tensor from a list of tensors is inefficient as it involves multiple host-device synchronizations. Use torch.stack(raw) instead, which is faster and correctly handles tensors on any device.
else:
raw_tensor = torch.stack(raw)
id2mean[idx] = raw_tensor.mean()
id2std[idx] = raw_tensor.std()| id2mean[idx] = torch.mean(torch.tensor(id2score[idx])) | ||
| id2std[idx] = torch.std(torch.tensor([id2score[idx]])) |
There was a problem hiding this comment.
Similar to the issue in the is_hinted path, torch.tensor([id2score[idx]]) will crash if the tensors in the list are on a GPU. Use torch.stack() for better performance and compatibility with CUDA tensors.
| id2mean[idx] = torch.mean(torch.tensor(id2score[idx])) | |
| id2std[idx] = torch.std(torch.tensor([id2score[idx]])) | |
| elif len(id2score[idx]) > 1: | |
| group_scores = torch.stack(id2score[idx]) | |
| id2mean[idx] = group_scores.mean() | |
| id2std[idx] = group_scores.std() |
| verifier_stdout: Optional[str], | ||
| verifier_error: Optional[str], | ||
| tool_error_messages: Optional[List[str]], | ||
| model: str = "claude-sonnet-4-20250514", |
There was a problem hiding this comment.
The model name claude-sonnet-4-20250514 appears to be invalid or a future-dated placeholder. Anthropic's current Claude 3.5 Sonnet model identifier is typically claude-3-5-sonnet-20241022 (or claude-3-5-sonnet-latest). Using a non-existent model name will cause the Anthropic API to return a 404 error, breaking the hint synthesis feature.
| model: str = "claude-sonnet-4-20250514", | |
| model: str = "claude-3-5-sonnet-20241022", |
| import ast | ||
| import re |
| """Query Supabase for session verifier scores as fallback. | ||
|
|
There was a problem hiding this comment.
Summary
hint_synthesizer.pymodule with batch async synthesis (semaphore-controlled concurrency, 30s timeout, automatic static fallback)hint_category(llm_synthesized vs static_fallback) with per-category success rate metrics in WandBTest plan
Hint [llm_synthesized]entries appear in logs with actionable texthint/category_llm_synthesized_success_rate> 0% within first 10 steps🤖 Generated with Claude Code