Skip to content

feat: LLM-synthesized hints for failed trajectories#1456

Open
dzorlu wants to merge 91 commits intoNovaSky-AI:mainfrom
fleet-ai:feat/llm-hints
Open

feat: LLM-synthesized hints for failed trajectories#1456
dzorlu wants to merge 91 commits intoNovaSky-AI:mainfrom
fleet-ai:feat/llm-hints

Conversation

@dzorlu
Copy link
Copy Markdown

@dzorlu dzorlu commented Apr 4, 2026

Summary

  • Replace static verifier-feedback hints (0% recovery in 35B parity run) with Claude Sonnet-powered hint synthesis that analyzes the full failed trajectory + verifier errors
  • New hint_synthesizer.py module with batch async synthesis (semaphore-controlled concurrency, 30s timeout, automatic static fallback)
  • Track hint_category (llm_synthesized vs static_fallback) with per-category success rate metrics in WandB

Test plan

  • Verify Hint [llm_synthesized] entries appear in logs with actionable text
  • Check WandB for hint/category_llm_synthesized_success_rate > 0% within first 10 steps
  • Confirm static fallback works when ANTHROPIC_API_KEY is missing
  • Compare eval pass@3 at step 20 vs parity run baseline (62.1%)

🤖 Generated with Claude Code


Open with Devin

Deniz and others added 30 commits March 28, 2026 14:38
Add fleet_task environment that integrates Fleet-hosted tasks with SkyRL
via OpenEnv's FleetTaskEnv abstraction layer. Supports multi-turn
tool-use and computer-use (multimodal) modalities.

- FleetTaskEnv(BaseTextEnv): provisions Fleet env, multi-turn episodes,
  reward via verifier, partial reward support, hint augmentation
- Tool call parser: handles <tool_call>/<function_call> tag formats with
  JSON repair for missing closing braces
- Multimodal observations: returns image_url content blocks for CUA,
  compatible with upstream's extract_images_from_conversation()
- Per-env metrics aggregation with environment breakdown
- Context management integration for long trajectories
- Trace upload support for eval telemetry

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…onfigs

Port Fleet-specific training infrastructure from fork to fresh SkyRL-v2:

Entrypoints:
- main_fleet.py: GRPO training on Fleet-hosted envs with S3 checkpoints
- main_task_gen.py: Task generation training entrypoint
- main_fleet_tinker.py: Tinker-based training with Fleet envs (LoRA, async)

Dataset & Checkpoints:
- prepare_dataset.py: Convert Fleet task JSON to SkyRL parquet format
  (stratified split, dedup, env capping, difficulty filtering)
- s3_checkpoints.py: Async S3 upload, cross-VM resume, local cleanup
- export_tasks.py: CLI to export tasks from Fleet API

Training Scripts:
- fleet-common-setup.sh: Shared setup (deps, OpenEnv, dataset download)
- fleet-common-run.sh: Multi-node Ray cluster + training launch
- fleet-35b-run.sh: Qwen3.5-35B config (TP=2, multi-node)
- fleet-qwen35-extra-setup.sh: Qwen3.5 deps (transformers 5.3, flash-attn)
- fleet-task-gen-run.sh: Task generation config

SkyPilot YAML Configs:
- openenv-fleet-grpo-qwen3_5-35b.yaml: 2-node H200 training
- task-gen-grpo-qwen3_5-9b.yaml: Single-node task gen

Also adds fleet_task and task_gen config to skyrl_gym_config/default.yaml.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Port the task generation environment from fleet-ai/SkyRL that enables
RL-based training of task-generating models. The environment supports
multi-turn task generation where the model generates (prompt, verifier)
pairs that are evaluated via Fleet harness rollouts.

Key components:
- TaskGenEnv(BaseTextEnv): Multi-turn env with tool-based DB exploration,
  task generation, and reward computation via variance + hint gap
- VerifierSandbox: AST-based static analysis for generated verifier code
  safety (blocked imports/builtins, complexity bounds, signature checks)
- Tool call parser: Handles <tool_call>/<function_call> tag formats

Reward formula: R = gate * (base_quality + alpha * var(raw_scores) + hint_gap)

Depends on PR #2 (fleet/training) for integrations.fleet.task_gen_reward.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When all raw rollout samples for a prompt score 0, hint augmentation
generates additional rollouts with verifier feedback injected into the
prompt. This rescues GRPO signal for otherwise dead prompts.

Key components:
- _run_hint_augmentation() in SkyRLGymGenerator: groups outputs by
  instance_id, identifies failing prompts, builds hint text from
  verifier ERROR/SUCCESS_ACCUMULATOR, launches hinted rollouts
- RLTF-SD: replaces hinted prompt_ids with original unhinted prompt_ids
  so the model learns to produce hint-quality outputs from the original
  prompt alone (grad log pi(y_hint | x_0) not grad log pi(y_hint | x_0 + hint))
- First-turn baseline in compute_grpo_outcome_advantage: when is_hinted
  is present, computes group mean/std from raw samples only, preventing
  hinted samples from contaminating the GRPO baseline
- Metrics: hint/total_hinted_rollouts, hint/hint_success_rate,
  hint/prompts_hinted, hint/signal_rescued

Config: enable_hints, hint_reward_threshold, n_hint_samples in
fleet_task section of skyrl_gym_config.

Only runs during training (not eval), only for non-step-wise
trajectories, and only when fleet_task.enable_hints=true.

Depends on PR #1 (fleet/task-env) for FleetTaskEnv.build_hint_text().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Port vision-language model support from SkyRL v1 (feat/vl-support-clean)
to SkyRL-v2's architecture:

- Generator: VL-aware chat template, image accumulation across turns,
  multi_modal_data construction for vLLM
- Engine pipeline: thread multi_modal_data through preprocess/generate
  in both sync and async vLLM engines
- Fleet env: Qwen coordinate adaptation ([0,1000] <-> pixel), initial
  screenshot capture, computer_use browser hints, done signal detection
- Utilities: image extraction, base64 decode, processor loading,
  VL chat template with proper vision token expansion
- New VL run script and SkyPilot YAML for CUA training
- Update existing YAMLs to use fleet/all branch

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RunPod/Lambda/Nebius/Vast were all out of H200 capacity.
Add GCP spot with proper NVIDIA 570 driver image.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SkyRL-v2 pyproject.toml defines 'fsdp' extra (includes vllm, flash-attn,
torch, flashinfer) but not a standalone 'vllm' extra. The old SkyRL had
'vllm' as a separate extra.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
uv pip install silently fails to build causal-conv1d CUDA extension
(reports "Checked 1 package" but module is not importable). Use pip
with --no-build-isolation to ensure it finds torch from the venv.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In SkyRL-v2, scripts/ is directly under repo root (not nested under
skyrl-train/). Changed cd from "../.." to ".." so the run scripts
correctly resolve the repo root directory.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TrainerConfig: loss_chunk_size, use_hybrid_env_sampling, min_samples_per_env
GeneratorConfig: inject_context_status, context_warning_threshold, trajectory_timeout_seconds

SkyRL-v2's strict Hydra config rejects unknown keys (no + prefix),
so these must be defined in the dataclass and YAML defaults.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The fleet entrypoints use @hydra.main which loads the legacy YAML
directly, but validate_cfg expects generator.inference_engine.*
(the new structured format). Apply translate_legacy_config to
convert flat generator.* keys before validation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The legacy YAML has flat generator.* keys (e.g. generator.backend)
but validate_cfg expects generator.inference_engine.* with all fields
including distributed_executor_backend. Add the full inference_engine
section with defaults so all fields are present after Hydra loads the
config and translate_legacy_config moves CLI overrides into it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove explicit fleet_task register() call from main_fleet.py since
  skyrl_gym.envs.__init__ already auto-registers it
- Remove --data-dir-name task_gen from task-gen run script so it uses
  the default MODALITY-based path (matching setup's download path)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace OmegaConf.create() approach (loses dataclass type info) with
  in-place sync of flat generator.* CLI overrides into the structured
  generator.inference_engine section. This preserves the Hydra DictConfig
  and avoids TypeError on dataclasses.asdict().
- Remove --skip-prepare from task-gen YAML so parquet files are generated
- Remove duplicate fleet_task registration (auto-registered by __init__)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hydra entrypoints pass DictConfig (not dataclass instances), so
dataclasses.asdict() fails. Fall back to OmegaConf.to_yaml() for
DictConfig objects.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hydra's @hydra.main produces DictConfig objects, but the codebase expects
typed dataclass instances (asdict(), attribute access, etc.). Switch Fleet
entrypoints to use SkyRLTrainConfig.from_cli_overrides() which produces
proper typed dataclasses via the legacy config translation path.

- Add fleet_task/task_gen as Optional[Dict] fields on SkyRLGymConfig
- Strip ++/+ Hydra prefixes from CLI args before from_cli_overrides
- Remove _sync_legacy_generator_to_inference_engine (legacy path handles it)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
accelerate 1.12.0 passes param.__dict__ (which includes transformers 5.3.0's
_is_hf_initialized flag) to Parameter.__new__() during init_empty_weights.
PyTorch 2.10.0 rejects this unknown kwarg. Newer accelerate filters it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
uv pip install -U accelerate pulls newer torch with CUDA 13.0, breaking
torchvision (CUDA 12.8). Use pip install --no-deps instead to upgrade
only accelerate without re-resolving transitive dependencies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
accelerate's init_empty_weights passes param.__dict__ to Parameter()
which includes _is_hf_initialized (set by transformers 5.x). torch 2.10
rejects this unknown kwarg. Patch Parameter.__new__ in fsdp_utils.py
to filter it out. Revert accelerate upgrade attempt (latest is 1.13.0,
still has the same issue).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- config.py: Always use legacy config path in from_cli_overrides to ensure
  flat keys (generator.backend etc.) are properly translated via
  translate_legacy_config. Fixes VL/35B ValueError on GeneratorConfig.

- prepare_dataset.py: Add --env-class CLI arg (fleet_task|task_gen) to set
  per-record env_class in parquet data. Previously hardcoded to fleet_task,
  causing task_gen training to create FleetTaskEnv (requires tasks_file).

- fleet-common-setup.sh: Accept --env-class and pass to prepare_dataset.

- task-gen YAML: Pass --env-class task_gen in setup block.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- task_gen_env.py: Default ROLLOUT_DIR to ~/rollouts instead of /workspace/rollouts.
  /workspace doesn't exist on GCP (only RunPod), causing PermissionError.

- config.py: Disable OmegaConf struct flag on base config before merging
  CLI overrides. Empty dicts in YAML (like chat_template_kwargs: {}) are
  loaded as closed structs, rejecting new keys during merge.

- config.py: Add try/except around asdict() in get_config_as_yaml_str
  to handle edge cases where asdict fails on Ray-serialized dataclasses.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… logging

FLEET_API_KEY was not being propagated to Ray workers via runtime_env,
causing task_gen's import_single_task to fail with empty API key.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dataset prepare step stores the environment name as 'data_source'
column, but TaskGenEnv.__init__ only looked for 'env_key'. This caused
all import_single_task calls to use env_id='unknown', which fails with
"Environment 'unknown' not found" from Fleet API.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
expandable_segments:True in PYTORCH_CUDA_ALLOC_CONF is incompatible
with vLLM's CuMemAllocator, causing AssertionError during model load.
The 9B script already had this flag; the 35B was missing it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fleet env returns list-format content (from OpenEnv multimodal observations)
that text-only templates like Qwen3.5-35B-A3B can't handle. This converts
list content (strings or image_url dicts) to plain text before applying the
chat template, preventing jinja2 TemplateError on non-VL models.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hint augmentation extends trajectory_ids in generator_input in-place,
but the separate uids variable in the trainer was never updated. This
caused IndexError in postprocess_generator_output when uids had fewer
entries than rewards (128 raw + N hinted rewards vs 128 uids).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a Fleet environment fails to provision (e.g., list_tools timeout),
return a zero-reward trajectory instead of propagating the exception
through tqdm.gather and crashing the entire training step. This makes
training resilient to transient Fleet API / MCP failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deniz and others added 28 commits March 29, 2026 21:13
vLLM 0.18.0 CuMemAllocator conflicts with expandable_segments. Without it,
memory fragmentation causes OOM on 35B. Pin 0.17.0 (cudaMalloc, no conflict).

Consolidated CHANGELOG: 5 fixes (merged old #4/#5 into single vLLM pin fix).
Updated CLAUDE.md to match.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The upstream vLLM 0.18 bump (d00b17e) removed backward-compat shims
and added 0.18-only APIs (OpenAIModelRegistry, OpenAIServingRender).
Since fleet-35b-run.sh pins vLLM 0.17.0 (for expandable_segments
compatibility), revert vllm_engine.py to the version with 0.17.0 support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reverts vLLM 0.17.0 pin and vllm_engine.py pre-0.18 revert. Instead:
- MAX_INPUT_LENGTH 96000→72000 to reduce memory pressure
- --no-pytorch-alloc-conf (disables expandable_segments for 0.18.0 compat)
- flash_attn=true + chunked lm_head + empty_cache at 72K

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. parse_tool_calls() returns ALL <tool_call> tags instead of just the
   first one. The model often batches multiple tool calls in one generation
   (73% of trajectories). Previously only the first was executed, the rest
   silently dropped.

2. Remove the "must explore before generating task" gate. With TCP errors
   on describe_db, this gate rejected 62-78% of generated tasks. The model
   should be free to generate tasks at any point.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…0.18.0

flash_attn=true + vLLM 0.18.0 triggers Xid 31 FAULT_PDE in GatedDeltaNet
during ref forward at both 97K and 72K — not a memory issue but a CUDA
memory mapping corruption from vLLM's CuMemAllocator. Trying SDPA at 72K.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backward pass verified working on sky-4da1-deniz. Re-enable step-0
eval for production training.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ep timing

Corrected fix #4 to reflect the final working config:
- flash_attn=false (SDPA), not flash_attn=true
- flash_attn=true causes Xid 31 at both 97K AND 72K (CuMemAllocator issue)
- Added "Verified working" note: ref forward 8.4 min, backward 45.6 min

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
12 hours on GCP spot 2×H200:8 with zero GPU errors.
Avg step time ~70 min, checkpoint saved to S3 at step 10.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add enable_hints flag (default False) gated by env_config.
Previously hints were always ON during training (controlled by is_eval).
Now hints only run when enable_hints=True AND not in eval mode.

Hints were net negative in iter#11 — verifier code dump confused evaluator.
Reward now uses raw variance only: R = base_quality + judge_gate * alpha * var(raw).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add tool_call_reward_per_call config (default 0.0, set to 0.02 in run script)
  Rewards each successful meta-tool call (describe_db, query_db) to incentivize
  multi-turn DB exploration instead of single-turn guessing from system prompt.

- MAX_INPUT_LENGTH 30720 → 65536: baseline runs showed 30K forced single-turn
  convergence by step 5 (describe_db schemas overflow context budget).

- MAX_GENERATE_LENGTH 2048 → 4096: more room for task+verifier output.

- eval_interval 20 → 10: get eval signal earlier.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rewrites _judge_task as a pre-filter optimized for very low false positive
rate. Only rejects tasks with clear structural defects:
  1. Phantom tables (not in env schema)
  2. Undefined function/constant references
  3. Vacuous checks (only user-exists or len>0)
  4. Read-write mismatch (prompt asks reads, verifier checks writes)

Passes env_schema to the classifier so it can verify table references.
Uses Haiku via OpenRouter for low cost/latency (~$0.001 per call).
Defaults to ACCEPT on any error (conservative).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Read-write mismatch is too subjective and risks false positives.
Classifier now checks only: phantom tables, undefined refs, vacuous checks.
Switch judge model from Haiku to Sonnet 4.5.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fleet integration: tool-use training, VL, task-gen, multi-node 35B
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
)

Four changes to address v3 reward hacking (89% keyword-only verifiers):

1. Enhanced LLM judge prompt: Replaces lenient pre-filter (93.8% false
   positive rate) with verifier rigor classification. ACCEPT only for
   DB-grounded verifiers (mutation diff, DB-queried answer validation,
   specific record lookup). REJECT keyword-only, prompt-echo, dead-code
   DB, cargo-cult, phantom tables, undefined refs. Backtested on 6,409
   v3 trajectories: PASS 4.7%, FAIL 95.3%.

2. AST node limit 500 → 700: Unblocks outlook verifiers (avg 414 nodes,
   336 rejected at old limit) without accepting degenerate verifiers.

3. Exploration enforcement: Gates <task> submission on called_describe_db
   when max_turns > 1. Forces minimum 2 turns (describe_db → submit) so
   model sees actual schema before generating verifier.

4. Auto-populate env_schema: Calls describe_db("seed") during init_async()
   when env_schema is empty (all current datasets). Ensures judge prompt
   and system prompt always have the real schema for phantom table detection.

Co-authored-by: Deniz <deniz@Denizs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
- Use TASK_GEN_ENV_CLASSES (not ENV_KEYS) to match run script
- Add OPENROUTER_API_KEY (required for LLM judge)
- Default to ticketmaster/zillow/outlook (not all 8)
- Accept envs as CLI args

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
declare -A breaks with set -u on some shells. Use case/esac instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Step 16 OOM'd during forward_backward with 128K context + VL screenshots.
GPU 0 had 133.4 GiB used, only 436 MiB free, needed 2.32 GiB.
96K matches the 35B run's approach for avoiding OOM without expandable_segments.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add _format_compact_schema(): "table: col (type), ..." format
  instead of raw describe_db dump (152K → ~10K for zillow)
- Remove describe_db from _META_TOOLS — schema is in the prompt
- Remove describe_db exploration gate — was causing context starvation
  for large-schema envs (zillow 82 tables, outlook 62 tables)
- Update system prompt: workflow starts with query_db

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…eness, unfiltered .all() rejection (#14)

Adapt battle-tested patterns from orchestrator verifier into task-gen prompt
and sandbox to fix v4.1 single-turn collapse and degenerate verifiers.

- Exploration gate: bounce <task> submission if model hasn't called query_db
  yet (multi-turn mode only), preventing single-turn collapse
- Strengthened verifier template: find_new_entries docstring, set-based
  comparison example (order-independent validation)
- Three new Rules: unfiltered .all() prohibition, set-based comparison,
  anti-permissiveness (must return 0 on unmodified DB)
- AST hard fail: sandbox rejects verifiers with .table("X").all() without
  preceding .eq()/.neq()/.select() filter (prevents warm-pool saturation)

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Add Qwen3.5-35B-A3B task generation config:
- 2-node (16 GPUs), TP=2, 8 inference engines
- flash_attn=false (SDPA), 72K input, chunked lm_head
- Task-gen entrypoint with judge, evaluator, and k_rollouts config
- Lower LR (5e-7) matching 35B tool-use training

Co-authored-by: Deniz <deniz@Mac.localdomain>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Replace static verifier-feedback hints with Claude Sonnet-powered hint
synthesis that analyzes the full failed trajectory + verifier errors to
produce actionable guidance. Falls back to static hints on any failure.

Key changes:
- New hint_synthesizer.py module (batch async synthesis with semaphore)
- Expose chat_history in env_metrics for trajectory analysis
- Track hint_category (llm_synthesized vs static_fallback) in metrics
- Add use_llm_hints, hint_model, hint_llm_timeout config options
- Add ANTHROPIC_API_KEY to 35B run script and SkyPilot YAML

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces SkyRL-v2, a fork optimized for multi-node FSDP2 training with features like synchronous model offloading, chunked logit computation, and hint-augmented advantage computation. It also adds Vision-Language support, S3 checkpoint management, and a task generation environment. Feedback identifies critical bugs in advantage computation where torch.tensor() is used on CUDA tensors instead of torch.stack(), and an invalid model identifier in the hint synthesizer. Additionally, the reviewer suggests moving imports to the top of the file for PEP 8 compliance and documenting environment variables required for the Supabase fallback mechanism.

Comment on lines +1242 to +1243
id2mean[idx] = torch.mean(torch.tensor(raw))
id2std[idx] = torch.std(torch.tensor([raw]))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Using torch.tensor([raw]) where raw is a list of scalar tensors located on a GPU will trigger a TypeError because torch.tensor() attempts to convert the input to a NumPy array first, which is not allowed for CUDA tensors. Additionally, creating a tensor from a list of tensors is inefficient as it involves multiple host-device synchronizations. Use torch.stack(raw) instead, which is faster and correctly handles tensors on any device.

                else:
                    raw_tensor = torch.stack(raw)
                    id2mean[idx] = raw_tensor.mean()
                    id2std[idx] = raw_tensor.std()

Comment on lines +1260 to +1261
id2mean[idx] = torch.mean(torch.tensor(id2score[idx]))
id2std[idx] = torch.std(torch.tensor([id2score[idx]]))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to the issue in the is_hinted path, torch.tensor([id2score[idx]]) will crash if the tensors in the list are on a GPU. Use torch.stack() for better performance and compatibility with CUDA tensors.

Suggested change
id2mean[idx] = torch.mean(torch.tensor(id2score[idx]))
id2std[idx] = torch.std(torch.tensor([id2score[idx]]))
elif len(id2score[idx]) > 1:
group_scores = torch.stack(id2score[idx])
id2mean[idx] = group_scores.mean()
id2std[idx] = group_scores.std()

verifier_stdout: Optional[str],
verifier_error: Optional[str],
tool_error_messages: Optional[List[str]],
model: str = "claude-sonnet-4-20250514",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The model name claude-sonnet-4-20250514 appears to be invalid or a future-dated placeholder. Anthropic's current Claude 3.5 Sonnet model identifier is typically claude-3-5-sonnet-20241022 (or claude-3-5-sonnet-latest). Using a non-existent model name will cause the Anthropic API to return a 404 error, breaking the hint synthesis feature.

Suggested change
model: str = "claude-sonnet-4-20250514",
model: str = "claude-3-5-sonnet-20241022",

Comment on lines +97 to +98
import ast
import re
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Standard Python practice (PEP 8) recommends placing all imports at the top of the file. Moving ast and re to the top-level imports improves readability and ensures they are only imported once.

Comment on lines +757 to +758
"""Query Supabase for session verifier scores as fallback.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _query_supabase_scores method relies on SUPABASE_URL and SUPABASE_KEY environment variables. These should be documented in the module docstring or the script headers (e.g., in scripts/fleet-task-gen-run.sh) to ensure users are aware of the requirements for this fallback mechanism.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 9 additional findings.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant