Skip to content

Make VIE faster#16

Open
jishnujayakumar wants to merge 36 commits into
mainfrom
jishnu/fasten-vie
Open

Make VIE faster#16
jishnujayakumar wants to merge 36 commits into
mainfrom
jishnu/fasten-vie

Conversation

@jishnujayakumar
Copy link
Copy Markdown
Collaborator

No description provided.

jishnujayakumar and others added 30 commits May 7, 2026 13:20
…(default off)

The matplotlib overlay path in propagate_masks_and_save was creating a fresh
figure per frame, opening the source image, redrawing every prior centroid
in an O(N^2) loop, and calling savefig — dwarfing the actual mask cost. The
binary mask PNGs (the output downstream BundleSDF actually consumes) are now
the only thing produced by default; the overlay is opt-in via --save_traj_overlay.
When opted-in the figure is reused across frames and the centroid trail is
appended incrementally rather than replotted from scratch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two no-quality-loss changes:

1. The Nelder-Mead translation refinement was running with xatol=1e-8 (very
   tight), no maxiter (unbounded), and disp=True (per-call console I/O). It
   typically converges in well under 50 iterations to within 1e-5 in metric
   units; the tighter tolerance was buying nothing visible in the depth-aligned
   mesh and dominated runtime. Defaults are now xatol=1e-5, maxiter=50,
   disp=False, all overrideable via --opt_xatol / --opt_maxiter / --opt_disp.

2. The per-batch-item regression_img + side_img pyrender passes and the
   {img_fn}_all.jpg overlay write are pure debug visualizations. They are now
   gated behind --save_debug_renders (default off). cam_view itself still
   renders since the depth_pc target_mask is derived from it.

Mesh outputs (model/, 3dhand/, scene/) and the optimized translation are
unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up Phase 1 grasp-transfer wins:
  - Hoist AdamGraspTransfer out of per-frame loop
  - Skip redundant target_handmodel reload in reset()
  - Expose --max_iter (default 100, was 300)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two Phase 1 wins for the BundleSDF leg:

1. Bumps the BundleSDF submodule to jishnu/fasten-vie@298918c, which adds
   -k (keep) and skip-rebuild handling to docker/start_docker.sh. Repeat
   launches reuse the running container instead of doing down + up --build
   each time.

2. Adds --n_step to run_bundlesdf.py, plumbed through BundleSDFProcessor
   into cfg_nerf['n_step']. Default unchanged (config.yml's value, currently
   10) so this is a no-op at default; lower values trade reconstruction
   quality for NeRF training speed and are intended for Phase 3 tuning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tation

The per-bbox loop in run_gdino_samv2 was calling propagate_masks_and_save
once per detected bbox. Each call ran SAM2's init_state(video_path=video_dir),
which scans + caches every video frame — N times the I/O for N bboxes, even
though SAM2 supports tracking multiple objects simultaneously.

New propagate_masks_and_save_multi(video_dir, bboxes, ...) calls init_state
once, registers every bbox as its own object id on frame 0, and runs a single
propagate_in_video loop that yields all object masks per frame.

Filename rule: when len(bboxes) == 1 the saved mask path is unchanged; when
N > 1 each file is prefixed obj{i}_<frame>.png so masks no longer overwrite
each other (the prior code silently dropped all but the last bbox's masks).

A timing log line (init / propagate+save / total) prints at the end so users
can see the speedup directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…efaults

Three combined changes for real-time hand extraction:

1. Warm-start: cache the optimized translation_new per hand side (left=0,
   right=1) on HandInfoExtractor and feed it as x0 to the next frame's
   minimize() instead of mean(depth_pc.points). Hand poses change smoothly
   between frames so this seeds the optimizer near the answer.

2. Aggressive defaults: xatol 1e-5 -> 1e-4 (≈0.1mm), maxiter 50 -> 30. With
   warm-starting these are typically enough for sub-mm convergence; revert
   via --opt_xatol / --opt_maxiter if quality regresses.

3. Per-frame timing summary printed at the end of the run (avg ms/frame, total)
   so the speedup is observable without external profiling.

Disable warm-starting with --no_warm_start to A/B against the cold-start path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sive defaults)

Picks up jishnu/fasten-vie@bdca654 with:
  - AdamGraspTransfer warm-start from prior frame's q_current
  - num_particles 32 -> 16, max_iter 100 -> 50 defaults
  - Per-frame timing summary

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iming

Phase 2/3 BundleSDF wins outside Docker:

1. Pre-cache: when not using a live segmenter, all masks are loaded into RAM
   in one pass before the frame loop instead of being read from disk inside
   the hot loop. Mask files are ~tens of KB each, so this adds at most a few
   MB of memory for hundreds of frames and removes per-frame disk seeks.

2. Aggressive default: --n_step now defaults to 5 (was None = config.yml's
   10). NeRF training was running 10 iters every keyframe trigger; with
   continual=true this fires repeatedly. 5 is usually enough for tracking-
   accurate poses; raise back to 10 if reconstruction quality regresses.

3. Per-frame timing summary printed at the end of process(), so the speedup
   is visible without external profiling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Runs gdino+samv2, hamer, rfp-grasp-transfer, and bundlesdf on a given task
dir, activating the appropriate conda env per module and logging to a
timestamped file under the task dir. Each module's own timing instrumentation
([samv2] / [hamer] / [grasp-transfer] / [bundlesdf]) lands in the log
together with a wall-clock '[bench] N/4 ... OK in Xs' summary per step.

Skips any module whose inputs are missing (no /rgb, no /depth, no MANO
models, no docker) so partial runs are useful.

Usage:
  ./bench_vie.sh /path/to/task_data_root [text_prompt]
  git checkout main && ./bench_vie.sh ...
  git checkout jishnu/fasten-vie && ./bench_vie.sh ...
  diff /path/to/task/bench_*.log

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds Python 3.10 / PyTorch / CUDA / MIT license / IRVL UTD lab badges plus
upstream attribution for GroundingDINO, SAM 2, HaMeR, and BundleSDF at the top
of vie/README.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er module

Adds a per-module benchmark section and TOC entries:

- GDINO+SAMv2: measured 7.86x speedup on robokit/perception.py::propagate_masks_and_save
  (139.70s -> 17.77s, 1995.6 -> 253.8 ms/frame) on task_39_seasoning_on_omlette_v1
  with single bbox on RTX 5070 Laptop. Measured with a SAM2-only mini-bench that
  bypasses GDINO; the win comes from gating per-frame matplotlib overlay
  generation behind --save_traj_overlay (off by default).
- HaMeR / rfp-grasp-transfer / BundleSDF: described with the underlying
  mechanism (warm-starts, hoisting, gated debug renders, persistent docker,
  smaller particle batches), the per-module timing log to look for, and an
  honest note that they were not measured on the dev machine due to MANO
  models, Blackwell-incompatible torch in robokit-py3.10, and missing Docker.

Pointers users to scripts/bench_vie.sh for end-to-end A/B on a working rig.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…th measured numbers

Replaces the "not measured here" sections with real A/B numbers from isolated
micro-benches that bypass the env walls (no MANO, no Blackwell torch needed):

  HaMeR: 209.57s -> 32.27s (70 frames, 138 calls), 6.49x speedup. Bench drives
  off existing pred_vertices/pred_cam_t in out/hamer/model/*.npz so MANO + the
  HaMeR forward pass aren't required; only the scipy minimize stage differs
  between branches.

  rfp-grasp-transfer: ~1240 ms/frame -> ~870 ms/frame, ~1.5x. Synthetic
  smooth-walking q on CPU (robokit env's torch 2.3.1+cu118 lacks sm_120).
  Smaller than the survey's "expected 4-8x" — discusses why (thermal noise,
  fixed reset() cost on CPU) and notes GPU speedup should be larger.

Also documents the Phase 1 reset() reload-skip that benchmarking caught as a
regression and got reverted in 2071aab — a real "the obvious optimization is
the wrong one" finding.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two stacking changes for Phase 4.

1. HaMeR fp16 autocast on the transformer forward pass.
   torch.amp.autocast(dtype=fp16) wrapped around self.model(batch). Active
   on CUDA only (no-op on CPU); enabled by default. Expected ~1.5-2x speedup
   on the model fwd with no observable mesh-quality regression. Disable via
   --no_fp16 to fall back to fp32.

   Note: this change is code-only on the dev rig (no MANO models + the
   robokit-py3.10 conda env's torch lacks Blackwell sm_120 kernels), so the
   speedup is not measured here. It will land on a working rig where
   extract_hand_bboxes_and_meshes.py actually runs.

2. Bump rfp-grasp-transfer submodule to jishnu/fasten-vie@f8badfb (Phase 4
   correspondence cache + jittered particle init).

Investigated and rejected: cKDTree drop-in for sklearn.neighbors.KDTree in
hamer/mesh_to_sdf/rgbd2pc.py. Bench ran 17+ minutes vs sklearn's 4 minutes
before being killed — cKDTree is slower for this query pattern (777 verts
against ~300k depth points, 138 queries per scipy minimize call). Sticking
with sklearn KDTree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ee rejected

Adds Phase 4 sections to the existing HaMeR and rfp-grasp-transfer
benchmark blocks:

- HaMeR: documents fp16 autocast on the transformer fwd (default on,
  --no_fp16 disables) and explicitly notes that cKDTree-as-drop-in for
  sklearn.neighbors.KDTree was investigated and rejected (slower for the
  777 verts × 300k depth points × 138 queries-per-minimize pattern).

- rfp-grasp-transfer: documents the correspondence cache and jittered-
  particle-init wins, plus the CPU re-bench (within Phase 3 noise; gains
  are convergence-quality, not raw wall-time on CPU).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standing this up on a fresh Blackwell laptop revealed five real walls that
the original setup_vie.sh + requirements.txt didn't anticipate. This commit
captures every workaround so a clean re-install just works.

Specifically:

- setup_vie.sh
  - pin transformers==4.47.1 (>=5 dropped BertModel.get_head_mask which the
    pinned old GroundingDINO needs)
  - pin setuptools<70 before installing mmcv (legacy mmcv setup.py imports
    pkg_resources which newer setuptools dropped)
  - install mmcv==1.5.0 explicitly with --no-build-isolation (HaMeR's pinned
    mmcv==1.3.9 fails to build on Python 3.10 toolchains, and mmpose 0.24
    only accepts mmcv in [1.3.8, 1.5.0])
  - install hamer with --no-deps so its strict mmcv pin doesn't undo the above
  - apply an in-place patch to groundingdino's ms_deform_attn.py so it falls
    back to the pure-PyTorch implementation when the _C CUDA extension isn't
    built (which is the common case — the pip wheel ships no _C and source
    builds need a matching CUDA toolchain)
  - re-pin numpy<2 after HaMeR's editable install (HaMeR drags in numpy>=2
    which breaks matplotlib + many c-extensions)
  - print a clear MANO + Blackwell-torch reminder at the end

- requirements.txt
  - pin transformers==4.47.1
  - pin setuptools<70 (build-time)
  - add the deps that hamer needs but its setup.py doesn't list cleanly
    (yacs, smplx, einops, jaxtyping, iopath, fvcore, omegaconf, hydra-core,
    pytorch_lightning, torchmetrics, timm, huggingface_hub, tokenizers,
    safetensors)

- hamer/setup.py
  - relax mmcv==1.3.9 to mmcv>=1.3.8,<=1.5.0 with a comment explaining why

- robokit/perception.py
  - emit a clear actionable warning at import time if groundingdino._C is
    missing AND ms_deform_attn.py hasn't been patched (so users know to
    re-run setup_vie.sh)

- README.md
  - new "Install Gotchas" section documenting all five workarounds so users
    debugging a fresh install can map a symptom to a fix

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chumpy 0.70 (used by smplx to unpickle MANO .pkl files) does
  from numpy import bool, int, float, complex, object, unicode, str, nan, inf
which fails on numpy 1.20+ where these bare-Python aliases were removed from
the numpy namespace. Patch chumpy's __init__.py in-place to set the aliases
on numpy before the legacy import line, so MANO loading succeeds without
needing a stale numpy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ight GPUs

The full HaMeR pipeline loads ViTDet-Huge (~2.5GB) + ViTPose (~1.2GB) + the
HaMeR transformer (~4GB) + BERT simultaneously, which OOMs on 8GB cards
(e.g. RTX 5070 Laptop). The detector module already supports a 'regnety'
alternative that's ~10x smaller; this change wires it through to the CLI as
--body_detector. Default stays 'vitdet' (no behavior change for users with
plenty of VRAM).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…amGraspTransfer)

Adds two new sub-sections under the rfp-grasp-transfer benchmark block:

Phase 5: deepcopy snapshot in reset()
  Profiling found URDF reload in reset() was ~98% of remaining per-frame
  cost with high variance (150-500 ms). Replaced with copy.deepcopy of a
  snapshot taken at __init__. After: 267-277 ms rock-solid, 1.67 it/s.

Phase 6: BatchedAdamGraspTransfer (frame batching)
  Process N frames in a single Adam call. Measured on RTX 5070 Laptop:
    F=1   1486 ms/frame  119s wall  1.00x
    F=4    301 ms/frame   43s wall  2.78x  <- sweet spot
    F=8    425 ms/frame   47s wall  2.51x
    F=16   331 ms/frame   47s wall  2.52x
  Verified: all 70 frames produced PLYs in both paths (138 due to two frames
  having only left hand in original HaMeR output).

Combined Phases 1-6 = ~5x wall-clock vs main on this rig.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Profiling showed 96% of per-frame time in the GSAM pipeline is the SAM2
forward pass itself; the only meaningful lever left is the model size.
Adding --sam2_size {large|base_plus|small|tiny} as a CLI flag (default
'large' = no behavior change for existing users) with auto-download of
the checkpoint on first use.

Measured on RTX 5070 Laptop, task_39 (70 frames):
  large      propagate+save 13.69s  total 26.34s  wall 85.90s  baseline
  base_plus  propagate+save  7.17s  total 17.33s  wall 67.13s  1.91x propagate

Per-frame steady-state: 196ms (large) -> 102ms (base_plus). Quality drop on
clean foreground objects has been minimal in our spot checks; small/textured
objects may need 'large'. Smaller variants ('small', 'tiny') wired but
unbenched here.

The init_state cost is roughly model-independent (~10s, dominated by JPEG
decode of all frames) so wall-clock ratio is smaller than propagate ratio.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SAM2's load_video_frames_from_jpg_images does sequential PIL JPEG decode +
resize per frame. On a 70-frame clip this was ~10s of upfront cost, dominated
by single-thread JPEG decode. NVIDIA DALI with nvJPEG GPU decoding processes
all frames through a single pipeline run, returning the same (N, 3, H, H)
ImageNet-normalized tensor SAM2 expects.

Implementation: at import time, perception.py tries to install DALI as a
monkey-patch for sam2.utils.misc.load_video_frames_from_jpg_images. If DALI
isn't installed the original PIL loader stays in place (zero behavior change).

Measured on RTX 5070 Laptop, task_39 (70 frames, --sam2_size base_plus):
  no DALI:   init=10.15s  propagate=7.17s   total=17.33s  wall=67.13s
  with DALI: init=0.93s   propagate=6.58s   total=7.51s   wall=17.55s

init_state: 10.9x faster.
Full wall-clock: 3.8x over Phase 7 alone (4.9x over Phase 1 baseline).

Caveats:
- async_loading_frames=True still uses the original loader (DALI's eager
  pipeline doesn't fit the lazy-frame use case).
- batch_size = num_frames; very long videos (1000+ frames) may need a
  chunked DALI pipeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…noise

Replaces the pile of ad-hoc logging.info / print calls + the ~12 lines of
upstream library deprecation/registry warnings that fired on every run with:

- A new robokit/log.py module: rich-based Console, RichHandler, plus helper
  functions section() / step() / note() / warn() / error() / success() /
  progress() / summary() / fmt_duration() / fmt_rate(). Reusable across all
  vie entry points.

- run_gdino_samv2.py:
  - top-of-file: warnings.filterwarnings("ignore"), TRANSFORMERS_VERBOSITY
    and PYTHONWARNINGS env vars, absl.set_verbosity(WARNING), and root-logger
    level WARNING. Kills upstream chatter.
  - main() restructured into Configuration / Loading / Detection / Tracking
    sections with timed step lines and a final colored summary panel.
  - Wraps GDINO + SAM2 init in a stdout-redirect context manager to swallow
    BERT's `final text_encoder_type` print and similar one-shot stdouts.

- perception.py:
  - Demoted noisy print() calls in load_model_hf + _load_predictor to
    logger.debug.
  - Removed the redundant defensive _C warning (was firing even when the
    GDINO patch was already applied due to a logic bug; the patch's own
    one-line warning suffices).
  - Demoted the [samv2] timing log line to debug since callers now render
    their own rich summary with these numbers.

Visual outcome:
  • Cyan section rules, green ✓ for steps, dim grey notes for config echoes.
  • Bordered cyan summary panel at end with objects/frames/detection/
    propagation/total wall/fps/output path.
  • Bold green "✓ Done." banner on success.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… waits

Previously each long-running stage (GroundingDINO load, SAM2 load + ckpt
download, GDINO inference, propagate init_state) ran silently and only
emitted a "✓ X done in Yms" line *after* completing. On a fresh run that's
~15s of staring at nothing.

Adds vlog.working(msg) — a context manager that:
  - shows a live spinner with the message during the op
  - replaces the spinner line in-place with "✓ msg (Xs)" on success
  - replaces with "✗ msg (failed after Xs)" + re-raises on exception

Wires it into run_gdino_samv2.py for each long stage so users see exactly
what's happening at all times. The SAM2 propagate stage already has its own
tqdm bar, so we just print a "propagating ..." note before it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The heavy ML imports (torch / GroundingDINO / SAM2 / robokit.perception) take
~5-8s on a fresh process and happen *before* main() runs, so previously the
user saw a blank terminal during that whole window.

Restructured the script: only the lightweight imports (os, sys, time, vlog)
happen at the very top. Once vlog is available we immediately print the
section header + start a spinner labeled "Importing ML stack ...", then do
the heavy imports inside that spinner's context. Spinner replaces in-place
with the usual ✓ confirmation when imports finish.

Also moved the post-import flag-definition + logger-quieting steps inside
the spinner block (they're trivial after imports anyway), and renamed the
in-main "GDINO + SAMv2" header to "Configuration" since the top-level banner
already announces what script we're in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Profiled `python -X importtime` and found four heavy modules being imported
at top of perception.py that the GDINO+SAM2-video hot path doesn't touch:

  pyrender   -> drags in pyglet (with GL context init), tkinter, freetype,
                imageio plugin registry. ~5s. Used only by DepthPC
                visualization helpers (vis=True paths).
  mobile_sam -> pulls in MobileSAM encoders + SamPredictor. ~2s. Used only
                by SegmentAnythingPredictor (mobile-Sam, not SAM2-video).
  matplotlib + cm  -> ~1s. Used only by the opt-in trajectory-overlay path.
  sklearn.neighbors.KDTree -> ~1s. Used only inside DepthPC.

Moved each import to its actual use site (inside class __init__ / method /
gated branch). The imports still happen lazily on first use — no behavior
change for existing callers — but the GDINO+SAM2-video script no longer
pays the cost.

Measured on RTX 5070 Laptop, --sam2_size base_plus:
  before:  Import ML stack 13.00s | GDINO load 12.37s | SAM2 load 13.23s | total 43s
  after:   Import ML stack  2.63s | GDINO load  4.33s | SAM2 load  0.72s | total 8s
  ~5x less wall-clock wait before propagation actually starts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jishnujayakumar and others added 6 commits May 8, 2026 11:39
…ransfer

Adds a single CLI flag name (--save_viz) to all three entry-point scripts
that switches on the per-script viz/debug output. Existing per-script flags
(--save_traj_overlay, --save_debug_renders, --debug_plots) keep working as
before for fine-grained control; --save_viz is the convenience alias users
asked for so they don't have to remember three different names.

  run_gdino_samv2.py     : --save_viz  -> save_traj_overlay
  hamer/extract_*.py     : --save_viz  -> save_debug_renders
  rfp/transfer_from_*.py : --save_viz  -> debug_plots

Bumps the rfp-grasp-transfer submodule pointer to pick up the same flag +
the batched-path Plotly HTML output that pairs with it.

Also fixes a latent bug exposed when actually exercising the overlay path
after the lazy-import refactor: SAM2VideoPredictor.show_mask referenced a
module-level `plt` that no longer exists. Now imports plt locally.

Verified end-to-end:
  GSAM   --save_viz: 70 trajectory-overlay PNGs written.
  rfp    --save_viz (per-frame and batched): 138 Plotly HTMLs written.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ch UX

Adds a new "✨ What's new on jishnu/fasten-vie" section near the top
summarizing the speed gains, the speed-vs-quality flag table, and the
unified --save_viz flag that's now consistent across run_gdino_samv2,
extract_hand_bboxes_and_meshes, and transfer_from_hamer.

Updates the per-step examples (Steps 3, 4, 5) to:
  - drop --debug_plots from the rfp example (no longer required for normal
    operation; only when you want plotly viz)
  - mention --save_viz, --sam2_size, --frame_batch_size, --no_warm_start,
    --body_detector inline as opt-in knobs
  - clarify that the extra_plots / transfer_extra_plots dirs are populated
    only when --save_viz is on; downstream pipeline data (binary masks,
    MANO npzs, scene PLYs, gripper PLYs, BundleSDF poses) is always written

Also notes the new colored-banner / spinner / summary-panel UX backed by
vie/robokit/log.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Renames every "jishnu/fasten-vie" reference to just "fasten-vie", and the
"What's new on jishnu/fasten-vie" section header to "Performance & UX
improvements". The branch is local convention; the README shouldn't read
like one person's effort.

The two remaining `jishnujayakumar/{robokit,BundleSDF}` URLs in the
Acknowledgments are upstream-fork repo links that the project actually
depends on at the package level, not author attribution — left intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three top-level imports in the HaMeR module that the hot path doesn't
exercise on every frame:
  open3d  -> only used in save_point_cloud + save_point_cloud_as_ply
  pyrender -> only used in compute_sdf_cost(vis=True), opt-in viewer
  sklearn KMeans -> only used in RGBD2PC.__init__(use_kmeans=True), opt-in

Moved each to its use site. Behavior unchanged; startup is faster on the
extract script especially when --help / sanity-check paths run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ache

extract_hand_bboxes_and_meshes.py end-to-end optimization on task_1_21s:
4 490 → 412 ms/frame (12m 25s → 1m 8s wall), no quality regression vs
the original.

Pipeline changes:
- torch Adam minimize on GPU replaces scipy Nelder-Mead (batched across both
  hands within a frame, patience early-stop, optimizer state pre-allocated)
- cv2 convex-hull hand mask (~5 ms) replaces pyrender (~370 ms); pyrender
  Renderer is lazy-loaded, only constructed under --save_viz
- Processing/viz mode split: scene + 3dhand PLYs gated viz-only via
  --save_scene_pcd / --save_3dhand_pcd / --save_viz; rfp-ready default
  writes only model/*.npz (everything rfp consumes)
- --detector_stride N caches ViTDet across N frames with auto-redetect on
  hand-keypoint loss
- Binary PLY writes everywhere (transparent to trimesh.load_mesh)
- cam_K.txt cached on the extractor (was np.loadtxt'd every frame)
- Async NPZ writes via ThreadPool
- --frame_batch_size N: cross-frame batched torch Adam over all hands x K
  frames (K=2: +24% speed, median 11 mm drift; K>2 not recommended)
- Lazy ML-stack import via _ensure_heavy_imports - module load 19s -> 1.5s
- ViTDet + ViTPose pickled to ~/.cache/hamer/ on first run; cache hit saves
  ~18s on subsequent runs (HaMeR LightningModule has a ctypes pointer in
  smplx's MANO wrapper so it stays on the fresh-load path)
- Per-step rich UI: 4 import spinners (HaMeR, detectron2, pyrender, mmpose)
  + 3 model load spinners; vlog.working survives stdout redirects
- Latent _save_meshes return-arity bug fixed - was returning 4 values, caller
  unpacked 3; every frame ValueError'd silently while writes still completed

mesh_to_sdf/rgbd2pc.py:
- Otsu-on-z 1D threshold replaces sklearn KMeans for depth-pc cluster filter
  (~124 -> 5 ms/frame; partitions ~98.5% identical on tabletop scenes where
  the foreground/background split is along the depth axis)
- Vectorized RGB-to-point projection (was a Python loop over ~300k points)
- Binary PLY writes

robokit/log.py:
- Snapshot sys.stdout at Console construction so vlog.working / vlog.progress
  spinners survive contextlib.redirect_stdout used for third-party noise
  suppression (otherwise "Loading models" went blank for ~45s while the
  spinner output was being captured into the silenced buffer)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ng/viz split

Brings the README in sync with d39693b. Adds:
- Default (processing-mode) + viz-mode run commands for HaMeR
- New flags: --frame_batch_size, --detector_stride, --torch_steps/min_steps/tol/lr,
  --mask_backend, --minimize_backend, --save_scene_pcd, --save_3dhand_pcd,
  --parallel_load, --no_model_cache
- Output gating: model/ always written; 3dhand/, scene/, extra_plots/ are
  opt-in via --save_viz or the per-stage flags
- Model cache section (~/.cache/hamer/, ~5 GB, ~18 s saved on warm runs)
- Phase 5+ benchmark table on RTX A5000 (4 490 -> 412 ms/frame on task_1_21s)
- Updated top-of-README speed-gains summary to reflect end-to-end 10.9x
- Updated performance-flags table to point at the new defaults + correct
  fallbacks (--minimize_backend scipy, --mask_backend pyrender)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant