WIP: GPU utilization fixes by carlosgjs · Pull Request #122 · RolnickLab/ami-data-companion

carlosgjs · 2026-03-04T01:23:51Z

Fixes

Keeping the data in the GPU between detection and classification. There was a step converting back to a PIL image that was inducing a GPU-CPU-GPU transfer.
Using pin_memory=True in the data loader (enables DMA transfers)
Doing batch collation in worker vs in the main process since the image stacking can be slow for large images
Using CUDA streams to pre-fetch batch N+1 into the GPU while batch N is processing.

Test logs:

2026-03-03 17:20:17 [info     ] Total: 0.07s/image, Classification time: 0.51s, Detection time: 1.01s, Load time: 0.04s, to GPU time: 0.00s, 
2026-03-03 17:20:22 [info     ] Total: 0.22s/image, Classification time: 4.26s, Detection time: 0.59s, Load time: 0.43s, to GPU time: 0.00s, 
2026-03-03 17:20:25 [info     ] Total: 0.12s/image, Classification time: 1.68s, Detection time: 1.00s, Load time: 0.10s, to GPU time: 0.00s,

GPU Utilization:

nvidia-smi --query-gpu=timestamp,name,utilization.gpu --format=csv -lms 500

See TODO comments in the PR

coderabbitai · 2026-03-04T01:24:02Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5f59c3f2-d25e-413a-ae42-26f6a051c181

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

carlosgjs · 2026-03-04T01:30:58Z

trapdata/antenna/worker.py

    result_poster: ResultPoster | None = None
+    prefetcher = CUDAPrefetcher(loader)  # if torch.cuda.is_available() else None
    try:
-        for i, batch in enumerate(loader):


TODO: The pre-fetcher currently only works with a GPU. Something like this is needed:

batch_source = loader if torch.cuda.is_available(): prefetcher = CUDAPrefetcher(loader) prefetcher.preload() batch_source = prefetcher

And replace both next(prefetcher) calls with next(batch_source)

carlosgjs · 2026-03-04T01:33:48Z

trapdata/ml/models/classification.py

            [
                torchvision.transforms.Resize((self.input_size, self.input_size)),
-                torchvision.transforms.ToTensor(),
+                # torchvision.transforms.ToTensor(),


This need to be put back for the ami api use case. But I think a wrapper that conditional converts or not could be used, e.g.

def maybe_totensor(x): if isinstance(x, torch.Tensor): return x return torchvision.transform.ToTensor()(x)

carlosgjs · 2026-03-04T01:37:41Z

trapdata/settings.py

    antenna_api_auth_token: str = ""
    antenna_service_name: str = "AMI Data Companion"
-    antenna_api_batch_size: int = 16
+    antenna_api_batch_size: int = 24


the batching/collation now happens in the RESTDataset, so effectively the API bstch size is used as the localization batch size. One of the parameters can be removed.

Copilot

Pull request overview

This PR aims to improve GPU utilization in the Antenna worker pipeline by reducing CPU↔GPU transfers and overlapping input transfer with inference.

Changes:

Adjusts REST data loading to collate batches in DataLoader workers, enable pinned memory, and introduce a CUDA prefetcher.
Modifies worker inference to keep tensors in GPU-friendly form (avoiding PIL conversions) and adds timing metrics.
Tunes default batch sizes and adds a benchmark option to skip sending acknowledgments.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`trapdata/settings.py`	Updates default batch sizes for localization and Antenna API task fetching.
`trapdata/ml/models/classification.py`	Alters classification transforms by removing `ToTensor()` steps to support tensor-based inputs.
`trapdata/antenna/worker.py`	Switches worker loop to use CUDA prefetching, avoids PIL conversions, and changes logging/timing.
`trapdata/antenna/datasets.py`	Changes REST dataset iteration/collation behavior, enables pinned-memory DataLoader settings, and adds `CUDAPrefetcher`.
`trapdata/antenna/benchmark.py`	Adds a CLI flag to skip sending acknowledgments during benchmarking.

Comments suppressed due to low confidence (1)

trapdata/antenna/datasets.py:371

rest_collate_fn now uses torch.stack(...), which will throw if images have different spatial sizes (common for real-world inputs). The detector stack already supports receiving a list[Tensor] for variable-size images, so stacking here can introduce hard failures. Consider keeping images as a list (as before) or explicitly resizing/padding to a common size before stacking.

    # Collate successful items
    if successful:
        result = {
            "images": torch.stack([item["image"] for item in successful]),
            "reply_subjects": [item["reply_subject"] for item in successful],
            "image_ids": [item["image_id"] for item in successful],
            "image_urls": [item.get("image_url") for item in successful],
        }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-06T22:14:18Z

trapdata/antenna/worker.py

            )
        return did_work
+    except StopIteration:
+        pass


The current loop relies on next(prefetcher) inside the while body; when the prefetcher is exhausted it raises StopIteration, which is caught by the outer except StopIteration: pass. That path skips the return did_work at the end of the try, so _process_job() will return None instead of bool in the normal end-of-iteration case. Restructure iteration to break cleanly on StopIteration and still hit the final return did_work (or return did_work from the except).

Suggested change

pass

# Iterator exhausted: return whether any work was done

return did_work

trapdata/antenna/datasets.py

trapdata/ml/models/classification.py

trapdata/antenna/worker.py

trapdata/antenna/datasets.py

WIP: GPU utilization fixes

a5325ad

carlosgjs commented Mar 4, 2026

View reviewed changes

Add skip-args support

cf86915

carlosgjs mentioned this pull request Mar 4, 2026

Support for NATS dead letter queue RolnickLab/antenna#1175

Open

carlosgjs requested review from Copilot and mihow March 6, 2026 22:09

Copilot started reviewing on behalf of carlosgjs March 6, 2026 22:09 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

carlosgjs added 5 commits March 6, 2026 14:32

Implemenrt MaybeTensor

cbd5b67

Update comments

8e8b3cd

Make this work on CPU too

57dd5bc

Use explicit device

7f4b283

Update worker test

ad99763

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: GPU utilization fixes#122

WIP: GPU utilization fixes#122
carlosgjs wants to merge 7 commits intoRolnickLab:mainfrom
carlosgjs:carlos/gpuperf3

carlosgjs commented Mar 4, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 4, 2026 •

edited

Loading

Review skipped

Uh oh!

carlosgjs Mar 4, 2026

Uh oh!

carlosgjs Mar 4, 2026

Uh oh!

carlosgjs Mar 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	pass
	# Iterator exhausted: return whether any work was done
	return did_work

Conversation

carlosgjs commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes

Test logs:

Uh oh!

coderabbitai bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

carlosgjs Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

carlosgjs Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

carlosgjs Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

carlosgjs commented Mar 4, 2026 •

edited

Loading

coderabbitai bot commented Mar 4, 2026 •

edited

Loading