Skip to content

WIP: GPU utilization fixes#122

Draft
carlosgjs wants to merge 7 commits intoRolnickLab:mainfrom
carlosgjs:carlos/gpuperf3
Draft

WIP: GPU utilization fixes#122
carlosgjs wants to merge 7 commits intoRolnickLab:mainfrom
carlosgjs:carlos/gpuperf3

Conversation

@carlosgjs
Copy link
Collaborator

@carlosgjs carlosgjs commented Mar 4, 2026

Fixes

  • Keeping the data in the GPU between detection and classification. There was a step converting back to a PIL image that was inducing a GPU-CPU-GPU transfer.
  • Using pin_memory=True in the data loader (enables DMA transfers)
  • Doing batch collation in worker vs in the main process since the image stacking can be slow for large images
  • Using CUDA streams to pre-fetch batch N+1 into the GPU while batch N is processing.

Test logs:

2026-03-03 17:20:17 [info     ] Total: 0.07s/image, Classification time: 0.51s, Detection time: 1.01s, Load time: 0.04s, to GPU time: 0.00s, 
2026-03-03 17:20:22 [info     ] Total: 0.22s/image, Classification time: 4.26s, Detection time: 0.59s, Load time: 0.43s, to GPU time: 0.00s, 
2026-03-03 17:20:25 [info     ] Total: 0.12s/image, Classification time: 1.68s, Detection time: 1.00s, Load time: 0.10s, to GPU time: 0.00s, 

GPU Utilization:

nvidia-smi --query-gpu=timestamp,name,utilization.gpu --format=csv -lms 500

See TODO comments in the PR

@coderabbitai
Copy link

coderabbitai bot commented Mar 4, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5f59c3f2-d25e-413a-ae42-26f6a051c181

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

result_poster: ResultPoster | None = None
prefetcher = CUDAPrefetcher(loader) # if torch.cuda.is_available() else None
try:
for i, batch in enumerate(loader):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: The pre-fetcher currently only works with a GPU. Something like this is needed:

batch_source = loader
if torch.cuda.is_available():
    prefetcher = CUDAPrefetcher(loader) 
    prefetcher.preload()
    batch_source = prefetcher

And replace both next(prefetcher) calls with next(batch_source)

[
torchvision.transforms.Resize((self.input_size, self.input_size)),
torchvision.transforms.ToTensor(),
# torchvision.transforms.ToTensor(),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This need to be put back for the ami api use case. But I think a wrapper that conditional converts or not could be used, e.g.

def maybe_totensor(x):
    if isinstance(x, torch.Tensor):
       return x
    return torchvision.transform.ToTensor()(x)

antenna_api_auth_token: str = ""
antenna_service_name: str = "AMI Data Companion"
antenna_api_batch_size: int = 16
antenna_api_batch_size: int = 24
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the batching/collation now happens in the RESTDataset, so effectively the API bstch size is used as the localization batch size. One of the parameters can be removed.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to improve GPU utilization in the Antenna worker pipeline by reducing CPU↔GPU transfers and overlapping input transfer with inference.

Changes:

  • Adjusts REST data loading to collate batches in DataLoader workers, enable pinned memory, and introduce a CUDA prefetcher.
  • Modifies worker inference to keep tensors in GPU-friendly form (avoiding PIL conversions) and adds timing metrics.
  • Tunes default batch sizes and adds a benchmark option to skip sending acknowledgments.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
trapdata/settings.py Updates default batch sizes for localization and Antenna API task fetching.
trapdata/ml/models/classification.py Alters classification transforms by removing ToTensor() steps to support tensor-based inputs.
trapdata/antenna/worker.py Switches worker loop to use CUDA prefetching, avoids PIL conversions, and changes logging/timing.
trapdata/antenna/datasets.py Changes REST dataset iteration/collation behavior, enables pinned-memory DataLoader settings, and adds CUDAPrefetcher.
trapdata/antenna/benchmark.py Adds a CLI flag to skip sending acknowledgments during benchmarking.
Comments suppressed due to low confidence (1)

trapdata/antenna/datasets.py:371

  • rest_collate_fn now uses torch.stack(...), which will throw if images have different spatial sizes (common for real-world inputs). The detector stack already supports receiving a list[Tensor] for variable-size images, so stacking here can introduce hard failures. Consider keeping images as a list (as before) or explicitly resizing/padding to a common size before stacking.
    # Collate successful items
    if successful:
        result = {
            "images": torch.stack([item["image"] for item in successful]),
            "reply_subjects": [item["reply_subject"] for item in successful],
            "image_ids": [item["image_id"] for item in successful],
            "image_urls": [item.get("image_url") for item in successful],
        }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

)
return did_work
except StopIteration:
pass
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current loop relies on next(prefetcher) inside the while body; when the prefetcher is exhausted it raises StopIteration, which is caught by the outer except StopIteration: pass. That path skips the return did_work at the end of the try, so _process_job() will return None instead of bool in the normal end-of-iteration case. Restructure iteration to break cleanly on StopIteration and still hit the final return did_work (or return did_work from the except).

Suggested change
pass
# Iterator exhausted: return whether any work was done
return did_work

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants