CUDA kernel for loading compressed `.h5` from disk to VRAM by bobleesj · Pull Request #135 · electronmicroscopy/quantem

bobleesj · 2026-01-09T00:16:25Z

Pretty brittle CUDA-accelerated loading of .h5 file.

On a single L40s GPU (~40 Gb), it takes about 0.5s to load and decompress ~1 Gb into ~10 Gb.

The current hardware limit is disk-to-GPU memory transfer speed (~80-90% of total wall time)

Tested using Arina 4D-STEM using data collected at NCEM with @smribet:

Performance:

API:

from quantem.hpc.io import load, bin
data = load(HDF5_PATH) 
bin(data, 2)

I will keep this as a draft PR for now. I will be testing with other datasets collected recently and improve its API, etc.

example_cuda_gpu_load.ipynb

no __init__ needed for test file.

arthurmccray · 2026-01-09T20:32:32Z

This looks great! Adding this comment to more widely raise a couple of questions we discussed in group meeting:

This uses the cupy custom CUDA kernel method of launching, and we don't currently have cupy as a dependency.
- We've messed about a little in the past with writing custom kernels that are torch compatible and with gradient calculation see this somewhat stale repo for an idea, but never came to a final conclusion.
- I think it would be ideal if we could make a torch-only method of running our own kernels (w/ or w/o backpropagation), but I don't honestly know how difficult that would be.
We do need to be a little careful with licensing, especially for code such as cuda kernels that we're adopting from other sources

It makes a lot of sense to me to figure out how we want to implement custom CUDA kernels, as I suspect we will have more usecases in the future, and we should probably do so sooner rather than later.

bobleesj · 2026-01-10T08:55:55Z

@arthurmccray Thanks so much for the feedback.

custom kernels that are torch compatible and with gradient calculation

indeed, integrating cuda kernal w/ pytorch requires a bit of logitics (mainly passing gradient info as you've mentioned). I experimented before and I can do a follow-up on this.

implement custom CUDA kernels,

Yes, I think we should leverage the given hardware and given that the API for the end-user dosn't change (just logging to mallard, done, gain 2-10x speed) then we should push for this.

I will report back all other topics that you've mentioned during dev meeting as well.

bobleesj added 3 commits January 8, 2026 15:57

add CUDA-based disk to GPU load in 0.5s

63d8bca

'tests: mock h5 file and compare bin, load with numpy

23f9126

remove benchmark data

52f09d9

bobleesj changed the title ~~CUDA-kernel for disk to VRAM .h5 load/bin (~10x faster)~~ CUDA kernel for loading compressed .h5 from disk to VRAM Jan 9, 2026

Delete tests/hpc/__init__.py

ed9a5e8

no __init__ needed for test file.

bobleesj closed this Jan 11, 2026

bobleesj deleted the hpc-load branch January 11, 2026 06:56

bobleesj restored the hpc-load branch January 11, 2026 06:57

bobleesj deleted the hpc-load branch January 11, 2026 06:57

bobleesj restored the hpc-load branch January 11, 2026 06:59

bobleesj reopened this Jan 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA kernel for loading compressed `.h5` from disk to VRAM #135

CUDA kernel for loading compressed `.h5` from disk to VRAM #135
bobleesj wants to merge 4 commits intoelectronmicroscopy:devfrom
bobleesj:hpc-load

bobleesj commented Jan 9, 2026

Uh oh!

arthurmccray commented Jan 9, 2026 •

edited

Loading

Uh oh!

bobleesj commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bobleesj commented Jan 9, 2026

Uh oh!

arthurmccray commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bobleesj commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

arthurmccray commented Jan 9, 2026 •

edited

Loading