CUDA kernel for loading compressed .h5 from disk to VRAM #135
CUDA kernel for loading compressed .h5 from disk to VRAM #135bobleesj wants to merge 4 commits intoelectronmicroscopy:devfrom
.h5 from disk to VRAM #135Conversation
.h5 load/bin (~10x faster).h5 from disk to VRAM
no __init__ needed for test file.
|
This looks great! Adding this comment to more widely raise a couple of questions we discussed in group meeting:
It makes a lot of sense to me to figure out how we want to implement custom CUDA kernels, as I suspect we will have more usecases in the future, and we should probably do so sooner rather than later. |
|
@arthurmccray Thanks so much for the feedback.
indeed, integrating cuda kernal w/ pytorch requires a bit of logitics (mainly passing gradient info as you've mentioned). I experimented before and I can do a follow-up on this.
Yes, I think we should leverage the given hardware and given that the API for the end-user dosn't change (just logging to mallard, done, gain 2-10x speed) then we should push for this. I will report back all other topics that you've mentioned during dev meeting as well. |
Pretty brittle CUDA-accelerated loading of .h5 file.
On a single L40s GPU (~40 Gb), it takes about 0.5s to load and decompress ~1 Gb into ~10 Gb.
The current hardware limit is disk-to-GPU memory transfer speed (~80-90% of total wall time)
Tested using Arina 4D-STEM using data collected at NCEM with @smribet:
Performance:
API:
I will keep this as a draft PR for now. I will be testing with other datasets collected recently and improve its API, etc.
example_cuda_gpu_load.ipynb