A pre-compiler pass that splits models based on latency measurements and a runtime to execute the split models across multiple machines.
Please use conda since this was tested only on conda.
conda create --prefix $PSCRATCH/ts310 python=3.10conda activate $PSCRATCH/ts310conda install poetryconda install graphvizpoetry install --no-root --all-groupsexport PYTHONPATH=$(pwd):$PYTHONPATH
To export a model, run the following command:
python3 ./torch_split/cli.py export examples.clip.interface:ClipInterface -b 1This will trace a batch size of 1 and export the model to /dev/shm/switchboard.tspartd. You can look at /dev/shm/switchboard.tspartd/structure.json to see the dataflow graph. To test the exported model, run:
python3 ./examples/clip/main.pyThis will do a topological sort of the exported switchboard and run the components in order. We see that the values are identical to the original model.
To optimize a model, run the following command. Currently WIP.
- optimize will stop when (A) soft memory limit for DRAM is reached or (B) max batch size is reached.
- Currently, only batch size of 1 will be optimized. Future updates will combine various batch sizes to find a better solution.
python3 ./torch_split/cli.py optimize examples.clip.interface:ClipInterface --max-batch 8You can pass the "-o" flag to specify the output directory for the optimization artifacts. By default, it will be saved to ./ts_bin.
You can pass the "-s" flag to generate the intermediate dataflow graphs used during partitioning. It will be saved under the "visualizations" folder in the output directory.
The allocation result you get might be differ from run to run due to the non-deterministic nature of pynvml's GPU utilization. For one run:
[00:09:49] INFO Pipeline Throughput: 3050.05
INFO Leximin Model Throughputs:
INFO A: 3050.05
INFO B: 3172.48
INFO C: 11520.80
INFO Assignments:
INFO - Model B assigned 10x to GPU1 with memory slice 4GB
INFO - Model B assigned 9x to GPU3 with memory slice 4GB
INFO - Model A assigned 10x to GPU0 with memory slice 4GB
INFO - Model A assigned 10x to GPU2 with memory slice 4GB
INFO - Model C assigned 1x to GPU3 with memory slice 4GB
This is because Model B has ~50% GPU utilization, Model A has ~30% GPU utilization, and Model C has ~10% GPU utilization.
Prerequisite: Export batch size 1
To serve the model using Ray Serve, run the following command:
serve run examples.clip.optimized_server:app
python3 ./examples/clip/client.pyor serve run examples.clip.monolithic_server:app to run the monolithic server.
Profiled data in examples/clip_data can be used to visualize latency and GPU utilization. To generate the plots, run:
Performance Improvement Summary:
QPS Latency Improvement (%) Throughput Improvement (%)
------------------------------------------------------------
1 48.18 1.68
2 49.26 1.68
4 50.94 1.66
8 50.62 1.79
16 52.07 0.52
32 52.23 1.13
64 54.91 1.38
128 67.15 3.46
256 90.02 43.13
512 63.77 76.19
1024 38.23 58.60
2048 37.67 66.21
------------------------------------------------------------
Optionally, this repo provides a tracing hook to export traces to an OpenTelemetry collector. You can install signoz here.
cd signoz/deploy/docker
docker compose up -d