Torch Split

A pre-compiler pass that splits models based on latency measurements and a runtime to execute the split models across multiple machines.

Installation

Please use conda since this was tested only on conda.

conda create --prefix $PSCRATCH/ts310 python=3.10
conda activate $PSCRATCH/ts310
conda install poetry
conda install graphviz
poetry install --no-root --all-groups
export PYTHONPATH=$(pwd):$PYTHONPATH

Running

Exporting

To export a model, run the following command:

python3 ./torch_split/cli.py export examples.clip.interface:ClipInterface -b 1

This will trace a batch size of 1 and export the model to /dev/shm/switchboard.tspartd. You can look at /dev/shm/switchboard.tspartd/structure.json to see the dataflow graph. To test the exported model, run:

python3 ./examples/clip/main.py

This will do a topological sort of the exported switchboard and run the components in order. We see that the values are identical to the original model.

Optimizing

To optimize a model, run the following command. Currently WIP.

optimize will stop when (A) soft memory limit for DRAM is reached or (B) max batch size is reached.
Currently, only batch size of 1 will be optimized. Future updates will combine various batch sizes to find a better solution.

python3 ./torch_split/cli.py optimize examples.clip.interface:ClipInterface --max-batch 8

You can pass the "-o" flag to specify the output directory for the optimization artifacts. By default, it will be saved to ./ts_bin.

You can pass the "-s" flag to generate the intermediate dataflow graphs used during partitioning. It will be saved under the "visualizations" folder in the output directory.

The allocation result you get might be differ from run to run due to the non-deterministic nature of pynvml's GPU utilization. For one run:

[00:09:49] INFO     Pipeline Throughput: 3050.05
           INFO     Leximin Model Throughputs:
           INFO      A: 3050.05
           INFO      B: 3172.48
           INFO      C: 11520.80
           INFO     Assignments:
           INFO      - Model B assigned 10x to GPU1 with memory slice 4GB
           INFO      - Model B assigned 9x to GPU3 with memory slice 4GB
           INFO      - Model A assigned 10x to GPU0 with memory slice 4GB
           INFO      - Model A assigned 10x to GPU2 with memory slice 4GB
           INFO      - Model C assigned 1x to GPU3 with memory slice 4GB

This is because Model B has ~50% GPU utilization, Model A has ~30% GPU utilization, and Model C has ~10% GPU utilization.

Serving

Prerequisite: Export batch size 1

To serve the model using Ray Serve, run the following command:

serve run examples.clip.optimized_server:app

python3 ./examples/clip/client.py

or serve run examples.clip.monolithic_server:app to run the monolithic server.

Profiled data in examples/clip_data can be used to visualize latency and GPU utilization. To generate the plots, run:

Performance Improvement Summary:
QPS        Latency Improvement (%)   Throughput Improvement (%)
------------------------------------------------------------
1          48.18                     1.68
2          49.26                     1.68
4          50.94                     1.66
8          50.62                     1.79
16         52.07                     0.52
32         52.23                     1.13
64         54.91                     1.38
128        67.15                     3.46
256        90.02                     43.13
512        63.77                     76.19
1024       38.23                     58.60
2048       37.67                     66.21
------------------------------------------------------------

Docker and Tracing

Optionally, this repo provides a tracing hook to export traces to an OpenTelemetry collector. You can install signoz here.

cd signoz/deploy/docker
docker compose up -d

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.vscode		.vscode
assets		assets
examples		examples
torch_split		torch_split
.gitignore		.gitignore
README.md		README.md
gpu_monitor.sh		gpu_monitor.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Torch Split

Installation

Running

Exporting

Optimizing

Serving

Docker and Tracing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Torch Split

Installation

Running

Exporting

Optimizing

Serving

Docker and Tracing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages