diff --git a/docs/examples.md b/docs/examples.md index 575c747ca2..e57a41cf52 100644 --- a/docs/examples.md +++ b/docs/examples.md @@ -122,6 +122,16 @@ hide: Set up Crusoe clusters with optimized networking

+ +

+ Nebius +

+ +

+ Set up Nebius clusters with optimized networking +

+

diff --git a/docs/examples/clusters/nebius/index.md b/docs/examples/clusters/nebius/index.md new file mode 100644 index 0000000000..e69de29bb2 diff --git a/examples/clusters/crusoe/README.md b/examples/clusters/crusoe/README.md index b34c4aef34..50ec88e461 100644 --- a/examples/clusters/crusoe/README.md +++ b/examples/clusters/crusoe/README.md @@ -1,24 +1,25 @@ --- title: Crusoe -description: Setting up Crusoe clusters using Managed Kubernetes or VMs with InfiniBand support +description: Using Crusoe clusters with InfiniBand support via Kubernetes or VMs --- # Crusoe -Crusoe offers two ways to use clusters with fast interconnect: +`dstack` allows using Crusoe clusters with fast interconnect via two ways: -* [Crusoe Managed Kubernetes](#kubernetes) – Lets you interact with clusters through the Kubernetes API and includes support for NVIDIA and AMD GPU operators and related tools. -* [Virtual Machines (VMs)](#vms) – Gives you direct access to clusters in the form of virtual machines with NVIDIA and AMD GPUs. +* [Kubernetes](#kubernetes) – If you create a Kubernetes cluster on Crusoe and configure a `kubernetes` backend and create a backend fleet in `dstack`, `dstack` lets you fully use this cluster through `dstack`. +* [VMs](#vms) – If you create a VM cluster on Crusoe and create an SSH fleet in `dstack`, `dstack` lets you fully use this cluster through `dstack`. + +## Kubernetes -Both options use the same underlying networking infrastructure. This example walks you through how to set up Crusoe clusters to use with `dstack`. +### Create a cluster -## Crusoe Managed Kubernetes { #kubernetes } +1. Go `Networking` → `Firewall Rules`, click `Create Firewall Rule`, and allow ingress traffic on port `30022`. This port will be used by the `dstack` server to access the jump host. +2. Go to `Orchestration` and click `Create Cluster`. Make sure to enable the `NVIDIA GPU Operator` add-on. +3. Go the the cluster, and click `Create Node Pool`. Select the right type of the instance, and `Desired Number of Nodes`. +4. Wait until nodes are provisioned. -!!! info "Prerequsisites" - 1. Go `Networking` → `Firewall Rules`, click `Create Firewall Rule`, and allow ingress traffic on port `30022`. This port will be used by the `dstack` server to access the jump host. - 2. Go to `Orchestration` and click `Create Cluster`. Make sure to enable the `NVIDIA GPU Operator` add-on. - 3. Go the the cluster, and click `Create Node Pool`. Select the right type of the instance. If you intend to auto-scale the cluster, make sure to set `Desired Number of Nodes` at least to `1`, since `dstack` doesn't currently support clusters that scale down to `0` nodes. - 4. Wait until at least one node is running. +> Even if you enable `autoscaling`, `dstack` can use only the nodes that are already provisioned. ### Configure the backend @@ -56,7 +57,7 @@ backends: [kubernetes] resources: # Specify requirements to filter nodes - gpu: 1..8 + gpu: 8 ``` @@ -75,12 +76,13 @@ Once the fleet is created, you can run [dev environments](https://dstack.ai/docs ## VMs -Another way to work with Crusoe clusters is through VMs. While `dstack` typically supports VM-based compute providers via [dedicated backends](https://dstack.ai/docs/concepts/backends#vm-based) that automate provisioning, Crusoe does not yet have [such a backend](https://github.com/dstackai/dstack/issues/3378). As a result, to use a VM-based Crusoe cluster with `dstack`, you should use [SSH fleets](https://dstack.ai/docs/concepts/fleets). +Another way to work with Crusoe clusters is through VMs. While `dstack` typically supports VM-based compute providers via [dedicated backends](https://dstack.ai/docs/concepts/backends#vm-based) that automate provisioning, Crusoe does not yet have [such a backend](https://github.com/dstackai/dstack/issues/3378). As a result, to use a VM-based Crusoe cluster with `dstack`, you should use [SSH fleets](https://dstack.ai/docs/concepts/fleets#ssh-fleets). -!!! info "Prerequsisites" - 1. Go to `Compute`, then `Instances`, and click `Create Instance`. Make sure to select the right instance type and VM image (that [support interconnect](https://docs.crusoecloud.com/networking/infiniband/managing-infiniband-networks/index.html)). Make sure to create as many instances as needed. +### Create instances -### Create a fleet +1. Go to `Compute`, then `Instances`, and click `Create Instance`. Make sure to select the right instance type and VM image (that [support interconnect](https://docs.crusoecloud.com/networking/infiniband/managing-infiniband-networks/index.html)). Make sure to create as many instances as needed. + +### Create a `dstack` fleet Follow the standard instructions for setting up an [SSH fleet](https://dstack.ai/docs/concepts/fleets/#ssh-fleets): @@ -115,9 +117,9 @@ $ dstack apply -f crusoe-fleet.dstack.yml Once the fleet is created, you can run [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), and [services](https://dstack.ai/docs/concepts/services). -## Run NCCL tests +## NCCL tests -Use a [distributed task](https://dstack.ai/docs/concepts/tasks#distributed-task) that runs NCCL tests to validate cluster network bandwidth. +Use a [distributed task](https://dstack.ai/docs/concepts/tasks#distributed-tasks) that runs NCCL tests to validate cluster network bandwidth. === "Crusoe Managed Kubernetes" @@ -253,9 +255,9 @@ Provisioning... nccl-tests provisioning completed (running) -# out-of-place in-place -# size count type redop root time algbw busbw #wrong time algbw busbw #wrong -# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) +out-of-place in-place + size count type redop root time algbw busbw #wrong time algbw busbw #wrong + (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 8 2 float sum -1 27.70 0.00 0.00 0 29.82 0.00 0.00 0 16 4 float sum -1 28.78 0.00 0.00 0 28.99 0.00 0.00 0 32 8 float sum -1 28.49 0.00 0.00 0 28.16 0.00 0.00 0 @@ -285,8 +287,8 @@ nccl-tests provisioning completed (running) 536870912 134217728 float sum -1 5300.49 101.29 189.91 0 5314.91 101.01 189.40 0 1073741824 268435456 float sum -1 10472.2 102.53 192.25 0 10485.6 102.40 192.00 0 2147483648 536870912 float sum -1 20749.1 103.50 194.06 0 20745.7 103.51 194.09 0 -# Out of bounds values : 0 OK -# Avg bus bandwidth : 53.7387 + Out of bounds values : 0 OK + Avg bus bandwidth : 53.7387 ``` diff --git a/examples/clusters/lambda/README.md b/examples/clusters/lambda/README.md index 50a98bf6ed..07fb0ce926 100644 --- a/examples/clusters/lambda/README.md +++ b/examples/clusters/lambda/README.md @@ -5,18 +5,17 @@ description: Setting up Lambda clusters using Kubernetes or 1-Click Clusters wit # Lambda -[Lambda](https://lambda.ai/) offers two ways to use clusters with a fast interconnect: +`dstack` allows using Lambda clusters with fast interconnect via two ways: -* [Kubernetes](#kubernetes) – Lets you interact with clusters through the Kubernetes API and includes support for NVIDIA GPU operators and related tools. -* [1-Click Clusters (1CC)](#1-click-clusters) – Gives you direct access to clusters in the form of bare-metal nodes. - -Both options use the same underlying networking infrastructure. This example walks you through how to set up Lambda clusters to use with `dstack`. +* [Kubernetes](#kubernetes) – If you create a Kubernetes cluster on Lambda and configure a `kubernetes` backend and create a backend fleet in `dstack`, `dstack` lets you fully use this cluster through `dstack`. +* [VMs](#vms) – If you create a 1CC cluster on Lambda and create an SSH fleet in `dstack`, `dstack` lets you fully use this cluster through `dstack`. ## Kubernetes -!!! info "Prerequsisites" - 1. Follow the instructions in [Lambda's guide](https://docs.lambda.ai/public-cloud/1-click-clusters/managed-kubernetes/#accessing-mk8s) on accessing MK8s. - 2. Go to `Firewall` → `Edit rules`, click `Add rule`, and allow ingress traffic on port `30022`. This port will be used by the `dstack` server to access the jump host. +### Prerequsisites + +1. Follow the instructions in [Lambda's guide](https://docs.lambda.ai/public-cloud/1-click-clusters/managed-kubernetes/#accessing-mk8s) on accessing MK8s. +2. Go to `Firewall` → `Edit rules`, click `Add rule`, and allow ingress traffic on port `30022`. This port will be used by the `dstack` server to access the jump host. ### Configure the backend @@ -75,8 +74,9 @@ Once the fleet is created, you can run [dev environments](https://dstack.ai/docs Another way to work with Lambda clusters is through [1CC](https://lambda.ai/1-click-clusters). While `dstack` supports automated cluster provisioning via [VM-based backends](https://dstack.ai/docs/concepts/backends#vm-based), there is currently no programmatic way to provision Lambda 1CCs. As a result, to use a 1CC cluster with `dstack`, you must use [SSH fleets](https://dstack.ai/docs/concepts/fleets). -!!! info "Prerequsisites" - 1. Follow the instructions in [Lambda's guide](https://docs.lambda.ai/public-cloud/1-click-clusters/) on working with 1-Click Clusters +### Prerequsisites + +1. Follow the instructions in [Lambda's guide](https://docs.lambda.ai/public-cloud/1-click-clusters/) on working with 1-Click Clusters ### Create a fleet @@ -171,11 +171,11 @@ $ dstack apply -f lambda-nccl-tests.dstack.yml Provisioning... ---> 100% -# nccl-tests version 2.17.6 nccl-headers=22602 nccl-library=22602 -# Collective test starting: all_reduce_perf -# -# size count type redop root time algbw busbw #wrong time algbw busbw #wrong -# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) + nccl-tests version 2.17.6 nccl-headers=22602 nccl-library=22602 + Collective test starting: all_reduce_perf + + size count type redop root time algbw busbw #wrong time algbw busbw #wrong + (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 8 2 float sum -1 36.50 0.00 0.00 0 36.16 0.00 0.00 0 16 4 float sum -1 35.55 0.00 0.00 0 35.49 0.00 0.00 0 32 8 float sum -1 35.49 0.00 0.00 0 36.28 0.00 0.00 0 @@ -205,8 +205,8 @@ Provisioning... 536870912 134217728 float sum -1 1625.63 330.25 619.23 0 1687.31 318.18 596.59 0 1073741824 268435456 float sum -1 2972.25 361.26 677.35 0 2971.33 361.37 677.56 0 2147483648 536870912 float sum -1 5784.75 371.23 696.06 0 5728.40 374.88 702.91 0 -# Out of bounds values : 0 OK -# Avg bus bandwidth : 137.179 + Out of bounds values : 0 OK + Avg bus bandwidth : 137.179 ``` diff --git a/examples/clusters/nebius/README.md b/examples/clusters/nebius/README.md new file mode 100644 index 0000000000..70b41f8a87 --- /dev/null +++ b/examples/clusters/nebius/README.md @@ -0,0 +1,257 @@ +--- +title: Nebius +description: Using Nebius clusters with InfiniBand support via VMs or Kubernetes +--- + +# Nebius + +`dstack` allows you to use Nebius clusters with fast interconnects in two ways: + +* [VMs](#vms) – If you configure a `nebius` backend in `dstack` by providing your Nebius credentials, `dstack` lets you fully provision and use clusters through `dstack`. +* [Kubernetes](#kubernetes) – If you create a Kubernetes cluster on Nebius and configure a `kubernetes` backend and create a backend fleet in `dstack`, `dstack` lets you fully use this cluster through `dstack`. + +## VMs + +Since `dstack` offers a VM-based backend that natively integrates with Nebius, you only need to provide your Nebius credentials to `dstack`, and it will allow you to fully provision and use clusters on Nebius through `dstack`. + +### Configure a backend + +You can configure the `nebius` backend using a credentials file [generated](https://docs.nebius.com/iam/service-accounts/authorized-keys#create) by the `nebius` CLI: + +
+ +```shell +$ nebius iam auth-public-key generate \ + --service-account-id <service account ID> \ + --output ~/.nebius/sa-credentials.json +``` + +
+ +
+ +```yaml +projects: +- name: main + backends: + - type: nebius + creds: + type: service_account + filename: ~/.nebius/sa-credentials.json +``` + +
+ +### Create a fleet + +Once the backend configured, you can create a fleet: + +
+ +```yaml +type: fleet +name: nebius-fleet + +nodes: 2 +placement: cluster + +backends: [nebius] + +resources: + gpu: H100:8 +``` + +
+ +Pass the fleet configuration to `dstack apply`: + +
+ +```shell +$ dstack apply -f nebius-fleet.dstack.yml +``` + +
+ +This will automatically create a Nebius cluster and provision instances. + +Once the fleet is created, you can run [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), and [services](https://dstack.ai/docs/concepts/services). + +> If you want instances to be provisioned on demand, you can set `nodes` to `0..2`. In this case, `dstack` will create instances only when you run workloads. + +## Kubernetes + +If, for some reason, you’d like to use dstack with Nebius’s managed Kubernetes service, you can point `dstack` to the cluster’s kubeconfig file, and `dstack` will allow you to fully use this cluster through `dstack`. + +### Create a cluster + +1. Go to `Compute` → `Kubernetes` and click `Create cluster`. Make sure to enable `Public endpoint`. +2. Go to `Node groups` and click `Create node group`. Make sure to enable `Assign public IPv4 addresses` and `Install NVIDIA GPU drivers and other components`. Select the appropriate instance type, specify the `Number of nodes`, and set `Node storage` to at least `120 GiB`. Make sure to click `Create` under `GPU cluster` if you plan to use a fast interconnect. +3. Go to `Applications`, find `NVIDIA Device Plugin`, and click `Deploy`. +4. Wait until the nodes are provisioned. + +> Even if you enable `autoscaling`, `dstack` can use only the nodes that are already provisioned. To provision instances on demand, use [VMs](#vms) (see above). + +#### Configure the kubeconfig file + +1. Click `How to connect` and copy the `nebius` CLI command that configures the `kubeconfig` file. +2. Install the `nebius` CLI and run the command: + +
+ +```shell +$ nebius mk8s cluster get-credentials --id <cluster id> --external +``` + +
+ +### Configure a backend + +Follow the standard instructions for setting up a [`kubernetes`](https://dstack.ai/docs/concepts/backends/#kubernetes) backend: + +
+ +```yaml +projects: + - name: main + backends: + - type: kubernetes + kubeconfig: + filename: +``` + +
+ +### Create a fleet + +Once the cluster and the `dstack` server are running, you can create a fleet: + +
+ +```yaml +type: fleet +name: nebius-fleet + +placement: cluster +nodes: 0.. + +backends: [kubernetes] + +resources: + # Specify requirements to filter nodes + gpu: 8 +``` + +
+ +Pass the fleet configuration to `dstack apply`: + +
+ +```shell +$ dstack apply -f nebius-fleet.dstack.yml +``` + +
+ +Once the fleet is created, you can run [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), and [services](https://dstack.ai/docs/concepts/services). + +## NCCL tests + +Use a [distributed task](https://dstack.ai/docs/concepts/tasks#distributed-tasks) to run NCCL tests and validate the cluster’s network bandwidth. + +
+ +```yaml +type: task +name: nccl-tests + +nodes: 2 +startup_order: workers-first +stop_criteria: master-done + +env: + - NCCL_DEBUG=INFO +commands: + - | + if [ $DSTACK_NODE_RANK -eq 0 ]; then + mpirun \ + --allow-run-as-root \ + --hostfile $DSTACK_MPI_HOSTFILE \ + -n $DSTACK_GPUS_NUM \ + -N $DSTACK_GPUS_PER_NODE \ + --bind-to none \ + /opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 + else + sleep infinity + fi + +# Required for `/dev/infiniband` access +privileged: true + +resources: + gpu: 8 + shm_size: 16GB +``` + +
+ +Pass the configuration to `dstack apply`: + +
+ +```shell +$ dstack apply -f crusoe-nccl-tests.dstack.yml + +Provisioning... +---> 100% + +nccl-tests provisioning completed (running) + + out-of-place in-place + size count type redop root time algbw busbw #wrong time algbw busbw #wrong + (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) + 8 2 float sum -1 45.72 0.00 0.00 0 29.78 0.00 0.00 0 + 16 4 float sum -1 29.92 0.00 0.00 0 29.42 0.00 0.00 0 + 32 8 float sum -1 30.10 0.00 0.00 0 29.75 0.00 0.00 0 + 64 16 float sum -1 34.48 0.00 0.00 0 29.36 0.00 0.00 0 + 128 32 float sum -1 30.38 0.00 0.01 0 29.67 0.00 0.01 0 + 256 64 float sum -1 30.48 0.01 0.02 0 29.97 0.01 0.02 0 + 512 128 float sum -1 30.45 0.02 0.03 0 30.85 0.02 0.03 0 + 1024 256 float sum -1 31.36 0.03 0.06 0 31.29 0.03 0.06 0 + 2048 512 float sum -1 32.27 0.06 0.12 0 32.26 0.06 0.12 0 + 4096 1024 float sum -1 36.04 0.11 0.21 0 43.17 0.09 0.18 0 + 8192 2048 float sum -1 37.24 0.22 0.41 0 35.54 0.23 0.43 0 + 16384 4096 float sum -1 37.22 0.44 0.83 0 34.55 0.47 0.89 0 + 32768 8192 float sum -1 43.82 0.75 1.40 0 35.64 0.92 1.72 0 + 65536 16384 float sum -1 37.85 1.73 3.25 0 37.55 1.75 3.27 0 + 131072 32768 float sum -1 43.10 3.04 5.70 0 53.08 2.47 4.63 0 + 262144 65536 float sum -1 58.59 4.47 8.39 0 63.33 4.14 7.76 0 + 524288 131072 float sum -1 97.88 5.36 10.04 0 83.91 6.25 11.72 0 + 1048576 262144 float sum -1 87.08 12.04 22.58 0 77.82 13.47 25.26 0 + 2097152 524288 float sum -1 99.06 21.17 39.69 0 97.67 21.47 40.26 0 + 4194304 1048576 float sum -1 110.14 38.08 71.40 0 114.66 36.58 68.59 0 + 8388608 2097152 float sum -1 154.48 54.30 101.82 0 156.03 53.76 100.80 0 + 16777216 4194304 float sum -1 210.33 79.77 149.56 0 200.98 83.48 156.52 0 + 33554432 8388608 float sum -1 274.23 122.36 229.43 0 276.45 121.38 227.58 0 + 67108864 16777216 float sum -1 472.43 142.05 266.35 0 480.00 139.81 262.14 0 + 134217728 33554432 float sum -1 759.58 176.70 331.31 0 756.21 177.49 332.79 0 + 268435456 67108864 float sum -1 1305.66 205.59 385.49 0 1303.37 205.95 386.16 0 + 536870912 134217728 float sum -1 2379.38 225.63 423.06 0 2373.42 226.20 424.13 0 + 1073741824 268435456 float sum -1 4511.97 237.98 446.21 0 4513.82 237.88 446.02 0 + 2147483648 536870912 float sum -1 8776.26 244.69 458.80 0 8760.42 245.13 459.63 0 + 4294967296 1073741824 float sum -1 17407.8 246.73 462.61 0 17302.2 248.23 465.44 0 + 8589934592 2147483648 float sum -1 34448.4 249.36 467.54 0 34381.0 249.85 468.46 0 + Out of bounds values : 0 OK + Avg bus bandwidth : 125.499 + + Collective test concluded: all_reduce_perf +``` + +
+ +## What's next + +1. Learn about [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), [services](https://dstack.ai/docs/concepts/services) +2. Check out [backends](https://dstack.ai/docs/concepts/backends) and [fleets](https://dstack.ai/docs/concepts/fleets) +3. Read Nebius' docs on [networking for VMs](https://docs.nebius.com/compute/clusters/gpu) and the [managed Kubernetes service](https://docs.nebius.com/kubernetes). diff --git a/mkdocs.yml b/mkdocs.yml index c98af0a2dd..ef745e6548 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -296,6 +296,7 @@ nav: - GCP: examples/clusters/gcp/index.md - Lambda: examples/clusters/lambda/index.md - Crusoe: examples/clusters/crusoe/index.md + - Nebius: examples/clusters/nebius/index.md - NCCL/RCCL tests: examples/clusters/nccl-rccl-tests/index.md - Inference: - SGLang: examples/inference/sglang/index.md