diff --git a/docs/examples/clusters/nebius/index.md b/docs/examples/clusters/nebius/index.md
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/examples/clusters/crusoe/README.md b/examples/clusters/crusoe/README.md
index b34c4aef34..50ec88e461 100644
--- a/examples/clusters/crusoe/README.md
+++ b/examples/clusters/crusoe/README.md
@@ -1,24 +1,25 @@
---
title: Crusoe
-description: Setting up Crusoe clusters using Managed Kubernetes or VMs with InfiniBand support
+description: Using Crusoe clusters with InfiniBand support via Kubernetes or VMs
---
# Crusoe
-Crusoe offers two ways to use clusters with fast interconnect:
+`dstack` allows using Crusoe clusters with fast interconnect via two ways:
-* [Crusoe Managed Kubernetes](#kubernetes) – Lets you interact with clusters through the Kubernetes API and includes support for NVIDIA and AMD GPU operators and related tools.
-* [Virtual Machines (VMs)](#vms) – Gives you direct access to clusters in the form of virtual machines with NVIDIA and AMD GPUs.
+* [Kubernetes](#kubernetes) – If you create a Kubernetes cluster on Crusoe and configure a `kubernetes` backend and create a backend fleet in `dstack`, `dstack` lets you fully use this cluster through `dstack`.
+* [VMs](#vms) – If you create a VM cluster on Crusoe and create an SSH fleet in `dstack`, `dstack` lets you fully use this cluster through `dstack`.
+
+## Kubernetes
-Both options use the same underlying networking infrastructure. This example walks you through how to set up Crusoe clusters to use with `dstack`.
+### Create a cluster
-## Crusoe Managed Kubernetes { #kubernetes }
+1. Go `Networking` → `Firewall Rules`, click `Create Firewall Rule`, and allow ingress traffic on port `30022`. This port will be used by the `dstack` server to access the jump host.
+2. Go to `Orchestration` and click `Create Cluster`. Make sure to enable the `NVIDIA GPU Operator` add-on.
+3. Go the the cluster, and click `Create Node Pool`. Select the right type of the instance, and `Desired Number of Nodes`.
+4. Wait until nodes are provisioned.
-!!! info "Prerequsisites"
- 1. Go `Networking` → `Firewall Rules`, click `Create Firewall Rule`, and allow ingress traffic on port `30022`. This port will be used by the `dstack` server to access the jump host.
- 2. Go to `Orchestration` and click `Create Cluster`. Make sure to enable the `NVIDIA GPU Operator` add-on.
- 3. Go the the cluster, and click `Create Node Pool`. Select the right type of the instance. If you intend to auto-scale the cluster, make sure to set `Desired Number of Nodes` at least to `1`, since `dstack` doesn't currently support clusters that scale down to `0` nodes.
- 4. Wait until at least one node is running.
+> Even if you enable `autoscaling`, `dstack` can use only the nodes that are already provisioned.
### Configure the backend
@@ -56,7 +57,7 @@ backends: [kubernetes]
resources:
# Specify requirements to filter nodes
- gpu: 1..8
+ gpu: 8
```
@@ -75,12 +76,13 @@ Once the fleet is created, you can run [dev environments](https://dstack.ai/docs
## VMs
-Another way to work with Crusoe clusters is through VMs. While `dstack` typically supports VM-based compute providers via [dedicated backends](https://dstack.ai/docs/concepts/backends#vm-based) that automate provisioning, Crusoe does not yet have [such a backend](https://github.com/dstackai/dstack/issues/3378). As a result, to use a VM-based Crusoe cluster with `dstack`, you should use [SSH fleets](https://dstack.ai/docs/concepts/fleets).
+Another way to work with Crusoe clusters is through VMs. While `dstack` typically supports VM-based compute providers via [dedicated backends](https://dstack.ai/docs/concepts/backends#vm-based) that automate provisioning, Crusoe does not yet have [such a backend](https://github.com/dstackai/dstack/issues/3378). As a result, to use a VM-based Crusoe cluster with `dstack`, you should use [SSH fleets](https://dstack.ai/docs/concepts/fleets#ssh-fleets).
-!!! info "Prerequsisites"
- 1. Go to `Compute`, then `Instances`, and click `Create Instance`. Make sure to select the right instance type and VM image (that [support interconnect](https://docs.crusoecloud.com/networking/infiniband/managing-infiniband-networks/index.html)). Make sure to create as many instances as needed.
+### Create instances
-### Create a fleet
+1. Go to `Compute`, then `Instances`, and click `Create Instance`. Make sure to select the right instance type and VM image (that [support interconnect](https://docs.crusoecloud.com/networking/infiniband/managing-infiniband-networks/index.html)). Make sure to create as many instances as needed.
+
+### Create a `dstack` fleet
Follow the standard instructions for setting up an [SSH fleet](https://dstack.ai/docs/concepts/fleets/#ssh-fleets):
@@ -115,9 +117,9 @@ $ dstack apply -f crusoe-fleet.dstack.yml
Once the fleet is created, you can run [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), and [services](https://dstack.ai/docs/concepts/services).
-## Run NCCL tests
+## NCCL tests
-Use a [distributed task](https://dstack.ai/docs/concepts/tasks#distributed-task) that runs NCCL tests to validate cluster network bandwidth.
+Use a [distributed task](https://dstack.ai/docs/concepts/tasks#distributed-tasks) that runs NCCL tests to validate cluster network bandwidth.
=== "Crusoe Managed Kubernetes"
@@ -253,9 +255,9 @@ Provisioning...
nccl-tests provisioning completed (running)
-# out-of-place in-place
-# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
-# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
+out-of-place in-place
+ size count type redop root time algbw busbw #wrong time algbw busbw #wrong
+ (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 27.70 0.00 0.00 0 29.82 0.00 0.00 0
16 4 float sum -1 28.78 0.00 0.00 0 28.99 0.00 0.00 0
32 8 float sum -1 28.49 0.00 0.00 0 28.16 0.00 0.00 0
@@ -285,8 +287,8 @@ nccl-tests provisioning completed (running)
536870912 134217728 float sum -1 5300.49 101.29 189.91 0 5314.91 101.01 189.40 0
1073741824 268435456 float sum -1 10472.2 102.53 192.25 0 10485.6 102.40 192.00 0
2147483648 536870912 float sum -1 20749.1 103.50 194.06 0 20745.7 103.51 194.09 0
-# Out of bounds values : 0 OK
-# Avg bus bandwidth : 53.7387
+ Out of bounds values : 0 OK
+ Avg bus bandwidth : 53.7387
```
diff --git a/examples/clusters/lambda/README.md b/examples/clusters/lambda/README.md
index 50a98bf6ed..07fb0ce926 100644
--- a/examples/clusters/lambda/README.md
+++ b/examples/clusters/lambda/README.md
@@ -5,18 +5,17 @@ description: Setting up Lambda clusters using Kubernetes or 1-Click Clusters wit
# Lambda
-[Lambda](https://lambda.ai/) offers two ways to use clusters with a fast interconnect:
+`dstack` allows using Lambda clusters with fast interconnect via two ways:
-* [Kubernetes](#kubernetes) – Lets you interact with clusters through the Kubernetes API and includes support for NVIDIA GPU operators and related tools.
-* [1-Click Clusters (1CC)](#1-click-clusters) – Gives you direct access to clusters in the form of bare-metal nodes.
-
-Both options use the same underlying networking infrastructure. This example walks you through how to set up Lambda clusters to use with `dstack`.
+* [Kubernetes](#kubernetes) – If you create a Kubernetes cluster on Lambda and configure a `kubernetes` backend and create a backend fleet in `dstack`, `dstack` lets you fully use this cluster through `dstack`.
+* [VMs](#vms) – If you create a 1CC cluster on Lambda and create an SSH fleet in `dstack`, `dstack` lets you fully use this cluster through `dstack`.
## Kubernetes
-!!! info "Prerequsisites"
- 1. Follow the instructions in [Lambda's guide](https://docs.lambda.ai/public-cloud/1-click-clusters/managed-kubernetes/#accessing-mk8s) on accessing MK8s.
- 2. Go to `Firewall` → `Edit rules`, click `Add rule`, and allow ingress traffic on port `30022`. This port will be used by the `dstack` server to access the jump host.
+### Prerequsisites
+
+1. Follow the instructions in [Lambda's guide](https://docs.lambda.ai/public-cloud/1-click-clusters/managed-kubernetes/#accessing-mk8s) on accessing MK8s.
+2. Go to `Firewall` → `Edit rules`, click `Add rule`, and allow ingress traffic on port `30022`. This port will be used by the `dstack` server to access the jump host.
### Configure the backend
@@ -75,8 +74,9 @@ Once the fleet is created, you can run [dev environments](https://dstack.ai/docs
Another way to work with Lambda clusters is through [1CC](https://lambda.ai/1-click-clusters). While `dstack` supports automated cluster provisioning via [VM-based backends](https://dstack.ai/docs/concepts/backends#vm-based), there is currently no programmatic way to provision Lambda 1CCs. As a result, to use a 1CC cluster with `dstack`, you must use [SSH fleets](https://dstack.ai/docs/concepts/fleets).
-!!! info "Prerequsisites"
- 1. Follow the instructions in [Lambda's guide](https://docs.lambda.ai/public-cloud/1-click-clusters/) on working with 1-Click Clusters
+### Prerequsisites
+
+1. Follow the instructions in [Lambda's guide](https://docs.lambda.ai/public-cloud/1-click-clusters/) on working with 1-Click Clusters
### Create a fleet
@@ -171,11 +171,11 @@ $ dstack apply -f lambda-nccl-tests.dstack.yml
Provisioning...
---> 100%
-# nccl-tests version 2.17.6 nccl-headers=22602 nccl-library=22602
-# Collective test starting: all_reduce_perf
-#
-# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
-# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
+ nccl-tests version 2.17.6 nccl-headers=22602 nccl-library=22602
+ Collective test starting: all_reduce_perf
+
+ size count type redop root time algbw busbw #wrong time algbw busbw #wrong
+ (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 36.50 0.00 0.00 0 36.16 0.00 0.00 0
16 4 float sum -1 35.55 0.00 0.00 0 35.49 0.00 0.00 0
32 8 float sum -1 35.49 0.00 0.00 0 36.28 0.00 0.00 0
@@ -205,8 +205,8 @@ Provisioning...
536870912 134217728 float sum -1 1625.63 330.25 619.23 0 1687.31 318.18 596.59 0
1073741824 268435456 float sum -1 2972.25 361.26 677.35 0 2971.33 361.37 677.56 0
2147483648 536870912 float sum -1 5784.75 371.23 696.06 0 5728.40 374.88 702.91 0
-# Out of bounds values : 0 OK
-# Avg bus bandwidth : 137.179
+ Out of bounds values : 0 OK
+ Avg bus bandwidth : 137.179
```
diff --git a/examples/clusters/nebius/README.md b/examples/clusters/nebius/README.md
new file mode 100644
index 0000000000..70b41f8a87
--- /dev/null
+++ b/examples/clusters/nebius/README.md
@@ -0,0 +1,257 @@
+---
+title: Nebius
+description: Using Nebius clusters with InfiniBand support via VMs or Kubernetes
+---
+
+# Nebius
+
+`dstack` allows you to use Nebius clusters with fast interconnects in two ways:
+
+* [VMs](#vms) – If you configure a `nebius` backend in `dstack` by providing your Nebius credentials, `dstack` lets you fully provision and use clusters through `dstack`.
+* [Kubernetes](#kubernetes) – If you create a Kubernetes cluster on Nebius and configure a `kubernetes` backend and create a backend fleet in `dstack`, `dstack` lets you fully use this cluster through `dstack`.
+
+## VMs
+
+Since `dstack` offers a VM-based backend that natively integrates with Nebius, you only need to provide your Nebius credentials to `dstack`, and it will allow you to fully provision and use clusters on Nebius through `dstack`.
+
+### Configure a backend
+
+You can configure the `nebius` backend using a credentials file [generated](https://docs.nebius.com/iam/service-accounts/authorized-keys#create) by the `nebius` CLI:
+
+
+
+```shell
+$ nebius iam auth-public-key generate \
+ --service-account-id <service account ID> \
+ --output ~/.nebius/sa-credentials.json
+```
+
+
+
+
+
+```yaml
+projects:
+- name: main
+ backends:
+ - type: nebius
+ creds:
+ type: service_account
+ filename: ~/.nebius/sa-credentials.json
+```
+
+
+
+### Create a fleet
+
+Once the backend configured, you can create a fleet:
+
+
+
+```yaml
+type: fleet
+name: nebius-fleet
+
+nodes: 2
+placement: cluster
+
+backends: [nebius]
+
+resources:
+ gpu: H100:8
+```
+
+
+
+Pass the fleet configuration to `dstack apply`:
+
+
+
+```shell
+$ dstack apply -f nebius-fleet.dstack.yml
+```
+
+
+
+This will automatically create a Nebius cluster and provision instances.
+
+Once the fleet is created, you can run [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), and [services](https://dstack.ai/docs/concepts/services).
+
+> If you want instances to be provisioned on demand, you can set `nodes` to `0..2`. In this case, `dstack` will create instances only when you run workloads.
+
+## Kubernetes
+
+If, for some reason, you’d like to use dstack with Nebius’s managed Kubernetes service, you can point `dstack` to the cluster’s kubeconfig file, and `dstack` will allow you to fully use this cluster through `dstack`.
+
+### Create a cluster
+
+1. Go to `Compute` → `Kubernetes` and click `Create cluster`. Make sure to enable `Public endpoint`.
+2. Go to `Node groups` and click `Create node group`. Make sure to enable `Assign public IPv4 addresses` and `Install NVIDIA GPU drivers and other components`. Select the appropriate instance type, specify the `Number of nodes`, and set `Node storage` to at least `120 GiB`. Make sure to click `Create` under `GPU cluster` if you plan to use a fast interconnect.
+3. Go to `Applications`, find `NVIDIA Device Plugin`, and click `Deploy`.
+4. Wait until the nodes are provisioned.
+
+> Even if you enable `autoscaling`, `dstack` can use only the nodes that are already provisioned. To provision instances on demand, use [VMs](#vms) (see above).
+
+#### Configure the kubeconfig file
+
+1. Click `How to connect` and copy the `nebius` CLI command that configures the `kubeconfig` file.
+2. Install the `nebius` CLI and run the command:
+
+
+
+```shell
+$ nebius mk8s cluster get-credentials --id <cluster id> --external
+```
+
+
+
+### Configure a backend
+
+Follow the standard instructions for setting up a [`kubernetes`](https://dstack.ai/docs/concepts/backends/#kubernetes) backend:
+
+
+
+```yaml
+projects:
+ - name: main
+ backends:
+ - type: kubernetes
+ kubeconfig:
+ filename:
+```
+
+
+
+### Create a fleet
+
+Once the cluster and the `dstack` server are running, you can create a fleet:
+
+
+
+```yaml
+type: fleet
+name: nebius-fleet
+
+placement: cluster
+nodes: 0..
+
+backends: [kubernetes]
+
+resources:
+ # Specify requirements to filter nodes
+ gpu: 8
+```
+
+
+
+Pass the fleet configuration to `dstack apply`:
+
+
+
+```shell
+$ dstack apply -f nebius-fleet.dstack.yml
+```
+
+
+
+Once the fleet is created, you can run [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), and [services](https://dstack.ai/docs/concepts/services).
+
+## NCCL tests
+
+Use a [distributed task](https://dstack.ai/docs/concepts/tasks#distributed-tasks) to run NCCL tests and validate the cluster’s network bandwidth.
+
+
+
+```yaml
+type: task
+name: nccl-tests
+
+nodes: 2
+startup_order: workers-first
+stop_criteria: master-done
+
+env:
+ - NCCL_DEBUG=INFO
+commands:
+ - |
+ if [ $DSTACK_NODE_RANK -eq 0 ]; then
+ mpirun \
+ --allow-run-as-root \
+ --hostfile $DSTACK_MPI_HOSTFILE \
+ -n $DSTACK_GPUS_NUM \
+ -N $DSTACK_GPUS_PER_NODE \
+ --bind-to none \
+ /opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
+ else
+ sleep infinity
+ fi
+
+# Required for `/dev/infiniband` access
+privileged: true
+
+resources:
+ gpu: 8
+ shm_size: 16GB
+```
+
+
+
+Pass the configuration to `dstack apply`:
+
+
+
+```shell
+$ dstack apply -f crusoe-nccl-tests.dstack.yml
+
+Provisioning...
+---> 100%
+
+nccl-tests provisioning completed (running)
+
+ out-of-place in-place
+ size count type redop root time algbw busbw #wrong time algbw busbw #wrong
+ (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
+ 8 2 float sum -1 45.72 0.00 0.00 0 29.78 0.00 0.00 0
+ 16 4 float sum -1 29.92 0.00 0.00 0 29.42 0.00 0.00 0
+ 32 8 float sum -1 30.10 0.00 0.00 0 29.75 0.00 0.00 0
+ 64 16 float sum -1 34.48 0.00 0.00 0 29.36 0.00 0.00 0
+ 128 32 float sum -1 30.38 0.00 0.01 0 29.67 0.00 0.01 0
+ 256 64 float sum -1 30.48 0.01 0.02 0 29.97 0.01 0.02 0
+ 512 128 float sum -1 30.45 0.02 0.03 0 30.85 0.02 0.03 0
+ 1024 256 float sum -1 31.36 0.03 0.06 0 31.29 0.03 0.06 0
+ 2048 512 float sum -1 32.27 0.06 0.12 0 32.26 0.06 0.12 0
+ 4096 1024 float sum -1 36.04 0.11 0.21 0 43.17 0.09 0.18 0
+ 8192 2048 float sum -1 37.24 0.22 0.41 0 35.54 0.23 0.43 0
+ 16384 4096 float sum -1 37.22 0.44 0.83 0 34.55 0.47 0.89 0
+ 32768 8192 float sum -1 43.82 0.75 1.40 0 35.64 0.92 1.72 0
+ 65536 16384 float sum -1 37.85 1.73 3.25 0 37.55 1.75 3.27 0
+ 131072 32768 float sum -1 43.10 3.04 5.70 0 53.08 2.47 4.63 0
+ 262144 65536 float sum -1 58.59 4.47 8.39 0 63.33 4.14 7.76 0
+ 524288 131072 float sum -1 97.88 5.36 10.04 0 83.91 6.25 11.72 0
+ 1048576 262144 float sum -1 87.08 12.04 22.58 0 77.82 13.47 25.26 0
+ 2097152 524288 float sum -1 99.06 21.17 39.69 0 97.67 21.47 40.26 0
+ 4194304 1048576 float sum -1 110.14 38.08 71.40 0 114.66 36.58 68.59 0
+ 8388608 2097152 float sum -1 154.48 54.30 101.82 0 156.03 53.76 100.80 0
+ 16777216 4194304 float sum -1 210.33 79.77 149.56 0 200.98 83.48 156.52 0
+ 33554432 8388608 float sum -1 274.23 122.36 229.43 0 276.45 121.38 227.58 0
+ 67108864 16777216 float sum -1 472.43 142.05 266.35 0 480.00 139.81 262.14 0
+ 134217728 33554432 float sum -1 759.58 176.70 331.31 0 756.21 177.49 332.79 0
+ 268435456 67108864 float sum -1 1305.66 205.59 385.49 0 1303.37 205.95 386.16 0
+ 536870912 134217728 float sum -1 2379.38 225.63 423.06 0 2373.42 226.20 424.13 0
+ 1073741824 268435456 float sum -1 4511.97 237.98 446.21 0 4513.82 237.88 446.02 0
+ 2147483648 536870912 float sum -1 8776.26 244.69 458.80 0 8760.42 245.13 459.63 0
+ 4294967296 1073741824 float sum -1 17407.8 246.73 462.61 0 17302.2 248.23 465.44 0
+ 8589934592 2147483648 float sum -1 34448.4 249.36 467.54 0 34381.0 249.85 468.46 0
+ Out of bounds values : 0 OK
+ Avg bus bandwidth : 125.499
+
+ Collective test concluded: all_reduce_perf
+```
+
+
+
+## What's next
+
+1. Learn about [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), [services](https://dstack.ai/docs/concepts/services)
+2. Check out [backends](https://dstack.ai/docs/concepts/backends) and [fleets](https://dstack.ai/docs/concepts/fleets)
+3. Read Nebius' docs on [networking for VMs](https://docs.nebius.com/compute/clusters/gpu) and the [managed Kubernetes service](https://docs.nebius.com/kubernetes).
diff --git a/mkdocs.yml b/mkdocs.yml
index c98af0a2dd..ef745e6548 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -296,6 +296,7 @@ nav:
- GCP: examples/clusters/gcp/index.md
- Lambda: examples/clusters/lambda/index.md
- Crusoe: examples/clusters/crusoe/index.md
+ - Nebius: examples/clusters/nebius/index.md
- NCCL/RCCL tests: examples/clusters/nccl-rccl-tests/index.md
- Inference:
- SGLang: examples/inference/sglang/index.md