Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions docs/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,16 @@ hide:
Set up Crusoe clusters with optimized networking
</p>
</a>
<a href="/examples/clusters/nebius"
class="feature-cell sky">
<h3>
Nebius
</h3>

<p>
Set up Nebius clusters with optimized networking
</p>
</a>
<a href="/examples/clusters/nccl-rccl-tests"
class="feature-cell sky">
<h3>
Expand Down
Empty file.
48 changes: 25 additions & 23 deletions examples/clusters/crusoe/README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,25 @@
---
title: Crusoe
description: Setting up Crusoe clusters using Managed Kubernetes or VMs with InfiniBand support
description: Using Crusoe clusters with InfiniBand support via Kubernetes or VMs
---

# Crusoe

Crusoe offers two ways to use clusters with fast interconnect:
`dstack` allows using Crusoe clusters with fast interconnect via two ways:

* [Crusoe Managed Kubernetes](#kubernetes) – Lets you interact with clusters through the Kubernetes API and includes support for NVIDIA and AMD GPU operators and related tools.
* [Virtual Machines (VMs)](#vms) – Gives you direct access to clusters in the form of virtual machines with NVIDIA and AMD GPUs.
* [Kubernetes](#kubernetes) – If you create a Kubernetes cluster on Crusoe and configure a `kubernetes` backend and create a backend fleet in `dstack`, `dstack` lets you fully use this cluster through `dstack`.
* [VMs](#vms) – If you create a VM cluster on Crusoe and create an SSH fleet in `dstack`, `dstack` lets you fully use this cluster through `dstack`.

## Kubernetes

Both options use the same underlying networking infrastructure. This example walks you through how to set up Crusoe clusters to use with `dstack`.
### Create a cluster

## Crusoe Managed Kubernetes { #kubernetes }
1. Go `Networking` → `Firewall Rules`, click `Create Firewall Rule`, and allow ingress traffic on port `30022`. This port will be used by the `dstack` server to access the jump host.
2. Go to `Orchestration` and click `Create Cluster`. Make sure to enable the `NVIDIA GPU Operator` add-on.
3. Go the the cluster, and click `Create Node Pool`. Select the right type of the instance, and `Desired Number of Nodes`.
4. Wait until nodes are provisioned.

!!! info "Prerequsisites"
1. Go `Networking` → `Firewall Rules`, click `Create Firewall Rule`, and allow ingress traffic on port `30022`. This port will be used by the `dstack` server to access the jump host.
2. Go to `Orchestration` and click `Create Cluster`. Make sure to enable the `NVIDIA GPU Operator` add-on.
3. Go the the cluster, and click `Create Node Pool`. Select the right type of the instance. If you intend to auto-scale the cluster, make sure to set `Desired Number of Nodes` at least to `1`, since `dstack` doesn't currently support clusters that scale down to `0` nodes.
4. Wait until at least one node is running.
> Even if you enable `autoscaling`, `dstack` can use only the nodes that are already provisioned.

### Configure the backend

Expand Down Expand Up @@ -56,7 +57,7 @@ backends: [kubernetes]

resources:
# Specify requirements to filter nodes
gpu: 1..8
gpu: 8
```

</div>
Expand All @@ -75,12 +76,13 @@ Once the fleet is created, you can run [dev environments](https://dstack.ai/docs

## VMs

Another way to work with Crusoe clusters is through VMs. While `dstack` typically supports VM-based compute providers via [dedicated backends](https://dstack.ai/docs/concepts/backends#vm-based) that automate provisioning, Crusoe does not yet have [such a backend](https://github.com/dstackai/dstack/issues/3378). As a result, to use a VM-based Crusoe cluster with `dstack`, you should use [SSH fleets](https://dstack.ai/docs/concepts/fleets).
Another way to work with Crusoe clusters is through VMs. While `dstack` typically supports VM-based compute providers via [dedicated backends](https://dstack.ai/docs/concepts/backends#vm-based) that automate provisioning, Crusoe does not yet have [such a backend](https://github.com/dstackai/dstack/issues/3378). As a result, to use a VM-based Crusoe cluster with `dstack`, you should use [SSH fleets](https://dstack.ai/docs/concepts/fleets#ssh-fleets).

!!! info "Prerequsisites"
1. Go to `Compute`, then `Instances`, and click `Create Instance`. Make sure to select the right instance type and VM image (that [support interconnect](https://docs.crusoecloud.com/networking/infiniband/managing-infiniband-networks/index.html)). Make sure to create as many instances as needed.
### Create instances

### Create a fleet
1. Go to `Compute`, then `Instances`, and click `Create Instance`. Make sure to select the right instance type and VM image (that [support interconnect](https://docs.crusoecloud.com/networking/infiniband/managing-infiniband-networks/index.html)). Make sure to create as many instances as needed.

### Create a `dstack` fleet

Follow the standard instructions for setting up an [SSH fleet](https://dstack.ai/docs/concepts/fleets/#ssh-fleets):

Expand Down Expand Up @@ -115,9 +117,9 @@ $ dstack apply -f crusoe-fleet.dstack.yml

Once the fleet is created, you can run [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), and [services](https://dstack.ai/docs/concepts/services).

## Run NCCL tests
## NCCL tests

Use a [distributed task](https://dstack.ai/docs/concepts/tasks#distributed-task) that runs NCCL tests to validate cluster network bandwidth.
Use a [distributed task](https://dstack.ai/docs/concepts/tasks#distributed-tasks) that runs NCCL tests to validate cluster network bandwidth.

=== "Crusoe Managed Kubernetes"

Expand Down Expand Up @@ -253,9 +255,9 @@ Provisioning...

nccl-tests provisioning completed (running)

# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
out-of-place in-place
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 27.70 0.00 0.00 0 29.82 0.00 0.00 0
16 4 float sum -1 28.78 0.00 0.00 0 28.99 0.00 0.00 0
32 8 float sum -1 28.49 0.00 0.00 0 28.16 0.00 0.00 0
Expand Down Expand Up @@ -285,8 +287,8 @@ nccl-tests provisioning completed (running)
536870912 134217728 float sum -1 5300.49 101.29 189.91 0 5314.91 101.01 189.40 0
1073741824 268435456 float sum -1 10472.2 102.53 192.25 0 10485.6 102.40 192.00 0
2147483648 536870912 float sum -1 20749.1 103.50 194.06 0 20745.7 103.51 194.09 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 53.7387
Out of bounds values : 0 OK
Avg bus bandwidth : 53.7387
```

</div>
Expand Down
34 changes: 17 additions & 17 deletions examples/clusters/lambda/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,17 @@ description: Setting up Lambda clusters using Kubernetes or 1-Click Clusters wit

# Lambda

[Lambda](https://lambda.ai/) offers two ways to use clusters with a fast interconnect:
`dstack` allows using Lambda clusters with fast interconnect via two ways:

* [Kubernetes](#kubernetes) – Lets you interact with clusters through the Kubernetes API and includes support for NVIDIA GPU operators and related tools.
* [1-Click Clusters (1CC)](#1-click-clusters) – Gives you direct access to clusters in the form of bare-metal nodes.

Both options use the same underlying networking infrastructure. This example walks you through how to set up Lambda clusters to use with `dstack`.
* [Kubernetes](#kubernetes) – If you create a Kubernetes cluster on Lambda and configure a `kubernetes` backend and create a backend fleet in `dstack`, `dstack` lets you fully use this cluster through `dstack`.
* [VMs](#vms) – If you create a 1CC cluster on Lambda and create an SSH fleet in `dstack`, `dstack` lets you fully use this cluster through `dstack`.

## Kubernetes

!!! info "Prerequsisites"
1. Follow the instructions in [Lambda's guide](https://docs.lambda.ai/public-cloud/1-click-clusters/managed-kubernetes/#accessing-mk8s) on accessing MK8s.
2. Go to `Firewall` → `Edit rules`, click `Add rule`, and allow ingress traffic on port `30022`. This port will be used by the `dstack` server to access the jump host.
### Prerequsisites

1. Follow the instructions in [Lambda's guide](https://docs.lambda.ai/public-cloud/1-click-clusters/managed-kubernetes/#accessing-mk8s) on accessing MK8s.
2. Go to `Firewall` → `Edit rules`, click `Add rule`, and allow ingress traffic on port `30022`. This port will be used by the `dstack` server to access the jump host.

### Configure the backend

Expand Down Expand Up @@ -75,8 +74,9 @@ Once the fleet is created, you can run [dev environments](https://dstack.ai/docs

Another way to work with Lambda clusters is through [1CC](https://lambda.ai/1-click-clusters). While `dstack` supports automated cluster provisioning via [VM-based backends](https://dstack.ai/docs/concepts/backends#vm-based), there is currently no programmatic way to provision Lambda 1CCs. As a result, to use a 1CC cluster with `dstack`, you must use [SSH fleets](https://dstack.ai/docs/concepts/fleets).

!!! info "Prerequsisites"
1. Follow the instructions in [Lambda's guide](https://docs.lambda.ai/public-cloud/1-click-clusters/) on working with 1-Click Clusters
### Prerequsisites

1. Follow the instructions in [Lambda's guide](https://docs.lambda.ai/public-cloud/1-click-clusters/) on working with 1-Click Clusters

### Create a fleet

Expand Down Expand Up @@ -171,11 +171,11 @@ $ dstack apply -f lambda-nccl-tests.dstack.yml
Provisioning...
---> 100%

# nccl-tests version 2.17.6 nccl-headers=22602 nccl-library=22602
# Collective test starting: all_reduce_perf
#
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
nccl-tests version 2.17.6 nccl-headers=22602 nccl-library=22602
Collective test starting: all_reduce_perf

size count type redop root time algbw busbw #wrong time algbw busbw #wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 36.50 0.00 0.00 0 36.16 0.00 0.00 0
16 4 float sum -1 35.55 0.00 0.00 0 35.49 0.00 0.00 0
32 8 float sum -1 35.49 0.00 0.00 0 36.28 0.00 0.00 0
Expand Down Expand Up @@ -205,8 +205,8 @@ Provisioning...
536870912 134217728 float sum -1 1625.63 330.25 619.23 0 1687.31 318.18 596.59 0
1073741824 268435456 float sum -1 2972.25 361.26 677.35 0 2971.33 361.37 677.56 0
2147483648 536870912 float sum -1 5784.75 371.23 696.06 0 5728.40 374.88 702.91 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 137.179
Out of bounds values : 0 OK
Avg bus bandwidth : 137.179
```

</div>
Expand Down
Loading
Loading