Skip to content

docs: add GPU node setup guide#171

Merged
bcho merged 13 commits into
mainfrom
chokevin/gpu-node-setup-guide
May 22, 2026
Merged

docs: add GPU node setup guide#171
bcho merged 13 commits into
mainfrom
chokevin/gpu-node-setup-guide

Conversation

@chokevin
Copy link
Copy Markdown
Collaborator

What

Adds a concise PM-facing GPU Flex Node setup guide at docs/usages/gpu-node-setup.md, linked from README.md and docs/usage.md.

Key points the guide drives home:

  • AKS Flex Node does not install the NVIDIA kernel driver. Pick an image with the driver baked in (microsoft-dsvm/ubuntu-hpc/2204/latest is the current Flex H100/H200 validation image; other Ubuntu HPC SKUs and custom prebaked images are listed as alternatives to validate).
  • AKS managed GPU node pools are not a host-image option — they install the driver at boot through AKS managed GPU bootstrap, so there is no baked AKS GPU image to consume.
  • You must manually install the cluster GPU stack after the node joins: GPU Operator (with driver.enabled=false), nvidia-container-toolkit, and GPU Feature Discovery. The guide shows the Helm install and confirms driver.enabled=false. Without this step, the node is Ready but pods will not get GPUs.
  • DRA support is an optional add-on, called out as a sidebar (gpu.nvidia.com, mig.nvidia.com DeviceClasses; legacy nvidia.com/gpu capacity may be 0 in DRA-only clusters).

Also includes a Karpenter-style Flex node config example, validation commands, troubleshooting table, and caveats.

This PR replaces #170, which was opened from a fork. Same commits, now hosted as a branch directly on Azure/AKSFlexNode.

Why

GPU Flex Node setup has two contracts that are easy to get wrong: the host NVIDIA driver (must come from the image) and the cluster GPU stack (must be installed manually after join). PMs and technical readers need a short doc that names both.

Non-goals

  • No runtime/provisioning code changes.
  • Does not claim GPU Operator installs the host driver.
  • Does not document hostRouting or other unvalidated CRD paths.
  • Does not hardcode validation cluster names.
  • Does not publish a supported image matrix.
  • Does not promote AKS managed GPU images as a Flex Node host image.

Testing

  • git diff --check.
  • Relative markdown link check.
  • Specificity guard (no validation cluster names, no hostRouting, no aksFlexNode, no over-strong GPU Operator driver claims, no AKS managed image option).
  • Code-fence parity.
  • make test was attempted and fails in existing pkg/config tests because this workstation hostname (Kevins-MacBook-Pro.local) is rejected as a Kubernetes DNS subdomain. Unrelated to the docs-only changes.

Risk

Documentation-only. The main risk is inaccurate guidance; the guide explicitly names the host-driver caveat, the mandatory manual cluster GPU stack install, and the optional nature of DRA.

Copilot AI review requested due to automatic review settings May 22, 2026 00:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-user documentation to guide GPU-capable AKS Flex Node setup, emphasizing the host-image NVIDIA driver requirement and the need to manually install the in-cluster GPU software stack after node join.

Changes:

  • Adds a new GPU Flex Node setup guide with image/driver contract, cluster GPU stack install steps, validation, and troubleshooting.
  • Links the new GPU guide from the top-level README and the main usage guide.
  • Adds reusable bootstrap-token RBAC and config templates under docs/examples/ for the guide to reference.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
README.md Adds a documentation link to the new GPU setup guide.
docs/usages/gpu-node-setup.md New GPU setup guide covering image/driver contract, GPU operator install, provisioning paths, validation, and troubleshooting.
docs/usage.md Adds a cross-link from the general usage guide to the GPU-specific guide.
docs/examples/bootstrap-token-rbac.yaml Adds a templated bootstrap-token Secret + RBAC bindings for the guide’s bootstrap-token flow.
docs/examples/bootstrap-token-config.json Adds a templated minimal config.json for bootstrap-token-based node join.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/usages/gpu-node-setup.md
Comment thread docs/usages/gpu-node-setup.md Outdated
Comment thread docs/usages/gpu-node-setup.md
Copy link
Copy Markdown
Member

@bcho bcho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot please rebase with the latest main branch and follow the latest structure / steps to refresh the doc content

Copy link
Copy Markdown
Contributor

Copilot AI commented May 22, 2026

@copilot please rebase with the latest main branch and follow the latest structure / steps to refresh the doc content

Updated in 958a85b: synced this branch with the latest main and refreshed the GPU docs to follow the current docs structure/steps (including updated links from README.md and docs/usage.md, plus prerequisite/command updates in docs/usages/gpu-node-setup.md).

Copilot finished work on behalf of bcho May 22, 2026 17:26
Copilot AI requested a review from bcho May 22, 2026 17:26
Copy link
Copy Markdown
Member

@bcho bcho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot please update based on the comments, focus on removing mentioning of nvidia ctk and karpenter

Comment thread docs/usages/gpu-node-setup.md Outdated
Comment thread docs/usages/gpu-node-setup.md Outdated
Comment thread docs/usages/gpu-node-setup.md Outdated
Comment thread docs/usages/gpu-node-setup.md Outdated
Comment thread docs/usages/gpu-node-setup.md Outdated
Comment thread docs/usages/gpu-node-setup.md Outdated
Comment thread docs/usages/gpu-node-setup.md Outdated
Copilot AI review requested due to automatic review settings May 22, 2026 17:45
@chokevin chokevin review requested due to automatic review settings May 22, 2026 17:45
Copilot AI review requested due to automatic review settings May 22, 2026 17:45
@chokevin chokevin review requested due to automatic review settings May 22, 2026 17:45
Copilot AI review requested due to automatic review settings May 22, 2026 17:46
@chokevin chokevin review requested due to automatic review settings May 22, 2026 17:46
Copilot AI review requested due to automatic review settings May 22, 2026 17:47
@chokevin chokevin review requested due to automatic review settings May 22, 2026 17:47
Copilot AI review requested due to automatic review settings May 22, 2026 17:47
@chokevin chokevin review requested due to automatic review settings May 22, 2026 17:47
Copilot AI review requested due to automatic review settings May 22, 2026 17:48
@chokevin chokevin review requested due to automatic review settings May 22, 2026 17:48
Copy link
Copy Markdown
Contributor

Copilot AI commented May 22, 2026

@copilot please update based on the comments, focus on removing mentioning of nvidia ctk and karpenter

Updated in ead203e: removed NVIDIA CTK and Karpenter mentions from docs/usages/gpu-node-setup.md, refined the post-join GPU stack wording to Operator/Device Plugin/GFD/DRA, and updated the template-rendering steps to remove envsubst usage.

Copilot finished work on behalf of bcho May 22, 2026 17:49
Copilot AI requested a review from bcho May 22, 2026 17:49
Copy link
Copy Markdown
Member

@bcho bcho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread docs/usages/gpu-node-setup.md Outdated
Comment thread docs/usages/gpu-node-setup.md Outdated
Copy link
Copy Markdown
Contributor

Copilot AI commented May 22, 2026

@copilot

Please share the specific updates you want in this PR, and I’ll apply them right away.

Copilot finished work on behalf of bcho May 22, 2026 19:10
Copilot AI requested a review from bcho May 22, 2026 19:10
Copy link
Copy Markdown
Member

@bcho bcho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot read the latest comments and update

Copilot finished work on behalf of bcho May 22, 2026 19:31
Copilot AI requested a review from bcho May 22, 2026 19:31
Copilot AI review requested due to automatic review settings May 22, 2026 19:39
@chokevin chokevin review requested due to automatic review settings May 22, 2026 19:39
Copy link
Copy Markdown
Contributor

Copilot AI commented May 22, 2026

@copilot read the latest comments and update

Addressed in b7c41bf. I applied the latest feedback by rephrasing the in-cluster GPU component guidance to avoid implying all components are always required, and updated the bootstrap/config steps to the aks-flex-config flow.

@chokevin
Copy link
Copy Markdown
Collaborator Author

I'm ok with the changes as is comments as addressed are reasonable

Copilot finished work on behalf of bcho May 22, 2026 19:51
Copilot AI requested a review from bcho May 22, 2026 19:51
@bcho bcho merged commit 0027010 into main May 22, 2026
9 checks passed
@bcho bcho deleted the chokevin/gpu-node-setup-guide branch May 22, 2026 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants