docs: add GPU node setup guide#171
Conversation
There was a problem hiding this comment.
Pull request overview
Adds end-user documentation to guide GPU-capable AKS Flex Node setup, emphasizing the host-image NVIDIA driver requirement and the need to manually install the in-cluster GPU software stack after node join.
Changes:
- Adds a new GPU Flex Node setup guide with image/driver contract, cluster GPU stack install steps, validation, and troubleshooting.
- Links the new GPU guide from the top-level README and the main usage guide.
- Adds reusable bootstrap-token RBAC and config templates under
docs/examples/for the guide to reference.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| README.md | Adds a documentation link to the new GPU setup guide. |
| docs/usages/gpu-node-setup.md | New GPU setup guide covering image/driver contract, GPU operator install, provisioning paths, validation, and troubleshooting. |
| docs/usage.md | Adds a cross-link from the general usage guide to the GPU-specific guide. |
| docs/examples/bootstrap-token-rbac.yaml | Adds a templated bootstrap-token Secret + RBAC bindings for the guide’s bootstrap-token flow. |
| docs/examples/bootstrap-token-config.json | Adds a templated minimal config.json for bootstrap-token-based node join. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Updated in |
Updated in |
|
Please share the specific updates you want in this PR, and I’ll apply them right away. |
Addressed in |
|
I'm ok with the changes as is comments as addressed are reasonable |
What
Adds a concise PM-facing GPU Flex Node setup guide at
docs/usages/gpu-node-setup.md, linked fromREADME.mdanddocs/usage.md.Key points the guide drives home:
microsoft-dsvm/ubuntu-hpc/2204/latestis the current Flex H100/H200 validation image; other Ubuntu HPC SKUs and custom prebaked images are listed as alternatives to validate).driver.enabled=false), nvidia-container-toolkit, and GPU Feature Discovery. The guide shows the Helm install and confirmsdriver.enabled=false. Without this step, the node isReadybut pods will not get GPUs.gpu.nvidia.com,mig.nvidia.comDeviceClasses; legacynvidia.com/gpucapacity may be0in DRA-only clusters).Also includes a Karpenter-style Flex node config example, validation commands, troubleshooting table, and caveats.
This PR replaces #170, which was opened from a fork. Same commits, now hosted as a branch directly on
Azure/AKSFlexNode.Why
GPU Flex Node setup has two contracts that are easy to get wrong: the host NVIDIA driver (must come from the image) and the cluster GPU stack (must be installed manually after join). PMs and technical readers need a short doc that names both.
Non-goals
hostRoutingor other unvalidated CRD paths.Testing
git diff --check.hostRouting, noaksFlexNode, no over-strong GPU Operator driver claims, no AKS managed image option).make testwas attempted and fails in existingpkg/configtests because this workstation hostname (Kevins-MacBook-Pro.local) is rejected as a Kubernetes DNS subdomain. Unrelated to the docs-only changes.Risk
Documentation-only. The main risk is inaccurate guidance; the guide explicitly names the host-driver caveat, the mandatory manual cluster GPU stack install, and the optional nature of DRA.