Add support for requirements checks to CDI#1795
Conversation
75213b5 to
56fd4f4
Compare
56fd4f4 to
405317e
Compare
Signed-off-by: Arjun <agadiyar@nvidia.com>
405317e to
68fa4bb
Compare
elezar
left a comment
There was a problem hiding this comment.
One note here. It is not sufficient to run the check in the modifier, we would have to ensure that we generate a hook that implements the check in some form. The driver version etc. are known at the point of spec generation, and one would have to inspect the envvars in the container.
As an additional note, this would be an ideal candidate to move to a createRuntime hook.
| // checkRequirements evaluates NVIDIA_REQUIRE_* constraints using the host | ||
| // CUDA driver API version from libcuda, the NVIDIA display driver version from | ||
| // the driver root (libcuda / libnvidia-ml soname), the compute capability of | ||
| // CUDA device 0, and (when requirements reference brand) the GPU product brand | ||
| // from NVML. It is used for CSV and CDI / JIT-CDI modes. |
There was a problem hiding this comment.
Note that there are cases where libcuda.so is not applicable (if we're not injecting actuall GPU devices, for example).
|
|
||
| // brandTypeToRequirementString maps NVML brand enums to lowercase tokens | ||
| // consistent with typical NVIDIA_REQUIRE_* image constraints. | ||
| func brandTypeToRequirementString(b nvml.BrandType) (string, bool) { |
There was a problem hiding this comment.
Question: is this something that we already have access to in go-nvlib?
| r.AddVersionProperty(requirements.CUDA, cudaVersion) | ||
| } | ||
|
|
||
| compteCapability, err := cuda.ComputeCapability(0) |
There was a problem hiding this comment.
Here we're always using the first device (which was fine for older Tegra-based systems), but this does not map to multi-device systems especially if they're heterogeneous.
This PR is aimed at addressing this issue. Essentially, at present only the legacy mode will check for NVIDIA_REQUIRE_ envvars. This PR will create a common checkRequirements function (with helpers to convert to semver format and get brand reqs with NVML) for both csv mode and CDI mode to go through the NVIDIA_REQUIRE_ envvars.
This was tested by deploying a pod with an invalid envvar value and verifying that the checks would stop deployment:
// CDI-only negative test: container create should fail when NVIDIA_REQUIRE_* cannot be satisfied (e.g. cuda>=99.0 on any real host). Uses RuntimeClass
nvidia(CDI / toolkit mode), notnvidia-legacy.apiVersion: v1
kind: Pod
metadata:
name: require-cuda-fail
spec:
runtimeClassName: nvidia
restartPolicy: Never
containers:
- name: c
image: ubuntu:22.04
command: ["sleep", "3600"]
resources:
limits:
nvidia.com/gpu: "1"
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: NVIDIA_DRIVER_CAPABILITIES
value: "all"
- name: NVIDIA_REQUIRE_CUDA
value: "cuda>=99.0"
- name: NVIDIA_REQUIRE_DRIVER
value: "driver>=9999.0.0"
..................................
Events:
Type Reason Age From Message
Normal Scheduled 12m default-scheduler Successfully assigned default/require-cuda-fail to ipp1-3167
Normal Pulled 12m kubelet Container image "ubuntu:22.04" already present on machine
Normal Created 12m kubelet Created container c
Warning Failed 12m kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: requirements not met: unsatisfied condition: driver>=9999.0.0 (driver=595.58.3)