Skip to content

Add support for requirements checks to CDI#1795

Draft
JunAr7112 wants to merge 1 commit into
NVIDIA:mainfrom
JunAr7112:requirement_checks
Draft

Add support for requirements checks to CDI#1795
JunAr7112 wants to merge 1 commit into
NVIDIA:mainfrom
JunAr7112:requirement_checks

Conversation

@JunAr7112
Copy link
Copy Markdown
Contributor

@JunAr7112 JunAr7112 commented Apr 29, 2026

This PR is aimed at addressing this issue. Essentially, at present only the legacy mode will check for NVIDIA_REQUIRE_ envvars. This PR will create a common checkRequirements function (with helpers to convert to semver format and get brand reqs with NVML) for both csv mode and CDI mode to go through the NVIDIA_REQUIRE_ envvars.

This was tested by deploying a pod with an invalid envvar value and verifying that the checks would stop deployment:

// CDI-only negative test: container create should fail when NVIDIA_REQUIRE_* cannot be satisfied (e.g. cuda>=99.0 on any real host). Uses RuntimeClass nvidia (CDI / toolkit mode), not nvidia-legacy.

apiVersion: v1
kind: Pod
metadata:
name: require-cuda-fail
spec:
runtimeClassName: nvidia
restartPolicy: Never
containers:
- name: c
image: ubuntu:22.04
command: ["sleep", "3600"]
resources:
limits:
nvidia.com/gpu: "1"
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: NVIDIA_DRIVER_CAPABILITIES
value: "all"
- name: NVIDIA_REQUIRE_CUDA
value: "cuda>=99.0"
- name: NVIDIA_REQUIRE_DRIVER
value: "driver>=9999.0.0"
..................................
Events:
Type Reason Age From Message

Normal Scheduled 12m default-scheduler Successfully assigned default/require-cuda-fail to ipp1-3167
Normal Pulled 12m kubelet Container image "ubuntu:22.04" already present on machine
Normal Created 12m kubelet Created container c
Warning Failed 12m kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: requirements not met: unsatisfied condition: driver>=9999.0.0 (driver=595.58.3)

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 29, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@JunAr7112 JunAr7112 force-pushed the requirement_checks branch from 75213b5 to 56fd4f4 Compare April 29, 2026 15:55
@JunAr7112 JunAr7112 marked this pull request as draft April 29, 2026 15:55
@JunAr7112 JunAr7112 force-pushed the requirement_checks branch from 56fd4f4 to 405317e Compare April 30, 2026 22:52
Signed-off-by: Arjun <agadiyar@nvidia.com>
@JunAr7112 JunAr7112 force-pushed the requirement_checks branch from 405317e to 68fa4bb Compare May 1, 2026 21:01
@JunAr7112 JunAr7112 marked this pull request as ready for review May 1, 2026 21:02
@cdesiniotis cdesiniotis marked this pull request as draft May 5, 2026 17:42
Copy link
Copy Markdown
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One note here. It is not sufficient to run the check in the modifier, we would have to ensure that we generate a hook that implements the check in some form. The driver version etc. are known at the point of spec generation, and one would have to inspect the envvars in the container.

As an additional note, this would be an ideal candidate to move to a createRuntime hook.

Comment on lines +34 to +38
// checkRequirements evaluates NVIDIA_REQUIRE_* constraints using the host
// CUDA driver API version from libcuda, the NVIDIA display driver version from
// the driver root (libcuda / libnvidia-ml soname), the compute capability of
// CUDA device 0, and (when requirements reference brand) the GPU product brand
// from NVML. It is used for CSV and CDI / JIT-CDI modes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there are cases where libcuda.so is not applicable (if we're not injecting actuall GPU devices, for example).


// brandTypeToRequirementString maps NVML brand enums to lowercase tokens
// consistent with typical NVIDIA_REQUIRE_* image constraints.
func brandTypeToRequirementString(b nvml.BrandType) (string, bool) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: is this something that we already have access to in go-nvlib?

r.AddVersionProperty(requirements.CUDA, cudaVersion)
}

compteCapability, err := cuda.ComputeCapability(0)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we're always using the first device (which was fine for older Tegra-based systems), but this does not map to multi-device systems especially if they're heterogeneous.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants