refactor: resolve workload container by name with Containers[0] fallback; wire GMS by pointer#8319
Closed
galletas1712 wants to merge 4 commits intomainfrom
Closed
refactor: resolve workload container by name with Containers[0] fallback; wire GMS by pointer#8319galletas1712 wants to merge 4 commits intomainfrom
galletas1712 wants to merge 4 commits intomainfrom
Conversation
…back Pull main-container resolution into a single snapshotprotocol helper (ResolveMainContainer / ResolveMainContainerName) that prefers the container named 'main' and falls back to Containers[0]. Replace five Containers[0] assumptions that were hit by reviewer comments on #8194: - deploy/operator/internal/dynamo/graph.go wired GMS onto Containers[0] even though ExtraPodSpec containers can be appended before the main container — now looks up by name. - deploy/operator/internal/checkpoint/podspec.go had a local resolveMainContainer without the fallback. - deploy/operator/internal/controller/checkpoint_job.go hardcoded both the initial and the post-sidecar re-acquire to index 0. - deploy/snapshot/protocol/checkpoint.go (WrapLaunchJob) and restore.go (NewRestorePod, PrepareRestorePodSpec) both assumed index 0 in a comment that was specific to operator-emitted pods. - deploy/snapshot/internal/controller/util.go and deploy/snapshot/cmd/snapshotctl/kube.go ditto. PrepareRestorePodSpecForCheckpoint now owns worker resolution internally so callers cannot pass a sidecar pointer by mistake. The public signature drops the container argument. MainContainerName moves to snapshotprotocol as the single source of truth; commonconsts.MainContainerName aliases it so existing call sites keep working. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
…earch The previous fix used snapshotprotocol.ResolveMainContainer to look up the main container in graph.go, which is a layering violation: graph.go builds general DGD pod specs and most DGDs never participate in checkpoint/restore. Searching podSpec.Containers for a container named 'main' was also unnecessary work given the local function still holds the main container as a value. Reorder GenerateBasePodSpec so GMS (and the DRA claim that pairs with it) run against the local container variable before it is appended to the slice. The caller passes the workload container by pointer, so dra.ApplyClaim and gms.EnsureServerSidecar do not need any container lookup at all; removing the hardcoded podSpec.Containers[0] write in dra.ApplyClaim in favor of a caller-supplied pointer makes the contract symmetric with gms.EnsureServerSidecar. This also sidesteps a failover-case footgun: buildFailoverPod runs after GMS, deletes the single main container, and replaces it with engine-0/engine-1 (via DeepCopy of the GMS-wired main). At the GMS wire point there really is exactly one workload container, and threading it through by pointer makes that invariant explicit in code rather than implicit in slice-order reasoning. checkpoint_job.go (the other dra.ApplyClaim call site) already has a resolved mainContainer pointer available, so the new signature is a one-token change there. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
… the duplicate The operator's commonconsts.MainContainerName predates the snapshot code (PR #3390, 2025-10-03 \u2014 added for LWS multinode enforcement in checkMainContainer). Snapshot tooling adopted the same convention later. My earlier follow-up commit inverted this relationship by making commonconsts.MainContainerName an alias for snapshotprotocol.MainContainerName, which put the producer of the convention behind the consumer in the dependency graph. Restore the operator constant as the authoritative declaration and keep a local copy in snapshotprotocol with a cross-reference comment noting that deploy/snapshot is a separate Go module and cannot import deploy/operator/internal directly. The two copies must be kept in sync; there is no runtime check, but the comment and the single-letter string value keep the drift risk low. As a drive-by, also replace a hardcoded "main" literal in the snapshot executor's RestoreRequest default with snapshotprotocol.MainContainerName so there is exactly one duplicate across the two modules rather than three. Also add an integration regression test for GMS + failover in GenerateBasePodSpec: asserts that after buildFailoverPod replaces the "main" container with engine-0/engine-1, both engines inherit the GMS shared-socket volume mount, GMS_SOCKET_DIR env, and DRA claim from the original main container (via DeepCopy in buildEngineContainer), and that the pod-level gms-server init sidecar and shared volume both land exactly once. This closes the gap in coverage for the GMS-before-GMS- wires-then-failover-clones sequencing introduced in the previous refactor commit. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
…ackage Operator code should not import snapshotprotocol helpers that deal with pod-spec shape \u2014 the snapshot module is a consumer of the operator's main-container convention, not the owner of the lookup logic. Add common.ResolveMainContainer in deploy/operator/internal/common/ with the same prefer-by-name-then-fallback-to-Containers[0] behavior. Update the two operator call sites (checkpoint_job.go and checkpoint/podspec.go) to use it. The snapshot side keeps its own copy (snapshotprotocol.ResolveMainContainer and ResolveMainContainerName) for use by the snapshot agent, snapshotctl, and snapshot protocol internals. Both copies must be kept in sync; cross-reference comments on both files now point to the other copy. This is the same duplication-with-docs pattern we're using for MainContainerName itself, because deploy/operator and deploy/snapshot are separate Go modules and cannot cross-import internal packages. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
Contributor
Author
|
Superseded by #8631. The underlying problem — snapshot helpers keying off |
galletas1712
pushed a commit
that referenced
this pull request
Apr 28, 2026
… contract The snapshot-agent keyed every checkpoint and restore operation off of a single implicit workload container (Containers[0] in some call sites, 'main' in others). That made intra-pod failover restore (#8319), where the operator clones 'main' into engine-0 / engine-1, impossible to model cleanly. Replace the implicit contract with one mandatory pod annotation: nvidia.com/snapshot-target-containers = "<name>[,<name>...]" - Checkpoint Jobs must list exactly one target container. - Restore pods may list one or more target containers; the same checkpoint is replayed into each. The annotation is required. snapshotprotocol.NewCheckpointJob, NewRestorePod, PrepareRestorePodSpec, and ValidateRestorePodSpec all refuse to run when it is missing — no silent Containers[0] / 'main' fallback. Per-container agent-side status annotations move from singletons to per-container-suffixed keys, with a pod-level aggregate rollup so observers do not need to know the full target-container list: nvidia.com/snapshot-checkpoint-status (Job, singleton) nvidia.com/snapshot-restore-status.<container> (Pod, per target) nvidia.com/snapshot-restore-container-id.<container> (Pod, per target) nvidia.com/snapshot-restore-status (Pod, aggregate; written only by the informer event loop, so concurrent per-container workers never race on it) The snapshot-control emptyDir is mounted into each target container with subPath=<containerName>, so engine-0 and engine-1 lifecycle sentinels live on disjoint disk paths while each container still sees its sentinels at /snapshot-control via $DYN_SNAPSHOT_CONTROL_DIR. Operator wiring (no CRD changes): - buildCheckpointJob stamps 'main' on the Job pod template and locates the main container by name (ExtraPodSpec can inject containers before main, so Containers[0] is not trusted). - ApplyRestorePodMetadata / InjectCheckpointIntoPodSpec honor a new CheckpointInfo.RestoreTargetContainers override. The DGD and DCD controllers populate that field with FailoverEngineContainerNames() ("engine-0", "engine-1") when failover is enabled. - GMS sidecars keep their single-main-container wiring. snapshotctl: - new --container (checkpoint) / --containers (restore) CLI flags - reconcileTargetContainers stamps or validates the annotation before submitting, and rejects mismatches between manifest and flag - waitForRestore watches the aggregate annotation Docs updated in docs/kubernetes/snapshot.md and deploy/snapshot/cmd/snapshotctl/README.md. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview:
One of three follow-up PRs closing reviewer feedback from #8194. This is the main-container resolution refactor: every site that hardcoded
podSpec.Containers[0]as "the workload" is now resolved via a shared helper that prefers-by-name withContainers[0]fallback.Companion PRs:
fix(snapshot): require exec-form command for checkpoint containers(fix(snapshot): require exec-form command for checkpoint containers #8318, already open) — CodeRabbit B, Cfix(operator): validate pod-info volume; operator review nits(next) — CodeRabbit A, G, H, JGMS Python changes (F, M, N) deferred to a separate GMS Python cleanup PR already in flight.
Closed predecessor: #8299.
Architectural decisions
1. Prefer-by-name with
Containers[0]fallback. Seven sites used to hardcodepodSpec.Containers[0]as "the workload". That assumption breaks wheneverExtraPodSpec.PodSpec.Containersor a webhook-injected sidecar (istio-proxy, etc.) lands ahead of the workload in the slice. New contract:The fallback is deliberate — dropping it would break any pod rendered before the naming convention or produced by non-operator tooling (
snapshotctlmanifests authored by hand, old fixtures, manual pods).snapshotctltightens the rule further: in multi-container manifests it requires the worker to be explicitly namedmain, because in CLI-authored input the ambiguity is user-visible.2. Operator and snapshot each own their own copy. The
"main"container name convention predates the snapshot subsystem (LWS multinode enforcement, PR #3390, Oct 2025) and is owned by the operator. Operator code must not importsnapshotprotocolhelpers that operate on pod-spec shape. After this PR:commonconsts.MainContainerName+common.ResolveMainContainerare the authoritative operator-side copies. Operator callers (graph.go,checkpoint_job.go,checkpoint/podspec.go) use these.snapshotprotocol.MainContainerName+snapshotprotocol.ResolveMainContainer/ResolveMainContainerNameare the snapshot-side copies. Snapshot agent, snapshotctl, and snapshot protocol internals use these.Why duplicate instead of share:
deploy/operatoranddeploy/snapshotare separatego.mods and cannot cross-importinternal/packages. Creating a third Go module purely for one constant and one 10-line function would cost more than the drift risk of duplication, especially given the cross-reference comments.3. GMS wires onto the main container by pointer, before it lands in the slice.
GenerateBasePodSpecnow holds the main container as a local variable at the GMS wire point and passes&containerdirectly todra.ApplyClaimandgms.EnsureServerSidecar. Rationale:graph.go's import ofsnapshotprotocol. Non-snapshot DGDs (most of them) no longer touch that package.checkpoint_job.go's post-Commit 1 main-container lookup could put the workload at a non-zero index, but the olddra.ApplyClaim(podSpec, name)still strippednvidia.com/gpufromContainers[0]and wrote the claim there. So GMS and the DRA claim could land on different containers. The newdra.ApplyClaim(podSpec, target *corev1.Container, name)signature matchesgms.EnsureServerSidecar(podSpec, mainContainer), and both receive the caller-resolved pointer.4. GMS ↔ failover ordering invariant.
buildFailoverPoddeletes the singlemaincontainer and replaces it withengine-0/engine-1(DeepCopy + engine-specific overrides). Ordering must be:Pinned by the new
TestGenerateBasePodSpec_GMSWiredToMainBeforeFailoverintegration test.5.
PrepareRestorePodSpecForCheckpointowns worker resolution. The helper no longer accepts an arbitrarycontainer *corev1.Containerpointer; it resolves the worker internally viaResolveMainContainer. Callers cannot pass a sidecar pointer by mistake. Closes the part of CodeRabbit I that was about API footguns.Details:
fix(snapshot): resolve main container by name with Containers[0] fallbacksnapshotprotocol.ResolveMainContainer/ResolveMainContainerName. Replace hardcoded index-0 sites in the operator and snapshot agent/snapshotctl. MakePrepareRestorePodSpecForCheckpointown its own worker resolution.refactor(operator): wire GMS onto main container by pointer, not by searchgraph.goso GMS runs against the localcontainervariable before it is appended to the slice; updatedra.ApplyClaimsignature to take a target container pointer; this closes a latent "GMS and DRA claim land on different containers" bug incheckpoint_job.go.refactor(operator): keep MainContainerName on operator side; document the duplicaterefactor(operator): move main-container resolver to operator common packagesnapshotprotocol.ResolveMainContainer. Newcommon.ResolveMainContainerlives indeploy/operator/internal/common/with its own tests.Where should the reviewer start?
deploy/snapshot/protocol/maincontainer.goanddeploy/operator/internal/common/maincontainer.go— the two identical resolvers, each with tests. The cross-reference comments on both files spell out the "keep in sync" invariant.deploy/operator/internal/dynamo/graph.goGMS block — verify the in-scopecontainerlocal is passed directly, nosnapshotprotocolimport. The regression testTestGenerateBasePodSpec_GMSWiredToMainBeforeFailoverpins the failover interaction.deploy/operator/internal/dra/dra.go— signature change from(podSpec, claimName)to(podSpec, target *corev1.Container, claimName). Both callers updated. Tests updated.deploy/snapshot/protocol/restore.goPrepareRestorePodSpecForCheckpoint— now resolves the worker internally; the sole operator caller incheckpoint/podspec.gois updated to not pass the pointer in.deploy/snapshot/cmd/snapshotctl/kube.go— the tightened multi-container rule (requires explicitname: mainwhen the manifest has multiple containers).Tests
snapshotprotocol.ResolveMainContainer+ResolveMainContainerName— nil spec, empty containers, single-container-is-main, main-at-non-zero-index, fallback.common.ResolveMainContainer— same cases on the operator side.TestGenerateBasePodSpec_GMSWiredToMainBeforeFailover— asserts that afterbuildFailoverPodreplaces the "main" container withengine-0/engine-1, both engines inherit the GMS shared-socket volume mount,GMS_SOCKET_DIRenv, and DRA claim (viaDeepCopyinbuildEngineContainer), and that the pod-levelgms-serverinit sidecar and shared volume land exactly once.TestApplyClaim_TargetsCallerSuppliedContainer(replaces the old_AlwaysTargetsFirstContainer) — verifies the sidecar-at-index-0 case.TestInjectCheckpointIntoPodSpec— added "resolves main even when not at index zero" case.deploy/operatorunit tests pass (go testagainst envtest assets).deploy/snapshottests pass.go vet ./...clean in both modules.pre-commit run --from-ref origin/main --to-ref HEADpasses.Risk notes for deploy-codeowners
Containers[0], so this refactor is safe to cherry-pick without auditing every pod-emitting path. The regression surface is "pods that have a sidecar at index 0 and a container namedmainat a non-zero index" — the fallback would have been wrong here pre-refactor anyway.dra.ApplyClaimis an internal-package function (deploy/operator/internal/dra/), not part of any published API surface. Signature change has zero external impact.PrepareRestorePodSpecForCheckpointis exported fromdeploy/snapshot/protocol, but I am not aware of any out-of-tree callers. Its one in-repo caller (checkpoint/podspec.go:InjectCheckpointIntoPodSpec) is updated.Containers[0]for its validator, PR 2 replaces it. Whichever lands second picks up the other's work via rebase.Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)