refactor: resolve workload container by name with Containers[0] fallback; wire GMS by pointer by galletas1712 · Pull Request #8319 · ai-dynamo/dynamo

galletas1712 · 2026-04-17T19:21:54Z

Overview:

One of three follow-up PRs closing reviewer feedback from #8194. This is the main-container resolution refactor: every site that hardcoded podSpec.Containers[0] as "the workload" is now resolved via a shared helper that prefers-by-name with Containers[0] fallback.

Companion PRs:

PR 1 — fix(snapshot): require exec-form command for checkpoint containers (fix(snapshot): require exec-form command for checkpoint containers #8318, already open) — CodeRabbit B, C
PR 3 — fix(operator): validate pod-info volume; operator review nits (next) — CodeRabbit A, G, H, J

GMS Python changes (F, M, N) deferred to a separate GMS Python cleanup PR already in flight.

Closed predecessor: #8299.

Architectural decisions

1. Prefer-by-name with Containers[0] fallback. Seven sites used to hardcode podSpec.Containers[0] as "the workload". That assumption breaks whenever ExtraPodSpec.PodSpec.Containers or a webhook-injected sidecar (istio-proxy, etc.) lands ahead of the workload in the slice. New contract:

ResolveMainContainer(podSpec):
  1. if no containers -> nil
  2. scan for container named "main" -> return it
  3. fall back to Containers[0]    (preserves pre-convention pods and tests)

The fallback is deliberate — dropping it would break any pod rendered before the naming convention or produced by non-operator tooling (snapshotctl manifests authored by hand, old fixtures, manual pods). snapshotctl tightens the rule further: in multi-container manifests it requires the worker to be explicitly named main, because in CLI-authored input the ambiguity is user-visible.

2. Operator and snapshot each own their own copy. The "main" container name convention predates the snapshot subsystem (LWS multinode enforcement, PR #3390, Oct 2025) and is owned by the operator. Operator code must not import snapshotprotocol helpers that operate on pod-spec shape. After this PR:

commonconsts.MainContainerName + common.ResolveMainContainer are the authoritative operator-side copies. Operator callers (graph.go, checkpoint_job.go, checkpoint/podspec.go) use these.
snapshotprotocol.MainContainerName + snapshotprotocol.ResolveMainContainer / ResolveMainContainerName are the snapshot-side copies. Snapshot agent, snapshotctl, and snapshot protocol internals use these.
Both sides carry cross-reference comments pointing at the other copy.

Why duplicate instead of share: deploy/operator and deploy/snapshot are separate go.mods and cannot cross-import internal/ packages. Creating a third Go module purely for one constant and one 10-line function would cost more than the drift risk of duplication, especially given the cross-reference comments.

3. GMS wires onto the main container by pointer, before it lands in the slice. GenerateBasePodSpec now holds the main container as a local variable at the GMS wire point and passes &container directly to dra.ApplyClaim and gms.EnsureServerSidecar. Rationale:

The main container exists as a local; searching the slice is unnecessary work.
Removes graph.go's import of snapshotprotocol. Non-snapshot DGDs (most of them) no longer touch that package.
Closes a latent bug: checkpoint_job.go's post-Commit 1 main-container lookup could put the workload at a non-zero index, but the old dra.ApplyClaim(podSpec, name) still stripped nvidia.com/gpu from Containers[0] and wrote the claim there. So GMS and the DRA claim could land on different containers. The new dra.ApplyClaim(podSpec, target *corev1.Container, name) signature matches gms.EnsureServerSidecar(podSpec, mainContainer), and both receive the caller-resolved pointer.

4. GMS ↔ failover ordering invariant. buildFailoverPod deletes the single main container and replaces it with engine-0 / engine-1 (DeepCopy + engine-specific overrides). Ordering must be:

build local `container` (name="main")
     |
     v
apply GMS wiring (DRA claim + shared volume + env + mount)
     |
     v
podSpec.Containers = append(..., container)
     |
     v
backend.UpdatePodSpec  (multinode init containers, etc.)
     |
     v
append frontend sidecar (if any)
     |
     v
buildFailoverPod: DeepCopy "main" -> engine-0, engine-1
                  engines inherit GMS volume/mount/env/claim transitively

Pinned by the new TestGenerateBasePodSpec_GMSWiredToMainBeforeFailover integration test.

5. PrepareRestorePodSpecForCheckpoint owns worker resolution. The helper no longer accepts an arbitrary container *corev1.Container pointer; it resolves the worker internally via ResolveMainContainer. Callers cannot pass a sidecar pointer by mistake. Closes the part of CodeRabbit I that was about API footguns.

Details:

Commit	Summary
`fix(snapshot): resolve main container by name with Containers[0] fallback`	Introduce `snapshotprotocol.ResolveMainContainer` / `ResolveMainContainerName`. Replace hardcoded index-0 sites in the operator and snapshot agent/snapshotctl. Make `PrepareRestorePodSpecForCheckpoint` own its own worker resolution.
`refactor(operator): wire GMS onto main container by pointer, not by search`	Reorder `graph.go` so GMS runs against the local `container` variable before it is appended to the slice; update `dra.ApplyClaim` signature to take a target container pointer; this closes a latent "GMS and DRA claim land on different containers" bug in `checkpoint_job.go`.
`refactor(operator): keep MainContainerName on operator side; document the duplicate`	Correct the provenance of the constant: operator is the owner, snapshot keeps a local copy with cross-reference docs. Also adds the GMS+failover integration test.
`refactor(operator): move main-container resolver to operator common package`	Finish decoupling: operator code no longer imports `snapshotprotocol.ResolveMainContainer`. New `common.ResolveMainContainer` lives in `deploy/operator/internal/common/` with its own tests.

Where should the reviewer start?

deploy/snapshot/protocol/maincontainer.go and deploy/operator/internal/common/maincontainer.go — the two identical resolvers, each with tests. The cross-reference comments on both files spell out the "keep in sync" invariant.
deploy/operator/internal/dynamo/graph.go GMS block — verify the in-scope container local is passed directly, no snapshotprotocol import. The regression test TestGenerateBasePodSpec_GMSWiredToMainBeforeFailover pins the failover interaction.
deploy/operator/internal/dra/dra.go — signature change from (podSpec, claimName) to (podSpec, target *corev1.Container, claimName). Both callers updated. Tests updated.
deploy/snapshot/protocol/restore.go PrepareRestorePodSpecForCheckpoint — now resolves the worker internally; the sole operator caller in checkpoint/podspec.go is updated to not pass the pointer in.
deploy/snapshot/cmd/snapshotctl/kube.go — the tightened multi-container rule (requires explicit name: main when the manifest has multiple containers).

Tests

New unit tests:
- snapshotprotocol.ResolveMainContainer + ResolveMainContainerName — nil spec, empty containers, single-container-is-main, main-at-non-zero-index, fallback.
- common.ResolveMainContainer — same cases on the operator side.
- TestGenerateBasePodSpec_GMSWiredToMainBeforeFailover — asserts that after buildFailoverPod replaces the "main" container with engine-0/engine-1, both engines inherit the GMS shared-socket volume mount, GMS_SOCKET_DIR env, and DRA claim (via DeepCopy in buildEngineContainer), and that the pod-level gms-server init sidecar and shared volume land exactly once.
- TestApplyClaim_TargetsCallerSuppliedContainer (replaces the old _AlwaysTargetsFirstContainer) — verifies the sidecar-at-index-0 case.
- TestInjectCheckpointIntoPodSpec — added "resolves main even when not at index zero" case.
All deploy/operator unit tests pass (go test against envtest assets).
All deploy/snapshot tests pass.
go vet ./... clean in both modules.
pre-commit run --from-ref origin/main --to-ref HEAD passes.

Risk notes for deploy-codeowners

Behavior preservation via fallback: any pod rendered before the naming convention (or produced by non-operator tooling) still resolves via Containers[0], so this refactor is safe to cherry-pick without auditing every pod-emitting path. The regression surface is "pods that have a sidecar at index 0 and a container named main at a non-zero index" — the fallback would have been wrong here pre-refactor anyway.
dra.ApplyClaim is an internal-package function (deploy/operator/internal/dra/), not part of any published API surface. Signature change has zero external impact.
PrepareRestorePodSpecForCheckpoint is exported from deploy/snapshot/protocol, but I am not aware of any out-of-tree callers. Its one in-repo caller (checkpoint/podspec.go:InjectCheckpointIntoPodSpec) is updated.
No API changes to CRDs, no Go module or Python dependency changes.
This PR is safe to merge before or after PR 1 (fix(snapshot): require exec-form command for checkpoint containers #8318). The two are independent — PR 1 keeps Containers[0] for its validator, PR 2 replaces it. Whichever lands second picks up the other's work via rebase.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Relates to: fix(gms): rewrite GMS checkpoint/restore operator support #8194 (addresses resolved CodeRabbit threads D, E, I)
Replaces: fix(gms): follow-up fixes for #8194 CodeRabbit and reviewer comments #8299 (closed; split into three single-concern PRs)

…back Pull main-container resolution into a single snapshotprotocol helper (ResolveMainContainer / ResolveMainContainerName) that prefers the container named 'main' and falls back to Containers[0]. Replace five Containers[0] assumptions that were hit by reviewer comments on #8194: - deploy/operator/internal/dynamo/graph.go wired GMS onto Containers[0] even though ExtraPodSpec containers can be appended before the main container — now looks up by name. - deploy/operator/internal/checkpoint/podspec.go had a local resolveMainContainer without the fallback. - deploy/operator/internal/controller/checkpoint_job.go hardcoded both the initial and the post-sidecar re-acquire to index 0. - deploy/snapshot/protocol/checkpoint.go (WrapLaunchJob) and restore.go (NewRestorePod, PrepareRestorePodSpec) both assumed index 0 in a comment that was specific to operator-emitted pods. - deploy/snapshot/internal/controller/util.go and deploy/snapshot/cmd/snapshotctl/kube.go ditto. PrepareRestorePodSpecForCheckpoint now owns worker resolution internally so callers cannot pass a sidecar pointer by mistake. The public signature drops the container argument. MainContainerName moves to snapshotprotocol as the single source of truth; commonconsts.MainContainerName aliases it so existing call sites keep working. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

…earch The previous fix used snapshotprotocol.ResolveMainContainer to look up the main container in graph.go, which is a layering violation: graph.go builds general DGD pod specs and most DGDs never participate in checkpoint/restore. Searching podSpec.Containers for a container named 'main' was also unnecessary work given the local function still holds the main container as a value. Reorder GenerateBasePodSpec so GMS (and the DRA claim that pairs with it) run against the local container variable before it is appended to the slice. The caller passes the workload container by pointer, so dra.ApplyClaim and gms.EnsureServerSidecar do not need any container lookup at all; removing the hardcoded podSpec.Containers[0] write in dra.ApplyClaim in favor of a caller-supplied pointer makes the contract symmetric with gms.EnsureServerSidecar. This also sidesteps a failover-case footgun: buildFailoverPod runs after GMS, deletes the single main container, and replaces it with engine-0/engine-1 (via DeepCopy of the GMS-wired main). At the GMS wire point there really is exactly one workload container, and threading it through by pointer makes that invariant explicit in code rather than implicit in slice-order reasoning. checkpoint_job.go (the other dra.ApplyClaim call site) already has a resolved mainContainer pointer available, so the new signature is a one-token change there. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

… the duplicate The operator's commonconsts.MainContainerName predates the snapshot code (PR #3390, 2025-10-03 \u2014 added for LWS multinode enforcement in checkMainContainer). Snapshot tooling adopted the same convention later. My earlier follow-up commit inverted this relationship by making commonconsts.MainContainerName an alias for snapshotprotocol.MainContainerName, which put the producer of the convention behind the consumer in the dependency graph. Restore the operator constant as the authoritative declaration and keep a local copy in snapshotprotocol with a cross-reference comment noting that deploy/snapshot is a separate Go module and cannot import deploy/operator/internal directly. The two copies must be kept in sync; there is no runtime check, but the comment and the single-letter string value keep the drift risk low. As a drive-by, also replace a hardcoded "main" literal in the snapshot executor's RestoreRequest default with snapshotprotocol.MainContainerName so there is exactly one duplicate across the two modules rather than three. Also add an integration regression test for GMS + failover in GenerateBasePodSpec: asserts that after buildFailoverPod replaces the "main" container with engine-0/engine-1, both engines inherit the GMS shared-socket volume mount, GMS_SOCKET_DIR env, and DRA claim from the original main container (via DeepCopy in buildEngineContainer), and that the pod-level gms-server init sidecar and shared volume both land exactly once. This closes the gap in coverage for the GMS-before-GMS- wires-then-failover-clones sequencing introduced in the previous refactor commit. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

…ackage Operator code should not import snapshotprotocol helpers that deal with pod-spec shape \u2014 the snapshot module is a consumer of the operator's main-container convention, not the owner of the lookup logic. Add common.ResolveMainContainer in deploy/operator/internal/common/ with the same prefer-by-name-then-fallback-to-Containers[0] behavior. Update the two operator call sites (checkpoint_job.go and checkpoint/podspec.go) to use it. The snapshot side keeps its own copy (snapshotprotocol.ResolveMainContainer and ResolveMainContainerName) for use by the snapshot agent, snapshotctl, and snapshot protocol internals. Both copies must be kept in sync; cross-reference comments on both files now point to the other copy. This is the same duplication-with-docs pattern we're using for MainContainerName itself, because deploy/operator and deploy/snapshot are separate Go modules and cannot cross-import internal packages. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

galletas1712 · 2026-04-23T19:32:49Z

Superseded by #8631. The underlying problem — snapshot helpers keying off Containers[0] / "main" — is resolved structurally in #8631 by making target-container selection an explicit mandatory annotation read by the protocol helpers themselves, so there is no fallback to harden and no implicit main-container convention to thread through the operator.

… contract The snapshot-agent keyed every checkpoint and restore operation off of a single implicit workload container (Containers[0] in some call sites, 'main' in others). That made intra-pod failover restore (#8319), where the operator clones 'main' into engine-0 / engine-1, impossible to model cleanly. Replace the implicit contract with one mandatory pod annotation: nvidia.com/snapshot-target-containers = "<name>[,<name>...]" - Checkpoint Jobs must list exactly one target container. - Restore pods may list one or more target containers; the same checkpoint is replayed into each. The annotation is required. snapshotprotocol.NewCheckpointJob, NewRestorePod, PrepareRestorePodSpec, and ValidateRestorePodSpec all refuse to run when it is missing — no silent Containers[0] / 'main' fallback. Per-container agent-side status annotations move from singletons to per-container-suffixed keys, with a pod-level aggregate rollup so observers do not need to know the full target-container list: nvidia.com/snapshot-checkpoint-status (Job, singleton) nvidia.com/snapshot-restore-status.<container> (Pod, per target) nvidia.com/snapshot-restore-container-id.<container> (Pod, per target) nvidia.com/snapshot-restore-status (Pod, aggregate; written only by the informer event loop, so concurrent per-container workers never race on it) The snapshot-control emptyDir is mounted into each target container with subPath=<containerName>, so engine-0 and engine-1 lifecycle sentinels live on disjoint disk paths while each container still sees its sentinels at /snapshot-control via $DYN_SNAPSHOT_CONTROL_DIR. Operator wiring (no CRD changes): - buildCheckpointJob stamps 'main' on the Job pod template and locates the main container by name (ExtraPodSpec can inject containers before main, so Containers[0] is not trusted). - ApplyRestorePodMetadata / InjectCheckpointIntoPodSpec honor a new CheckpointInfo.RestoreTargetContainers override. The DGD and DCD controllers populate that field with FailoverEngineContainerNames() ("engine-0", "engine-1") when failover is enabled. - GMS sidecars keep their single-main-container wiring. snapshotctl: - new --container (checkpoint) / --containers (restore) CLI flags - reconcileTargetContainers stamps or validates the annotation before submitting, and rejects mismatches between manifest and flag - waitForRestore watches the aggregate annotation Docs updated in docs/kubernetes/snapshot.md and deploy/snapshot/cmd/snapshotctl/README.md. Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>

galletas1712 added 4 commits April 17, 2026 12:19

pull-request-size Bot added the size/XL label Apr 17, 2026

github-actions Bot added refactor deployment::k8s Relates to dynamo deployment in kubernetes labels Apr 17, 2026

This was referenced Apr 17, 2026

fix(operator): validate and repair pod-info volume; operator review nits #8320

Draft

feat(snapshot): support multi-container snapshot per pod (PR 1/2) #8421

Closed

feat(snapshot): make multi-container checkpoint/restore an annotation contract #8631

Merged

galletas1712 closed this Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: resolve workload container by name with Containers[0] fallback; wire GMS by pointer#8319

refactor: resolve workload container by name with Containers[0] fallback; wire GMS by pointer#8319
galletas1712 wants to merge 4 commits intomainfrom
schwinns/resolve-main-container-by-name

galletas1712 commented Apr 17, 2026

Uh oh!

galletas1712 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

galletas1712 commented Apr 17, 2026

Overview:

Architectural decisions

Details:

Where should the reviewer start?

Tests

Risk notes for deploy-codeowners

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Uh oh!

galletas1712 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant