Skip to content

refactor: resolve workload container by name with Containers[0] fallback; wire GMS by pointer#8319

Closed
galletas1712 wants to merge 4 commits intomainfrom
schwinns/resolve-main-container-by-name
Closed

refactor: resolve workload container by name with Containers[0] fallback; wire GMS by pointer#8319
galletas1712 wants to merge 4 commits intomainfrom
schwinns/resolve-main-container-by-name

Conversation

@galletas1712
Copy link
Copy Markdown
Contributor

Overview:

One of three follow-up PRs closing reviewer feedback from #8194. This is the main-container resolution refactor: every site that hardcoded podSpec.Containers[0] as "the workload" is now resolved via a shared helper that prefers-by-name with Containers[0] fallback.

Companion PRs:

GMS Python changes (F, M, N) deferred to a separate GMS Python cleanup PR already in flight.

Closed predecessor: #8299.

Architectural decisions

1. Prefer-by-name with Containers[0] fallback. Seven sites used to hardcode podSpec.Containers[0] as "the workload". That assumption breaks whenever ExtraPodSpec.PodSpec.Containers or a webhook-injected sidecar (istio-proxy, etc.) lands ahead of the workload in the slice. New contract:

ResolveMainContainer(podSpec):
  1. if no containers -> nil
  2. scan for container named "main" -> return it
  3. fall back to Containers[0]    (preserves pre-convention pods and tests)

The fallback is deliberate — dropping it would break any pod rendered before the naming convention or produced by non-operator tooling (snapshotctl manifests authored by hand, old fixtures, manual pods). snapshotctl tightens the rule further: in multi-container manifests it requires the worker to be explicitly named main, because in CLI-authored input the ambiguity is user-visible.

2. Operator and snapshot each own their own copy. The "main" container name convention predates the snapshot subsystem (LWS multinode enforcement, PR #3390, Oct 2025) and is owned by the operator. Operator code must not import snapshotprotocol helpers that operate on pod-spec shape. After this PR:

  • commonconsts.MainContainerName + common.ResolveMainContainer are the authoritative operator-side copies. Operator callers (graph.go, checkpoint_job.go, checkpoint/podspec.go) use these.
  • snapshotprotocol.MainContainerName + snapshotprotocol.ResolveMainContainer / ResolveMainContainerName are the snapshot-side copies. Snapshot agent, snapshotctl, and snapshot protocol internals use these.
  • Both sides carry cross-reference comments pointing at the other copy.

Why duplicate instead of share: deploy/operator and deploy/snapshot are separate go.mods and cannot cross-import internal/ packages. Creating a third Go module purely for one constant and one 10-line function would cost more than the drift risk of duplication, especially given the cross-reference comments.

3. GMS wires onto the main container by pointer, before it lands in the slice. GenerateBasePodSpec now holds the main container as a local variable at the GMS wire point and passes &container directly to dra.ApplyClaim and gms.EnsureServerSidecar. Rationale:

  • The main container exists as a local; searching the slice is unnecessary work.
  • Removes graph.go's import of snapshotprotocol. Non-snapshot DGDs (most of them) no longer touch that package.
  • Closes a latent bug: checkpoint_job.go's post-Commit 1 main-container lookup could put the workload at a non-zero index, but the old dra.ApplyClaim(podSpec, name) still stripped nvidia.com/gpu from Containers[0] and wrote the claim there. So GMS and the DRA claim could land on different containers. The new dra.ApplyClaim(podSpec, target *corev1.Container, name) signature matches gms.EnsureServerSidecar(podSpec, mainContainer), and both receive the caller-resolved pointer.

4. GMS ↔ failover ordering invariant. buildFailoverPod deletes the single main container and replaces it with engine-0 / engine-1 (DeepCopy + engine-specific overrides). Ordering must be:

build local `container` (name="main")
     |
     v
apply GMS wiring (DRA claim + shared volume + env + mount)
     |
     v
podSpec.Containers = append(..., container)
     |
     v
backend.UpdatePodSpec  (multinode init containers, etc.)
     |
     v
append frontend sidecar (if any)
     |
     v
buildFailoverPod: DeepCopy "main" -> engine-0, engine-1
                  engines inherit GMS volume/mount/env/claim transitively

Pinned by the new TestGenerateBasePodSpec_GMSWiredToMainBeforeFailover integration test.

5. PrepareRestorePodSpecForCheckpoint owns worker resolution. The helper no longer accepts an arbitrary container *corev1.Container pointer; it resolves the worker internally via ResolveMainContainer. Callers cannot pass a sidecar pointer by mistake. Closes the part of CodeRabbit I that was about API footguns.

Details:

Commit Summary
fix(snapshot): resolve main container by name with Containers[0] fallback Introduce snapshotprotocol.ResolveMainContainer / ResolveMainContainerName. Replace hardcoded index-0 sites in the operator and snapshot agent/snapshotctl. Make PrepareRestorePodSpecForCheckpoint own its own worker resolution.
refactor(operator): wire GMS onto main container by pointer, not by search Reorder graph.go so GMS runs against the local container variable before it is appended to the slice; update dra.ApplyClaim signature to take a target container pointer; this closes a latent "GMS and DRA claim land on different containers" bug in checkpoint_job.go.
refactor(operator): keep MainContainerName on operator side; document the duplicate Correct the provenance of the constant: operator is the owner, snapshot keeps a local copy with cross-reference docs. Also adds the GMS+failover integration test.
refactor(operator): move main-container resolver to operator common package Finish decoupling: operator code no longer imports snapshotprotocol.ResolveMainContainer. New common.ResolveMainContainer lives in deploy/operator/internal/common/ with its own tests.

Where should the reviewer start?

  1. deploy/snapshot/protocol/maincontainer.go and deploy/operator/internal/common/maincontainer.go — the two identical resolvers, each with tests. The cross-reference comments on both files spell out the "keep in sync" invariant.
  2. deploy/operator/internal/dynamo/graph.go GMS block — verify the in-scope container local is passed directly, no snapshotprotocol import. The regression test TestGenerateBasePodSpec_GMSWiredToMainBeforeFailover pins the failover interaction.
  3. deploy/operator/internal/dra/dra.go — signature change from (podSpec, claimName) to (podSpec, target *corev1.Container, claimName). Both callers updated. Tests updated.
  4. deploy/snapshot/protocol/restore.go PrepareRestorePodSpecForCheckpoint — now resolves the worker internally; the sole operator caller in checkpoint/podspec.go is updated to not pass the pointer in.
  5. deploy/snapshot/cmd/snapshotctl/kube.go — the tightened multi-container rule (requires explicit name: main when the manifest has multiple containers).

Tests

  • New unit tests:
    • snapshotprotocol.ResolveMainContainer + ResolveMainContainerName — nil spec, empty containers, single-container-is-main, main-at-non-zero-index, fallback.
    • common.ResolveMainContainer — same cases on the operator side.
    • TestGenerateBasePodSpec_GMSWiredToMainBeforeFailover — asserts that after buildFailoverPod replaces the "main" container with engine-0/engine-1, both engines inherit the GMS shared-socket volume mount, GMS_SOCKET_DIR env, and DRA claim (via DeepCopy in buildEngineContainer), and that the pod-level gms-server init sidecar and shared volume land exactly once.
    • TestApplyClaim_TargetsCallerSuppliedContainer (replaces the old _AlwaysTargetsFirstContainer) — verifies the sidecar-at-index-0 case.
    • TestInjectCheckpointIntoPodSpec — added "resolves main even when not at index zero" case.
  • All deploy/operator unit tests pass (go test against envtest assets).
  • All deploy/snapshot tests pass.
  • go vet ./... clean in both modules.
  • pre-commit run --from-ref origin/main --to-ref HEAD passes.

Risk notes for deploy-codeowners

  • Behavior preservation via fallback: any pod rendered before the naming convention (or produced by non-operator tooling) still resolves via Containers[0], so this refactor is safe to cherry-pick without auditing every pod-emitting path. The regression surface is "pods that have a sidecar at index 0 and a container named main at a non-zero index" — the fallback would have been wrong here pre-refactor anyway.
  • dra.ApplyClaim is an internal-package function (deploy/operator/internal/dra/), not part of any published API surface. Signature change has zero external impact.
  • PrepareRestorePodSpecForCheckpoint is exported from deploy/snapshot/protocol, but I am not aware of any out-of-tree callers. Its one in-repo caller (checkpoint/podspec.go:InjectCheckpointIntoPodSpec) is updated.
  • No API changes to CRDs, no Go module or Python dependency changes.
  • This PR is safe to merge before or after PR 1 (fix(snapshot): require exec-form command for checkpoint containers #8318). The two are independent — PR 1 keeps Containers[0] for its validator, PR 2 replaces it. Whichever lands second picks up the other's work via rebase.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

…back

Pull main-container resolution into a single snapshotprotocol helper
(ResolveMainContainer / ResolveMainContainerName) that prefers the
container named 'main' and falls back to Containers[0]. Replace five
Containers[0] assumptions that were hit by reviewer comments on #8194:

- deploy/operator/internal/dynamo/graph.go wired GMS onto Containers[0]
  even though ExtraPodSpec containers can be appended before the main
  container — now looks up by name.
- deploy/operator/internal/checkpoint/podspec.go had a local
  resolveMainContainer without the fallback.
- deploy/operator/internal/controller/checkpoint_job.go hardcoded both
  the initial and the post-sidecar re-acquire to index 0.
- deploy/snapshot/protocol/checkpoint.go (WrapLaunchJob) and restore.go
  (NewRestorePod, PrepareRestorePodSpec) both assumed index 0 in a
  comment that was specific to operator-emitted pods.
- deploy/snapshot/internal/controller/util.go and
  deploy/snapshot/cmd/snapshotctl/kube.go ditto.

PrepareRestorePodSpecForCheckpoint now owns worker resolution internally
so callers cannot pass a sidecar pointer by mistake. The public
signature drops the container argument.

MainContainerName moves to snapshotprotocol as the single source of
truth; commonconsts.MainContainerName aliases it so existing call sites
keep working.

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
…earch

The previous fix used snapshotprotocol.ResolveMainContainer to look up
the main container in graph.go, which is a layering violation: graph.go
builds general DGD pod specs and most DGDs never participate in
checkpoint/restore. Searching podSpec.Containers for a container named
'main' was also unnecessary work given the local function still holds
the main container as a value.

Reorder GenerateBasePodSpec so GMS (and the DRA claim that pairs with
it) run against the local container variable before it is appended to
the slice. The caller passes the workload container by pointer, so
dra.ApplyClaim and gms.EnsureServerSidecar do not need any container
lookup at all; removing the hardcoded podSpec.Containers[0] write in
dra.ApplyClaim in favor of a caller-supplied pointer makes the
contract symmetric with gms.EnsureServerSidecar.

This also sidesteps a failover-case footgun: buildFailoverPod runs
after GMS, deletes the single main container, and replaces it with
engine-0/engine-1 (via DeepCopy of the GMS-wired main). At the GMS
wire point there really is exactly one workload container, and
threading it through by pointer makes that invariant explicit in code
rather than implicit in slice-order reasoning.

checkpoint_job.go (the other dra.ApplyClaim call site) already has a
resolved mainContainer pointer available, so the new signature is a
one-token change there.

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
… the duplicate

The operator's commonconsts.MainContainerName predates the snapshot
code (PR #3390, 2025-10-03 \u2014 added for LWS multinode enforcement in
checkMainContainer). Snapshot tooling adopted the same convention
later. My earlier follow-up commit inverted this relationship by
making commonconsts.MainContainerName an alias for
snapshotprotocol.MainContainerName, which put the producer of the
convention behind the consumer in the dependency graph.

Restore the operator constant as the authoritative declaration and
keep a local copy in snapshotprotocol with a cross-reference comment
noting that deploy/snapshot is a separate Go module and cannot import
deploy/operator/internal directly. The two copies must be kept in
sync; there is no runtime check, but the comment and the single-letter
string value keep the drift risk low.

As a drive-by, also replace a hardcoded "main" literal in the snapshot
executor's RestoreRequest default with snapshotprotocol.MainContainerName
so there is exactly one duplicate across the two modules rather than
three.

Also add an integration regression test for GMS + failover in
GenerateBasePodSpec: asserts that after buildFailoverPod replaces the
"main" container with engine-0/engine-1, both engines inherit the GMS
shared-socket volume mount, GMS_SOCKET_DIR env, and DRA claim from the
original main container (via DeepCopy in buildEngineContainer), and
that the pod-level gms-server init sidecar and shared volume both land
exactly once. This closes the gap in coverage for the GMS-before-GMS-
wires-then-failover-clones sequencing introduced in the previous
refactor commit.

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
…ackage

Operator code should not import snapshotprotocol helpers that deal with
pod-spec shape \u2014 the snapshot module is a consumer of the operator's
main-container convention, not the owner of the lookup logic.

Add common.ResolveMainContainer in deploy/operator/internal/common/
with the same prefer-by-name-then-fallback-to-Containers[0] behavior.
Update the two operator call sites (checkpoint_job.go and
checkpoint/podspec.go) to use it.

The snapshot side keeps its own copy (snapshotprotocol.ResolveMainContainer
and ResolveMainContainerName) for use by the snapshot agent,
snapshotctl, and snapshot protocol internals. Both copies must be kept
in sync; cross-reference comments on both files now point to the other
copy. This is the same duplication-with-docs pattern we're using for
MainContainerName itself, because deploy/operator and deploy/snapshot
are separate Go modules and cannot cross-import internal packages.

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
@galletas1712
Copy link
Copy Markdown
Contributor Author

Superseded by #8631. The underlying problem — snapshot helpers keying off Containers[0] / "main" — is resolved structurally in #8631 by making target-container selection an explicit mandatory annotation read by the protocol helpers themselves, so there is no fallback to harden and no implicit main-container convention to thread through the operator.

galletas1712 pushed a commit that referenced this pull request Apr 28, 2026
… contract

The snapshot-agent keyed every checkpoint and restore operation off of a
single implicit workload container (Containers[0] in some call sites,
'main' in others). That made intra-pod failover restore (#8319), where
the operator clones 'main' into engine-0 / engine-1, impossible to model
cleanly.

Replace the implicit contract with one mandatory pod annotation:

  nvidia.com/snapshot-target-containers = "<name>[,<name>...]"

- Checkpoint Jobs must list exactly one target container.
- Restore pods may list one or more target containers; the same
  checkpoint is replayed into each.

The annotation is required. snapshotprotocol.NewCheckpointJob,
NewRestorePod, PrepareRestorePodSpec, and ValidateRestorePodSpec all
refuse to run when it is missing — no silent Containers[0] / 'main'
fallback.

Per-container agent-side status annotations move from singletons to
per-container-suffixed keys, with a pod-level aggregate rollup so
observers do not need to know the full target-container list:

  nvidia.com/snapshot-checkpoint-status              (Job, singleton)
  nvidia.com/snapshot-restore-status.<container>     (Pod, per target)
  nvidia.com/snapshot-restore-container-id.<container> (Pod, per target)
  nvidia.com/snapshot-restore-status                 (Pod, aggregate;
     written only by the informer event loop, so concurrent
     per-container workers never race on it)

The snapshot-control emptyDir is mounted into each target container
with subPath=<containerName>, so engine-0 and engine-1 lifecycle
sentinels live on disjoint disk paths while each container still sees
its sentinels at /snapshot-control via $DYN_SNAPSHOT_CONTROL_DIR.

Operator wiring (no CRD changes):
- buildCheckpointJob stamps 'main' on the Job pod template and locates
  the main container by name (ExtraPodSpec can inject containers
  before main, so Containers[0] is not trusted).
- ApplyRestorePodMetadata / InjectCheckpointIntoPodSpec honor a new
  CheckpointInfo.RestoreTargetContainers override. The DGD and DCD
  controllers populate that field with FailoverEngineContainerNames()
  ("engine-0", "engine-1") when failover is enabled.
- GMS sidecars keep their single-main-container wiring.

snapshotctl:
- new --container (checkpoint) / --containers (restore) CLI flags
- reconcileTargetContainers stamps or validates the annotation before
  submitting, and rejects mismatches between manifest and flag
- waitForRestore watches the aggregate annotation

Docs updated in docs/kubernetes/snapshot.md and
deploy/snapshot/cmd/snapshotctl/README.md.

Signed-off-by: Schwinn Saereesitthipitak <schwinns@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deployment::k8s Relates to dynamo deployment in kubernetes refactor size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant