feat: use content fetch instead of image pull#7573
Conversation
7ed67cf to
a66f521
Compare
|
if this work, it will be a game changer: Why would anyone use ctr content fetch?
Pre-fetch blobs |
|
ctr content fetch <ref|descriptor> Downloads raw OCI blobs (manifest, config, layer tarballs) into the content store. ctr images pull Resolves the reference, downloads the same blobs into the content store, and creates the image record. |
1) Fetch (blobs land in content store)ctr content fetch docker.io/library/nginx:latest 2) Identify the manifest digest you fetchedctr content ls | grep manifest e.g. sha256:MANIFEST_DIGEST3) Create the image record (register name → manifest)ctr images create 4) (Optional) Unpack now to avoid first-run penaltyctr images unpack docker.io/library/nginx:latest 5) Runctr run --rm docker.io/library/nginx:latest nginx If you skip step 3, the image won’t show up in ctr images ls, and ctr run won’t find it. |
|
Having a penalty overhead of first time run is expected, but the tradeoff here is caching many more images since we have disk space of 15GB+ vs smaller set right now with unpacked. Did a lot of testing, there is an overall penalty of 5s-7s additional startup time, to unpack and register the image with snapshotter (overlayfs). But this penalty is across all the containers in parallel. with ctr content fetch mcr.microsoft.com/aks/aks-gpu-cuda:580.95.05-20251021155213 time ctr containers create mcr.microsoft.com/aks/aks-gpu-cuda:580.95.05-20251021155213 debug-box real 0m7.905s with time ctr images pull mcr.microsoft.com/aks/aks-gpu-cuda:580.95.05-20251021155213 real 0m16.372s time ctr containers create mcr.microsoft.com/aks/aks-gpu-cuda:580.95.05-20251021155213 debug-box real 0m0.054s |
|
Obviously we can use this option strategically, like use i feel that 5s-7s additional startup time is not much, compared to savings we get in additional areas like disk space, additional caching etc. But open to discuss as a team on pros and cons. |
a66f521 to
f110737
Compare
f110737 to
eb01029
Compare
b96f7aa to
a835e6c
Compare
| package main | ||
|
|
||
| import ( | ||
| "context" | ||
| "fmt" | ||
| "os" | ||
| "runtime" | ||
|
|
||
| containerd "github.com/containerd/containerd/v2/client" | ||
| "github.com/containerd/containerd/v2/pkg/namespaces" | ||
| "github.com/containerd/platforms" | ||
| ) | ||
|
|
||
| const ( | ||
| defaultSocket = "/run/containerd/containerd.sock" | ||
| defaultNS = "k8s.io" | ||
| // images with compressed content size below this threshold are | ||
| // unpacked after fetch, effectively turning the operation into a | ||
| // full pull (~150 MiB compressed ≈ ~300 MiB unpacked). | ||
| pullSizeThreshold = 150 * 1024 * 1024 // 150 MiB | ||
| ) | ||
|
|
||
| func main() { | ||
| if len(os.Args) < 2 { | ||
| fmt.Fprintf(os.Stderr, "Usage: %s <image-ref> [image-ref...]\n", os.Args[0]) | ||
| fmt.Fprintf(os.Stderr, "Example: %s mcr.microsoft.com/oss/kubernetes/pause:3.9\n", os.Args[0]) | ||
| os.Exit(1) | ||
| } | ||
|
|
||
| socket := os.Getenv("CONTAINERD_SOCKET") | ||
| if socket == "" { | ||
| socket = defaultSocket | ||
| } | ||
| ns := os.Getenv("CONTAINERD_NAMESPACE") | ||
| if ns == "" { | ||
| ns = defaultNS | ||
| } | ||
|
|
||
| client, err := containerd.New(socket) | ||
| if err != nil { | ||
| fmt.Fprintf(os.Stderr, "Failed to connect to containerd at %s: %v\n", socket, err) | ||
| os.Exit(1) | ||
| } | ||
| defer client.Close() | ||
|
|
||
| ctx := namespaces.WithNamespace(context.Background(), ns) | ||
|
|
||
| failed := 0 | ||
| for _, ref := range os.Args[1:] { | ||
| if err := fetchImage(ctx, client, ref); err != nil { | ||
| fmt.Fprintf(os.Stderr, "FAIL %s: %v\n", ref, err) | ||
| failed++ | ||
| } | ||
| } | ||
|
|
||
| if failed > 0 { | ||
| os.Exit(1) | ||
| } | ||
| } | ||
|
|
||
| // fetchImage uses client.Fetch() which: | ||
| // - Downloads all blobs (manifest, config, layers) into the content store | ||
| // - Creates an image record in the metadata database | ||
| // - Does NOT unpack layers into the snapshotter | ||
| // | ||
| // If the total image content size is below pullSizeThreshold (150 MiB), | ||
| // client.Pull() is called to additionally unpack the layers. Pull reuses | ||
| // already-fetched content from the store and handles snapshotter resolution | ||
| // internally (namespace label → platform default). | ||
| func fetchImage(ctx context.Context, client *containerd.Client, ref string) error { | ||
| fmt.Printf("Fetching %s ...\n", ref) | ||
|
|
||
| platform := fmt.Sprintf("linux/%s", runtime.GOARCH) | ||
| p, err := platforms.Parse(platform) | ||
| if err != nil { | ||
| return fmt.Errorf("parse platform %s: %w", platform, err) | ||
| } | ||
| platformMatcher := platforms.OnlyStrict(p) | ||
|
|
||
| imageMeta, err := client.Fetch(ctx, ref, | ||
| containerd.WithPlatformMatcher(platformMatcher), | ||
| ) | ||
| if err != nil { | ||
| return fmt.Errorf("fetch failed: %w", err) | ||
| } | ||
|
|
||
| image := containerd.NewImage(client, imageMeta) | ||
|
|
||
| size, err := image.Size(ctx) | ||
| if err != nil { | ||
| fmt.Fprintf(os.Stderr, "WARN %s: could not determine image size, skipping unpack: %v\n", ref, err) | ||
| fmt.Printf("OK %s -> %s (fetched)\n", imageMeta.Name, imageMeta.Target.Digest) | ||
| return nil | ||
| } | ||
|
|
||
| if size < pullSizeThreshold { | ||
| // We use pull here instead of use unpack because some runtimes (e.g. containerd-shim-runsc-v1), | ||
| // require pull to trigger unpacking into the correct snapshotter based on the image's platform. | ||
| if _, err := client.Pull(ctx, ref, | ||
| containerd.WithPlatformMatcher(platformMatcher), | ||
| containerd.WithPullUnpack, | ||
| ); err != nil { | ||
| return fmt.Errorf("pull failed: %w", err) | ||
| } | ||
| fmt.Printf("OK %s -> %s (pulled, %s)\n", imageMeta.Name, imageMeta.Target.Digest, formatSize(size)) | ||
| } else { | ||
| fmt.Printf("OK %s -> %s (fetched, %s)\n", imageMeta.Name, imageMeta.Target.Digest, formatSize(size)) | ||
| } | ||
|
|
||
| return nil | ||
| } | ||
|
|
||
| func formatSize(bytes int64) string { | ||
| const ( | ||
| mib = 1024 * 1024 | ||
| gib = 1024 * 1024 * 1024 | ||
| ) | ||
| switch { | ||
| case bytes >= gib: | ||
| return fmt.Sprintf("%.2f GiB", float64(bytes)/float64(gib)) | ||
| case bytes >= mib: | ||
| return fmt.Sprintf("%.2f MiB", float64(bytes)/float64(mib)) | ||
| default: | ||
| return fmt.Sprintf("%d bytes", bytes) | ||
| } | ||
| } |
There was a problem hiding this comment.
The image-fetcher binary is a new Go module but has no unit tests. Other similar standalone Go binaries in this repo (e.g., vhdbuilder/lister) also lack tests, so this is consistent with the codebase convention. However, the image-fetcher has non-trivial logic (size-based fetch vs. pull threshold) that would benefit from test coverage, especially since incorrect behavior could silently produce VHDs with images in the wrong state (fetched-only vs. fully pulled).
| size, err := image.Size(ctx) | ||
| if err != nil { | ||
| fmt.Fprintf(os.Stderr, "WARN %s: could not determine image size, skipping unpack: %v\n", ref, err) | ||
| fmt.Printf("OK %s -> %s (fetched)\n", imageMeta.Name, imageMeta.Target.Digest) | ||
| return nil | ||
| } | ||
|
|
||
| if size < pullSizeThreshold { |
There was a problem hiding this comment.
The image.Size() method returns the total size of content in the content store (compressed layer blobs + config + manifests). The comparison size < pullSizeThreshold where pullSizeThreshold is 150 MiB is used to decide fetch-only vs. full pull. However, the comment on line 17-20 says "images with compressed content size below this threshold are unpacked after fetch, effectively turning the operation into a full pull (~150 MiB compressed ≈ ~300 MiB unpacked)."
The image.Size() documentation states it returns the total size of the image's content (sum of all blob sizes in the content store). For a multi-platform image that was fetched with a platform matcher, this should be the platform-specific size. Please verify this returns the compressed content size as intended, not the unpacked/decompressed size, as the threshold logic depends on this distinction.
| @@ -609,7 +609,7 @@ if [ $OS = $UBUNTU_OS_NAME ] && [ "$(isARM64)" -ne 1 ]; then # No ARM64 SKU wit | |||
|
|
|||
| mkdir -p /opt/{actions,gpu} | |||
|
|
|||
| ctr -n k8s.io image pull "$NVIDIA_DRIVER_IMAGE:$NVIDIA_DRIVER_IMAGE_TAG" | |||
| /opt/azure/containers/image-fetcher "$NVIDIA_DRIVER_IMAGE:$NVIDIA_DRIVER_IMAGE_TAG" | |||
There was a problem hiding this comment.
The NVIDIA driver image fetch on line 612 was previously using ctr -n k8s.io image pull, which both downloads blobs AND creates an image record AND unpacks layers. The new image-fetcher binary uses client.Fetch() for images above 150 MiB (which the NVIDIA driver image likely is), which only downloads blobs and creates an image record but does NOT unpack layers.
However, looking at the GPU install flow: after the image is fetched during VHD build, CTR_GPU_INSTALL_CMD (defined in cse_config.sh) later runs ctr run to extract drivers. ctr run requires layers to be unpacked into a snapshotter — a fetch-only image (no unpack) will fail at ctr run time with an error like "no snapshot available".
Please verify that the NVIDIA GPU driver extraction workflow (which uses ctr run) works correctly when the image is only fetched (not pulled/unpacked). If the NVIDIA driver image is above 150 MiB compressed (which it likely is at ~1+ GiB), image-fetcher will only fetch it without unpacking, and ctr run will fail.
a8a1700 to
1e9d51f
Compare
| bootType: efi | ||
|
|
||
| disks: | ||
| - partitionTableType: gpt |
There was a problem hiding this comment.
should we get sign-off from @hbeberman or @miz060 on these changes?
There was a problem hiding this comment.
Yes will check with them
There was a problem hiding this comment.
I'm alright with the change but I have one functional concern.
OSGuard has some systemd-repart config files baked in that balance the size of root-a and root-b when applied to an Azure disk.
After making this change do you observe the root-a partition growing to fill the provisioned Azure disk?
There was a problem hiding this comment.
You are right, the change here https://github.com/microsoft/azurelinux/blob/25bde1f99877f485a18f9edd996101c0fd393db6/toolkit/imageconfigs/files/osguard/repart.d/14-root-a.conf conflicts with the change i made, issue is that
The storage: block is the problem. You're re-customizing a base image that was already built with the upstream template's partition layout (root-a: 12G). Your config
re-specifies storage but with root-a: 20G — this conflicts with the base image's existing layout.
Root cause: You have reinitialize-verity (which says "I'm re-customizing an existing verity image") and a storage: block (which says "re-partition the disk"). These
contradict each other. The Image Customizer likely tries to repartition the already-partitioned base image, corrupting the boot chain → OSProvisioningTimedOut.
I reverted my change to not break OsGuard image, instead i fallback to fetch only for OSGuard, make changes to upstream and then remove fetchonly
1e9d51f to
e0f5893
Compare
Use platforms.OnlyStrict via WithPlatformMatcher instead of WithPlatform to bypass Pull's internal platforms.Only() which expands to match sub-platforms (e.g. 386 for amd64). Also fix stale 300 MiB comment. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When IMAGE_FETCH_ONLY=true is set, image-fetcher skips unpacking for all images regardless of size. This saves disk space on OSGuard's constrained root partition. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
e0f5893 to
a013bdc
Compare
What this PR does / why we need it:
We should just fetch the contents instead of performing a pull, so the compressed images remain on disk, when a run is performed all it does is unpack and run. The difference is negligible at cost of smaller initial image size.
here is the difference
Before
After
Which issue(s) this PR fixes:
Fixes #
Requirements:
Special notes for your reviewer:
Release note: