[SPARK-56768][PYTHON][INFRA] Share SBT compile artifact across pyspark CI jobs#55726
Draft
zhengruifeng wants to merge 11 commits intoapache:masterfrom
Draft
[SPARK-56768][PYTHON][INFRA] Share SBT compile artifact across pyspark CI jobs#55726zhengruifeng wants to merge 11 commits intoapache:masterfrom
zhengruifeng wants to merge 11 commits intoapache:masterfrom
Conversation
The pyspark matrix (8 jobs) and sparkr each rebuild the same Spark JARs from scratch, costing ~13m27s of SBT compile per job. With sparkr included, this is roughly 127m of redundant SBT compile per CI run. This change adds a `precompile-pyspark` job that runs the same SBT build (`Test/package + streaming-kinesis-asl-assembly/assembly + connect/assembly + assembly/package` with the 11 standard profiles) once, tars all `target/` directories with `zstd -3 -T0`, and uploads them as a 1-day-retention artifact. The pyspark matrix and sparkr jobs now `needs:` this job, download and extract the artifact, and set `SKIP_BUILD=1` so `dev/run-tests` skips the redundant compile. Net CI compute saved: roughly 95-105m per run, ~13-14% of total. Wall clock is roughly unchanged - the build is now serialized before the matrix instead of parallel-hidden inside it. Generated-by: Claude Code (Opus 4.7)
Drop the pyspark-specific name from the shared SBT compile job. The artifact it produces is reusable from any job that needs the same profile/goal set, e.g. R or doc builds in follow-ups. Generated-by: Claude Code (Opus 4.7)
The pyspark/sparkr Docker images do not ship zstd, so the "Extract precompiled artifact" step failed with `zstd: Cannot exec: No such file or directory`. Switch to xz, which is in xz-utils and present in every standard Ubuntu base image. Use `XZ_OPT='-T0 -9'` so compression is multi-threaded and at the highest level, which is also a slightly better ratio than zstd at -3. Generated-by: Claude Code (Opus 4.7)
Use tar's `-j` codec (bzip2). bzip2 is in every standard Ubuntu base image (same availability as xz), and its default level is 9, so no extra options are needed. Generated-by: Claude Code (Opus 4.7)
REVERT BEFORE MERGE. Force the precondition step to emit only pyspark + pyspark-pandas as true so the PR's CI iterations skip Maven/lint/docs/sparkr/etc. and only exercise the path this PR touches. Generated-by: Claude Code (Opus 4.7)
Revert the sparkr job to its original shape: no `precompile` in `needs`, no `SKIP_BUILD` env, no artifact download/extract. Also drop sparkr from the `precompile` job's `if:` gate since it is no longer a consumer. Generated-by: Claude Code (Opus 4.7)
The TEMP override that forced pyspark/pyspark-pandas only was added for iteration on this PR. With the implementation validated, restore the normal precondition logic so the full set of jobs runs. Generated-by: Claude Code (Opus 4.7)
The pyspark-install matrix entry has its own gate independent of the umbrella `pyspark` flag. Through the normal precondition path the two are correlated (pyspark-install belongs to the pyspark module list), but via `inputs.jobs` they can be set independently. Add pyspark-install to the precompile job's `if:` so the artifact is always available when the matrix entry runs. Generated-by: Claude Code (Opus 4.7)
…inishes Adds a final job that runs after the pyspark matrix completes (whether the matrix succeeded, failed, or was cancelled — gated only on the precompile job succeeding) and deletes the spark-compile-<run_id> artifact via the GitHub Actions REST API. The artifact's `retention-days: 1` already auto-expires it within 24h, so this is a "reclaim immediately" optimization rather than a leak fix. Best-effort: a failed delete does not fail the workflow. Generated-by: Claude Code (Opus 4.7)
…yspark finishes" This reverts commit 9323883.
The precompile job runs on a fresh ubuntu-latest runner with ~14 GB free out of the box. The full SBT build plus the resulting bzip2 artifact fits comfortably; the disk-cleanup step (which removes Android SDK, .NET, etc. from the runner image) added ~10s for no benefit. Generated-by: Claude Code (Opus 4.7)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR adds a single shared
precompileCI job that runs Spark's SBT build once and uploads the resultingtarget/trees as a GitHub Actions artifact. The 8 pyspark matrix entries plus the optionalpyspark-installentry now consume that artifact instead of re-running the same SBT build themselves. The job is named generically because the same artifact can be reused by sparkr, R, or documentation jobs in follow-ups.Concretely:
precompilejob in.github/workflows/build_and_test.ymlruns the SBT build:target/directory (excluding./build/and./.git/) withtar -cjf(bzip2), uploads asspark-compile-${{ github.run_id }}withretention-days: 1so storage is reclaimed within 24h.The job's
if:gate fires when any ofpyspark,pyspark-pandas, orpyspark-installis true in the precondition output, so the artifact is always available for any matrix entry that needs it (including viainputs.jobsoverrides used by scheduled / dispatched workflows).pysparkmatrix job addsprecompiletoneeds:, downloads and extracts the artifact before running tests, and setsSKIP_BUILD: truein env.dev/run-tests.pynow skipsbuild_apache_sparkandbuild_spark_assembly_sbtwhenSKIP_BUILDis set, matching the existingSKIP_UNIDOC/SKIP_MIMApattern.SBT invocations: before vs. after
Every pyspark matrix entry today drives
dev/run-tests.py, which makes two SBT calls back-to-back (build_spark_sbtatdev/run-tests.py:647thenbuild_spark_assembly_sbtatdev/run-tests.py:656):The 11 profiles, identical across all 8 entries:
-Phadoop-3 -Pyarn -Phive -Phive-thriftserver -Pkubernetes -Phadoop-cloud -Pjvm-profiler -Pspark-ganglia-lgpl -Pkinesis-asl -Pdocker-integration-tests -Pvolcano.After this PR, with
SKIP_BUILD: trueset on the matrix job, both calls are gated off; no SBT compile runs in the matrix entry at all. The newprecompilejob runs one SBT invocation that combines all four goals (safe becauseSKIP_MIMA=truein the pyspark job, so the original split fordev/mimais moot here):precompilejob, all 4 goals combined)The produced
target/is byte-equivalent — same goals, same profiles, same Scala/Java/Hadoop versions.Why are the changes needed?
Each of the 8 pyspark matrix jobs runs the same ~13m27s SBT compile independently. Across a single CI run that's roughly 108m of redundant compile time, against a per-run total of ~700m. This change deduplicates that work.
Estimated savings, based on a recent run of Build and test:
Wall clock of the workflow is roughly unchanged. The build was previously parallel-hidden inside each matrix runner; sharing it serializes one ~13m build before the matrix, but the slowest matrix runner shrinks by the same amount, so the critical path is similar (within a few minutes).
Does this PR introduce any user-facing change?
No. CI infrastructure change only.
How was this patch tested?
The change is exercised by the CI run of this PR itself:
precompilesucceeds and produces an artifact of reasonable size, the build phase works.SKIP_BUILDis correctly skipping the local compile.A few things to watch in the first run:
target/is roughly 1-3 GB raw; expect ~600 MB-1 GB after bzip2. The "Package compile output" step prints the size withls -lh. If it ever gets close to GHA's 10 GB per-artifact cap we should slim the find pattern (e.g., excludetarget/streamsand intermediate scaladoc).bzip2in the test images. The pyspark Docker images needbzip2for the extract step. It is in thebzip2package and present in every standard Ubuntu base image.The doctests in
dev/sparktestsupport/utils.pycontinue to pass; no logic inis-changed.pyor the module graph was changed.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.7)