Skip to content

Cosmos CI build optimizations#48260

Open
kushagraThapar wants to merge 7 commits intoAzure:mainfrom
kushagraThapar:kushagra/cosmos-ci-build-optimizations
Open

Cosmos CI build optimizations#48260
kushagraThapar wants to merge 7 commits intoAzure:mainfrom
kushagraThapar:kushagra/cosmos-ci-build-optimizations

Conversation

@kushagraThapar
Copy link
Member

Cosmos CI Build Optimizations

Summary

Comprehensive CI pipeline infrastructure optimization for Cosmos DB emulator tests and Build stage unit tests. Targets redundant compilation, unnecessary uber JAR creation, and excessive job count.

Estimated savings:

Metric Before After Improvement
PR emulator jobs 16 11 -31%
Total agent time 23.4 hrs ~12 hrs ~49%

Changes

1. PR-conditional emulator matrix (16 → 11 jobs)

Created cosmos-emulator-matrix-pr.json with reduced JDK variants for PR builds. Full matrix runs on main merges only.

Dropped for PRs (5 jobs):

Dropped Job Kept Variant
Spark 3.3 Java 11 Java 8
Spark 3.4 Java 8 Java 11
Spark 3.5/Scala 2.12 Java 8 Java 17
Spark 4.0/Scala 2.13 Java 17 Java 21
Kafka Java 11 Java 17

File: eng/pipelines/templates/stages/cosmos-emulator-matrix-pr.json (new), cosmos-sdk-client.yml (conditional matrix selection)

2. Skip maven-shade-plugin for non-Spark/non-Kafka jobs

Core emulator, long emulator, and encryption jobs don't need Spark/Kafka uber JARs. Added -Dshade.skip=true -Dmaven.antrun.skip=true via AdditionalArgs to skip shade plugin in both build and test steps.

Savings: ~14 min per non-Spark job — the build step previously spent 88% of its time (14 of 17 min) creating Spark uber JARs.

Files: cosmos-emulator-matrix.json, cosmos-emulator-matrix-pr.json

3. Per-job ProjectListOverride for Spark/Kafka jobs

Each Spark emulator job previously compiled ALL 14 modules including other Spark versions it doesn't test (~11 min wasted per job). Added ProjectListOverride support to generate-project-list.ps1 — if set via matrix variable, the script uses it directly instead of computing from the full artifacts list.

Each Spark job now only builds: azure-cosmos + azure-cosmos-test + azure-cosmos-tests + its specific Spark module.

Savings: ~11 min × 9 Spark jobs = ~99 min agent time

Files: eng/pipelines/scripts/generate-project-list.ps1, cosmos-emulator-matrix.json, cosmos-emulator-matrix-pr.json

4. BuildOptions plumbing for Build stage unit tests

Added BuildOptions parameter through ci.ymlci.tests.ymlbuild-and-test.yml pipeline chain. Defaults to empty (no behavior change for other SDKs). Cosmos Build stage sets it to skip shade since unit tests don't need uber JARs.

Savings: ~14 min per unit test job

Files: eng/pipelines/templates/jobs/ci.yml, eng/pipelines/templates/jobs/ci.tests.yml, cosmos-sdk-client.yml

5. Increase Maven build parallelization (1 → 2)

All stages (Build, TestEmulator, TestVNextEmulator) now use BuildParallelization: 2.

File: cosmos-sdk-client.yml

Testing

Pipeline changes validated by CI itself. The generate-project-list.ps1 change is backward compatible — ProjectListOverride defaults to empty (no-op for non-Cosmos pipelines).

kushagraThapar and others added 6 commits March 4, 2026 17:14
1. PR-conditional emulator matrix (16 → 11 jobs):
   Drops redundant JDK variants for Spark/Kafka in PR builds.
   Full matrix on main merges.

   Dropped for PRs (5 jobs, ~5 agent hours saved):
   - Spark 3.3 Java 11 (keeping Java 8)
   - Spark 3.4 Java 8 (keeping Java 11)
   - Spark 3.5/Scala 2.12 Java 8 (keeping Java 17)
   - Spark 4.0/Scala 2.13 Java 17 (keeping Java 21)
   - Kafka Java 11 (keeping Java 17)

2. Increase BuildParallelization from 1 to 2 in all stages
   (Build, TestEmulator, TestVNextEmulator).

3. Skip maven-shade-plugin for non-Spark/non-Kafka emulator jobs:
   Core emulator, long emulator, and encryption jobs don't need
   Spark/Kafka uber JARs. Adding -Dshade.skip=true saves ~90s of
   shade plugin execution per Spark module × 5 modules = ~7-8 min
   per non-Spark job (5 jobs × 7 min = ~35 min agent time saved).

4. Remove outdated comment about emulator download time.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The antrun 03-repack phase expects shade output (native .jnilib/.so
files in target/tmp/). When -Dshade.skip=true, the shade output doesn't
exist and antrun fails with 'Could not find file'. Add
-Dmaven.antrun.skip=true alongside -Dshade.skip=true.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The test step runs 'clean verify' which recompiles everything from
scratch, including Spark shade. Our BuildOptions only affected the
build step. Add -Dshade.skip=true -Dmaven.antrun.skip=true to
AdditionalArgs for non-Spark jobs so it flows into TestOptions too.

Keep BuildOptions for the build step as well (both steps need it).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add BuildOptions parameter through ci.yml → ci.tests.yml → build-and-test.yml
pipeline chain. Defaults to empty string (no behavior change for other SDKs).

Cosmos Build stage sets BuildOptions to '-Dshade.skip=true -Dmaven.antrun.skip=true'
to skip Spark/Kafka uber JAR creation during unit test matrix jobs, saving ~14 min
per job. The release artifact deploy step is unaffected.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Each Spark emulator job previously compiled ALL 14 modules including
other Spark versions it doesn't test, wasting ~11 min per job on
unnecessary shade+compile.

Changes:
- generate-project-list.ps1: Check for ProjectListOverride env var
  at the top. If set, use it directly and skip normal computation.
  Defaults to empty (no behavior change for other SDKs).
- Emulator matrix JSONs: Add ProjectListOverride for each Spark and
  Kafka job with only the modules they need (core + their specific
  Spark/Kafka module).

Example: Spark 3.5/2.13 job previously built 14 modules (41 min test
step). Now builds only 6 modules, saving ~11 min per Spark job.

Estimated savings: ~11 min × 9 Spark jobs + ~5 min × 2 Kafka jobs
= ~109 min agent time per full CI run.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@kushagraThapar kushagraThapar marked this pull request as ready for review March 5, 2026 18:28
Copilot AI review requested due to automatic review settings March 5, 2026 18:28
@kushagraThapar kushagraThapar requested review from a team, benbp, raych1 and weshaggard as code owners March 5, 2026 18:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

…mpilation

Emulator, Long Emulator, and Encryption jobs were compiling all 14 cosmos
modules including 7 Spark modules (Scala compilation ~10-16 min) despite
only running core emulator tests. Add ProjectListOverride to limit these
jobs to only the modules they actually test:
- Emulator/Long Emulator: azure-cosmos, azure-cosmos-test, azure-cosmos-tests
- Encryption: adds azure-cosmos-encryption

Also reverts the no-op TestSuiteBase trigger commit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@kushagraThapar kushagraThapar force-pushed the kushagra/cosmos-ci-build-optimizations branch from 8b4024c to ddec16d Compare March 5, 2026 22:26

# If ProjectListOverride is set (e.g., from matrix variables), use it directly
# to avoid building unnecessary modules in jobs that only test a subset.
if ($env:PROJECTLISTOVERRIDE -and $env:PROJECTLISTOVERRIDE -notlike '*ProjectListOverride*') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can override the existing ArtifactsJson variable today, but it's messier:

"ArtifactsJson": "{
\"name\": \"azure-core-version-tests\",
\"groupId\": \"com.azure\",
\"safeName\": \"azurecoreversiontests\"
}",

Given your scenario though, it's probably simpler to just allow this type of override. @alzimmermsft can you think of any gotchas here?

Copy link
Member

@alzimmermsft alzimmermsft Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The short answer is yes, there could be issues caused by this if the manual project list override doesn't fully enclose the build space, but the bigger problem could be in From Source runs where it calculates the build space. But overall, I'm good with this as this should be used in very niche scenarios, but two thoughts:

  1. @kushagraThapar, mind removing this for one build run to see how much this affects CI time? Based on what I know about the emulator runs, I think removing Shade and Ant are the majority of the CI time improvement. If this doesn't affect build time much we may just want to remove it.
  2. If we can, guard this on From Source runs, if that is something we can check for. Should just be check on $env:TESTFROMSOURCE being false / missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants