Skip to content

ci: harden PyPI publish against partial-upload failures#700

Merged
qing-ant merged 4 commits intomainfrom
qing/harden-publish-workflow
Mar 20, 2026
Merged

ci: harden PyPI publish against partial-upload failures#700
qing-ant merged 4 commits intomainfrom
qing/harden-publish-workflow

Conversation

@qing-ant
Copy link
Contributor

@qing-ant qing-ant commented Mar 18, 2026

Context

v0.1.49 partially published — only the macosx_11_0_arm64 wheel landed on PyPI before the project hit its 10 GiB quota. The failure surfaced as a generic 400 Bad Request with no quota-specific text. Re-runs failed immediately because the already-published wheel caused twine to error on "file exists."

Update: PyPI granted a quota increase to 50 GiB, so the immediate pressure is off — but the hardening here is still valuable for the next time we approach the limit.

Fixes the failure modes that made this hard to diagnose and impossible to retry.

Refs #687.

Changes

  • --skip-existing on twine upload — re-runs are idempotent; already-published files are skipped instead of erroring
  • Upload sdist before wheels — the 87 KB sdist is a universal fallback. If it lands first, pip install works for everyone via source build even if wheel upload dies mid-sequence. Previously .tar.gz sorted last alphabetically in dist/* and never got uploaded when things went wrong
  • Pre-flight quota check — queries the PyPI JSON API, sums all release sizes + local dist sizes, fails with a clear ::error:: annotation if projected > 47.5 GiB (95% of the 50 GiB quota). Prints current usage % on every run so quota creep is visible

Not in this PR

  • Pruning old releases (manual PyPI web UI action — less urgent now)
  • Re-publishing v0.1.49's missing wheels (resolved — all wheels now live on PyPI)

v0.1.49 partially published (1 of 5 wheels) when the 10 GiB PyPI project
quota was exhausted mid-upload. Re-runs then failed because twine errors
on already-published filenames.

Three mitigations:

- Add --skip-existing to twine upload so re-running the workflow after
  a partial upload skips files PyPI already has instead of aborting.

- Upload the sdist before the wheels. The sdist is ~87 KB and is a
  universal install fallback; landing it first means pip install works
  on every platform even if large wheel uploads fail partway. Build it
  to a separate sdist/ dir so we can control upload order explicitly
  (previously dist/* uploaded alphabetically, .tar.gz last).

- Add a pre-flight quota check that fetches current PyPI usage from the
  JSON API, sums pending artifact sizes, and fails the job with a clear
  error if projected usage exceeds 9.5 GiB (0.5 GiB buffer under the
  10 GiB limit). Always prints current usage percentage for visibility.
@qing-ant
Copy link
Contributor Author

Bumped the pre-flight check constants to reflect the 50 GiB quota increase PyPI granted. Threshold is now 47.5 GiB (95%) — same proportional buffer as before. Print statements now interpolate QUOTA_BYTES/THRESHOLD_BYTES instead of hardcoding the number, so future bumps only touch the two constants at the top.

@qing-ant qing-ant marked this pull request as ready for review March 20, 2026 18:53
Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — straightforward CI hardening with clear motivation; the inline nits on the advisory quota check are worth addressing but non-blocking.

Extended reasoning...

Overview

This PR modifies a single file (.github/workflows/build-and-publish.yml) to harden the PyPI publish workflow against partial-upload failures. The three changes are: (1) --skip-existing on twine upload for idempotent re-runs, (2) uploading the sdist before wheels so pip install has a universal fallback even if wheel uploads fail partway, and (3) a pre-flight quota check that queries the PyPI JSON API and fails with a clear error if projected usage exceeds 95% of the 50 GiB quota.

Security risks

No new security risks. The PYPI_API_TOKEN secret usage was already present and is unchanged. The new urlopen call fetches public PyPI metadata over HTTPS — no credentials are involved. The workflow still runs in the production environment with appropriate permissions.

Level of scrutiny

This is a CI-only change with no impact on production code, library behavior, or user-facing APIs. The changes are well-motivated by a real incident (v0.1.49 partial publish) and follow straightforward patterns. Low scrutiny is appropriate.

Other factors

The two bugs found are both labeled "Nit" and I agree with that classification. The missing timeout/error-handling on the urlopen call is a valid observation — an advisory check should degrade gracefully — but the practical risk is very low (PyPI is highly available, and GitHub Actions has a job-level timeout as a backstop). The double-counting issue on re-runs is technically correct but the 2.5 GiB buffer makes a false positive essentially impossible in practice. Neither nit undermines the core value of the PR, which is the --skip-existing flag and sdist-first ordering.

- Add timeout=30 to urlopen and wrap in try/except so the step
  logs a warning and continues if PyPI is slow or returns an error.
- Filter local filenames against files already uploaded to PyPI for
  the current version to avoid double-counting on re-runs after a
  partial upload.
Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both previously flagged issues (missing timeout/try-except and double-counting on re-runs) have been addressed. LGTM.

Extended reasoning...

Overview

This PR modifies a single file (.github/workflows/build-and-publish.yml) to harden the PyPI publish workflow against partial-upload failures. Changes include: splitting build/publish steps, adding --skip-existing to twine uploads for idempotent re-runs, uploading sdist before wheels for a universal fallback, and adding a pre-flight quota check.

Security risks

No security concerns. The workflow already handles PyPI credentials via GitHub secrets (PYPI_API_TOKEN), and this PR does not change how secrets are accessed or used. The only external API call is a read-only fetch of public PyPI metadata.

Level of scrutiny

This is a CI workflow change with no application code impact. The logic is straightforward and well-commented. Both issues I raised in my previous review (missing timeout/error handling on the urlopen call, and double-counting of already-uploaded files on re-runs) have been incorporated into the current code exactly as suggested.

Other factors

The PR addresses a real operational incident (v0.1.49 partial publish). The changes are self-contained, follow established CI patterns, and the author has been responsive to feedback. No outstanding reviewer comments remain unaddressed.

Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — both prior suggestions (timeout + try/except, dedup of already-uploaded files) are addressed in the current revision.

Extended reasoning...

Overview

This PR modifies a single CI workflow file (.github/workflows/build-and-publish.yml) to harden the PyPI publish pipeline against partial-upload failures. The changes add --skip-existing to twine uploads, reorder uploads so the sdist lands first as a universal fallback, and add a pre-flight quota check against the PyPI JSON API.

Security risks

No security concerns. Secrets (PYPI_API_TOKEN, DEPLOY_KEY) are handled via existing GitHub Actions secrets references — no new secret exposure. The only external API call is a read-only GET to the public PyPI JSON API, wrapped in try/except so failures degrade gracefully.

Level of scrutiny

This is a CI-only change to a publish workflow — no production application code is modified. The logic is straightforward (quota arithmetic, file listing, twine flags). Both issues I raised in my previous review have been addressed: the urlopen call now has timeout=30 and a try/except that emits a warning and exits cleanly, and the dedup logic filters out already-uploaded files to prevent double-counting on re-runs.

Other factors

Both inline comments from my prior review are marked as resolved, and the diff confirms the fixes are present. No other reviewers have outstanding comments. The PR is well-scoped, defensive in nature, and follows established CI patterns. The bug hunting system found no issues.

@qing-ant qing-ant merged commit 117cffa into main Mar 20, 2026
16 checks passed
@qing-ant qing-ant deleted the qing/harden-publish-workflow branch March 20, 2026 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants