ci: harden PyPI publish against partial-upload failures#700
Conversation
v0.1.49 partially published (1 of 5 wheels) when the 10 GiB PyPI project quota was exhausted mid-upload. Re-runs then failed because twine errors on already-published filenames. Three mitigations: - Add --skip-existing to twine upload so re-running the workflow after a partial upload skips files PyPI already has instead of aborting. - Upload the sdist before the wheels. The sdist is ~87 KB and is a universal install fallback; landing it first means pip install works on every platform even if large wheel uploads fail partway. Build it to a separate sdist/ dir so we can control upload order explicitly (previously dist/* uploaded alphabetically, .tar.gz last). - Add a pre-flight quota check that fetches current PyPI usage from the JSON API, sums pending artifact sizes, and fails the job with a clear error if projected usage exceeds 9.5 GiB (0.5 GiB buffer under the 10 GiB limit). Always prints current usage percentage for visibility.
|
Bumped the pre-flight check constants to reflect the 50 GiB quota increase PyPI granted. Threshold is now 47.5 GiB (95%) — same proportional buffer as before. Print statements now interpolate |
There was a problem hiding this comment.
LGTM — straightforward CI hardening with clear motivation; the inline nits on the advisory quota check are worth addressing but non-blocking.
Extended reasoning...
Overview
This PR modifies a single file (.github/workflows/build-and-publish.yml) to harden the PyPI publish workflow against partial-upload failures. The three changes are: (1) --skip-existing on twine upload for idempotent re-runs, (2) uploading the sdist before wheels so pip install has a universal fallback even if wheel uploads fail partway, and (3) a pre-flight quota check that queries the PyPI JSON API and fails with a clear error if projected usage exceeds 95% of the 50 GiB quota.
Security risks
No new security risks. The PYPI_API_TOKEN secret usage was already present and is unchanged. The new urlopen call fetches public PyPI metadata over HTTPS — no credentials are involved. The workflow still runs in the production environment with appropriate permissions.
Level of scrutiny
This is a CI-only change with no impact on production code, library behavior, or user-facing APIs. The changes are well-motivated by a real incident (v0.1.49 partial publish) and follow straightforward patterns. Low scrutiny is appropriate.
Other factors
The two bugs found are both labeled "Nit" and I agree with that classification. The missing timeout/error-handling on the urlopen call is a valid observation — an advisory check should degrade gracefully — but the practical risk is very low (PyPI is highly available, and GitHub Actions has a job-level timeout as a backstop). The double-counting issue on re-runs is technically correct but the 2.5 GiB buffer makes a false positive essentially impossible in practice. Neither nit undermines the core value of the PR, which is the --skip-existing flag and sdist-first ordering.
- Add timeout=30 to urlopen and wrap in try/except so the step logs a warning and continues if PyPI is slow or returns an error. - Filter local filenames against files already uploaded to PyPI for the current version to avoid double-counting on re-runs after a partial upload.
There was a problem hiding this comment.
Both previously flagged issues (missing timeout/try-except and double-counting on re-runs) have been addressed. LGTM.
Extended reasoning...
Overview
This PR modifies a single file (.github/workflows/build-and-publish.yml) to harden the PyPI publish workflow against partial-upload failures. Changes include: splitting build/publish steps, adding --skip-existing to twine uploads for idempotent re-runs, uploading sdist before wheels for a universal fallback, and adding a pre-flight quota check.
Security risks
No security concerns. The workflow already handles PyPI credentials via GitHub secrets (PYPI_API_TOKEN), and this PR does not change how secrets are accessed or used. The only external API call is a read-only fetch of public PyPI metadata.
Level of scrutiny
This is a CI workflow change with no application code impact. The logic is straightforward and well-commented. Both issues I raised in my previous review (missing timeout/error handling on the urlopen call, and double-counting of already-uploaded files on re-runs) have been incorporated into the current code exactly as suggested.
Other factors
The PR addresses a real operational incident (v0.1.49 partial publish). The changes are self-contained, follow established CI patterns, and the author has been responsive to feedback. No outstanding reviewer comments remain unaddressed.
There was a problem hiding this comment.
LGTM — both prior suggestions (timeout + try/except, dedup of already-uploaded files) are addressed in the current revision.
Extended reasoning...
Overview
This PR modifies a single CI workflow file (.github/workflows/build-and-publish.yml) to harden the PyPI publish pipeline against partial-upload failures. The changes add --skip-existing to twine uploads, reorder uploads so the sdist lands first as a universal fallback, and add a pre-flight quota check against the PyPI JSON API.
Security risks
No security concerns. Secrets (PYPI_API_TOKEN, DEPLOY_KEY) are handled via existing GitHub Actions secrets references — no new secret exposure. The only external API call is a read-only GET to the public PyPI JSON API, wrapped in try/except so failures degrade gracefully.
Level of scrutiny
This is a CI-only change to a publish workflow — no production application code is modified. The logic is straightforward (quota arithmetic, file listing, twine flags). Both issues I raised in my previous review have been addressed: the urlopen call now has timeout=30 and a try/except that emits a warning and exits cleanly, and the dedup logic filters out already-uploaded files to prevent double-counting on re-runs.
Other factors
Both inline comments from my prior review are marked as resolved, and the diff confirms the fixes are present. No other reviewers have outstanding comments. The PR is well-scoped, defensive in nature, and follows established CI patterns. The bug hunting system found no issues.
Context
v0.1.49 partially published — only the
macosx_11_0_arm64wheel landed on PyPI before the project hit its 10 GiB quota. The failure surfaced as a generic400 Bad Requestwith no quota-specific text. Re-runs failed immediately because the already-published wheel caused twine to error on "file exists."Update: PyPI granted a quota increase to 50 GiB, so the immediate pressure is off — but the hardening here is still valuable for the next time we approach the limit.
Fixes the failure modes that made this hard to diagnose and impossible to retry.
Refs #687.
Changes
--skip-existingon twine upload — re-runs are idempotent; already-published files are skipped instead of erroringpip installworks for everyone via source build even if wheel upload dies mid-sequence. Previously.tar.gzsorted last alphabetically indist/*and never got uploaded when things went wrong::error::annotation if projected > 47.5 GiB (95% of the 50 GiB quota). Prints current usage % on every run so quota creep is visibleNot in this PR