Skip to content

Publish a canonical release_manifest.json for each policyengine-uk-data release #322

@MaxGhenis

Description

@MaxGhenis

Problem

policyengine-uk-data should expose the same release-contract guarantees as the US data repo, but the UK path has an additional constraint: the runtime artifacts live behind authenticated/private storage because of licensing.

Today the repo has package-version tags and upload machinery, but there is still no single canonical release manifest that downstream consumers can use to answer:

  • which datasets and auxiliary artifacts belong to a given policyengine-uk-data version
  • which dataset is the default for that release
  • which exact authenticated HF locator should be used
  • what checksum to verify after download
  • what provenance is required to rebuild the artifact deterministically from pinned inputs

That forces downstream repos to infer too much from filename convention or private operational knowledge. It also makes the UK story diverge from the US story at exactly the point where we want a common cross-country contract.

Relevant current pieces:

  • policyengine_uk_data/utils/data_upload.py already uploads files and tags the corresponding HF revision with the package version
  • policyengine_uk_data/utils/huggingface.py already supports version-aware downloads
  • the repo has explicit guidance that UK data remains private/authenticated, so the release contract needs to work without assuming public artifacts

Desired contract

Each policyengine-uk-data release should publish one canonical release_manifest.json at the authenticated HF revision/tag matching the package version.

For example, if the package version is X.Y.Z, downstream code should be able to resolve:

  • repo: the canonical UK data repo for that release
  • revision/tag: X.Y.Z
  • manifest: release_manifest.json@X.Y.Z

The manifest can remain private if the underlying artifact paths must remain private. The key is that it is stable, machine-readable, and version-aligned.

That manifest should define:

  • policyengine-uk-data version
  • compatible policyengine-uk version or compatibility range
  • default datasets by logical name
  • all published runtime artifacts for that release, including auxiliary regional/weight artifacts
  • exact authenticated locators (repo_id, path, revision)
  • SHA256 and size for each artifact
  • build provenance needed for deterministic rebuilds

What should change

  1. Publish one canonical release manifest per UK data package version.
  2. Treat the HF tag matching the package version as the official lookup boundary.
  3. Keep private/authenticated storage as needed, but make the release contract machine-readable and stable.
  4. Include regional and auxiliary artifacts in the same manifest, not just the top-level FRS H5.
  5. Encode compatibility with policyengine-uk explicitly.
  6. Include deterministic rebuild provenance, not just download metadata.
  7. Make checksum verification a first-class supported path for downstream consumers.

Acceptance criteria

  • Every policyengine-uk-data release publishes a machine-readable release_manifest.json at the HF revision/tag matching the package version.
  • The release manifest works for private/authenticated artifacts and does not depend on downstream consumers guessing filenames.
  • The manifest lists all runtime-relevant artifacts for that release, including national, regional, and auxiliary artifacts required for reproducibility.
  • The manifest defines the default dataset by logical name.
  • Each artifact entry contains an exact locator, checksum, and size.
  • The manifest includes enough provenance to support deterministic rebuild verification from pinned inputs.
  • The manifest includes explicit compatibility metadata for policyengine-uk.

Downstream consumers

This is the UK counterpart to the data-release contract needed by:

  • policyengine.py
  • policyengine-api
  • policyengine-api-v2
  • policyengine-api-v2-alpha
  • policyengine-household-api
  • policyengine-app-v2

Non-goals

  • making UK licensed data public
  • requiring the wheel to bundle large datasets
  • relying on mutable root filenames or repo-default branches as the release contract

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions