WIP: Content-hash-based change tracking for data imports#3199
Draft
jonathangreen wants to merge 4 commits intomainfrom
Draft
WIP: Content-hash-based change tracking for data imports#3199jonathangreen wants to merge 4 commits intomainfrom
jonathangreen wants to merge 4 commits intomainfrom
Conversation
Fixes all broken tests, mypy errors, and incomplete source changes from the initial WIP commit (bde0829). This commit contains all Claude authored work. - LicensePool model was missing `updated_at` and `created_at` columns referenced by new circulation code, causing 49 test failures - 31 mypy errors across json.py, bibliographic.py, circulation.py, and integration importers - Incomplete rename of `has_changed` → `needs_apply` left stale calls in bibliographic.py, circulation.py, and three integration importers - `data_source_last_updated` still referenced in bibliographic.py, two OPDS extractors, and the Boundless parser/conftest - Missing alembic migration for all new DB columns - `LinkData.content` (bytes | str field) caused UnicodeDecodeError when hashing bibliographic data containing embedded binary images - `_canonicalize` / `_canonicalize_sort_key` lacked type annotations - ODL reimport of expired licenses was incorrectly skipped because license expiry is time-dependent, not detectable by content hash src/palace/manager/sqlalchemy/model/licensing.py - Add `created_at` and `updated_at` columns to LicensePool src/palace/manager/data_layer/base/mutable.py - Fix `should_apply_to` condition: `<=` → `<` so equal timestamps still trigger a hash check rather than an unconditional skip src/palace/manager/data_layer/link.py - Add `@field_serializer("content", when_used="json")` to base64-encode binary bytes in the `bytes | str | None` union field src/palace/manager/data_layer/bibliographic.py - Replace `data_source_last_updated` with `updated_at` throughout - Replace `has_changed` calls with `should_apply_to` in apply() / apply_edition_only(); `_update_edition_timestamp` now also stores `updated_at_data_hash` on the edition src/palace/manager/data_layer/circulation.py - Replace remaining `has_changed` / `last_checked` references - Set `pool.updated_at` alongside `pool.updated_at_data_hash` after apply - Early-return skip is bypassed when `self.licenses is not None` (ODL-style pools) so time-expired licenses are always reprocessed; inner availability block gets the same treatment src/palace/manager/util/json.py - Add `int` type annotations to all `float_precision` parameters src/palace/manager/integration/license/{opds,boundless,overdrive}/importer.py - `has_changed` → `needs_apply` src/palace/manager/integration/license/{opds1,odl}/extractor.py src/palace/manager/integration/license/boundless/parser.py - `data_source_last_updated=` → `updated_at=` alembic/versions/20260402_57d824b34167_add_change_tracking_hash_columns.py - New migration: `updated_at_data_hash` on editions and licensepools, `created_at` / `updated_at` on licensepools tests/manager/data_layer/test_bibliographic.py - Replace `data_source_last_updated` with `updated_at`; rewrite test_apply_no_changes_needed for hash-based semantics; rename test_data_source_last_updated_updates_timestamp tests/manager/data_layer/test_measurement.py - Update test_taken_at: taken_at now defaults to None tests/manager/integration/license/{opds,overdrive}/test_importer.py tests/manager/integration/license/boundless/conftest.py - Update mock/fixture references from has_changed / last_checked to needs_apply / updated_at
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3199 +/- ##
=======================================
Coverage 93.28% 93.28%
=======================================
Files 496 496
Lines 46005 46050 +45
Branches 6300 6302 +2
=======================================
+ Hits 42916 42958 +42
- Misses 2002 2004 +2
- Partials 1087 1088 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Introduces content-hash-based change tracking for
BibliographicDataandCirculationData. Instead of relying solely on timestamps to decide whether incoming data needs to be applied, we now compute a SHA-256 hash of the canonicalized JSON representation and store it on the DB model. An update is skipped only if the data is both not newer and has the same hash.Changes so far
json_canonical/json_hashutilities for deterministic JSON hashingBaseMutableDatagainsupdated_at,created_at,calculate_hash(), andshould_apply_to()BibliographicDataandCirculationDatareplacehas_changed()withneeds_apply()/should_apply_to()data_source_last_updatedandlast_checkedfields in favor of the unifiedupdated_at/created_aton the base classEditionandLicensePoolmodels getupdated_at_data_hashcolumnsMotivation and Context
Timestamp-only change detection is unreliable when data sources re-publish unchanged data with new timestamps, or don't give us any timestamp to work with. A content hash allows us to skip truly unchanged data even when timestamps advance.
How Has This Been Tested?
Checklist