Skip to content

WIP: Content-hash-based change tracking for data imports#3199

Draft
jonathangreen wants to merge 4 commits intomainfrom
feature/change-tracking
Draft

WIP: Content-hash-based change tracking for data imports#3199
jonathangreen wants to merge 4 commits intomainfrom
feature/change-tracking

Conversation

@jonathangreen
Copy link
Copy Markdown
Member

@jonathangreen jonathangreen commented Apr 2, 2026

This is a work in progress and is currently only half baked. Do not merge.

Description

Introduces content-hash-based change tracking for BibliographicData and CirculationData. Instead of relying solely on timestamps to decide whether incoming data needs to be applied, we now compute a SHA-256 hash of the canonicalized JSON representation and store it on the DB model. An update is skipped only if the data is both not newer and has the same hash.

Changes so far

  • New json_canonical / json_hash utilities for deterministic JSON hashing
  • BaseMutableData gains updated_at, created_at, calculate_hash(), and should_apply_to()
  • BibliographicData and CirculationData replace has_changed() with needs_apply() / should_apply_to()
  • Removed data_source_last_updated and last_checked fields in favor of the unified updated_at/created_at on the base class
  • Edition and LicensePool models get updated_at_data_hash columns

Motivation and Context

Timestamp-only change detection is unreliable when data sources re-publish unchanged data with new timestamps, or don't give us any timestamp to work with. A content hash allows us to skip truly unchanged data even when timestamps advance.

How Has This Been Tested?

Checklist

  • I have updated the documentation accordingly.
  • All new and existing tests passed.

@jonathangreen jonathangreen added the feature New feature label Apr 2, 2026
Fixes all broken tests, mypy errors, and incomplete source changes from
the initial WIP commit (bde0829).

This commit contains all Claude authored work.

- LicensePool model was missing `updated_at` and `created_at` columns
  referenced by new circulation code, causing 49 test failures
- 31 mypy errors across json.py, bibliographic.py, circulation.py,
  and integration importers
- Incomplete rename of `has_changed` → `needs_apply` left stale calls
  in bibliographic.py, circulation.py, and three integration importers
- `data_source_last_updated` still referenced in bibliographic.py,
  two OPDS extractors, and the Boundless parser/conftest
- Missing alembic migration for all new DB columns
- `LinkData.content` (bytes | str field) caused UnicodeDecodeError when
  hashing bibliographic data containing embedded binary images
- `_canonicalize` / `_canonicalize_sort_key` lacked type annotations
- ODL reimport of expired licenses was incorrectly skipped because
  license expiry is time-dependent, not detectable by content hash
src/palace/manager/sqlalchemy/model/licensing.py
- Add `created_at` and `updated_at` columns to LicensePool
src/palace/manager/data_layer/base/mutable.py
- Fix `should_apply_to` condition: `<=` → `<` so equal timestamps
  still trigger a hash check rather than an unconditional skip
src/palace/manager/data_layer/link.py
- Add `@field_serializer("content", when_used="json")` to base64-encode
  binary bytes in the `bytes | str | None` union field
src/palace/manager/data_layer/bibliographic.py
- Replace `data_source_last_updated` with `updated_at` throughout
- Replace `has_changed` calls with `should_apply_to` in apply() /
  apply_edition_only(); `_update_edition_timestamp` now also stores
  `updated_at_data_hash` on the edition
src/palace/manager/data_layer/circulation.py
- Replace remaining `has_changed` / `last_checked` references
- Set `pool.updated_at` alongside `pool.updated_at_data_hash` after apply
- Early-return skip is bypassed when `self.licenses is not None`
  (ODL-style pools) so time-expired licenses are always reprocessed;
  inner availability block gets the same treatment
src/palace/manager/util/json.py
- Add `int` type annotations to all `float_precision` parameters
src/palace/manager/integration/license/{opds,boundless,overdrive}/importer.py
- `has_changed` → `needs_apply`
src/palace/manager/integration/license/{opds1,odl}/extractor.py
src/palace/manager/integration/license/boundless/parser.py
- `data_source_last_updated=` → `updated_at=`
alembic/versions/20260402_57d824b34167_add_change_tracking_hash_columns.py
- New migration: `updated_at_data_hash` on editions and licensepools,
  `created_at` / `updated_at` on licensepools
tests/manager/data_layer/test_bibliographic.py
- Replace `data_source_last_updated` with `updated_at`; rewrite
  test_apply_no_changes_needed for hash-based semantics; rename
  test_data_source_last_updated_updates_timestamp
tests/manager/data_layer/test_measurement.py
- Update test_taken_at: taken_at now defaults to None
tests/manager/integration/license/{opds,overdrive}/test_importer.py
tests/manager/integration/license/boundless/conftest.py
- Update mock/fixture references from has_changed / last_checked
  to needs_apply / updated_at
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 3, 2026

Codecov Report

❌ Patch coverage is 97.70115% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.28%. Comparing base (0c48691) to head (882ba2f).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
src/palace/manager/util/json.py 92.59% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3199   +/-   ##
=======================================
  Coverage   93.28%   93.28%           
=======================================
  Files         496      496           
  Lines       46005    46050   +45     
  Branches     6300     6302    +2     
=======================================
+ Hits        42916    42958   +42     
- Misses       2002     2004    +2     
- Partials     1087     1088    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants