Fix BibliographicData.has_changed to throttle updates when data sourc…#3198
Conversation
…e provides no timestamp (PP-3983) When data_source_last_updated is None but edition.updated_at is set, avoid triggering an update unless the minimum time between updates (default 2 hours) has elapsed. Previously, a missing source timestamp always forced a re-import on every run.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3198 +/- ##
==========================================
- Coverage 93.28% 93.28% -0.01%
==========================================
Files 496 496
Lines 46005 46012 +7
Branches 6300 6302 +2
==========================================
+ Hits 42916 42921 +5
- Misses 2002 2004 +2
Partials 1087 1087 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
tdilauro
left a comment
There was a problem hiding this comment.
I left a few suggestions below.
src/palace/manager/integration/license/overdrive/representation.py
Outdated
Show resolved
Hide resolved
| self, | ||
| session: Session, | ||
| edition: Edition | None = None, | ||
| ) -> bool: |
There was a problem hiding this comment.
We should probably update the docstring here temporarily, since it doesn't accurately reflect the current situation.
| return True | ||
| return False | ||
| elif edition.updated_at is None: | ||
| # if we don't have an edition updated_at but we do have a source last updated time, assume change |
There was a problem hiding this comment.
The condition the elif here doesn't match the comment.
|
|
||
| # If we don't have any information about the last update time, assume we need to update. | ||
| if edition.updated_at is None or self.data_source_last_updated is None: | ||
| if self.data_source_last_updated is None and edition.updated_at is not None: |
There was a problem hiding this comment.
I found it challenging to reason over this if/elif block and the following if block. Using the conditions described on the PR description, it seems like what we're trying to do is something like:
if edition.updated_at is None:
# No edition timestamp: always re-import.
# Covers both "edition.updated_at is None" and "both timestamps None".
return True
elif self.data_source_last_updated is None:
# No source timestamp: re-import only if the edition is older than the minimum interval.
return utc_now() - edition.updated_at > self.minimum_time_between_updates
elif self.data_source_last_updated > edition.updated_at:
# Source has a newer timestamp than the edition: re-import.
return True
else:
# Source timestamp is not newer than the edition: skip.
self.log.info("Publication ... is unchanged. ...")
return False
There was a problem hiding this comment.
that is cleaner: I'll change.
| def test_has_changed_no_edition(self, db: DatabaseTransactionFixture): | ||
| """has_changed returns True when no matching edition exists.""" | ||
| bibliographic = BibliographicData( | ||
| data_source_name=DataSource.OVERDRIVE, |
There was a problem hiding this comment.
Only this first test explicitly uses and OverDrive (OD) data source. That test is not sensitive to which integration is active, as we would expect any source to have changed in this case. It seems like some of the tests below that are sensitive to data source / distributor and min time between updates should be adjusted -- perhaps via parameterization (on OD/non-OD, time since edition update) -- to effectively test the behavior in this change.
There was a problem hiding this comment.
This test should is actually not realistic since overdrive will never supply the datasource_last_updated (it is because of this fact that I'm creating this PR in the first place). The presence or absence of the min time between updates is determined when the object is constructed. It so happens that Overdrive is the only data source the specifies the min time. But BibliograhicData doesn't know anything about that. So as far as I can see the new tests are not sensitive to the data source/distributor.
There was a problem hiding this comment.
But you're right we need a test verifying the three day minimum time between updates for Overdrive. I'll add that to overdrive/test_representation.py
Co-authored-by: Tim DiLauro <tdilauro@users.noreply.github.com>
Co-authored-by: Tim DiLauro <tdilauro@users.noreply.github.com>
…n.py Co-authored-by: Tim DiLauro <tdilauro@users.noreply.github.com>
- Add type annotation to minimum_time_between_updates field (timedelta instead of int) - Restore missing None guard after edition lookup - Refactor if/elif/else logic for clarity - Expand docstring to document all return cases
|
@tdilauro : Thanks again for your thorough code review. I've updated the PR with all your suggestions except where noted. |
|
FYI @tdilauro : I also updated the PR description to reflect the latest PR state. |
tdilauro
left a comment
There was a problem hiding this comment.
This looks pretty good to me 🚀
I've added a few more suggestions below. Most are minor things. I'm approving this now, but I'd still like to see a more integrated test that demonstrates that the longer minimum time between changes affects only OverDrive books and discriminates between books from OD and other data sources. I put some example code in a comment to illustrate what I'm trying to get at.
I disabled auto-merge, so that I could go ahead and approve the change while still giving you a chance to react without the code already being merged.
| any elapsed time triggers an update). | ||
|
|
||
| Returns False (skip update) when ``data_source_last_updated`` is ``None`` and | ||
| ``edition.updated_at`` is within the ``minimum_time_between_updates_in_seconds`` |
There was a problem hiding this comment.
Minor - minimum_time_between_updates_in_seconds was renamed to minimum_time_between_updates.
| self.log.info( | ||
| f"Publication {self.primary_identifier_data} has no source last updated timestamp so change is " | ||
| f"indeterminate. Since it has been updated in the last {self.minimum_time_between_updates}" | ||
| f"ie on {edition.updated_at}, we will refrain from attempting to update it." |
There was a problem hiding this comment.
Minor - Not sure what's meant to happen here: the previous line doesn't end in a space and then this one starts with "ie".
| # mechanisms. | ||
| assert internal_formats("no-such-format") == [] | ||
|
|
||
| def test_book_info_to_bibliographic_sets_minimum_time_between_updates( |
There was a problem hiding this comment.
This test does get at verifying that this change is affecting OverDrive books, but it would be nice to see the behavior difference in a more integrated way in tests/manager/data_layer/test_bibliographic.py.
I was thinking something there with a shape like:
@pytest.mark.parametrize(
"is_overdrive, edition_age, expected",
[
pytest.param(
False,
datetime.timedelta(seconds=1),
True,
id="default-minimum-always-reimports",
),
pytest.param(
True,
datetime.timedelta(days=1),
False,
id="overdrive-within-cooldown-skips",
),
pytest.param(
True,
datetime.timedelta(days=4),
True,
id="overdrive-outside-cooldown-reimports",
),
],
)
def test_has_changed_no_source_timestamp(
self,
db: DatabaseTransactionFixture,
is_overdrive: bool,
edition_age: datetime.timedelta,
expected: bool,
):
now = utc_now()
edition = db.edition()
edition.updated_at = now - edition_age
data_source_name = DataSource.OVERDRIVE if is_overdrive else edition.data_source.name
bibliographic = BibliographicData(
data_source_name=data_source_name,
data_source_last_updated=None,
)
with freeze_time(now):
assert bibliographic.has_changed(db.session, edition=edition) is expectedThere was a problem hiding this comment.
I think I'm missing you @tdilauro: I'm probably being dense. I'll ping you on slack and maybe we can jump on a short call.
There was a problem hiding this comment.
Whether the DS is overdrive or not has no bearing on the expected functionality of the has_changed method. The difference is driven by the presence of a min time btw updates.
#3198) ## Description Refactor `BibliographicData.has_changed` to throttle re-imports when a data source does not supply a `data_source_last_updated` timestamp. Previously, a missing source timestamp caused every record to be treated as changed on every import run. The method now supports a configurable `minimum_time_between_updates` field (a `datetime.timedelta`, defaulting to zero) that controls how long to wait before re-importing a record when no source timestamp is available. The Overdrive representation extractor sets this to 3 days, since Overdrive does not provide per-record update timestamps. Other changes: - Add type annotation to `minimum_time_between_updates` - Restore a missing `None` guard in `has_changed` after the edition lookup - Refactor the if/elif/else logic for clarity and add a log message when skipping due to the minimum interval - Expand the `has_changed` docstring to document all return cases - Add `import datetime` to `representation.py` (was missing) ## Motivation and Context Fixes PP-3983. Without a per-record "last updated" timestamp from Overdrive, every import run was unnecessarily re-importing every Overdrive record, generating excess metadata update tasks. The 3-day cooldown on `minimum_time_between_updates` prevents this churn while still allowing records to be refreshed periodically. ## How Has This Been Tested? Eight new unit tests were added to `TestBibliographicData` covering all `has_changed` code paths: - No matching edition in the database → always re-import - `edition.updated_at` is `None` (with or without a source timestamp) → always re-import - `data_source_last_updated` is `None`, edition older than `minimum_time_between_updates` → re-import - `data_source_last_updated` is `None`, edition within `minimum_time_between_updates` → skip - Custom `minimum_time_between_updates` values respected - Source timestamp newer than edition → re-import - Source timestamp older than edition → skip A new test in `TestOverdriveRepresentationExtractor` confirms that `book_info_to_bibliographic` sets `minimum_time_between_updates = timedelta(days=3)`. Existing tests in `test_bibliographic.py` were updated to reflect the new change-detection semantics where `data_source_last_updated=None` no longer unconditionally forces a re-import. All 43 tests in `test_bibliographic.py` and all 30 tests in `test_representation.py` pass. ## Checklist - [x] I have updated the documentation accordingly. - [x] All new and existing tests passed. --------- Co-authored-by: Tim DiLauro <tdilauro@users.noreply.github.com> and Claude
Description
Refactor
BibliographicData.has_changedto throttle re-imports when a data sourcedoes not supply a
data_source_last_updatedtimestamp.Previously, a missing source timestamp caused every record to be treated as changed
on every import run. The method now supports a configurable
minimum_time_between_updatesfield (a
datetime.timedelta, defaulting to zero) that controls how long to wait beforere-importing a record when no source timestamp is available.
The Overdrive representation extractor sets this to 3 days, since Overdrive does not
provide per-record update timestamps.
Other changes:
minimum_time_between_updatesNoneguard inhas_changedafter the edition lookupthe minimum interval
has_changeddocstring to document all return casesimport datetimetorepresentation.py(was missing)Motivation and Context
Fixes PP-3983. Without a per-record "last updated" timestamp from Overdrive, every import
run was unnecessarily re-importing every Overdrive record, generating excess metadata
update tasks. The 3-day cooldown on
minimum_time_between_updatesprevents this churnwhile still allowing records to be refreshed periodically.
How Has This Been Tested?
Eight new unit tests were added to
TestBibliographicDatacovering allhas_changedcode paths:
edition.updated_atisNone(with or without a source timestamp) → always re-importdata_source_last_updatedisNone, edition older thanminimum_time_between_updates→ re-importdata_source_last_updatedisNone, edition withinminimum_time_between_updates→ skipminimum_time_between_updatesvalues respectedA new test in
TestOverdriveRepresentationExtractorconfirms thatbook_info_to_bibliographicsetsminimum_time_between_updates = timedelta(days=3).Existing tests in
test_bibliographic.pywere updated to reflect the new change-detectionsemantics where
data_source_last_updated=Noneno longer unconditionally forces a re-import.All 43 tests in
test_bibliographic.pyand all 30 tests intest_representation.pypass.Checklist