Skip to content

Fix BibliographicData.has_changed to throttle updates when data sourc…#3198

Merged
dbernstein merged 8 commits intomainfrom
bugfix/PP-3983-esnure-no-change-recorded-if-no-data-source-update-time
Apr 3, 2026
Merged

Fix BibliographicData.has_changed to throttle updates when data sourc…#3198
dbernstein merged 8 commits intomainfrom
bugfix/PP-3983-esnure-no-change-recorded-if-no-data-source-update-time

Conversation

@dbernstein
Copy link
Copy Markdown
Contributor

@dbernstein dbernstein commented Apr 2, 2026

Description

Refactor BibliographicData.has_changed to throttle re-imports when a data source
does not supply a data_source_last_updated timestamp.
Previously, a missing source timestamp caused every record to be treated as changed
on every import run. The method now supports a configurable minimum_time_between_updates
field (a datetime.timedelta, defaulting to zero) that controls how long to wait before
re-importing a record when no source timestamp is available.
The Overdrive representation extractor sets this to 3 days, since Overdrive does not
provide per-record update timestamps.
Other changes:

  • Add type annotation to minimum_time_between_updates
  • Restore a missing None guard in has_changed after the edition lookup
  • Refactor the if/elif/else logic for clarity and add a log message when skipping due to
    the minimum interval
  • Expand the has_changed docstring to document all return cases
  • Add import datetime to representation.py (was missing)

Motivation and Context

Fixes PP-3983. Without a per-record "last updated" timestamp from Overdrive, every import
run was unnecessarily re-importing every Overdrive record, generating excess metadata
update tasks. The 3-day cooldown on minimum_time_between_updates prevents this churn
while still allowing records to be refreshed periodically.

How Has This Been Tested?

Eight new unit tests were added to TestBibliographicData covering all has_changed
code paths:

  • No matching edition in the database → always re-import
  • edition.updated_at is None (with or without a source timestamp) → always re-import
  • data_source_last_updated is None, edition older than minimum_time_between_updates → re-import
  • data_source_last_updated is None, edition within minimum_time_between_updates → skip
  • Custom minimum_time_between_updates values respected
  • Source timestamp newer than edition → re-import
  • Source timestamp older than edition → skip
    A new test in TestOverdriveRepresentationExtractor confirms that
    book_info_to_bibliographic sets minimum_time_between_updates = timedelta(days=3).
    Existing tests in test_bibliographic.py were updated to reflect the new change-detection
    semantics where data_source_last_updated=None no longer unconditionally forces a re-import.
    All 43 tests in test_bibliographic.py and all 30 tests in test_representation.py pass.

Checklist

  • I have updated the documentation accordingly.
  • All new and existing tests passed.

…e provides no timestamp (PP-3983)

When data_source_last_updated is None but edition.updated_at is set,
avoid triggering an update unless the minimum time between updates
(default 2 hours) has elapsed. Previously, a missing source timestamp
always forced a re-import on every run.
@dbernstein dbernstein marked this pull request as ready for review April 2, 2026 19:54
@dbernstein dbernstein requested a review from jonathangreen April 2, 2026 20:00
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.28%. Comparing base (0c48691) to head (cc7adea).
⚠️ Report is 11 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3198      +/-   ##
==========================================
- Coverage   93.28%   93.28%   -0.01%     
==========================================
  Files         496      496              
  Lines       46005    46012       +7     
  Branches     6300     6302       +2     
==========================================
+ Hits        42916    42921       +5     
- Misses       2002     2004       +2     
  Partials     1087     1087              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dbernstein dbernstein requested a review from a team April 2, 2026 20:57
Copy link
Copy Markdown
Contributor

@tdilauro tdilauro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few suggestions below.

self,
session: Session,
edition: Edition | None = None,
) -> bool:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably update the docstring here temporarily, since it doesn't accurately reflect the current situation.

return True
return False
elif edition.updated_at is None:
# if we don't have an edition updated_at but we do have a source last updated time, assume change
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition the elif here doesn't match the comment.


# If we don't have any information about the last update time, assume we need to update.
if edition.updated_at is None or self.data_source_last_updated is None:
if self.data_source_last_updated is None and edition.updated_at is not None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it challenging to reason over this if/elif block and the following if block. Using the conditions described on the PR description, it seems like what we're trying to do is something like:

if edition.updated_at is None:
    # No edition timestamp: always re-import.
    # Covers both "edition.updated_at is None" and "both timestamps None".
    return True

elif self.data_source_last_updated is None:
    # No source timestamp: re-import only if the edition is older than the minimum interval.
    return utc_now() - edition.updated_at > self.minimum_time_between_updates

elif self.data_source_last_updated > edition.updated_at:
    # Source has a newer timestamp than the edition: re-import.
    return True

else:
    # Source timestamp is not newer than the edition: skip.
    self.log.info("Publication ... is unchanged. ...")
    return False

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is cleaner: I'll change.

def test_has_changed_no_edition(self, db: DatabaseTransactionFixture):
"""has_changed returns True when no matching edition exists."""
bibliographic = BibliographicData(
data_source_name=DataSource.OVERDRIVE,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only this first test explicitly uses and OverDrive (OD) data source. That test is not sensitive to which integration is active, as we would expect any source to have changed in this case. It seems like some of the tests below that are sensitive to data source / distributor and min time between updates should be adjusted -- perhaps via parameterization (on OD/non-OD, time since edition update) -- to effectively test the behavior in this change.

Copy link
Copy Markdown
Contributor Author

@dbernstein dbernstein Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test should is actually not realistic since overdrive will never supply the datasource_last_updated (it is because of this fact that I'm creating this PR in the first place). The presence or absence of the min time between updates is determined when the object is constructed. It so happens that Overdrive is the only data source the specifies the min time. But BibliograhicData doesn't know anything about that. So as far as I can see the new tests are not sensitive to the data source/distributor.

Copy link
Copy Markdown
Contributor Author

@dbernstein dbernstein Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you're right we need a test verifying the three day minimum time between updates for Overdrive. I'll add that to overdrive/test_representation.py

dbernstein and others added 4 commits April 2, 2026 15:21
Co-authored-by: Tim DiLauro <tdilauro@users.noreply.github.com>
Co-authored-by: Tim DiLauro <tdilauro@users.noreply.github.com>
…n.py

Co-authored-by: Tim DiLauro <tdilauro@users.noreply.github.com>
- Add type annotation to minimum_time_between_updates field (timedelta instead of int)
- Restore missing None guard after edition lookup
- Refactor if/elif/else logic for clarity
- Expand docstring to document all return cases
@dbernstein
Copy link
Copy Markdown
Contributor Author

@tdilauro : Thanks again for your thorough code review. I've updated the PR with all your suggestions except where noted.

@dbernstein
Copy link
Copy Markdown
Contributor Author

FYI @tdilauro : I also updated the PR description to reflect the latest PR state.

@dbernstein dbernstein enabled auto-merge (squash) April 3, 2026 04:36
@tdilauro tdilauro disabled auto-merge April 3, 2026 13:50
Copy link
Copy Markdown
Contributor

@tdilauro tdilauro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me 🚀

I've added a few more suggestions below. Most are minor things. I'm approving this now, but I'd still like to see a more integrated test that demonstrates that the longer minimum time between changes affects only OverDrive books and discriminates between books from OD and other data sources. I put some example code in a comment to illustrate what I'm trying to get at.

I disabled auto-merge, so that I could go ahead and approve the change while still giving you a chance to react without the code already being merged.

any elapsed time triggers an update).

Returns False (skip update) when ``data_source_last_updated`` is ``None`` and
``edition.updated_at`` is within the ``minimum_time_between_updates_in_seconds``
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor - minimum_time_between_updates_in_seconds was renamed to minimum_time_between_updates.

self.log.info(
f"Publication {self.primary_identifier_data} has no source last updated timestamp so change is "
f"indeterminate. Since it has been updated in the last {self.minimum_time_between_updates}"
f"ie on {edition.updated_at}, we will refrain from attempting to update it."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor - Not sure what's meant to happen here: the previous line doesn't end in a space and then this one starts with "ie".

# mechanisms.
assert internal_formats("no-such-format") == []

def test_book_info_to_bibliographic_sets_minimum_time_between_updates(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test does get at verifying that this change is affecting OverDrive books, but it would be nice to see the behavior difference in a more integrated way in tests/manager/data_layer/test_bibliographic.py.

I was thinking something there with a shape like:

@pytest.mark.parametrize(
      "is_overdrive, edition_age, expected",
      [
          pytest.param(
              False,
              datetime.timedelta(seconds=1),
              True,
              id="default-minimum-always-reimports",
          ),
          pytest.param(
              True,
              datetime.timedelta(days=1),
              False,
              id="overdrive-within-cooldown-skips",
          ),
          pytest.param(
              True,
              datetime.timedelta(days=4),
              True,
              id="overdrive-outside-cooldown-reimports",
          ),
      ],
  )
  def test_has_changed_no_source_timestamp(
      self,
      db: DatabaseTransactionFixture,
      is_overdrive: bool,
      edition_age: datetime.timedelta,
      expected: bool,
  ):
      now = utc_now()
      edition = db.edition()
      edition.updated_at = now - edition_age

      data_source_name = DataSource.OVERDRIVE if is_overdrive else edition.data_source.name

      bibliographic = BibliographicData(
          data_source_name=data_source_name,
          data_source_last_updated=None,
      )
      with freeze_time(now):
          assert bibliographic.has_changed(db.session, edition=edition) is expected

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm missing you @tdilauro: I'm probably being dense. I'll ping you on slack and maybe we can jump on a short call.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether the DS is overdrive or not has no bearing on the expected functionality of the has_changed method. The difference is driven by the presence of a min time btw updates.

@dbernstein dbernstein merged commit 7dd9ec5 into main Apr 3, 2026
22 of 23 checks passed
@dbernstein dbernstein deleted the bugfix/PP-3983-esnure-no-change-recorded-if-no-data-source-update-time branch April 3, 2026 16:43
dbernstein added a commit that referenced this pull request Apr 3, 2026
#3198)

## Description
Refactor `BibliographicData.has_changed` to throttle re-imports when a
data source
does not supply a `data_source_last_updated` timestamp.
Previously, a missing source timestamp caused every record to be treated
as changed
on every import run. The method now supports a configurable
`minimum_time_between_updates`
field (a `datetime.timedelta`, defaulting to zero) that controls how
long to wait before
re-importing a record when no source timestamp is available.
The Overdrive representation extractor sets this to 3 days, since
Overdrive does not
provide per-record update timestamps.
Other changes:
- Add type annotation to `minimum_time_between_updates` 
- Restore a missing `None` guard in `has_changed` after the edition
lookup
- Refactor the if/elif/else logic for clarity and add a log message when
skipping due to
  the minimum interval
- Expand the `has_changed` docstring to document all return cases
- Add `import datetime` to `representation.py` (was missing)
## Motivation and Context
Fixes PP-3983. Without a per-record "last updated" timestamp from
Overdrive, every import
run was unnecessarily re-importing every Overdrive record, generating
excess metadata
update tasks. The 3-day cooldown on `minimum_time_between_updates`
prevents this churn
while still allowing records to be refreshed periodically.
## How Has This Been Tested?
Eight new unit tests were added to `TestBibliographicData` covering all
`has_changed`
code paths:
- No matching edition in the database → always re-import
- `edition.updated_at` is `None` (with or without a source timestamp) →
always re-import
- `data_source_last_updated` is `None`, edition older than
`minimum_time_between_updates` → re-import
- `data_source_last_updated` is `None`, edition within
`minimum_time_between_updates` → skip
- Custom `minimum_time_between_updates` values respected
- Source timestamp newer than edition → re-import
- Source timestamp older than edition → skip
A new test in `TestOverdriveRepresentationExtractor` confirms that
`book_info_to_bibliographic` sets `minimum_time_between_updates =
timedelta(days=3)`.
Existing tests in `test_bibliographic.py` were updated to reflect the
new change-detection
semantics where `data_source_last_updated=None` no longer
unconditionally forces a re-import.
All 43 tests in `test_bibliographic.py` and all 30 tests in
`test_representation.py` pass.
## Checklist
- [x] I have updated the documentation accordingly.
- [x] All new and existing tests passed.

---------

Co-authored-by: Tim DiLauro <tdilauro@users.noreply.github.com> and Claude
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants