Skip to content

feat: add isle of wight council scraper#2076

Open
InertiaUK wants to merge 2 commits into
robbrad:masterfrom
InertiaUK:feat/isle-of-wight-scraper
Open

feat: add isle of wight council scraper#2076
InertiaUK wants to merge 2 commits into
robbrad:masterfrom
InertiaUK:feat/isle-of-wight-scraper

Conversation

@InertiaUK
Copy link
Copy Markdown
Contributor

@InertiaUK InertiaUK commented May 13, 2026

Summary

  • New scraper for Isle of Wight Council's Blazor Server waste lookup at digitalservices.iow.gov.uk/wasteday
  • The site uses blazorpack (a binary SignalR protocol) which crashes standard Chrome and headless Chrome
  • Uses Playwright to navigate the Blazor form — only browser automation tool that handles this protocol reliably
  • Scraper enters postcode, selects address from dropdown, reads collection day, then downloads the PDF calendar via expect_download
  • Parses the PDF with pdfplumber — colour-coded cell backgrounds distinguish recycling (green) from non-recyclable (grey) collection weeks
  • Returns ~48 dated entries covering a full year with correct bin types

Dual-mode design

  • Checks UKBCD_USE_PLAYWRIGHT env var to choose between Playwright and Selenium branches
  • Selenium branch included for backward compatibility but will crash on Blazor (as documented)
  • Requires pip install playwright && playwright install chromium and xvfb-run for the Playwright path

Testing

  • Postcode + house number: PO31 7LE + paon 101
  • Tested end-to-end via production API — 47 collection entries returned (alternating Recycling / Non-recyclable waste)
  • PDF colour parsing tested against both 2024-25 (CMYK) and 2025-26 (RGB) calendar formats

Summary by CodeRabbit

New Features

  • Added support for Isle of Wight Council bin collection schedules
  • Implemented PDF calendar parsing to automatically extract future collection dates
  • Automatic address matching and selection from postcode search results
  • Local caching of collection calendars to reduce repeated downloads

Review Change Stack

New scraper for Isle of Wight Council's Blazor Server waste lookup.
Uses Playwright to navigate the form (Selenium and headless Chrome
crash on the blazorpack SignalR protocol). Downloads the PDF collection
calendar and parses colour-coded cell backgrounds with pdfplumber to
distinguish recycling from non-recyclable waste weeks.

Dual-mode: checks UKBCD_USE_PLAYWRIGHT env var for Playwright branch,
falls back to Selenium branch for environments without Playwright.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

Warning

Review limit reached

@InertiaUK, we couldn't start this review because you've used your available PR reviews for now.

Your plan currently allows 2 reviews/hour. Refill in 25 minutes and 44 seconds.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more review capacity refills, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3549d3a2-d252-4e61-bd8a-6dbf9402c6af

📥 Commits

Reviewing files that changed from the base of the PR and between bd36334 and d860873.

📒 Files selected for processing (1)
  • uk_bin_collection/uk_bin_collection/councils/IsleOfWightCouncil.py
📝 Walkthrough

Walkthrough

A new Isle of Wight Council bin-collection scraper is added with PDF calendar parsing, color-based bin type classification, dual browser automation paths (Playwright and Selenium), cached PDF downloads, and test configuration. Both paths navigate to the wasteday site, fill postcode, select address, extract collection day, download and parse the PDF calendar, and return formatted bin collection dates.

Changes

Isle of Wight Council scraper

Layer / File(s) Summary
PDF calendar parser and color-based bin classification
uk_bin_collection/uk_bin_collection/councils/IsleOfWightCouncil.py (lines 1–188)
Defines color constants and tolerance thresholds, detects background colors within PDF cell rectangles, classifies cells by color into bin types, locates month/day table alignment, extracts datetimes from matched cells, and returns sorted (datetime, bin_type) results.
Cached PDF download and address selection helpers
uk_bin_collection/uk_bin_collection/councils/IsleOfWightCouncil.py (lines 191–267)
Implements cached PDF download with filesystem TTL to avoid re-fetching, validates PDF content, selects best address label from dropdown results preferring PAON matches, and extracts "Collection Day" and calendar PDF link by parsing page HTML/DOM. Formats parsed bin dates into UKBCD output structure filtering out past dates.
Playwright-based scraping implementation
uk_bin_collection/uk_bin_collection/councils/IsleOfWightCouncil.py (lines 276–377)
Launches Chromium with explicit non-headless settings, fills postcode field, selects PAON-matched address, extracts collection day from DOM, uses page.expect_download() to capture PDF, parses calendar via PDF parser, formats UKBCD response, and ensures browser shutdown in finally block.
Selenium-based scraping implementation
uk_bin_collection/uk_bin_collection/councils/IsleOfWightCouncil.py (lines 378–454)
Creates webdriver, waits for postcode/address UI, selects address by visible text, extracts collection day and PDF URL from page source, downloads PDF via cached helper using dynamic user-agent and cookies, parses calendar, formats response, and unconditionally quits webdriver in finally block.
Test configuration
uk_bin_collection/tests/input.json (lines 1295–1303)
Adds IsleOfWightCouncil test entry with house number, postcode, skip_get_url flag, council URL, wiki metadata, and LAD24CD code.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Browser as Chromium/Webdriver
  participant WastePage as Wasteday Page
  participant Cache as PDF Cache
  participant Parser as PDF Parser
  User->>Browser: Initialize browser (Playwright or Selenium)
  Browser->>WastePage: Navigate to wasteday URL
  Browser->>WastePage: Fill postcode and wait for address dropdown
  Browser->>WastePage: Select best matching PAON address
  WastePage->>Browser: Return collection day and PDF link
  Browser->>Cache: Check PDF cache by URL
  alt Cache miss or expired TTL
    Browser->>WastePage: Download PDF calendar
    WastePage->>Cache: Store with TTL
  end
  Cache->>Parser: Read cached PDF file
  Parser->>Parser: Detect colors and classify bin types by cell position
  Parser->>Parser: Extract future collection dates
  Parser-->>Browser: Return list of (date, bin_type) tuples
  Browser->>Browser: Format into UKBCD bins response object
  Browser-->>User: Return JSON with collection schedule
  Browser->>Browser: Clean up (close browser or quit webdriver)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • dp247

Poem

🐰 A new council joins the warren, Isle of Wight so bright,
With PDF calendars parsed in color and in light,
Two paths to automate the scraping task with care,
Chromium and Selenium—a duet beyond compare!
Bins shall be collected, dates tracked and clean,
The finest waste-day data ever seen. 🎨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a new Isle of Wight Council scraper module. It is concise, clear, and directly related to the primary purpose of the pull request.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.67%. Comparing base (8ecf878) to head (d860873).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #2076   +/-   ##
=======================================
  Coverage   86.67%   86.67%           
=======================================
  Files           9        9           
  Lines        1141     1141           
=======================================
  Hits          989      989           
  Misses        152      152           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@uk_bin_collection/uk_bin_collection/councils/IsleOfWightCouncil.py`:
- Around line 352-356: The cache key currently only uses PDF_CACHE_DIR,
user_postcode and collection_day so different addresses in the same postcode can
collide; update the cache_key generation (the variable named cache_key used to
build pdf_path) to also incorporate the selected address identifier (e.g.,
selected_address, selected_address_label or the final download URL if you have a
variable like download_url) so the MD5 input uniquely represents
postcode+collection_day+address (or download URL), then rebuild pdf_path using
the new cache_key.
- Around line 196-214: The current logic in the download block (checking
os.path.exists(pdf_path) and writing directly to pdf_path after requests.get)
can yield a race where another process reads a half-written PDF; fix by writing
the downloaded resp.content to a temporary file and atomically replacing the
cache entry with os.replace (or tempfile.NamedTemporaryFile + os.replace) into
pdf_path, ensuring you clean up the temp on error; update the code around
pdf_path, requests.get, resp and PDF_CACHE_MAX_AGE to perform the atomic write
and handle exceptions so concurrent callers never see a partial file.
- Around line 176-183: The current branch that falls back to bin_type =
"Collection" for any color not in RECYCLING_COLORS, NON_RECYCLABLE_COLORS or
HEADER_BG_COLORS should be replaced with an explicit failure: when
_color_in_list(bg, ...) matches none, raise a clear parsing exception (e.g.,
ValueError or a custom ParseError) including the unexpected bg value and context
(page/cell info from _get_bg_color or the enclosing method) so callers can
detect PDF format changes; update the code paths in the same function/method in
IsleOfWightCouncil (the block using _color_in_list, RECYCLING_COLORS,
NON_RECYCLABLE_COLORS, HEADER_BG_COLORS) to remove the silent "Collection"
default and raise that exception instead.
- Around line 86-99: The parser currently returns [] on two error conditions
(empty month_headers and unknown collection_day); instead raise a ValueError
with a clear message in both places so failures are loud and debuggable: when
month_headers is falsy (check month_headers) raise ValueError("No month headers
parsed from PDF") and when target_weekday is None after mapping collection_day
via day_map (use collection_day.upper() and day_map.get(...)) raise
ValueError(f"Unrecognized collection_day: {collection_day}"); update the code
around month_headers, first_month, day_map, target_weekday to perform these
raises instead of returning [].
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b2bef7ab-3221-4282-b8d5-4386735edf99

📥 Commits

Reviewing files that changed from the base of the PR and between 8ecf878 and bd36334.

📒 Files selected for processing (2)
  • uk_bin_collection/tests/input.json
  • uk_bin_collection/uk_bin_collection/councils/IsleOfWightCouncil.py

Comment thread uk_bin_collection/uk_bin_collection/councils/IsleOfWightCouncil.py Outdated
Comment thread uk_bin_collection/uk_bin_collection/councils/IsleOfWightCouncil.py Outdated
Comment thread uk_bin_collection/uk_bin_collection/councils/IsleOfWightCouncil.py
Comment thread uk_bin_collection/uk_bin_collection/councils/IsleOfWightCouncil.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant