feat: add isle of wight council scraper#2076
Conversation
New scraper for Isle of Wight Council's Blazor Server waste lookup. Uses Playwright to navigate the form (Selenium and headless Chrome crash on the blazorpack SignalR protocol). Downloads the PDF collection calendar and parses colour-coded cell backgrounds with pdfplumber to distinguish recycling from non-recyclable waste weeks. Dual-mode: checks UKBCD_USE_PLAYWRIGHT env var for Playwright branch, falls back to Selenium branch for environments without Playwright.
|
Warning Review limit reached
Your plan currently allows 2 reviews/hour. Refill in 25 minutes and 44 seconds. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more review capacity refills, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughA new Isle of Wight Council bin-collection scraper is added with PDF calendar parsing, color-based bin type classification, dual browser automation paths (Playwright and Selenium), cached PDF downloads, and test configuration. Both paths navigate to the wasteday site, fill postcode, select address, extract collection day, download and parse the PDF calendar, and return formatted bin collection dates. ChangesIsle of Wight Council scraper
Sequence Diagram(s)sequenceDiagram
participant User
participant Browser as Chromium/Webdriver
participant WastePage as Wasteday Page
participant Cache as PDF Cache
participant Parser as PDF Parser
User->>Browser: Initialize browser (Playwright or Selenium)
Browser->>WastePage: Navigate to wasteday URL
Browser->>WastePage: Fill postcode and wait for address dropdown
Browser->>WastePage: Select best matching PAON address
WastePage->>Browser: Return collection day and PDF link
Browser->>Cache: Check PDF cache by URL
alt Cache miss or expired TTL
Browser->>WastePage: Download PDF calendar
WastePage->>Cache: Store with TTL
end
Cache->>Parser: Read cached PDF file
Parser->>Parser: Detect colors and classify bin types by cell position
Parser->>Parser: Extract future collection dates
Parser-->>Browser: Return list of (date, bin_type) tuples
Browser->>Browser: Format into UKBCD bins response object
Browser-->>User: Return JSON with collection schedule
Browser->>Browser: Clean up (close browser or quit webdriver)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #2076 +/- ##
=======================================
Coverage 86.67% 86.67%
=======================================
Files 9 9
Lines 1141 1141
=======================================
Hits 989 989
Misses 152 152 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@uk_bin_collection/uk_bin_collection/councils/IsleOfWightCouncil.py`:
- Around line 352-356: The cache key currently only uses PDF_CACHE_DIR,
user_postcode and collection_day so different addresses in the same postcode can
collide; update the cache_key generation (the variable named cache_key used to
build pdf_path) to also incorporate the selected address identifier (e.g.,
selected_address, selected_address_label or the final download URL if you have a
variable like download_url) so the MD5 input uniquely represents
postcode+collection_day+address (or download URL), then rebuild pdf_path using
the new cache_key.
- Around line 196-214: The current logic in the download block (checking
os.path.exists(pdf_path) and writing directly to pdf_path after requests.get)
can yield a race where another process reads a half-written PDF; fix by writing
the downloaded resp.content to a temporary file and atomically replacing the
cache entry with os.replace (or tempfile.NamedTemporaryFile + os.replace) into
pdf_path, ensuring you clean up the temp on error; update the code around
pdf_path, requests.get, resp and PDF_CACHE_MAX_AGE to perform the atomic write
and handle exceptions so concurrent callers never see a partial file.
- Around line 176-183: The current branch that falls back to bin_type =
"Collection" for any color not in RECYCLING_COLORS, NON_RECYCLABLE_COLORS or
HEADER_BG_COLORS should be replaced with an explicit failure: when
_color_in_list(bg, ...) matches none, raise a clear parsing exception (e.g.,
ValueError or a custom ParseError) including the unexpected bg value and context
(page/cell info from _get_bg_color or the enclosing method) so callers can
detect PDF format changes; update the code paths in the same function/method in
IsleOfWightCouncil (the block using _color_in_list, RECYCLING_COLORS,
NON_RECYCLABLE_COLORS, HEADER_BG_COLORS) to remove the silent "Collection"
default and raise that exception instead.
- Around line 86-99: The parser currently returns [] on two error conditions
(empty month_headers and unknown collection_day); instead raise a ValueError
with a clear message in both places so failures are loud and debuggable: when
month_headers is falsy (check month_headers) raise ValueError("No month headers
parsed from PDF") and when target_weekday is None after mapping collection_day
via day_map (use collection_day.upper() and day_map.get(...)) raise
ValueError(f"Unrecognized collection_day: {collection_day}"); update the code
around month_headers, first_month, day_map, target_weekday to perform these
raises instead of returning [].
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b2bef7ab-3221-4282-b8d5-4386735edf99
📒 Files selected for processing (2)
uk_bin_collection/tests/input.jsonuk_bin_collection/uk_bin_collection/councils/IsleOfWightCouncil.py
Summary
digitalservices.iow.gov.uk/wastedayblazorpack(a binary SignalR protocol) which crashes standard Chrome and headless Chromeexpect_downloadDual-mode design
UKBCD_USE_PLAYWRIGHTenv var to choose between Playwright and Selenium branchespip install playwright && playwright install chromiumandxvfb-runfor the Playwright pathTesting
PO31 7LE+ paon101Summary by CodeRabbit
New Features