feat: add scottish borders council scraper#2087
Conversation
Uses Bartec Municipal Portal. Pure HTTP with requests + BeautifulSoup. 3-step form flow with CSRF tokens: postcode search, address select by UPRN, then parse Syncfusion Schedule JSON for collection events. Also supports house number matching as fallback for address selection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Warning Review limit reached
Your plan currently allows 2 reviews/hour. Refill in 19 minutes and 32 seconds. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more review capacity refills, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR introduces a new web scraper for Scottish Borders Council that authenticates via CSRF tokens, searches for collection addresses by postcode, and extracts bin collection schedules from portal calendar events. Test configuration is updated to include the new council and fix a unicode character encoding in an existing entry. ChangesScottish Borders Council Scraper Implementation and Testing
Sequence DiagramsequenceDiagram
participant Client as parse_data
participant Portal as Portal Server
participant Session as HTTP Session
participant Parser as HTML/JSON Parser
Client->>Session: Create session with headers
Session->>Portal: GET calendar page
Portal-->>Session: HTML with CSRF token
Client->>Parser: Extract CSRF via _get_csrf_token
Parser-->>Client: __RequestVerificationToken value
Client->>Session: POST postcode to address handler
Session->>Portal: postcode request
Portal-->>Session: JSON with address dropdown options
Client->>Parser: Match UPRN (provided, paon match, or first)
Parser-->>Client: Selected UPRN and address
Client->>Session: GET search results page
Session->>Portal: UPRN selection confirmation
Portal-->>Session: HTML with refreshed CSRF
Client->>Parser: Re-extract CSRF from results
Parser-->>Client: Updated CSRF token
Client->>Session: POST premises selection
Session->>Portal: Selected UPRN confirmation
Portal-->>Session: Rendered page with calendar events
Client->>Parser: Extract isJson blocks and find Subject/StartTime
Parser-->>Client: Calendar event objects
Client->>Parser: Build bins list with collectionDate mapping
Parser-->>Client: Sorted bins by chronological date
Client->>Client: Return bindata dict with bins
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #2087 +/- ##
=======================================
Coverage 86.67% 86.67%
=======================================
Files 9 9
Lines 1141 1141
=======================================
Hits 989 989
Misses 152 152 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@uk_bin_collection/uk_bin_collection/councils/ScottishBordersCouncil.py`:
- Around line 20-25: The _get_csrf_token function currently assumes the found
input has a value; update it to validate that the input element returned by
soup.find("input", {"name": "__RequestVerificationToken"}) actually contains a
non-empty "value" attribute and raise a clear ValueError if missing.
Specifically, in _get_csrf_token check token is not None and that
token.get("value") (or token.has_attr("value") and token["value"].strip()) is
truthy; if not, raise an error like "CSRF token input found but missing value"
so callers don't receive None in subsequent requests.
- Around line 50-66: The HTTP requests in ScottishBordersCouncil (the
session.get(self.BASE_URL) and the session.post(... handler=SearchPostcode ...))
lack timeouts; update both calls to pass timeout=REQUEST_TIMEOUT so they won't
hang indefinitely (keep response.raise_for_status() as-is); search for the
GET/POST occurrences in the method that calls _get_csrf_token and add the
timeout argument to each request.
- Around line 150-177: The parser currently silent-returns bindata with empty
"bins" when no valid events are found; modify the logic in the event-processing
routine (the loop over events that builds bindata["bins"], using variables
events, subject, start_time, collection_date and date_format) to detect after
the loop if bindata["bins"] is empty and, in that case, raise a descriptive
exception (e.g., ValueError or a custom ParseError) that includes context (e.g.,
number of events processed and a hint that dates/subjects were invalid) instead
of returning {"bins": []}; keep the existing parsing/continue behavior for
individual invalid events but ensure the top-level failure is raised from the
same function that currently returns bindata.
- Around line 137-142: The loop that parses matched JSON fragments can crash on
malformed input; wrap the json.loads(raw_json) call inside a try/except that
catches json.JSONDecodeError (and optionally ValueError) so a single bad
fragment is skipped and the loop continues, optionally logging a warning; keep
the existing logic that checks parsed, isinstance(parsed[0], dict) and "Subject"
in parsed[0] and only set events = parsed and break when a valid fragment is
found (referencing variables all_matches, raw_json, parsed, events in
ScottishBordersCouncil.py).
- Around line 86-102: The code converts UPRN values with int(addr.get("UPRN",
0)) which can raise ValueError/TypeError for None, empty or non-numeric UPRNs;
add a small helper (e.g., safe_uprn_str or parse_uprn) and use it wherever UPRNs
are read (the blocks referencing selected_uprn, addresses, user_uprn, user_paon
and the calls to addr.get("UPRN", 0)) to validate and convert to a numeric
string safely: attempt to coerce to str, strip, check numeric (or catch
ValueError/TypeError around int()), return None for invalid values, and skip
those addresses instead of letting an exception propagate.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6028689f-1519-41f9-afe9-5af2f6da8777
📒 Files selected for processing (2)
uk_bin_collection/tests/input.jsonuk_bin_collection/uk_bin_collection/councils/ScottishBordersCouncil.py
Summary
scotborders-live-portal.bartecmunicipal.comTest plan
Summary by CodeRabbit
New Features
Configuration Updates