fix: rewrite fenland scraper to use selenium for cloudflare bypass by InertiaUK · Pull Request #2085 · robbrad/UKBinCollectionData

InertiaUK · 2026-05-22T12:00:00Z

Summary

Fenland's GIS layer endpoint (/article/13114/?type=loadlayer) is now behind Cloudflare JS challenge
Direct HTTP requests via requests library get 403 with Cloudflare challenge page
Rewrites scraper to use Selenium: loads main page to pass CF challenge, then fetches JSON API from within browser context using execute_async_script
Adds web_driver to input.json since this is now a Selenium-based scraper

Changes

FenlandDistrictCouncil.py - rewritten from requests.get to Selenium + in-browser fetch
input.json - added web_driver field and updated wiki_note

Test plan

Tested on VPS with Chromium 134 + Xvfb - returns 12 bins for UPRN 200002981143 (PE13 3SL)
Verified Cloudflare challenge resolves before API call

Summary by CodeRabbit

Bug Fixes
- Restored bin collection data retrieval for Fenland District Council
Chores
- Updated test configuration data for multiple councils

Fenland's GIS endpoint is now behind Cloudflare JS challenge, blocking direct HTTP requests. Rewrites scraper to use Selenium: loads page to pass challenge, then fetches JSON API from browser context. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-22T12:00:14Z

Warning

Review limit reached

@InertiaUK, we couldn't start this review because you've used your available PR reviews for now.

Your plan currently allows 2 reviews/hour. Refill in 22 minutes and 34 seconds.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more review capacity refills, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f9e69e4e-3913-4ebc-9669-5e725e6ade9d

📥 Commits

Reviewing files that changed from the base of the PR and between e29d744 and c0b1a58.

📒 Files selected for processing (2)

uk_bin_collection/tests/input.json
uk_bin_collection/uk_bin_collection/councils/FenlandDistrictCouncil.py

📝 Walkthrough

Walkthrough

Fenland District Council scraper replaces requests-based JSON fetch with Selenium-driven flow to bypass Cloudflare protection. Core logic sets Chrome user agent, creates WebDriver, injects CDP script to hide navigator.webdriver, waits for challenge interstitial to clear, executes async in-page fetch, and parses returned JSON. Test fixtures are updated for Fenland configuration and other councils.

Changes

Fenland Scraper Selenium Integration and Council Test Fixtures

Layer / File(s)	Summary
Fenland Scraper — Selenium and Cloudflare Bypass Implementation `uk_bin_collection/uk_bin_collection/councils/FenlandDistrictCouncil.py`	Module imports shift from `requests` to Selenium WebDriver utilities. `parse_data` is fully reimplemented: derives parameters from kwargs, configures Chrome user agent, creates WebDriver with CDP script injection to mask `navigator.webdriver`, navigates to Fenland page, waits for challenge interstitial to clear, executes async browser-context `fetch` to the layer endpoint, validates and parses JSON response extracting `upcoming` collections, transforms data into `{"bins": [...]}` structure with reformatted dates, and ensures driver cleanup via `finally` block.
Council Test Fixtures — Metadata and Configuration Updates `uk_bin_collection/tests/input.json`	Environment First `wiki_note` text is adjusted for Lewes/Eastbourne guidance. Fenland entry expands `wiki_note` to document Selenium requirement, adds `web_driver` configuration, and ensures `LAD24CD` field. North Hertfordshire council configuration block from `house_number` through `LAD24CD` is rewritten while preserving visible field values.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

dp247

Poem

🐰 A scraper once bound by requests chains,
Now rides Selenium through Cloudflare's refrains,
With CDP tricks to hide the bot's trace,
The browser fetches at a rapid pace!
Fixtures aligned, the tests shall pass—
A refactor quite bold, first-class! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'fix: rewrite fenland scraper to use selenium for cloudflare bypass' accurately and specifically describes the main change: rewriting the Fenland scraper to use Selenium to bypass Cloudflare protection.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-22T12:02:27Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.67%. Comparing base (8ecf878) to head (c0b1a58).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #2085   +/-   ##
=======================================
  Coverage   86.67%   86.67%           
=======================================
  Files           9        9           
  Lines        1141     1141           
=======================================
  Hits          989      989           
  Misses        152      152

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@uk_bin_collection/tests/input.json`:
- Line 882: The wiki_note value contains a mojibake replacement character in the
string ("property�you"); update the "wiki_note" string to replace that
replacement character with a proper separator (e.g., a dash or a period + space)
so it reads correctly (for example: "property - you can use..." or "property.
You can use...") ensuring the rest of the text, including the UPRN placeholder
and FindMyAddress link, remains unchanged.

In `@uk_bin_collection/uk_bin_collection/councils/FenlandDistrictCouncil.py`:
- Around line 50-69: The fetch result from driver.execute_async_script(api_url)
is assumed to be valid JSON with features[0] but can be an HTML error page,
non-JSON, or a JSON without features; update the parsing around the
execute_async_script call in FenlandDistrictCouncil (the block that currently
checks result.startswith("ERROR:") and then does
json.loads(result)["features"][0]["properties"]["upcoming"]) to: first verify
the result is not an HTML/error page and that it is valid JSON (catch
JSONDecodeError), then load the JSON and explicitly check that the top-level
"features" key exists, is a list, and is non-empty before accessing features[0];
if any check fails, raise a descriptive ValueError (e.g., "Unexpected API
response: missing or empty features" or "Non-JSON or HTML response from API") so
failures are explicit and scraper-specific.
- Around line 28-35: FenlandDistrictCouncil.parse_data currently calls
driver.execute_cdp_cmd (and relies on create_webdriver returning a Chrome
webdriver) without guarding against non-Chromium browsers and assumes the
fetched in-page payload is a JSON with features[0].properties.upcoming; update
the class so either its wiki_note clearly states Chrome/Chromium-only OR add an
early runtime guard in FenlandDistrictCouncil.parse_data that checks
driver.capabilities["browserName"] (or equivalent) and raises a clear error if
not Chrome/Chromium, and modify the fetch result handling to explicitly check
the HTTP status and content-type, parse JSON safely, validate that the top-level
"features" is a non-empty list and that features[0]["properties"]["upcoming"]
exists (raising descriptive errors if any check fails) before accessing the
value.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cce6ca62-7e3f-48b1-bd4c-eaf30aecd31c

📥 Commits

Reviewing files that changed from the base of the PR and between 8ecf878 and e29d744.

📒 Files selected for processing (2)

uk_bin_collection/tests/input.json
uk_bin_collection/uk_bin_collection/councils/FenlandDistrictCouncil.py

coderabbitai · 2026-05-22T12:04:33Z

+        driver = create_webdriver(web_driver, headless, user_agent, __name__)
+
+        try:
+            driver.execute_cdp_cmd(
+                "Page.addScriptToEvaluateOnNewDocument",
+                {
+                    "source": "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
+                },


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail python - <<'PY' from pathlib import Path for path in Path("uk_bin_collection").rglob("common.py"): text = path.read_text() marker = "def create_webdriver(" start = text.find(marker) if start == -1: continue rest = text[start:] next_def = rest.find("\ndef ", 1) snippet = rest if next_def == -1 else rest[:next_def] print(f"== {path} ==") print(snippet) print() PY rg -n -C2 'execute_cdp_cmd|Page.addScriptToEvaluateOnNewDocument|web_driver|Chrom(e|ium)' \ uk_bin_collection/uk_bin_collection/councils/FenlandDistrictCouncil.py \ uk_bin_collection/tests/input.json

Repository: robbrad/UKBinCollectionData

Length of output: 50383

🏁 Script executed:

#!/bin/bash set -euo pipefail # Show the FenlandDistrictCouncil.py content around the cited areas python - <<'PY' import itertools from pathlib import Path path = Path("uk_bin_collection/uk_bin_collection/councils/FenlandDistrictCouncil.py") lines = path.read_text().splitlines() for start,end in [(1,120),(120,220)]: print(f"\n===== {path} lines {start}-{end} =====") for i in range(start-1, min(end, len(lines))): print(f"{i+1:4d}: {lines[i]}") PY # Extract the Fenland entry from tests/input.json python - <<'PY' import json from pathlib import Path p = Path("uk_bin_collection/tests/input.json") data = json.loads(p.read_text()) # Find key that matches Fenland (case-sensitive key lookup) for k in list(data.keys()): if "Fenland" in k: print(f"\n=== input.json entry: {k} ===") print(json.dumps(data[k], indent=2)[:4000]) break else: print("No key containing 'Fenland' found in input.json top-level.") PY # Search within FenlandDistrictCouncil.py for how JSON is retrieved/parsed rg -n "execute_async_script|features|upcoming|json|response|status|webdriver|CDP|addScriptToEvaluateOnNewDocument" \ uk_bin_collection/uk_bin_collection/councils/FenlandDistrictCouncil.py

Repository: robbrad/UKBinCollectionData

Length of output: 4087

🏁 Script executed:

#!/bin/bash set -euo pipefail # Find Fenland entry textually (avoid JSON decoding issues) rg -n -C3 "Fenland" uk_bin_collection/tests/input.json | head -n 200 # Inspect create_webdriver implementation in common.py (to see if any enforcement/guards exist) python - <<'PY' from pathlib import Path p = Path("uk_bin_collection/uk_bin_collection/common.py") text = p.read_bytes().decode("utf-8", errors="replace").splitlines() for i,line in enumerate(text, start=1): if "def create_webdriver" in line: start=i break else: raise SystemExit("create_webdriver not found") end = min(len(text), start+120) for j in range(start, end+1): print(f"{j:4d}: {text[j-1]}") PY # Search for any other guard around execute_cdp_cmd / Chromium in the codebase rg -n "execute_cdp_cmd\\(|Page\\.addScriptToEvaluateOnNewDocument|CDP|Chrom(e|ium)" uk_bin_collection/uk_bin_collection | head -n 200

Repository: robbrad/UKBinCollectionData

Length of output: 37389

Make the Chromium/CDP requirement explicit and validate the API response shape/status

FenlandDistrictCouncil.parse_data calls driver.execute_cdp_cmd(...) and create_webdriver() is Chrome/Chromium-specific (webdriver.ChromeOptions, returns webdriver.Chrome), but the Fenland web_driver is documented/treated as generic Selenium—if the remote node is Firefox-backed, this will fail before the first successful parse. Update FenlandDistrictCouncil’s wiki_note (or fixture/docs) to state Chrome/Chromium-only, or add an early runtime guard (e.g., check driver.capabilities["browserName"]).

The in-browser fetch() returns r.text() without checking HTTP status, then the code immediately does json.loads(result)["features"][0]["properties"]["upcoming"] with no payload-shape validation, producing opaque failures for non-200/HTML/empty features. Add explicit status handling and schema validation with clear errors.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@uk_bin_collection/uk_bin_collection/councils/FenlandDistrictCouncil.py` around lines 28 - 35, FenlandDistrictCouncil.parse_data currently calls driver.execute_cdp_cmd (and relies on create_webdriver returning a Chrome webdriver) without guarding against non-Chromium browsers and assumes the fetched in-page payload is a JSON with features[0].properties.upcoming; update the class so either its wiki_note clearly states Chrome/Chromium-only OR add an early runtime guard in FenlandDistrictCouncil.parse_data that checks driver.capabilities["browserName"] (or equivalent) and raises a clear error if not Chrome/Chromium, and modify the fetch result handling to explicitly check the HTTP status and content-type, parse JSON safely, validate that the top-level "features" is a non-empty list and that features[0]["properties"]["upcoming"] exists (raising descriptive errors if any check fails) before accessing the value.

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

fix: address CodeRabbit review feedback

c0b1a58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: rewrite fenland scraper to use selenium for cloudflare bypass#2085

fix: rewrite fenland scraper to use selenium for cloudflare bypass#2085
InertiaUK wants to merge 2 commits into
robbrad:masterfrom
InertiaUK:fix/fenland-playwright-cloudflare

InertiaUK commented May 22, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 22, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented May 22, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

InertiaUK commented May 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

InertiaUK commented May 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 22, 2026 •

edited

Loading

codecov Bot commented May 22, 2026 •

edited

Loading