Skip to content

fix: rewrite fenland scraper to use selenium for cloudflare bypass#2085

Open
InertiaUK wants to merge 2 commits into
robbrad:masterfrom
InertiaUK:fix/fenland-playwright-cloudflare
Open

fix: rewrite fenland scraper to use selenium for cloudflare bypass#2085
InertiaUK wants to merge 2 commits into
robbrad:masterfrom
InertiaUK:fix/fenland-playwright-cloudflare

Conversation

@InertiaUK
Copy link
Copy Markdown
Contributor

@InertiaUK InertiaUK commented May 22, 2026

Summary

  • Fenland's GIS layer endpoint (/article/13114/?type=loadlayer) is now behind Cloudflare JS challenge
  • Direct HTTP requests via requests library get 403 with Cloudflare challenge page
  • Rewrites scraper to use Selenium: loads main page to pass CF challenge, then fetches JSON API from within browser context using execute_async_script
  • Adds web_driver to input.json since this is now a Selenium-based scraper

Changes

  • FenlandDistrictCouncil.py - rewritten from requests.get to Selenium + in-browser fetch
  • input.json - added web_driver field and updated wiki_note

Test plan

  • Tested on VPS with Chromium 134 + Xvfb - returns 12 bins for UPRN 200002981143 (PE13 3SL)
  • Verified Cloudflare challenge resolves before API call

Summary by CodeRabbit

  • Bug Fixes

    • Restored bin collection data retrieval for Fenland District Council
  • Chores

    • Updated test configuration data for multiple councils

Review Change Stack

Fenland's GIS endpoint is now behind Cloudflare JS challenge, blocking
direct HTTP requests. Rewrites scraper to use Selenium: loads page to
pass challenge, then fetches JSON API from browser context.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 22, 2026

Warning

Review limit reached

@InertiaUK, we couldn't start this review because you've used your available PR reviews for now.

Your plan currently allows 2 reviews/hour. Refill in 22 minutes and 34 seconds.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more review capacity refills, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f9e69e4e-3913-4ebc-9669-5e725e6ade9d

📥 Commits

Reviewing files that changed from the base of the PR and between e29d744 and c0b1a58.

📒 Files selected for processing (2)
  • uk_bin_collection/tests/input.json
  • uk_bin_collection/uk_bin_collection/councils/FenlandDistrictCouncil.py
📝 Walkthrough

Walkthrough

Fenland District Council scraper replaces requests-based JSON fetch with Selenium-driven flow to bypass Cloudflare protection. Core logic sets Chrome user agent, creates WebDriver, injects CDP script to hide navigator.webdriver, waits for challenge interstitial to clear, executes async in-page fetch, and parses returned JSON. Test fixtures are updated for Fenland configuration and other councils.

Changes

Fenland Scraper Selenium Integration and Council Test Fixtures

Layer / File(s) Summary
Fenland Scraper — Selenium and Cloudflare Bypass Implementation
uk_bin_collection/uk_bin_collection/councils/FenlandDistrictCouncil.py
Module imports shift from requests to Selenium WebDriver utilities. parse_data is fully reimplemented: derives parameters from kwargs, configures Chrome user agent, creates WebDriver with CDP script injection to mask navigator.webdriver, navigates to Fenland page, waits for challenge interstitial to clear, executes async browser-context fetch to the layer endpoint, validates and parses JSON response extracting upcoming collections, transforms data into {"bins": [...]} structure with reformatted dates, and ensures driver cleanup via finally block.
Council Test Fixtures — Metadata and Configuration Updates
uk_bin_collection/tests/input.json
Environment First wiki_note text is adjusted for Lewes/Eastbourne guidance. Fenland entry expands wiki_note to document Selenium requirement, adds web_driver configuration, and ensures LAD24CD field. North Hertfordshire council configuration block from house_number through LAD24CD is rewritten while preserving visible field values.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • dp247

Poem

🐰 A scraper once bound by requests chains,
Now rides Selenium through Cloudflare's refrains,
With CDP tricks to hide the bot's trace,
The browser fetches at a rapid pace!
Fixtures aligned, the tests shall pass—
A refactor quite bold, first-class!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'fix: rewrite fenland scraper to use selenium for cloudflare bypass' accurately and specifically describes the main change: rewriting the Fenland scraper to use Selenium to bypass Cloudflare protection.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.67%. Comparing base (8ecf878) to head (c0b1a58).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #2085   +/-   ##
=======================================
  Coverage   86.67%   86.67%           
=======================================
  Files           9        9           
  Lines        1141     1141           
=======================================
  Hits          989      989           
  Misses        152      152           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@uk_bin_collection/tests/input.json`:
- Line 882: The wiki_note value contains a mojibake replacement character in the
string ("property�you"); update the "wiki_note" string to replace that
replacement character with a proper separator (e.g., a dash or a period + space)
so it reads correctly (for example: "property - you can use..." or "property.
You can use...") ensuring the rest of the text, including the UPRN placeholder
and FindMyAddress link, remains unchanged.

In `@uk_bin_collection/uk_bin_collection/councils/FenlandDistrictCouncil.py`:
- Around line 50-69: The fetch result from driver.execute_async_script(api_url)
is assumed to be valid JSON with features[0] but can be an HTML error page,
non-JSON, or a JSON without features; update the parsing around the
execute_async_script call in FenlandDistrictCouncil (the block that currently
checks result.startswith("ERROR:") and then does
json.loads(result)["features"][0]["properties"]["upcoming"]) to: first verify
the result is not an HTML/error page and that it is valid JSON (catch
JSONDecodeError), then load the JSON and explicitly check that the top-level
"features" key exists, is a list, and is non-empty before accessing features[0];
if any check fails, raise a descriptive ValueError (e.g., "Unexpected API
response: missing or empty features" or "Non-JSON or HTML response from API") so
failures are explicit and scraper-specific.
- Around line 28-35: FenlandDistrictCouncil.parse_data currently calls
driver.execute_cdp_cmd (and relies on create_webdriver returning a Chrome
webdriver) without guarding against non-Chromium browsers and assumes the
fetched in-page payload is a JSON with features[0].properties.upcoming; update
the class so either its wiki_note clearly states Chrome/Chromium-only OR add an
early runtime guard in FenlandDistrictCouncil.parse_data that checks
driver.capabilities["browserName"] (or equivalent) and raises a clear error if
not Chrome/Chromium, and modify the fetch result handling to explicitly check
the HTTP status and content-type, parse JSON safely, validate that the top-level
"features" is a non-empty list and that features[0]["properties"]["upcoming"]
exists (raising descriptive errors if any check fails) before accessing the
value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cce6ca62-7e3f-48b1-bd4c-eaf30aecd31c

📥 Commits

Reviewing files that changed from the base of the PR and between 8ecf878 and e29d744.

📒 Files selected for processing (2)
  • uk_bin_collection/tests/input.json
  • uk_bin_collection/uk_bin_collection/councils/FenlandDistrictCouncil.py

Comment thread uk_bin_collection/tests/input.json Outdated
Comment on lines +28 to +35
driver = create_webdriver(web_driver, headless, user_agent, __name__)

try:
driver.execute_cdp_cmd(
"Page.addScriptToEvaluateOnNewDocument",
{
"source": "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

python - <<'PY'
from pathlib import Path

for path in Path("uk_bin_collection").rglob("common.py"):
    text = path.read_text()
    marker = "def create_webdriver("
    start = text.find(marker)
    if start == -1:
        continue
    rest = text[start:]
    next_def = rest.find("\ndef ", 1)
    snippet = rest if next_def == -1 else rest[:next_def]
    print(f"== {path} ==")
    print(snippet)
    print()
PY

rg -n -C2 'execute_cdp_cmd|Page.addScriptToEvaluateOnNewDocument|web_driver|Chrom(e|ium)' \
  uk_bin_collection/uk_bin_collection/councils/FenlandDistrictCouncil.py \
  uk_bin_collection/tests/input.json

Repository: robbrad/UKBinCollectionData

Length of output: 50383


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show the FenlandDistrictCouncil.py content around the cited areas
python - <<'PY'
import itertools
from pathlib import Path

path = Path("uk_bin_collection/uk_bin_collection/councils/FenlandDistrictCouncil.py")
lines = path.read_text().splitlines()
for start,end in [(1,120),(120,220)]:
    print(f"\n===== {path} lines {start}-{end} =====")
    for i in range(start-1, min(end, len(lines))):
        print(f"{i+1:4d}: {lines[i]}")
PY

# Extract the Fenland entry from tests/input.json
python - <<'PY'
import json
from pathlib import Path

p = Path("uk_bin_collection/tests/input.json")
data = json.loads(p.read_text())
# Find key that matches Fenland (case-sensitive key lookup)
for k in list(data.keys()):
    if "Fenland" in k:
        print(f"\n=== input.json entry: {k} ===")
        print(json.dumps(data[k], indent=2)[:4000])
        break
else:
    print("No key containing 'Fenland' found in input.json top-level.")
PY

# Search within FenlandDistrictCouncil.py for how JSON is retrieved/parsed
rg -n "execute_async_script|features|upcoming|json|response|status|webdriver|CDP|addScriptToEvaluateOnNewDocument" \
  uk_bin_collection/uk_bin_collection/councils/FenlandDistrictCouncil.py

Repository: robbrad/UKBinCollectionData

Length of output: 4087


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find Fenland entry textually (avoid JSON decoding issues)
rg -n -C3 "Fenland" uk_bin_collection/tests/input.json | head -n 200

# Inspect create_webdriver implementation in common.py (to see if any enforcement/guards exist)
python - <<'PY'
from pathlib import Path
p = Path("uk_bin_collection/uk_bin_collection/common.py")
text = p.read_bytes().decode("utf-8", errors="replace").splitlines()
for i,line in enumerate(text, start=1):
    if "def create_webdriver" in line:
        start=i
        break
else:
    raise SystemExit("create_webdriver not found")
end = min(len(text), start+120)
for j in range(start, end+1):
    print(f"{j:4d}: {text[j-1]}")
PY

# Search for any other guard around execute_cdp_cmd / Chromium in the codebase
rg -n "execute_cdp_cmd\\(|Page\\.addScriptToEvaluateOnNewDocument|CDP|Chrom(e|ium)" uk_bin_collection/uk_bin_collection | head -n 200

Repository: robbrad/UKBinCollectionData

Length of output: 37389


Make the Chromium/CDP requirement explicit and validate the API response shape/status

  • FenlandDistrictCouncil.parse_data calls driver.execute_cdp_cmd(...) and create_webdriver() is Chrome/Chromium-specific (webdriver.ChromeOptions, returns webdriver.Chrome), but the Fenland web_driver is documented/treated as generic Selenium—if the remote node is Firefox-backed, this will fail before the first successful parse. Update FenlandDistrictCouncil’s wiki_note (or fixture/docs) to state Chrome/Chromium-only, or add an early runtime guard (e.g., check driver.capabilities["browserName"]).
  • The in-browser fetch() returns r.text() without checking HTTP status, then the code immediately does json.loads(result)["features"][0]["properties"]["upcoming"] with no payload-shape validation, producing opaque failures for non-200/HTML/empty features. Add explicit status handling and schema validation with clear errors.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@uk_bin_collection/uk_bin_collection/councils/FenlandDistrictCouncil.py`
around lines 28 - 35, FenlandDistrictCouncil.parse_data currently calls
driver.execute_cdp_cmd (and relies on create_webdriver returning a Chrome
webdriver) without guarding against non-Chromium browsers and assumes the
fetched in-page payload is a JSON with features[0].properties.upcoming; update
the class so either its wiki_note clearly states Chrome/Chromium-only OR add an
early runtime guard in FenlandDistrictCouncil.parse_data that checks
driver.capabilities["browserName"] (or equivalent) and raises a clear error if
not Chrome/Chromium, and modify the fetch result handling to explicitly check
the HTTP status and content-type, parse JSON safely, validate that the top-level
"features" is a non-empty list and that features[0]["properties"]["upcoming"]
exists (raising descriptive errors if any check fails) before accessing the
value.

Comment thread uk_bin_collection/uk_bin_collection/councils/FenlandDistrictCouncil.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant