fix: rewrite fenland scraper to use selenium for cloudflare bypass#2085
fix: rewrite fenland scraper to use selenium for cloudflare bypass#2085InertiaUK wants to merge 2 commits into
Conversation
Fenland's GIS endpoint is now behind Cloudflare JS challenge, blocking direct HTTP requests. Rewrites scraper to use Selenium: loads page to pass challenge, then fetches JSON API from browser context. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Warning Review limit reached
Your plan currently allows 2 reviews/hour. Refill in 22 minutes and 34 seconds. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more review capacity refills, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughFenland District Council scraper replaces requests-based JSON fetch with Selenium-driven flow to bypass Cloudflare protection. Core logic sets Chrome user agent, creates WebDriver, injects CDP script to hide navigator.webdriver, waits for challenge interstitial to clear, executes async in-page fetch, and parses returned JSON. Test fixtures are updated for Fenland configuration and other councils. ChangesFenland Scraper Selenium Integration and Council Test Fixtures
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #2085 +/- ##
=======================================
Coverage 86.67% 86.67%
=======================================
Files 9 9
Lines 1141 1141
=======================================
Hits 989 989
Misses 152 152 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@uk_bin_collection/tests/input.json`:
- Line 882: The wiki_note value contains a mojibake replacement character in the
string ("property�you"); update the "wiki_note" string to replace that
replacement character with a proper separator (e.g., a dash or a period + space)
so it reads correctly (for example: "property - you can use..." or "property.
You can use...") ensuring the rest of the text, including the UPRN placeholder
and FindMyAddress link, remains unchanged.
In `@uk_bin_collection/uk_bin_collection/councils/FenlandDistrictCouncil.py`:
- Around line 50-69: The fetch result from driver.execute_async_script(api_url)
is assumed to be valid JSON with features[0] but can be an HTML error page,
non-JSON, or a JSON without features; update the parsing around the
execute_async_script call in FenlandDistrictCouncil (the block that currently
checks result.startswith("ERROR:") and then does
json.loads(result)["features"][0]["properties"]["upcoming"]) to: first verify
the result is not an HTML/error page and that it is valid JSON (catch
JSONDecodeError), then load the JSON and explicitly check that the top-level
"features" key exists, is a list, and is non-empty before accessing features[0];
if any check fails, raise a descriptive ValueError (e.g., "Unexpected API
response: missing or empty features" or "Non-JSON or HTML response from API") so
failures are explicit and scraper-specific.
- Around line 28-35: FenlandDistrictCouncil.parse_data currently calls
driver.execute_cdp_cmd (and relies on create_webdriver returning a Chrome
webdriver) without guarding against non-Chromium browsers and assumes the
fetched in-page payload is a JSON with features[0].properties.upcoming; update
the class so either its wiki_note clearly states Chrome/Chromium-only OR add an
early runtime guard in FenlandDistrictCouncil.parse_data that checks
driver.capabilities["browserName"] (or equivalent) and raises a clear error if
not Chrome/Chromium, and modify the fetch result handling to explicitly check
the HTTP status and content-type, parse JSON safely, validate that the top-level
"features" is a non-empty list and that features[0]["properties"]["upcoming"]
exists (raising descriptive errors if any check fails) before accessing the
value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: cce6ca62-7e3f-48b1-bd4c-eaf30aecd31c
📒 Files selected for processing (2)
uk_bin_collection/tests/input.jsonuk_bin_collection/uk_bin_collection/councils/FenlandDistrictCouncil.py
| driver = create_webdriver(web_driver, headless, user_agent, __name__) | ||
|
|
||
| try: | ||
| driver.execute_cdp_cmd( | ||
| "Page.addScriptToEvaluateOnNewDocument", | ||
| { | ||
| "source": "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})" | ||
| }, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python - <<'PY'
from pathlib import Path
for path in Path("uk_bin_collection").rglob("common.py"):
text = path.read_text()
marker = "def create_webdriver("
start = text.find(marker)
if start == -1:
continue
rest = text[start:]
next_def = rest.find("\ndef ", 1)
snippet = rest if next_def == -1 else rest[:next_def]
print(f"== {path} ==")
print(snippet)
print()
PY
rg -n -C2 'execute_cdp_cmd|Page.addScriptToEvaluateOnNewDocument|web_driver|Chrom(e|ium)' \
uk_bin_collection/uk_bin_collection/councils/FenlandDistrictCouncil.py \
uk_bin_collection/tests/input.jsonRepository: robbrad/UKBinCollectionData
Length of output: 50383
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show the FenlandDistrictCouncil.py content around the cited areas
python - <<'PY'
import itertools
from pathlib import Path
path = Path("uk_bin_collection/uk_bin_collection/councils/FenlandDistrictCouncil.py")
lines = path.read_text().splitlines()
for start,end in [(1,120),(120,220)]:
print(f"\n===== {path} lines {start}-{end} =====")
for i in range(start-1, min(end, len(lines))):
print(f"{i+1:4d}: {lines[i]}")
PY
# Extract the Fenland entry from tests/input.json
python - <<'PY'
import json
from pathlib import Path
p = Path("uk_bin_collection/tests/input.json")
data = json.loads(p.read_text())
# Find key that matches Fenland (case-sensitive key lookup)
for k in list(data.keys()):
if "Fenland" in k:
print(f"\n=== input.json entry: {k} ===")
print(json.dumps(data[k], indent=2)[:4000])
break
else:
print("No key containing 'Fenland' found in input.json top-level.")
PY
# Search within FenlandDistrictCouncil.py for how JSON is retrieved/parsed
rg -n "execute_async_script|features|upcoming|json|response|status|webdriver|CDP|addScriptToEvaluateOnNewDocument" \
uk_bin_collection/uk_bin_collection/councils/FenlandDistrictCouncil.pyRepository: robbrad/UKBinCollectionData
Length of output: 4087
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find Fenland entry textually (avoid JSON decoding issues)
rg -n -C3 "Fenland" uk_bin_collection/tests/input.json | head -n 200
# Inspect create_webdriver implementation in common.py (to see if any enforcement/guards exist)
python - <<'PY'
from pathlib import Path
p = Path("uk_bin_collection/uk_bin_collection/common.py")
text = p.read_bytes().decode("utf-8", errors="replace").splitlines()
for i,line in enumerate(text, start=1):
if "def create_webdriver" in line:
start=i
break
else:
raise SystemExit("create_webdriver not found")
end = min(len(text), start+120)
for j in range(start, end+1):
print(f"{j:4d}: {text[j-1]}")
PY
# Search for any other guard around execute_cdp_cmd / Chromium in the codebase
rg -n "execute_cdp_cmd\\(|Page\\.addScriptToEvaluateOnNewDocument|CDP|Chrom(e|ium)" uk_bin_collection/uk_bin_collection | head -n 200Repository: robbrad/UKBinCollectionData
Length of output: 37389
Make the Chromium/CDP requirement explicit and validate the API response shape/status
FenlandDistrictCouncil.parse_datacallsdriver.execute_cdp_cmd(...)andcreate_webdriver()is Chrome/Chromium-specific (webdriver.ChromeOptions, returnswebdriver.Chrome), but the Fenlandweb_driveris documented/treated as generic Selenium—if the remote node is Firefox-backed, this will fail before the first successful parse. UpdateFenlandDistrictCouncil’swiki_note(or fixture/docs) to state Chrome/Chromium-only, or add an early runtime guard (e.g., checkdriver.capabilities["browserName"]).- The in-browser
fetch()returnsr.text()without checking HTTP status, then the code immediately doesjson.loads(result)["features"][0]["properties"]["upcoming"]with no payload-shape validation, producing opaque failures for non-200/HTML/emptyfeatures. Add explicit status handling and schema validation with clear errors.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@uk_bin_collection/uk_bin_collection/councils/FenlandDistrictCouncil.py`
around lines 28 - 35, FenlandDistrictCouncil.parse_data currently calls
driver.execute_cdp_cmd (and relies on create_webdriver returning a Chrome
webdriver) without guarding against non-Chromium browsers and assumes the
fetched in-page payload is a JSON with features[0].properties.upcoming; update
the class so either its wiki_note clearly states Chrome/Chromium-only OR add an
early runtime guard in FenlandDistrictCouncil.parse_data that checks
driver.capabilities["browserName"] (or equivalent) and raises a clear error if
not Chrome/Chromium, and modify the fetch result handling to explicitly check
the HTTP status and content-type, parse JSON safely, validate that the top-level
"features" is a non-empty list and that features[0]["properties"]["upcoming"]
exists (raising descriptive errors if any check fails) before accessing the
value.
Summary
/article/13114/?type=loadlayer) is now behind Cloudflare JS challengerequestslibrary get 403 with Cloudflare challenge pageexecute_async_scriptweb_driverto input.json since this is now a Selenium-based scraperChanges
FenlandDistrictCouncil.py- rewritten fromrequests.getto Selenium + in-browser fetchinput.json- addedweb_driverfield and updatedwiki_noteTest plan
Summary by CodeRabbit
Bug Fixes
Chores