Skip to content

fix: make South Lanarkshire schedule parsing defensive#2077

Closed
stusemple wants to merge 2 commits into
robbrad:masterfrom
stusemple:fix-south-lanarkshire-index-error
Closed

fix: make South Lanarkshire schedule parsing defensive#2077
stusemple wants to merge 2 commits into
robbrad:masterfrom
stusemple:fix-south-lanarkshire-index-error

Conversation

@stusemple
Copy link
Copy Markdown
Contributor

@stusemple stusemple commented May 13, 2026

Closes #2018.

Summary

This makes the South Lanarkshire Council parser more defensive when reading the collection schedule table.

The current parser assumes that every schedule has at least two space-separated parts:

row.find("td").get_text().strip().split(" ")[1]

For some South Lanarkshire rows this can raise:

IndexError: list index out of range

This PR stores the and cells, skips malformed rows, and safely defaults schedule_cadence to an empty string when there is no second token.

Testing

  • Ran SouthLanarkshireCouncil manually against a South Lanarkshire URL
  • Ran:

poetry run pytest uk_bin_collection/tests/step_defs/ -k "SouthLanarkshireCouncil"

Summary by CodeRabbit

  • Bug Fixes
    • Improved parsing robustness for South Lanarkshire bin collection schedules to better handle unexpected or malformed rows and skipped area-only entries.
    • More reliable extraction of collection cadence by normalizing schedule text before analysis.
    • Added diagnostic warnings to reduce silent failures during schedule retrieval.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 34750520-8388-4514-9dd7-0ddaf59884e1

📥 Commits

Reviewing files that changed from the base of the PR and between 82c4aca and c092a4e.

📒 Files selected for processing (1)
  • uk_bin_collection/uk_bin_collection/councils/SouthLanarkshireCouncil.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • uk_bin_collection/uk_bin_collection/councils/SouthLanarkshireCouncil.py

📝 Walkthrough

Walkthrough

Module-level logging and defensive parsing were added to SouthLanarkshireCouncil.parse_data: header/detail cells are extracted and validated, rows with missing cells or "Area" are skipped, td text is normalized once, cadence is derived with a guarded split, and the date regex runs against the full normalized td text.

Changes

South Lanarkshire schedule parsing

Layer / File(s) Summary
Module logging
uk_bin_collection/uk_bin_collection/councils/SouthLanarkshireCouncil.py
Added import logging and logger = logging.getLogger(__name__) to emit runtime diagnostics during parsing.
Cell extraction and missing-cell handling
uk_bin_collection/uk_bin_collection/councils/SouthLanarkshireCouncil.py
Refactored loop to extract th_cell and td_cell variables; skip rows when either cell is missing and log a warning.
Skip Area rows
uk_bin_collection/uk_bin_collection/councils/SouthLanarkshireCouncil.py
Detect and skip rows where th_cell text equals "Area", with a debug log.
Normalized td text, guarded cadence split, regex on full text
uk_bin_collection/uk_bin_collection/councils/SouthLanarkshireCouncil.py
Compute a single td_text = td_cell.get_text().strip(); derive schedule_cadence using a guarded split (substring after first space or empty string if none) and apply existing date regex to td_text rather than indexing split parts—prevents IndexError for unexpected formats.

Estimated Code Review Effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Suggested Reviewers

  • dp247

Poem

🐰 I hop through tables, th and td in view,
Warning when a cell is missing, skipping through,
I normalize the text and split with care,
Regex hunts the date — no errors there! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'fix: make South Lanarkshire schedule parsing defensive' directly and clearly describes the main change—adding defensive parsing to prevent IndexError when accessing schedule data.
Linked Issues check ✅ Passed The PR addresses all core coding objectives from issue #2018: preventing IndexError by handling missing schedule tokens, safely defaulting schedule_cadence, and making the parser robust against malformed rows.
Out of Scope Changes check ✅ Passed All changes in SouthLanarkshireCouncil.py are directly scoped to fixing the identified schedule parsing issue; no unrelated modifications were introduced.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@uk_bin_collection/uk_bin_collection/councils/SouthLanarkshireCouncil.py`:
- Line 87: The current defensive split that sets schedule_cadence =
td_text.split(" ", 1)[1] if " " in td_text else "" should be changed to validate
and fail loudly: in the parsing code that assigns schedule_cadence from td_text
(look for the schedule_cadence variable and the method that processes td_text in
SouthLanarkshireCouncil), check for the expected space separator and either
raise a ValueError with a clear message including td_text when the format is
unexpected, or at minimum emit a logger.warning indicating malformed schedule
text; then perform the split (td_text.split(" ", 1)[1]) only after validation so
downstream checks like "Fortnightly" in schedule_cadence are not silently
skipped.
- Around line 73-77: Add a logger and emit warnings/debugs when rows are
skipped: import logging and create logger = logging.getLogger(__name__), then in
the parsing loop where th_cell and td_cell are checked (the block around th_cell
= row.find("th") / td_cell = row.find("td")), replace the silent continue with
logger.warning("Skipping row: missing th or td cell") before continue; likewise,
where schedule_type == "Area" is silently skipped, add logger.debug("Skipping
area header row") before that continue. Reference these changes in the
SouthLanarkshireCouncil parser method where th_cell, td_cell and schedule_type
are used.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dddbb33e-3700-4b77-9da3-0cd5b6d38d37

📥 Commits

Reviewing files that changed from the base of the PR and between 8ecf878 and 82c4aca.

📒 Files selected for processing (1)
  • uk_bin_collection/uk_bin_collection/councils/SouthLanarkshireCouncil.py

Comment thread uk_bin_collection/uk_bin_collection/councils/SouthLanarkshireCouncil.py Outdated
@robbrad
Copy link
Copy Markdown
Owner

robbrad commented Jun 2, 2026

Merged into the June 2026 release PR #2110. Thanks!

@robbrad robbrad closed this Jun 2, 2026
ChrisJHawley pushed a commit to ChrisJHawley/UKBinCollectionData that referenced this pull request Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Index out of Range - South Lanarkshire

2 participants