Add checks for broken docs urls#6448
Conversation
Merging this PR will not alter performance
Comparing Footnotes
|
Greptile SummaryThis PR adds a new GitHub Actions workflow and Python script that validate
Confidence Score: 4/5Safe to merge after addressing the single-quoted title regex gap; otherwise the tool works correctly. One P1 logic issue: single-quoted Markdown link titles are not stripped from the captured URL, causing false-positive "not found in sitemap" errors. All other logic (fragment/query stripping for the underscore check, sitemap prefix normalization, skip-dirs) is correct and well-tested. docs/app/scripts/check_doc_links.py — specifically the LINK_RE constant on line 25. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[GitHub Actions Trigger\npull_request / push to main\nwith docs path filter] --> B[Checkout & Setup Build Env\npython 3.14 + uv sync]
B --> C[uv run reflex export\n--frontend-only --no-zip\nGenerates .web/public/sitemap.xml]
C --> D[uv run python\nscripts/check_doc_links.py]
D --> E[load_sitemap_paths\nParse sitemap.xml → set of normalized paths]
D --> F[iter_md_files\nrglob *.md, skip SKIP_DIRS]
F --> G[iter_md_links\nMatch LINK_RE on each line]
G --> H{For each raw URL}
H --> I{Underscore in path_only?}
I -- Yes --> J[Append underscore error]
I -- No --> K{sitemap_key in valid_paths?}
J --> K
K -- No --> L[Append not-found error]
K -- Yes --> M[OK]
L --> N{Any errors?}
J --> N
M --> N
N -- Yes --> O[Print errors to stderr\nExit 1 → CI fails]
N -- No --> P[Print success\nExit 0]
Reviews (2): Last reviewed commit: "updates" | Re-trigger Greptile |
|
@greptile-apps re-review |
| from pathlib import Path | ||
| from urllib.parse import urlparse | ||
|
|
||
| LINK_RE = re.compile(r"\]\(\s*(/docs(?=[/)#?\s])[^)]*?)(?:\s+\"[^\"]*\")?\s*\)") |
There was a problem hiding this comment.
The regex captures single-quoted Markdown link titles into
raw. For a link like [text](/docs/foo 'My Title') the optional title group (?:\s+"[^"]*")? requires double quotes, so it won't strip the '...' text. Instead, [^)]*? absorbs the trailing space + title, making raw = "/docs/foo 'My Title'". The subsequent sitemap lookup then tries _strip_docs_prefix(_normalize("/docs/foo 'My Title'")) → /foo 'My Title' which is never in the sitemap, producing a spurious "not found" error for every single-quoted-title link in the docs.
| LINK_RE = re.compile(r"\]\(\s*(/docs(?=[/)#?\s])[^)]*?)(?:\s+\"[^\"]*\")?\s*\)") | |
| LINK_RE = re.compile(r"\]\(\s*(/docs(?=[/)#?\s])[^)]*?)(?:\s+(?:\"[^\"]*\"|'[^']*'|\([^)]*\)))?\s*\)") |
No description provided.