Skip to content

Excavate: extract IPv6 URLs (#1815)#3071

Open
ChrisJr404 wants to merge 1 commit intoblacklanternsecurity:devfrom
ChrisJr404:feat/excavate-ipv6-1815
Open

Excavate: extract IPv6 URLs (#1815)#3071
ChrisJr404 wants to merge 1 commit intoblacklanternsecurity:devfrom
ChrisJr404:feat/excavate-ipv6-1815

Conversation

@ChrisJr404
Copy link
Copy Markdown

Summary

Closes #1815 ("Excavate IPv6 URLs", filed by @TheTechromancer).

The url_full YARA rule and the two Python post-filters (full_url_regex / full_url_regex_strict) all only accepted word-character / dotted hostnames in the host slot, so URLs with bracketed IPv6 hosts were dropped at extraction time:

  • http://[2001:db8::1]/api
  • http://[::1]:8080/path
  • https://[fe80::dead:beef]/foo/bar.html

This PR adds a \\[[0-9a-fA-F:]+\\] alternative to the host part of all three patterns. The bracketed form is preserved in the captured host so downstream parsers (urllib, etc.) still recognise the URL as IPv6.

Tests

bbot/test/test_step_1/test_excavate_url_regexes.py — 6 cases:

  • test_full_url_regex_matches_ipv6 — accepts 6 representative IPv6 URLs and verifies the captured host keeps the leading [.
  • test_full_url_regex_still_matches_existing_patterns — regression guard for plain DNS-name + IPv4 URLs.
  • Two corresponding pairs for full_url_regex_strict.
  • Two corresponding pairs for the YARA url_full rule (compiled directly from excavate.URLExtractor.yara_rules['url_full']).
$ pytest bbot/test/test_step_1/test_excavate_url_regexes.py -v
test_full_url_regex_matches_ipv6                              PASSED
test_full_url_regex_still_matches_existing_patterns           PASSED
test_full_url_regex_strict_matches_ipv6                       PASSED
test_full_url_regex_strict_still_matches_existing_patterns    PASSED
test_yara_url_rule_matches_ipv6                               PASSED
test_yara_url_rule_still_matches_existing_patterns            PASSED

Notes

  • No behavioural change for existing DNS-name / IPv4 URLs: the original alternation is preserved as the second branch of the alternation.
  • The new patterns accept IPv6-shaped tokens regardless of whether they are valid addresses; downstream URL validation still runs (validators.validate_url_parsed) before the URL becomes a URL_UNVERIFIED event, so malformed inputs still get rejected one stage later.

The url_full YARA rule and the full_url_regex / full_url_regex_strict
post-filters all required hosts to be word-character labels, so URLs
with bracketed IPv6 hosts (http://[2001:db8::1]/, http://[::1]:8080/...)
were dropped at extraction time. Add a [0-9a-fA-F:]+ alternative to the
host part of all three patterns so IPv6 URLs are emitted as
URL_UNVERIFIED events alongside DNS-name URLs.

Adds bbot/test/test_step_1/test_excavate_url_regexes.py — 6 cases that
pin both the new IPv6 acceptance and a regression guard for the
existing DNS-name / IPv4 URLs.

Closes blacklanternsecurity#1815
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@ChrisJr404
Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

bls-cla-bot Bot added a commit to blacklanternsecurity/CLA that referenced this pull request May 3, 2026
@ChrisJr404
Copy link
Copy Markdown
Author

recheck

1 similar comment
@ChrisJr404
Copy link
Copy Markdown
Author

recheck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant