Excavate: extract IPv6 URLs (#1815)#3071
Open
ChrisJr404 wants to merge 1 commit intoblacklanternsecurity:devfrom
Open
Excavate: extract IPv6 URLs (#1815)#3071ChrisJr404 wants to merge 1 commit intoblacklanternsecurity:devfrom
ChrisJr404 wants to merge 1 commit intoblacklanternsecurity:devfrom
Conversation
The url_full YARA rule and the full_url_regex / full_url_regex_strict post-filters all required hosts to be word-character labels, so URLs with bracketed IPv6 hosts (http://[2001:db8::1]/, http://[::1]:8080/...) were dropped at extraction time. Add a [0-9a-fA-F:]+ alternative to the host part of all three patterns so IPv6 URLs are emitted as URL_UNVERIFIED events alongside DNS-name URLs. Adds bbot/test/test_step_1/test_excavate_url_regexes.py — 6 cases that pin both the new IPv6 acceptance and a regression guard for the existing DNS-name / IPv4 URLs. Closes blacklanternsecurity#1815
Contributor
|
All contributors have signed the CLA ✍️ ✅ |
Author
|
I have read the CLA Document and I hereby sign the CLA |
bls-cla-bot Bot
added a commit
to blacklanternsecurity/CLA
that referenced
this pull request
May 3, 2026
Author
|
recheck |
1 similar comment
Author
|
recheck |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #1815 ("Excavate IPv6 URLs", filed by @TheTechromancer).
The
url_fullYARA rule and the two Python post-filters (full_url_regex/full_url_regex_strict) all only accepted word-character / dotted hostnames in the host slot, so URLs with bracketed IPv6 hosts were dropped at extraction time:http://[2001:db8::1]/apihttp://[::1]:8080/pathhttps://[fe80::dead:beef]/foo/bar.htmlThis PR adds a
\\[[0-9a-fA-F:]+\\]alternative to the host part of all three patterns. The bracketed form is preserved in the captured host so downstream parsers (urllib, etc.) still recognise the URL as IPv6.Tests
bbot/test/test_step_1/test_excavate_url_regexes.py— 6 cases:test_full_url_regex_matches_ipv6— accepts 6 representative IPv6 URLs and verifies the captured host keeps the leading[.test_full_url_regex_still_matches_existing_patterns— regression guard for plain DNS-name + IPv4 URLs.full_url_regex_strict.url_fullrule (compiled directly fromexcavate.URLExtractor.yara_rules['url_full']).Notes
validators.validate_url_parsed) before the URL becomes a URL_UNVERIFIED event, so malformed inputs still get rejected one stage later.