Excavate: extract IPv6 URLs (#1815) by ChrisJr404 · Pull Request #3071 · blacklanternsecurity/bbot

ChrisJr404 · 2026-05-03T23:07:25Z

Summary

Closes #1815 ("Excavate IPv6 URLs", filed by @TheTechromancer).

The url_full YARA rule and the two Python post-filters (full_url_regex / full_url_regex_strict) all only accepted word-character / dotted hostnames in the host slot, so URLs with bracketed IPv6 hosts were dropped at extraction time:

http://[2001:db8::1]/api
http://[::1]:8080/path
https://[fe80::dead:beef]/foo/bar.html

This PR adds a \\[[0-9a-fA-F:]+\\] alternative to the host part of all three patterns. The bracketed form is preserved in the captured host so downstream parsers (urllib, etc.) still recognise the URL as IPv6.

Tests

bbot/test/test_step_1/test_excavate_url_regexes.py — 6 cases:

test_full_url_regex_matches_ipv6 — accepts 6 representative IPv6 URLs and verifies the captured host keeps the leading [.
test_full_url_regex_still_matches_existing_patterns — regression guard for plain DNS-name + IPv4 URLs.
Two corresponding pairs for full_url_regex_strict.
Two corresponding pairs for the YARA url_full rule (compiled directly from excavate.URLExtractor.yara_rules['url_full']).

$ pytest bbot/test/test_step_1/test_excavate_url_regexes.py -v
test_full_url_regex_matches_ipv6                              PASSED
test_full_url_regex_still_matches_existing_patterns           PASSED
test_full_url_regex_strict_matches_ipv6                       PASSED
test_full_url_regex_strict_still_matches_existing_patterns    PASSED
test_yara_url_rule_matches_ipv6                               PASSED
test_yara_url_rule_still_matches_existing_patterns            PASSED

Notes

No behavioural change for existing DNS-name / IPv4 URLs: the original alternation is preserved as the second branch of the alternation.
The new patterns accept IPv6-shaped tokens regardless of whether they are valid addresses; downstream URL validation still runs (validators.validate_url_parsed) before the URL becomes a URL_UNVERIFIED event, so malformed inputs still get rejected one stage later.

The url_full YARA rule and the full_url_regex / full_url_regex_strict post-filters all required hosts to be word-character labels, so URLs with bracketed IPv6 hosts (http://[2001:db8::1]/, http://[::1]:8080/...) were dropped at extraction time. Add a [0-9a-fA-F:]+ alternative to the host part of all three patterns so IPv6 URLs are emitted as URL_UNVERIFIED events alongside DNS-name URLs. Adds bbot/test/test_step_1/test_excavate_url_regexes.py — 6 cases that pin both the new IPv6 acceptance and a regression guard for the existing DNS-name / IPv4 URLs. Closes blacklanternsecurity#1815

github-actions · 2026-05-03T23:07:39Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

ChrisJr404 · 2026-05-03T23:08:48Z

I have read the CLA Document and I hereby sign the CLA

ChrisJr404 · 2026-05-04T01:16:40Z

recheck

ChrisJr404 · 2026-05-04T16:30:13Z

recheck

bls-cla-bot Bot added a commit to blacklanternsecurity/CLA that referenced this pull request May 3, 2026

@ChrisJr404 has signed the CLA in blacklanternsecurity/bbot#3071

b2313c4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Excavate: extract IPv6 URLs (#1815)#3071

Excavate: extract IPv6 URLs (#1815)#3071
ChrisJr404 wants to merge 1 commit intoblacklanternsecurity:devfrom
ChrisJr404:feat/excavate-ipv6-1815

ChrisJr404 commented May 3, 2026

Uh oh!

github-actions Bot commented May 3, 2026 •

edited

Loading

Uh oh!

ChrisJr404 commented May 3, 2026

Uh oh!

ChrisJr404 commented May 4, 2026

Uh oh!

ChrisJr404 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ChrisJr404 commented May 3, 2026

Summary

Tests

Notes

Uh oh!

github-actions Bot commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChrisJr404 commented May 3, 2026

Uh oh!

ChrisJr404 commented May 4, 2026

Uh oh!

ChrisJr404 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 3, 2026 •

edited

Loading