Fix: html2text ignores base tag when resolving relative links (#1680) #1681

Joorrit · 2025-12-27T11:23:39Z

Summary

The html2text module was previously ignoring the HTML <base> tag. This caused relative links to be resolved incorrectly (relative to the page URL instead of the base URL) when a <base> tag was present in the document head. This PR adds logic to detect and respect the <base> tag during parsing.

List of files changed and why

crawl4ai/html2text/__init__.py - Updated the handle_tag method to check for the base tag and update self.baseurl with the provided href. This ensures all subsequent relative links in the document are resolved against this new base URL.

How Has This Been Tested?

I verified this change using a local reproduction script with the following test case:

Test Scenario:

Page URL: https://example.com/subfolder/page.html

Input HTML:

<html>
<head>
    <base href="https://example.com/">
</head>
<body>
    <a href="files/document.pdf">Link</a>
</body>
</html>

Behavior Before Fix: The link resolved relative to the page URL: https://example.com/subfolder/files/document.pdf
Behavior After Fix: The link correctly resolves relative to the base tag: https://example.com/files/document.pdf

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added/updated unit tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…nclecode#1680)

Fix: html2text ignores base tag when resolving relative links (Issue u…

fdb828b

…nclecode#1680)

Joorrit closed this Dec 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix: html2text ignores base tag when resolving relative links (#1680) #1681

Fix: html2text ignores base tag when resolving relative links (#1680) #1681

Uh oh!

Joorrit commented Dec 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Fix: html2text ignores base tag when resolving relative links (#1680) #1681

Fix: html2text ignores base tag when resolving relative links (#1680) #1681

Uh oh!

Conversation

Joorrit commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

List of files changed and why

How Has This Been Tested?

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Joorrit commented Dec 27, 2025 •

edited

Loading