Skip to content

Conversation

@Joorrit
Copy link

@Joorrit Joorrit commented Dec 27, 2025

Summary

Fixes #1680

The html2text module was previously ignoring the HTML <base> tag. This caused relative links to be resolved incorrectly (relative to the page URL instead of the base URL) when a <base> tag was present in the document head. This PR adds logic to detect and respect the <base> tag during parsing.

List of files changed and why

  • crawl4ai/html2text/__init__.py - Updated the handle_tag method to check for the base tag and update self.baseurl with the provided href. This ensures all subsequent relative links in the document are resolved against this new base URL.

How Has This Been Tested?

I verified this change using a local reproduction script with the following test case:

Test Scenario:

  • Page URL: https://example.com/subfolder/page.html
  • Input HTML:
    <html>
    <head>
        <base href="https://example.com/">
    </head>
    <body>
        <a href="files/document.pdf">Link</a>
    </body>
    </html>
  • Behavior Before Fix: The link resolved relative to the page URL: https://example.com/subfolder/files/document.pdf
  • Behavior After Fix: The link correctly resolves relative to the base tag: https://example.com/files/document.pdf

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@Joorrit Joorrit closed this Dec 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: html2text ignores <base> tag when resolving relative links

1 participant