Python implementation of Defuddle — extract and clean web content as Markdown.
Pass any HTML string (or a URL via the CLI) and get back clean, readable Markdown with rich metadata extracted from the page.
- Content extraction — finds the main article body, removes ads, navbars, sidebars, comments, cookie notices, paywalls, and other clutter
- Metadata extraction — title, author, published date, description, image, favicon, domain, language, site name via OpenGraph, Twitter Cards, Schema.org, and DOM fallbacks
- Markdown conversion — clean ATX-style Markdown with properly fenced code blocks (with language tags), tables, figures, and footnotes
- Code block handling — detects syntax highlighter markup from Prism, Highlight.js, Shiki, and others; normalises indentation and strips UI chrome (copy buttons, toolbars)
- Image processing — promotes lazy-loaded images, picks the highest-res srcset source, removes tracking pixels
- CLI — fetch any URL and copy the Markdown to your clipboard in one command
- Raw Python — only standard library + BeautifulSoup4, markdownify, click, httpx, rich, pyperclip
pip install pydefuddleOr with uv:
uv add pydefuddlefrom pydefuddle import defuddle
with open("page.html") as f:
html = f.read()
result = defuddle(html, url="https://example.com/article")
print(result.title) # "How Python Works"
print(result.author) # "Jane Smith"
print(result.published) # "2024-03-15"
print(result.markdown) # Clean Markdown string| Option | Type | Default | Description |
|---|---|---|---|
markdown |
bool | True |
Convert to Markdown (set False for clean HTML only) |
remove_low_scoring |
bool | True |
Remove low-signal blocks via content scoring |
remove_small_images |
bool | True |
Remove tracking pixels and tiny images |
remove_hidden_elements |
bool | True |
Remove elements hidden with CSS |
content_selector |
str | None |
Override content discovery with a CSS selector |
debug |
bool | False |
Include removal debug info in result |
from pydefuddle import Defuddle, DefuddleOptions
opts = DefuddleOptions(markdown=True, debug=True, content_selector="article")
result = Defuddle(html, url="https://example.com", options=opts).parse()
for removal in result.debug:
print(removal.name, removal.count, removal.selector)result.content # str — clean HTML
result.markdown # str — Markdown (empty if markdown=False)
result.title # str
result.author # str
result.published # str — ISO date / datetime string
result.description # str
result.image # str — URL
result.favicon # str — URL
result.domain # str
result.language # str — BCP 47 (e.g. "en", "fr")
result.site_title # str
result.word_count # int
result.parse_time # float — milliseconds
result.debug # list[DebugRemoval] | Nonepydefuddle fetch https://example.com/some-articleThe Markdown is copied to your clipboard automatically.
pydefuddle fetch <url> --no-clipboard # print to stdout instead
pydefuddle fetch <url> --output out.md # write to file
pydefuddle fetch <url> --preview # render in terminal with rich
pydefuddle fetch <url> --debug # show removal steps
pydefuddle fetch <url> --no-markdown # return clean HTML insteadpydefuddle parse page.html --no-clipboard
pydefuddle parse page.html --output article.mdgit clone https://github.com/phalt/pydefuddle
cd pydefuddle
make install # install deps with uv
make test # run tests with coverage
make format # ruff format + lintBased on Defuddle by Steph Ango (@kepano), which is the JavaScript original powering Obsidian Web Clipper.
MIT