pydefuddle

Python implementation of Defuddle — extract and clean web content as Markdown.

Pass any HTML string (or a URL via the CLI) and get back clean, readable Markdown with rich metadata extracted from the page.

Features

Content extraction — finds the main article body, removes ads, navbars, sidebars, comments, cookie notices, paywalls, and other clutter
Metadata extraction — title, author, published date, description, image, favicon, domain, language, site name via OpenGraph, Twitter Cards, Schema.org, and DOM fallbacks
Markdown conversion — clean ATX-style Markdown with properly fenced code blocks (with language tags), tables, figures, and footnotes
Code block handling — detects syntax highlighter markup from Prism, Highlight.js, Shiki, and others; normalises indentation and strips UI chrome (copy buttons, toolbars)
Image processing — promotes lazy-loaded images, picks the highest-res srcset source, removes tracking pixels
CLI — fetch any URL and copy the Markdown to your clipboard in one command
Raw Python — only standard library + BeautifulSoup4, markdownify, click, httpx, rich, pyperclip

Installation

pip install pydefuddle

Or with uv:

uv add pydefuddle

Python API

from pydefuddle import defuddle

with open("page.html") as f:
    html = f.read()

result = defuddle(html, url="https://example.com/article")

print(result.title)      # "How Python Works"
print(result.author)     # "Jane Smith"
print(result.published)  # "2024-03-15"
print(result.markdown)   # Clean Markdown string

`defuddle(html, url="", **options)` — convenience function

Option	Type	Default	Description
`markdown`	bool	`True`	Convert to Markdown (set `False` for clean HTML only)
`remove_low_scoring`	bool	`True`	Remove low-signal blocks via content scoring
`remove_small_images`	bool	`True`	Remove tracking pixels and tiny images
`remove_hidden_elements`	bool	`True`	Remove elements hidden with CSS
`content_selector`	str	`None`	Override content discovery with a CSS selector
`debug`	bool	`False`	Include removal debug info in result

`Defuddle` class

from pydefuddle import Defuddle, DefuddleOptions

opts = DefuddleOptions(markdown=True, debug=True, content_selector="article")
result = Defuddle(html, url="https://example.com", options=opts).parse()

for removal in result.debug:
    print(removal.name, removal.count, removal.selector)

`DefuddleResult` fields

result.content       # str  — clean HTML
result.markdown      # str  — Markdown (empty if markdown=False)
result.title         # str
result.author        # str
result.published     # str  — ISO date / datetime string
result.description   # str
result.image         # str  — URL
result.favicon       # str  — URL
result.domain        # str
result.language      # str  — BCP 47 (e.g. "en", "fr")
result.site_title    # str
result.word_count    # int
result.parse_time    # float — milliseconds
result.debug         # list[DebugRemoval] | None

CLI

Fetch a URL → clipboard

pydefuddle fetch https://example.com/some-article

The Markdown is copied to your clipboard automatically.

Options

pydefuddle fetch <url> --no-clipboard   # print to stdout instead
pydefuddle fetch <url> --output out.md  # write to file
pydefuddle fetch <url> --preview        # render in terminal with rich
pydefuddle fetch <url> --debug          # show removal steps
pydefuddle fetch <url> --no-markdown    # return clean HTML instead

Parse a local file

pydefuddle parse page.html --no-clipboard
pydefuddle parse page.html --output article.md

Development

git clone https://github.com/phalt/pydefuddle
cd pydefuddle
make install   # install deps with uv
make test      # run tests with coverage
make format    # ruff format + lint

Credits

Based on Defuddle by Steph Ango (@kepano), which is the JavaScript original powering Obsidian Web Clipper.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
pydefuddle		pydefuddle
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pydefuddle

Features

Installation

Python API

`defuddle(html, url="", **options)` — convenience function

`Defuddle` class

`DefuddleResult` fields

CLI

Fetch a URL → clipboard

Options

Parse a local file

Development

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pydefuddle

Features

Installation

Python API

defuddle(html, url="", **options) — convenience function

Defuddle class

DefuddleResult fields

CLI

Fetch a URL → clipboard

Options

Parse a local file

Development

Credits

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`defuddle(html, url="", **options)` — convenience function

`Defuddle` class

`DefuddleResult` fields

Packages