Skip to content

yuwei2010/pydefuddle

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pydefuddle

Python implementation of Defuddle — extract and clean web content as Markdown.

Pass any HTML string (or a URL via the CLI) and get back clean, readable Markdown with rich metadata extracted from the page.

Features

  • Content extraction — finds the main article body, removes ads, navbars, sidebars, comments, cookie notices, paywalls, and other clutter
  • Metadata extraction — title, author, published date, description, image, favicon, domain, language, site name via OpenGraph, Twitter Cards, Schema.org, and DOM fallbacks
  • Markdown conversion — clean ATX-style Markdown with properly fenced code blocks (with language tags), tables, figures, and footnotes
  • Code block handling — detects syntax highlighter markup from Prism, Highlight.js, Shiki, and others; normalises indentation and strips UI chrome (copy buttons, toolbars)
  • Image processing — promotes lazy-loaded images, picks the highest-res srcset source, removes tracking pixels
  • CLI — fetch any URL and copy the Markdown to your clipboard in one command
  • Raw Python — only standard library + BeautifulSoup4, markdownify, click, httpx, rich, pyperclip

Installation

pip install pydefuddle

Or with uv:

uv add pydefuddle

Python API

from pydefuddle import defuddle

with open("page.html") as f:
    html = f.read()

result = defuddle(html, url="https://example.com/article")

print(result.title)      # "How Python Works"
print(result.author)     # "Jane Smith"
print(result.published)  # "2024-03-15"
print(result.markdown)   # Clean Markdown string

defuddle(html, url="", **options) — convenience function

Option Type Default Description
markdown bool True Convert to Markdown (set False for clean HTML only)
remove_low_scoring bool True Remove low-signal blocks via content scoring
remove_small_images bool True Remove tracking pixels and tiny images
remove_hidden_elements bool True Remove elements hidden with CSS
content_selector str None Override content discovery with a CSS selector
debug bool False Include removal debug info in result

Defuddle class

from pydefuddle import Defuddle, DefuddleOptions

opts = DefuddleOptions(markdown=True, debug=True, content_selector="article")
result = Defuddle(html, url="https://example.com", options=opts).parse()

for removal in result.debug:
    print(removal.name, removal.count, removal.selector)

DefuddleResult fields

result.content       # str  — clean HTML
result.markdown      # str  — Markdown (empty if markdown=False)
result.title         # str
result.author        # str
result.published     # str  — ISO date / datetime string
result.description   # str
result.image         # str  — URL
result.favicon       # str  — URL
result.domain        # str
result.language      # str  — BCP 47 (e.g. "en", "fr")
result.site_title    # str
result.word_count    # int
result.parse_time    # float — milliseconds
result.debug         # list[DebugRemoval] | None

CLI

Fetch a URL → clipboard

pydefuddle fetch https://example.com/some-article

The Markdown is copied to your clipboard automatically.

Options

pydefuddle fetch <url> --no-clipboard   # print to stdout instead
pydefuddle fetch <url> --output out.md  # write to file
pydefuddle fetch <url> --preview        # render in terminal with rich
pydefuddle fetch <url> --debug          # show removal steps
pydefuddle fetch <url> --no-markdown    # return clean HTML instead

Parse a local file

pydefuddle parse page.html --no-clipboard
pydefuddle parse page.html --output article.md

Development

git clone https://github.com/phalt/pydefuddle
cd pydefuddle
make install   # install deps with uv
make test      # run tests with coverage
make format    # ruff format + lint

Credits

Based on Defuddle by Steph Ango (@kepano), which is the JavaScript original powering Obsidian Web Clipper.

License

MIT

About

Defuddle implementation in Python

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 98.6%
  • Makefile 1.4%