Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 116 additions & 23 deletions packages/markitdown-ocr/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
# MarkItDown OCR Plugin

LLM Vision plugin for MarkItDown that extracts text from images embedded in PDF, DOCX, PPTX, and XLSX files.
OCR plugin for MarkItDown that extracts text from images embedded in PDF, DOCX, PPTX, and XLSX files.

Uses the same `llm_client` / `llm_model` pattern that MarkItDown already supports for image descriptions — no new ML libraries or binary dependencies required.
Supports **two OCR providers**:

- **glm-ocr** — ZhiPu AI's specialized layout parsing model (better table recognition, lower cost)
- **LLM Vision** — Any OpenAI-compatible vision model (GPT-4o, Gemini, etc.)

## Features

Expand All @@ -11,28 +14,65 @@ Uses the same `llm_client` / `llm_model` pattern that MarkItDown already support
- **Enhanced PPTX Converter**: OCR for images in PowerPoint presentations
- **Enhanced XLSX Converter**: OCR for images in Excel spreadsheets
- **Context Preservation**: Maintains document structure and flow when inserting extracted text
- **Multiple Providers**: Choose glm-ocr for best table/Chinese text recognition, or LLM Vision for general use

## Installation

```bash
pip install markitdown-ocr
```

The plugin uses whatever OpenAI-compatible client you already have. Install one if you don't have it yet:
Then install at least one OCR provider:

```bash
pip install openai
# Option 1: glm-ocr (recommended for Chinese documents and tables)
pip install markitdown-ocr[glmocr]

# Option 2: LLM Vision (general purpose, any OpenAI-compatible model)
pip install markitdown-ocr[llm]
```

## Usage

### Command Line
### Using glm-ocr Provider (Recommended)

glm-ocr uses ZhiPu AI's specialized layout parsing model — better table recognition, structured output, and lower cost.

**Via environment variable:**

```bash
markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
export GLMOCR_API_KEY="your-zhipu-api-key"
markitdown document.pdf --use-plugins
```

**Via Python API:**

```python
from markitdown import MarkItDown

# Option 1: Pass API key directly
md = MarkItDown(
enable_plugins=True,
glmocr_api_key="your-zhipu-api-key",
)

# Option 2: Use environment variable (GLMOCR_API_KEY)
md = MarkItDown(enable_plugins=True)

result = md.convert("document_with_tables.pdf")
print(result.text_content)
```

**Via config file** (`pyproject.toml`):

```toml
[tool.markitdown-ocr.glmocr]
# api_key = "" # Recommended: use env var GLMOCR_API_KEY instead
model = "glm-ocr"
timeout = 120
```

### Python API
### Using LLM Vision Provider

Pass `llm_client` and `llm_model` to `MarkItDown()` exactly as you would for image descriptions:

Expand All @@ -50,9 +90,22 @@ result = md.convert("document_with_images.pdf")
print(result.text_content)
```

If no `llm_client` is provided the plugin still loads, but OCR is silently skipped — falling back to the standard built-in converter.
### Provider Priority

### Custom Prompt
When both providers are configured, **glm-ocr takes priority**. To force LLM Vision instead, simply don't set `glmocr_api_key`:

```python
md = MarkItDown(
enable_plugins=True,
llm_client=OpenAI(),
llm_model="gpt-4o",
# glmocr_api_key not set → uses LLM Vision
)
```

If no provider is configured, the plugin still loads but OCR is silently skipped — falling back to the standard built-in converter.

### Custom Prompt (LLM Vision only)

Override the default extraction prompt for specialized documents:

Expand Down Expand Up @@ -85,27 +138,34 @@ md = MarkItDown(

## How It Works

When `MarkItDown(enable_plugins=True, llm_client=..., llm_model=...)` is called:
### Provider Selection

When `MarkItDown(enable_plugins=True, ...)` is called:

1. MarkItDown discovers the plugin via the `markitdown.plugin` entry point group
2. It calls `register_converters()`, forwarding all kwargs including `llm_client` and `llm_model`
3. The plugin creates an `LLMVisionOCRService` from those kwargs
2. It calls `register_converters()`, forwarding all kwargs
3. The plugin selects an OCR provider:
- If `glmocr_api_key` or `GLMOCR_API_KEY` is set → **GlmOcrService** (zai-sdk + glm-ocr)
- Else if `llm_client` + `llm_model` are set → **LLMVisionOCRService** (OpenAI-compatible)
- Else → no OCR (standard text extraction)
4. Four OCR-enhanced converters are registered at **priority -1.0** — before the built-in converters at priority 0.0

### Conversion Flow

When a file is converted:

1. The OCR converter accepts the file
2. It extracts embedded images from the document
3. Each image is sent to the LLM with an extraction prompt
3. Each image is sent to the selected OCR provider
4. The returned text is inserted inline, preserving document structure
5. If the LLM call fails, conversion continues without that image's text
5. If the OCR call fails, conversion continues without that image's text

## Supported File Formats

### PDF

- Embedded images are extracted by position (via `page.images` / page XObjects) and OCR'd inline, interleaved with the surrounding text in vertical reading order.
- **Scanned PDFs** (pages with no extractable text) are detected automatically: each page is rendered at 300 DPI and sent to the LLM as a full-page image.
- **Scanned PDFs** (pages with no extractable text) are detected automatically: each page is rendered at 300 DPI and sent to the OCR provider as a full-page image.
- **Malformed PDFs** that pdfplumber/pdfminer cannot open (e.g. truncated EOF) are retried with PyMuPDF page rendering, so content is still recovered.

### DOCX
Expand Down Expand Up @@ -136,21 +196,45 @@ Every extracted OCR block is wrapped as:
[End OCR]*
```

## Configuration Reference

### glm-ocr Provider

| Parameter | Env Variable | Default | Description |
|-----------|-------------|---------|-------------|
| `glmocr_api_key` | `GLMOCR_API_KEY` | — | ZhiPu AI API key (required) |
| `glmocr_model` | `GLMOCR_MODEL` | `"glm-ocr"` | Model name |
| `glmocr_timeout` | `GLMOCR_TIMEOUT` | `120` | Request timeout (seconds) |

### LLM Vision Provider

| Parameter | Description |
|-----------|-------------|
| `llm_client` | OpenAI-compatible client instance |
| `llm_model` | Model name (e.g., `'gpt-4o'`) |
| `llm_prompt` | Custom extraction prompt |

## Troubleshooting

### OCR text missing from output

The most likely cause is a missing `llm_client` or `llm_model`. Verify:
The most likely cause is a missing provider configuration. Verify:

```python
# For glm-ocr
md = MarkItDown(enable_plugins=True, glmocr_api_key="your-key")

# For LLM Vision
from openai import OpenAI
from markitdown import MarkItDown
md = MarkItDown(enable_plugins=True, llm_client=OpenAI(), llm_model="gpt-4o")
```

md = MarkItDown(
enable_plugins=True,
llm_client=OpenAI(), # required
llm_model="gpt-4o", # required
)
### glm-ocr import error

Make sure zai-sdk is installed:

```bash
pip install markitdown-ocr[glmocr]
```

### Plugin not loading
Expand All @@ -163,7 +247,7 @@ markitdown --list-plugins # should show: ocr

### API errors

The plugin propagates LLM API errors as warnings and continues conversion. Check your API key, quota, and that the chosen model supports vision inputs.
The plugin propagates OCR API errors as warnings and continues conversion. Check your API key, quota, and that the chosen model supports vision inputs.

## Development

Expand Down Expand Up @@ -192,6 +276,15 @@ MIT — see [LICENSE](LICENSE).

## Changelog

### 0.2.0

- **Added glm-ocr provider**: ZhiPu AI layout parsing via zai-sdk
- Provider selection: glm-ocr (priority) → LLM Vision (fallback)
- New `GlmOcrService` class with `extract_text()` interface
- New `GlmOcrConfig` for configuration management (env vars + TOML + kwargs)
- HTML → Markdown conversion for glm-ocr structured output
- Optional dependency: `markitdown-ocr[glmocr]`

### 0.1.0 (Initial Release)

- LLM Vision OCR for PDF, DOCX, PPTX, XLSX
Expand Down
13 changes: 11 additions & 2 deletions packages/markitdown-ocr/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ build-backend = "hatchling.build"
[project]
name = "markitdown-ocr"
dynamic = ["version"]
description = 'OCR plugin for MarkItDown - Extracts text from images in PDF, DOCX, PPTX, and XLSX via LLM Vision'
description = 'OCR plugin for MarkItDown - Extracts text from images in PDF, DOCX, PPTX, and XLSX via LLM Vision or glm-ocr'
readme = "README.md"
requires-python = ">=3.10"
license = "MIT"
keywords = ["markitdown", "ocr", "pdf", "docx", "xlsx", "pptx", "llm", "vision"]
keywords = ["markitdown", "ocr", "pdf", "docx", "xlsx", "pptx", "llm", "vision", "glm-ocr", "zhipu"]
authors = [
{ name = "Contributors", email = "noreply@github.com" },
]
Expand Down Expand Up @@ -43,6 +43,9 @@ dependencies = [
llm = [
"openai>=1.0.0",
]
glmocr = [
"zai-sdk>=0.2.2",
]

[project.urls]
Documentation = "https://github.com/microsoft/markitdown#readme"
Expand All @@ -55,3 +58,9 @@ path = "src/markitdown_ocr/__about__.py"
# CRITICAL: Plugin entry point - MarkItDown will discover this plugin through this entry point
[project.entry-points."markitdown.plugin"]
ocr = "markitdown_ocr"

# glm-ocr provider configuration (also supports environment variables)
[tool.markitdown-ocr.glmocr]
# api_key = "" # Recommended: set via environment variable GLMOCR_API_KEY
model = "glm-ocr"
timeout = 600
9 changes: 8 additions & 1 deletion packages/markitdown-ocr/src/markitdown_ocr/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,20 @@
"""
markitdown-ocr: OCR plugin for MarkItDown

Adds LLM Vision-based text extraction from images embedded in PDF, DOCX, PPTX, and XLSX files.
Adds text extraction from images embedded in PDF, DOCX, PPTX, and XLSX files.
Supports multiple OCR providers:
- LLM Vision (OpenAI-compatible: GPT-4o, Gemini, etc.)
- glm-ocr (ZhiPu AI layout parsing: better table recognition, lower cost)
"""

from ._plugin import __plugin_interface_version__, register_converters
from .__about__ import __version__
from ._ocr_service import (
OCRResult,
LLMVisionOCRService,
GlmOcrService,
)
from ._glmocr_config import GlmOcrConfig
from ._pdf_converter_with_ocr import PdfConverterWithOCR
from ._docx_converter_with_ocr import DocxConverterWithOCR
from ._pptx_converter_with_ocr import PptxConverterWithOCR
Expand All @@ -24,6 +29,8 @@
"register_converters",
"OCRResult",
"LLMVisionOCRService",
"GlmOcrService",
"GlmOcrConfig",
"PdfConverterWithOCR",
"DocxConverterWithOCR",
"PptxConverterWithOCR",
Expand Down
93 changes: 93 additions & 0 deletions packages/markitdown-ocr/src/markitdown_ocr/_glmocr_config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
"""Configuration management for glm-ocr provider."""

import os
from dataclasses import dataclass
from pathlib import Path
from typing import Optional

try:
import tomllib # Python 3.11+
except ImportError:
try:
import tomli as tomllib # type: ignore[no-redef]
except ImportError:
tomllib = None # type: ignore[assignment]


@dataclass
class GlmOcrConfig:
"""glm-ocr provider configuration for markitdown-ocr.

Config sources (priority high to low):
1. kwargs parameters (passed at registration time)
2. Environment variables
3. Config file (pyproject.toml [tool.markitdown-ocr.glmocr] section)
4. Default values
"""

api_key: str = ""
model: str = "glm-ocr"
timeout: int = 120

@classmethod
def load(cls, config_path: Optional[str] = None) -> "GlmOcrConfig":
"""Load configuration from multiple sources."""
config = cls()
config._load_from_file(config_path)
config._load_from_env()
return config

def _load_from_file(self, config_path: Optional[str] = None) -> None:
"""Load from config file (pyproject.toml)."""
if tomllib is None:
return

search_paths: list[Path] = []

if config_path:
search_paths.append(Path(config_path))

# Current directory pyproject.toml
search_paths.append(Path("pyproject.toml"))

# User config directory
search_paths.append(
Path.home() / ".config" / "markitdown-ocr" / "config.toml"
)

for path in search_paths:
if path.exists():
try:
with open(path, "rb") as f:
data = tomllib.load(f)

# Read [tool.markitdown-ocr.glmocr] section
if "tool" in data and "markitdown-ocr" in data["tool"]:
section = data["tool"]["markitdown-ocr"]
glmocr_section = section.get("glmocr", {})
self._apply_config(glmocr_section)

break # Use first found config file
except Exception:
pass

def _apply_config(self, data: dict) -> None:
"""Apply config values from a dict."""
if "api_key" in data:
self.api_key = data["api_key"]
if "model" in data:
self.model = data["model"]
if "timeout" in data:
self.timeout = int(data["timeout"])

def _load_from_env(self) -> None:
"""Load from environment variables (highest priority)."""
if os.environ.get("GLMOCR_API_KEY"):
self.api_key = os.environ["GLMOCR_API_KEY"]
if os.environ.get("GLMOCR_MODEL"):
self.model = os.environ["GLMOCR_MODEL"]
if os.environ.get("GLMOCR_TIMEOUT"):
try:
self.timeout = int(os.environ["GLMOCR_TIMEOUT"])
except ValueError:
pass
Loading