DocumentIndex

Lightweight hierarchical tree index for financial documents with reasoning-based retrieval.

Overview

DocumentIndex builds hierarchical tree structures from financial documents (SEC filings, earnings calls, research reports) and provides two powerful retrieval modes:

Agentic QA: Intelligent, iterative question answering that navigates the document structure
Provenance Extraction: Exhaustive scan to find ALL evidence related to a topic

Unlike vector similarity search, DocumentIndex uses LLM reasoning to understand document structure and find relevant information.

Features

📄 Hierarchical Tree Indexing: Understands document structure (PART, ITEM, Note, etc.) with LLM-skip for well-sectioned documents
🤖 Multi-Provider LLM Support: OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, Ollama
🧠 Multi-Model Routing: Use cheaper models for scoring/summaries, capable models for structure detection and excerpt extraction
🔍 Dual Retrieval Modes: Agentic QA and Provenance Extraction
⚡ Token-Aware Batching: Intelligent grouping of LLM calls by token budget to minimize API round-trips and cost
📊 Streaming Responses: Real-time progress tracking and streaming outputs
💾 LLM-Level Caching: Response caching across all components (memory, file, and Redis backends)
🔗 Cross-Reference Resolution: Automatically resolves "see Note 15", "refer to Item 1A" with batched scoring
📝 Metadata Extraction: Company info, dates, financial numbers

Installation

pip install documentindex

With Redis support:

pip install documentindex[cache]

Quick Start

import asyncio
from documentindex import DocumentIndexer, AgenticQA, ProvenanceExtractor

async def main():
    # 1. Index a document
    indexer = DocumentIndexer()
    doc_index = await indexer.index(
        text=your_document_text,
        doc_name="AAPL_10K_2024"
    )
    
    # 2. Ask questions (Agentic QA)
    qa = AgenticQA(doc_index)
    result = await qa.answer("What was the revenue in 2024?")
    print(result.answer)
    print(f"Confidence: {result.confidence}")
    
    # 3. Extract all evidence about a topic (Provenance)
    extractor = ProvenanceExtractor(doc_index)
    evidence = await extractor.extract_all("climate change risks")
    print(f"Found {len(evidence.evidence)} relevant sections")

asyncio.run(main())

Use Cases

Use Case 1: Document Indexing

from documentindex import DocumentIndexer, IndexerConfig, LLMConfig

# Configure indexer with multi-model support
config = IndexerConfig(
    llm_config=LLMConfig(model="gpt-4o"),            # Structure detection
    summary_llm_config=LLMConfig(model="gpt-4o-mini"), # Cheaper model for summaries
    generate_summaries=True,
    extract_metadata=True,
)

indexer = DocumentIndexer(config)
doc_index = await indexer.index(
    text=document_text,
    doc_name="10K_2024",
)

# Access structure
for node in doc_index.structure:
    print(f"[{node.node_id}] {node.title}")
    for child in node.children:
        print(f"  [{child.node_id}] {child.title}")

# Get text for a node
text = doc_index.get_node_text("0001")

Use Case 2: Question Answering

from documentindex import AgenticQA, AgenticQAConfig

qa = AgenticQA(doc_index)

# Simple question
result = await qa.answer("What are the main risk factors?")

# With configuration
config = AgenticQAConfig(
    max_iterations=5,
    confidence_threshold=0.7,
    follow_cross_refs=True,
)
result = await qa.answer("Explain the revenue breakdown", config)

# Access citations
for citation in result.citations:
    print(f"- {citation.node_title}: {citation.excerpt}")

Use Case 3: Streaming Responses

# Stream QA answer
result = await qa.answer_stream("What was the revenue?")
async for chunk in result.answer_stream:
    print(chunk.content, end="", flush=True)

# Progress callbacks
def on_progress(update):
    print(f"[{update.progress_pct:.1f}%] {update.step_name}")

doc_index = await indexer.index_with_progress(
    text=document_text,
    progress_callback=on_progress,
)

Use Case 4: Provenance Extraction

from documentindex import ProvenanceExtractor, ProvenanceConfig, LLMConfig

# Multi-model: cheap model for scoring, capable model for excerpts
extractor = ProvenanceExtractor(
    doc_index,
    llm_config=LLMConfig(model="gpt-4o"),              # Excerpt extraction
    scoring_llm_config=LLMConfig(model="gpt-4o-mini"),  # Scoring + summary
)

# Extract evidence for single topic
result = await extractor.extract_all(
    topic="environmental sustainability",
    config=ProvenanceConfig(
        relevance_threshold=0.6,
        extract_excerpts=True,
        excerpt_threshold=0.75,       # Only extract excerpts for high-confidence matches
        excerpt_token_budget=30000,   # Token budget per excerpt batch
    ),
)

# Multiple topics (scoring cache shared across topics)
topics = {
    "climate": "climate change and environmental risks",
    "regulatory": "regulatory compliance requirements",
    "financial": "revenue and financial performance",
}
results = await extractor.extract_by_category(topics)

LLM Provider Configuration

OpenAI

from documentindex import create_openai_client, LLMConfig

# Using factory
client = create_openai_client(model="gpt-4o")

# Using config
config = LLMConfig(model="gpt-4o")
# Or: model="openai/gpt-4-turbo"

Anthropic

from documentindex import create_anthropic_client

client = create_anthropic_client(model="claude-sonnet-4-20250514")

# Or via config
config = LLMConfig(model="anthropic/claude-sonnet-4-20250514")

AWS Bedrock

from documentindex import create_bedrock_client

client = create_bedrock_client(
    model="anthropic.claude-3-sonnet-20240229-v1:0",
    region="us-east-1",
)

# Or via config
config = LLMConfig(
    model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
    provider_config={"aws_region_name": "us-east-1"},
)

Azure OpenAI

from documentindex import create_azure_client

client = create_azure_client(
    deployment_name="gpt-4",
    api_base="https://your-resource.openai.azure.com",
    api_version="2024-02-15-preview",
)

Local (Ollama)

from documentindex import create_ollama_client

client = create_ollama_client(
    model="llama2",
    base_url="http://localhost:11434",
)

Caching

Memory Cache (Development)

from documentindex import CacheConfig, CacheManager

config = CacheConfig(backend="memory", memory_max_size=1000)
cache = CacheManager(config)

File Cache (Persistence)

config = CacheConfig(
    backend="file",
    file_cache_dir=".cache/documentindex",
)
cache = CacheManager(config)

Redis Cache (Production)

config = CacheConfig(
    backend="redis",
    redis_host="localhost",
    redis_port=6379,
)
cache = CacheManager(config)

Using Cache with Components

indexer = DocumentIndexer(config, cache_manager=cache)
searcher = NodeSearcher(doc_index, cache_manager=cache)
extractor = ProvenanceExtractor(doc_index, cache_manager=cache)

Supported Document Types

DocumentIndex automatically detects document types:

SEC Filings: 10-K, 10-Q, 8-K, DEF 14A, S-1, 20-F, 6-K
Earnings Documents: Earnings calls, earnings releases
Analysis: Research reports, press releases
Generic: Any text document

API Reference

Core Classes

Class	Description
`DocumentIndexer`	Builds hierarchical tree from text
`DocumentIndex`	Container for indexed document
`NodeSearcher`	Searches for related nodes
`AgenticQA`	Question answering with reasoning
`ProvenanceExtractor`	Exhaustive evidence extraction

Data Models

Model	Description
`TreeNode`	Node in document tree
`TextSpan`	Maps to original text
`NodeMatch`	Search result with relevance
`Citation`	Citation to document location
`QAResult`	Question answering result
`ProvenanceResult`	Provenance extraction result

Configuration Classes

Config	Description
`LLMConfig`	LLM provider settings
`IndexerConfig`	Indexing options
`AgenticQAConfig`	QA behavior settings
`ProvenanceConfig`	Extraction settings
`CacheConfig`	Cache backend settings

Examples

See the examples/ directory for complete examples:

Comprehensive Tutorials

indexer_deep_dive.py - DocumentIndexer deep dive with hierarchical tree visualization, metadata extraction, and cross-reference resolution
searcher_showcase.py - NodeSearcher showcase with relevance scoring, batch search, and cross-reference expansion
agentic_qa_tutorial.py - AgenticQA tutorial with reasoning traces, multi-hop questions, and confidence scoring
provenance_patterns.py - ProvenanceExtractor patterns with multi-category analysis, threshold tuning, and export formats

Quick Start Examples

basic_usage.py - Getting started with basic indexing and querying
streaming_example.py - Progress tracking and streaming responses
multi_provider_example.py - Using different LLM providers (OpenAI, Anthropic, Bedrock, Azure)
caching_example.py - Cache configurations (memory, file, Redis)

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=documentindex

# Format code
ruff format src tests

# Lint code
ruff check src tests

License

MIT License

Contributing

Contributions are welcome! Please read our contributing guidelines first.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
docs		docs
examples		examples
skills		skills
src/documentindex		src/documentindex
tests		tests
.env_example		.env_example
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
DESIGN.md		DESIGN.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

DocumentIndex

Overview

Features

Installation

Quick Start

Use Cases

Use Case 1: Document Indexing

Use Case 2: Question Answering

Use Case 3: Streaming Responses

Use Case 4: Provenance Extraction

LLM Provider Configuration

OpenAI

Anthropic

AWS Bedrock

Azure OpenAI

Local (Ollama)

Caching

Memory Cache (Development)

File Cache (Persistence)

Redis Cache (Production)

Using Cache with Components

Supported Document Types

API Reference

Core Classes

Data Models

Configuration Classes

Examples

Comprehensive Tutorials

Quick Start Examples

Development

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages