Lightweight hierarchical tree index for financial documents with reasoning-based retrieval.
DocumentIndex builds hierarchical tree structures from financial documents (SEC filings, earnings calls, research reports) and provides two powerful retrieval modes:
- Agentic QA: Intelligent, iterative question answering that navigates the document structure
- Provenance Extraction: Exhaustive scan to find ALL evidence related to a topic
Unlike vector similarity search, DocumentIndex uses LLM reasoning to understand document structure and find relevant information.
- 📄 Hierarchical Tree Indexing: Understands document structure (PART, ITEM, Note, etc.) with LLM-skip for well-sectioned documents
- 🤖 Multi-Provider LLM Support: OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, Ollama
- 🧠 Multi-Model Routing: Use cheaper models for scoring/summaries, capable models for structure detection and excerpt extraction
- 🔍 Dual Retrieval Modes: Agentic QA and Provenance Extraction
- ⚡ Token-Aware Batching: Intelligent grouping of LLM calls by token budget to minimize API round-trips and cost
- 📊 Streaming Responses: Real-time progress tracking and streaming outputs
- 💾 LLM-Level Caching: Response caching across all components (memory, file, and Redis backends)
- 🔗 Cross-Reference Resolution: Automatically resolves "see Note 15", "refer to Item 1A" with batched scoring
- 📝 Metadata Extraction: Company info, dates, financial numbers
pip install documentindexWith Redis support:
pip install documentindex[cache]import asyncio
from documentindex import DocumentIndexer, AgenticQA, ProvenanceExtractor
async def main():
# 1. Index a document
indexer = DocumentIndexer()
doc_index = await indexer.index(
text=your_document_text,
doc_name="AAPL_10K_2024"
)
# 2. Ask questions (Agentic QA)
qa = AgenticQA(doc_index)
result = await qa.answer("What was the revenue in 2024?")
print(result.answer)
print(f"Confidence: {result.confidence}")
# 3. Extract all evidence about a topic (Provenance)
extractor = ProvenanceExtractor(doc_index)
evidence = await extractor.extract_all("climate change risks")
print(f"Found {len(evidence.evidence)} relevant sections")
asyncio.run(main())from documentindex import DocumentIndexer, IndexerConfig, LLMConfig
# Configure indexer with multi-model support
config = IndexerConfig(
llm_config=LLMConfig(model="gpt-4o"), # Structure detection
summary_llm_config=LLMConfig(model="gpt-4o-mini"), # Cheaper model for summaries
generate_summaries=True,
extract_metadata=True,
)
indexer = DocumentIndexer(config)
doc_index = await indexer.index(
text=document_text,
doc_name="10K_2024",
)
# Access structure
for node in doc_index.structure:
print(f"[{node.node_id}] {node.title}")
for child in node.children:
print(f" [{child.node_id}] {child.title}")
# Get text for a node
text = doc_index.get_node_text("0001")from documentindex import AgenticQA, AgenticQAConfig
qa = AgenticQA(doc_index)
# Simple question
result = await qa.answer("What are the main risk factors?")
# With configuration
config = AgenticQAConfig(
max_iterations=5,
confidence_threshold=0.7,
follow_cross_refs=True,
)
result = await qa.answer("Explain the revenue breakdown", config)
# Access citations
for citation in result.citations:
print(f"- {citation.node_title}: {citation.excerpt}")# Stream QA answer
result = await qa.answer_stream("What was the revenue?")
async for chunk in result.answer_stream:
print(chunk.content, end="", flush=True)
# Progress callbacks
def on_progress(update):
print(f"[{update.progress_pct:.1f}%] {update.step_name}")
doc_index = await indexer.index_with_progress(
text=document_text,
progress_callback=on_progress,
)from documentindex import ProvenanceExtractor, ProvenanceConfig, LLMConfig
# Multi-model: cheap model for scoring, capable model for excerpts
extractor = ProvenanceExtractor(
doc_index,
llm_config=LLMConfig(model="gpt-4o"), # Excerpt extraction
scoring_llm_config=LLMConfig(model="gpt-4o-mini"), # Scoring + summary
)
# Extract evidence for single topic
result = await extractor.extract_all(
topic="environmental sustainability",
config=ProvenanceConfig(
relevance_threshold=0.6,
extract_excerpts=True,
excerpt_threshold=0.75, # Only extract excerpts for high-confidence matches
excerpt_token_budget=30000, # Token budget per excerpt batch
),
)
# Multiple topics (scoring cache shared across topics)
topics = {
"climate": "climate change and environmental risks",
"regulatory": "regulatory compliance requirements",
"financial": "revenue and financial performance",
}
results = await extractor.extract_by_category(topics)from documentindex import create_openai_client, LLMConfig
# Using factory
client = create_openai_client(model="gpt-4o")
# Using config
config = LLMConfig(model="gpt-4o")
# Or: model="openai/gpt-4-turbo"from documentindex import create_anthropic_client
client = create_anthropic_client(model="claude-sonnet-4-20250514")
# Or via config
config = LLMConfig(model="anthropic/claude-sonnet-4-20250514")from documentindex import create_bedrock_client
client = create_bedrock_client(
model="anthropic.claude-3-sonnet-20240229-v1:0",
region="us-east-1",
)
# Or via config
config = LLMConfig(
model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
provider_config={"aws_region_name": "us-east-1"},
)from documentindex import create_azure_client
client = create_azure_client(
deployment_name="gpt-4",
api_base="https://your-resource.openai.azure.com",
api_version="2024-02-15-preview",
)from documentindex import create_ollama_client
client = create_ollama_client(
model="llama2",
base_url="http://localhost:11434",
)from documentindex import CacheConfig, CacheManager
config = CacheConfig(backend="memory", memory_max_size=1000)
cache = CacheManager(config)config = CacheConfig(
backend="file",
file_cache_dir=".cache/documentindex",
)
cache = CacheManager(config)config = CacheConfig(
backend="redis",
redis_host="localhost",
redis_port=6379,
)
cache = CacheManager(config)indexer = DocumentIndexer(config, cache_manager=cache)
searcher = NodeSearcher(doc_index, cache_manager=cache)
extractor = ProvenanceExtractor(doc_index, cache_manager=cache)DocumentIndex automatically detects document types:
- SEC Filings: 10-K, 10-Q, 8-K, DEF 14A, S-1, 20-F, 6-K
- Earnings Documents: Earnings calls, earnings releases
- Analysis: Research reports, press releases
- Generic: Any text document
| Class | Description |
|---|---|
DocumentIndexer |
Builds hierarchical tree from text |
DocumentIndex |
Container for indexed document |
NodeSearcher |
Searches for related nodes |
AgenticQA |
Question answering with reasoning |
ProvenanceExtractor |
Exhaustive evidence extraction |
| Model | Description |
|---|---|
TreeNode |
Node in document tree |
TextSpan |
Maps to original text |
NodeMatch |
Search result with relevance |
Citation |
Citation to document location |
QAResult |
Question answering result |
ProvenanceResult |
Provenance extraction result |
| Config | Description |
|---|---|
LLMConfig |
LLM provider settings |
IndexerConfig |
Indexing options |
AgenticQAConfig |
QA behavior settings |
ProvenanceConfig |
Extraction settings |
CacheConfig |
Cache backend settings |
See the examples/ directory for complete examples:
indexer_deep_dive.py- DocumentIndexer deep dive with hierarchical tree visualization, metadata extraction, and cross-reference resolutionsearcher_showcase.py- NodeSearcher showcase with relevance scoring, batch search, and cross-reference expansionagentic_qa_tutorial.py- AgenticQA tutorial with reasoning traces, multi-hop questions, and confidence scoringprovenance_patterns.py- ProvenanceExtractor patterns with multi-category analysis, threshold tuning, and export formats
basic_usage.py- Getting started with basic indexing and queryingstreaming_example.py- Progress tracking and streaming responsesmulti_provider_example.py- Using different LLM providers (OpenAI, Anthropic, Bedrock, Azure)caching_example.py- Cache configurations (memory, file, Redis)
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=documentindex
# Format code
ruff format src tests
# Lint code
ruff check src testsMIT License
Contributions are welcome! Please read our contributing guidelines first.