Skip to content

Jason-AI-lab/DocumentIndex

Repository files navigation

DocumentIndex

Lightweight hierarchical tree index for financial documents with reasoning-based retrieval.

Overview

DocumentIndex builds hierarchical tree structures from financial documents (SEC filings, earnings calls, research reports) and provides two powerful retrieval modes:

  • Agentic QA: Intelligent, iterative question answering that navigates the document structure
  • Provenance Extraction: Exhaustive scan to find ALL evidence related to a topic

Unlike vector similarity search, DocumentIndex uses LLM reasoning to understand document structure and find relevant information.

Features

  • 📄 Hierarchical Tree Indexing: Understands document structure (PART, ITEM, Note, etc.) with LLM-skip for well-sectioned documents
  • 🤖 Multi-Provider LLM Support: OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, Ollama
  • 🧠 Multi-Model Routing: Use cheaper models for scoring/summaries, capable models for structure detection and excerpt extraction
  • 🔍 Dual Retrieval Modes: Agentic QA and Provenance Extraction
  • Token-Aware Batching: Intelligent grouping of LLM calls by token budget to minimize API round-trips and cost
  • 📊 Streaming Responses: Real-time progress tracking and streaming outputs
  • 💾 LLM-Level Caching: Response caching across all components (memory, file, and Redis backends)
  • 🔗 Cross-Reference Resolution: Automatically resolves "see Note 15", "refer to Item 1A" with batched scoring
  • 📝 Metadata Extraction: Company info, dates, financial numbers

Installation

pip install documentindex

With Redis support:

pip install documentindex[cache]

Quick Start

import asyncio
from documentindex import DocumentIndexer, AgenticQA, ProvenanceExtractor

async def main():
    # 1. Index a document
    indexer = DocumentIndexer()
    doc_index = await indexer.index(
        text=your_document_text,
        doc_name="AAPL_10K_2024"
    )
    
    # 2. Ask questions (Agentic QA)
    qa = AgenticQA(doc_index)
    result = await qa.answer("What was the revenue in 2024?")
    print(result.answer)
    print(f"Confidence: {result.confidence}")
    
    # 3. Extract all evidence about a topic (Provenance)
    extractor = ProvenanceExtractor(doc_index)
    evidence = await extractor.extract_all("climate change risks")
    print(f"Found {len(evidence.evidence)} relevant sections")

asyncio.run(main())

Use Cases

Use Case 1: Document Indexing

from documentindex import DocumentIndexer, IndexerConfig, LLMConfig

# Configure indexer with multi-model support
config = IndexerConfig(
    llm_config=LLMConfig(model="gpt-4o"),            # Structure detection
    summary_llm_config=LLMConfig(model="gpt-4o-mini"), # Cheaper model for summaries
    generate_summaries=True,
    extract_metadata=True,
)

indexer = DocumentIndexer(config)
doc_index = await indexer.index(
    text=document_text,
    doc_name="10K_2024",
)

# Access structure
for node in doc_index.structure:
    print(f"[{node.node_id}] {node.title}")
    for child in node.children:
        print(f"  [{child.node_id}] {child.title}")

# Get text for a node
text = doc_index.get_node_text("0001")

Use Case 2: Question Answering

from documentindex import AgenticQA, AgenticQAConfig

qa = AgenticQA(doc_index)

# Simple question
result = await qa.answer("What are the main risk factors?")

# With configuration
config = AgenticQAConfig(
    max_iterations=5,
    confidence_threshold=0.7,
    follow_cross_refs=True,
)
result = await qa.answer("Explain the revenue breakdown", config)

# Access citations
for citation in result.citations:
    print(f"- {citation.node_title}: {citation.excerpt}")

Use Case 3: Streaming Responses

# Stream QA answer
result = await qa.answer_stream("What was the revenue?")
async for chunk in result.answer_stream:
    print(chunk.content, end="", flush=True)

# Progress callbacks
def on_progress(update):
    print(f"[{update.progress_pct:.1f}%] {update.step_name}")

doc_index = await indexer.index_with_progress(
    text=document_text,
    progress_callback=on_progress,
)

Use Case 4: Provenance Extraction

from documentindex import ProvenanceExtractor, ProvenanceConfig, LLMConfig

# Multi-model: cheap model for scoring, capable model for excerpts
extractor = ProvenanceExtractor(
    doc_index,
    llm_config=LLMConfig(model="gpt-4o"),              # Excerpt extraction
    scoring_llm_config=LLMConfig(model="gpt-4o-mini"),  # Scoring + summary
)

# Extract evidence for single topic
result = await extractor.extract_all(
    topic="environmental sustainability",
    config=ProvenanceConfig(
        relevance_threshold=0.6,
        extract_excerpts=True,
        excerpt_threshold=0.75,       # Only extract excerpts for high-confidence matches
        excerpt_token_budget=30000,   # Token budget per excerpt batch
    ),
)

# Multiple topics (scoring cache shared across topics)
topics = {
    "climate": "climate change and environmental risks",
    "regulatory": "regulatory compliance requirements",
    "financial": "revenue and financial performance",
}
results = await extractor.extract_by_category(topics)

LLM Provider Configuration

OpenAI

from documentindex import create_openai_client, LLMConfig

# Using factory
client = create_openai_client(model="gpt-4o")

# Using config
config = LLMConfig(model="gpt-4o")
# Or: model="openai/gpt-4-turbo"

Anthropic

from documentindex import create_anthropic_client

client = create_anthropic_client(model="claude-sonnet-4-20250514")

# Or via config
config = LLMConfig(model="anthropic/claude-sonnet-4-20250514")

AWS Bedrock

from documentindex import create_bedrock_client

client = create_bedrock_client(
    model="anthropic.claude-3-sonnet-20240229-v1:0",
    region="us-east-1",
)

# Or via config
config = LLMConfig(
    model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
    provider_config={"aws_region_name": "us-east-1"},
)

Azure OpenAI

from documentindex import create_azure_client

client = create_azure_client(
    deployment_name="gpt-4",
    api_base="https://your-resource.openai.azure.com",
    api_version="2024-02-15-preview",
)

Local (Ollama)

from documentindex import create_ollama_client

client = create_ollama_client(
    model="llama2",
    base_url="http://localhost:11434",
)

Caching

Memory Cache (Development)

from documentindex import CacheConfig, CacheManager

config = CacheConfig(backend="memory", memory_max_size=1000)
cache = CacheManager(config)

File Cache (Persistence)

config = CacheConfig(
    backend="file",
    file_cache_dir=".cache/documentindex",
)
cache = CacheManager(config)

Redis Cache (Production)

config = CacheConfig(
    backend="redis",
    redis_host="localhost",
    redis_port=6379,
)
cache = CacheManager(config)

Using Cache with Components

indexer = DocumentIndexer(config, cache_manager=cache)
searcher = NodeSearcher(doc_index, cache_manager=cache)
extractor = ProvenanceExtractor(doc_index, cache_manager=cache)

Supported Document Types

DocumentIndex automatically detects document types:

  • SEC Filings: 10-K, 10-Q, 8-K, DEF 14A, S-1, 20-F, 6-K
  • Earnings Documents: Earnings calls, earnings releases
  • Analysis: Research reports, press releases
  • Generic: Any text document

API Reference

Core Classes

Class Description
DocumentIndexer Builds hierarchical tree from text
DocumentIndex Container for indexed document
NodeSearcher Searches for related nodes
AgenticQA Question answering with reasoning
ProvenanceExtractor Exhaustive evidence extraction

Data Models

Model Description
TreeNode Node in document tree
TextSpan Maps to original text
NodeMatch Search result with relevance
Citation Citation to document location
QAResult Question answering result
ProvenanceResult Provenance extraction result

Configuration Classes

Config Description
LLMConfig LLM provider settings
IndexerConfig Indexing options
AgenticQAConfig QA behavior settings
ProvenanceConfig Extraction settings
CacheConfig Cache backend settings

Examples

See the examples/ directory for complete examples:

Comprehensive Tutorials

  • indexer_deep_dive.py - DocumentIndexer deep dive with hierarchical tree visualization, metadata extraction, and cross-reference resolution
  • searcher_showcase.py - NodeSearcher showcase with relevance scoring, batch search, and cross-reference expansion
  • agentic_qa_tutorial.py - AgenticQA tutorial with reasoning traces, multi-hop questions, and confidence scoring
  • provenance_patterns.py - ProvenanceExtractor patterns with multi-category analysis, threshold tuning, and export formats

Quick Start Examples

  • basic_usage.py - Getting started with basic indexing and querying
  • streaming_example.py - Progress tracking and streaming responses
  • multi_provider_example.py - Using different LLM providers (OpenAI, Anthropic, Bedrock, Azure)
  • caching_example.py - Cache configurations (memory, file, Redis)

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=documentindex

# Format code
ruff format src tests

# Lint code
ruff check src tests

License

MIT License

Contributing

Contributions are welcome! Please read our contributing guidelines first.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages