Skip to content

Hexecu/mcp-doc-builder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MCP Doc Builder

Intelligent Documentation Scraping, Vectorization, and Semantic Search for AI Coding Assistants.

Overview

MCP Doc Builder is a Model Context Protocol (MCP) server that provides:

  • Intelligent Web Scraping: LLM-guided crawler that intelligently decides which documentation pages to index
  • Semantic Vectorization: Gemini text-embedding-004 for semantic search across documentation
  • Dynamic Ontology: Automatically extracts concepts and relationships from documentation
  • Knowledge Graph: Neo4j-based storage with full graph traversal capabilities
  • Hybrid Search: Combined vector similarity and fulltext search for optimal results

Features

Intelligent Crawling

  • LLM-powered link evaluation decides which pages to follow
  • Respects rate limits to avoid overwhelming documentation servers
  • Configurable depth (1-5 hops from root URL)
  • Smart content extraction with trafilatura

Semantic Search

  • Gemini text-embedding-004 for 768-dimensional vectors
  • Neo4j Vector Index for fast similarity search
  • Fulltext search with Lucene
  • Hybrid search combining both methods

Dynamic Ontology

  • Automatic concept extraction (APIs, patterns, entities)
  • Relationship inference (uses, extends, requires, etc.)
  • Chunk-to-concept linking
  • Concept co-occurrence analysis

MCP Integration

  • 6 tools for complete documentation management
  • Resources for graph exploration
  • Workflow prompts for common tasks

Quick Start

1. Prerequisites

  • Python 3.11+
  • Docker (for Neo4j)
  • LiteLLM Gateway or Gemini API key

2. Installation

You can install doc-builder-mcp globally using pipx (recommended) or in a local virtual environment.

Option 1: One-Line Install (Recommended)

# Install the package
pipx install doc-builder-mcp

# Run the interactive Setup Wizard
doc-mcp-setup

The wizard will:

  1. Check for Docker and Neo4j.
  2. Ask for your LiteLLM / Gemini Credentials.
  3. Configure the LLM Mode (LiteLLM vs Gemini Direct).
  4. Generate a secure .env file.
❓ Don't have pipx? Click here to install it

macOS:

brew install pipx
pipx ensurepath

Windows:

winget install pipx
pipx ensurepath

Linux (Debian/Ubuntu):

sudo apt install pipx
pipx ensurepath

Restart your terminal after installing pipx.

Alternative: Standard Pip

If you prefer not to use pipx:

pip install doc-builder-mcp
doc-mcp-setup

Option 2: Manual Development Setup

If you want to contribute or modify the code:

git clone https://github.com/Hexecu/mcp-doc-builder.git
cd mcp-doc-builder
make full-setup

3. Setup

Run the interactive setup wizard:

doc-mcp-setup

Or manually configure:

cp ../.env.example ../.env
# Edit .env with your configuration

4. Start Neo4j

Start the Neo4j database natively with docker or using the provided Makefile:

make neo4j-up

This uses the docker-compose.yml to start the Neo4j instance.

5. Run the Server

# STDIO mode (for IDE integration)
make server-stdio

# HTTP mode (for API access)
make server

Configuration

Environment Variables

Variable Description Default
NEO4J_URI Neo4j connection URI bolt://localhost:7688
NEO4J_USERNAME Neo4j username neo4j
NEO4J_PASSWORD Neo4j password -
LLM_MODE litellm, gemini_direct, or both litellm
LITELLM_BASE_URL LiteLLM Gateway URL -
LITELLM_API_KEY LiteLLM API key -
LITELLM_MODEL Model name gemini-2.5-flash
CRAWLER_MAX_DEPTH Maximum crawl depth 2
CRAWLER_RATE_LIMIT Seconds between requests 1.0
CRAWLER_MAX_PAGES Max pages per source 500

MCP Tools

doc_ingest

Ingest and index a documentation website.

{
  "url": "https://nextjs.org/docs",
  "name": "Next.js Docs",
  "max_depth": 2
}

doc_search

Search indexed documentation.

{
  "query": "how to use React hooks",
  "limit": 10,
  "search_mode": "hybrid"
}

doc_context

Get comprehensive context for a topic.

{
  "topic": "authentication in Next.js",
  "include_related": true
}

doc_sources

List all indexed documentation sources.

doc_refresh

Refresh/re-index a documentation source.

{
  "source_id": "source_abc123",
  "force": false
}

doc_ontology

Explore extracted concepts and relationships.

{
  "source_id": "source_abc123",
  "concept": "useState"
}

IDE Integration

You can use this MCP server with any compatible AI coding assistant.

Antigravity (Google Deepmind)

  1. Open Antigravity Settings or locate your configuration file.
  2. Add the MCP server configuration:
{
  "mcpServers": {
    "doc-builder": {
      "command": "doc-mcp",
      "args": [],
      "env": {
        "NEO4J_URI": "bolt://localhost:7688",
        "NEO4J_PASSWORD": "your-password",
        "LITELLM_API_KEY": "your-key"
      }
    }
  }
}

Recommended Antigravity Custom Rule

To maximize the utility of doc-builder, add this specific rule to your Antigravity global profile or project guidelines (e.g., in .gemini/rules.md or global settings):

Documentation Strategy Rule: "Whenever you encounter a new, unknown, or recently updated library, framework, or API that is not fully covered by your base training, you MUST proactively use the doc_ingest MCP tool to scrape and index its official documentation. Before writing complex implementation code for unfamiliar tools, always query doc_search or doc_context to understand the best practices and latest syntax."

Cursor

Cursor supports MCP natively. To add the server:

  1. Open Cursor Settings (Cmd/Ctrl + Shift + J) > Features > MCP.
  2. Click + Add new MCP server.
  3. Set the Type to command.
  4. Set the Name to doc-builder.
  5. Set the Command to doc-mcp (assuming you installed via pipx).
  6. Add the necessary environment variables (NEO4J_PASSWORD, LITELLM_API_KEY, etc.) directly in the Cursor UI environment section.

VS Code (with Claude Dev / Roo Code)

If you use Claude Dev, Roo Code, or similar MCP clients in VS Code:

  1. Open the MCP configuration file (usually found at ~/Library/Application Support/Code/User/globalStorage/saoudrizwan.claude-dev/settings/cline_mcp_settings.json on Mac).
  2. Add the server entry:
{
  "mcpServers": {
    "doc-builder": {
      "command": "doc-mcp",
      "args": [],
      "env": {
        "NEO4J_URI": "bolt://localhost:7688",
        "NEO4J_PASSWORD": "your-password",
        "LITELLM_API_KEY": "your-key"
      }
    }
  }
}

Architecture

mcp-doc-builder/
├── docker-compose.yml        # Neo4j container
├── .env.example              # Configuration template
└── server/
    ├── pyproject.toml        # Python package
    └── src/doc_builder/
        ├── main.py           # MCP server entry
        ├── config.py         # Settings
        ├── cli/              # Setup wizard & status
        ├── crawler/          # Web scraping
        │   ├── spider.py     # Async crawler
        │   ├── parser.py     # HTML parsing
        │   └── agent.py      # LLM link evaluation
        ├── vector/           # Vectorization
        │   ├── embedder.py   # Gemini embeddings
        │   ├── chunker.py    # Smart chunking
        │   └── indexer.py    # Neo4j vector index
        ├── ontology/         # Knowledge extraction
        │   ├── extractor.py  # Concept extraction
        │   ├── metatag.py    # Metatag processing
        │   └── linker.py     # Relationship building
        ├── kg/               # Neo4j graph
        │   ├── neo4j.py      # Async client
        │   ├── repo.py       # Query repository
        │   └── schema.cypher # Database schema
        ├── llm/              # LLM integration
        │   ├── client.py     # LiteLLM wrapper
        │   └── prompts/      # Prompt templates
        ├── mcp/              # MCP protocol
        │   ├── tools.py      # Tool definitions
        │   ├── resources.py  # Resource handlers
        │   └── prompts.py    # Workflow prompts
        └── security/         # Auth & validation

Graph Schema

Nodes (Doc* prefixed for namespace separation)

  • DocSource: Documentation root (URL, name, status)
  • DocPage: Individual pages with metadata
  • DocChunk: Vectorized content chunks with embeddings
  • DocConcept: Extracted concepts (APIs, patterns, entities)
  • DocMetatag: Page metatags (og:, twitter:, etc.)
  • DocCrawlJob: Crawl job tracking

Relationships

  • (DocSource)-[:CONTAINS]->(DocPage)
  • (DocPage)-[:LINKS_TO]->(DocPage)
  • (DocPage)-[:HAS_CHUNK]->(DocChunk)
  • (DocChunk)-[:MENTIONS]->(DocConcept)
  • (DocConcept)-[:RELATES_TO]->(DocConcept)

CLI Commands

# Interactive setup
doc-mcp-setup

# Health check
doc-mcp-status --doctor

# Run server
doc-mcp

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Type checking
mypy src/

# Linting
ruff check src/

License

MIT

Related Projects

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors