Intelligent Documentation Scraping, Vectorization, and Semantic Search for AI Coding Assistants.
MCP Doc Builder is a Model Context Protocol (MCP) server that provides:
- Intelligent Web Scraping: LLM-guided crawler that intelligently decides which documentation pages to index
- Semantic Vectorization: Gemini text-embedding-004 for semantic search across documentation
- Dynamic Ontology: Automatically extracts concepts and relationships from documentation
- Knowledge Graph: Neo4j-based storage with full graph traversal capabilities
- Hybrid Search: Combined vector similarity and fulltext search for optimal results
- LLM-powered link evaluation decides which pages to follow
- Respects rate limits to avoid overwhelming documentation servers
- Configurable depth (1-5 hops from root URL)
- Smart content extraction with trafilatura
- Gemini text-embedding-004 for 768-dimensional vectors
- Neo4j Vector Index for fast similarity search
- Fulltext search with Lucene
- Hybrid search combining both methods
- Automatic concept extraction (APIs, patterns, entities)
- Relationship inference (uses, extends, requires, etc.)
- Chunk-to-concept linking
- Concept co-occurrence analysis
- 6 tools for complete documentation management
- Resources for graph exploration
- Workflow prompts for common tasks
- Python 3.11+
- Docker (for Neo4j)
- LiteLLM Gateway or Gemini API key
You can install doc-builder-mcp globally using pipx (recommended) or in a local virtual environment.
# Install the package
pipx install doc-builder-mcp
# Run the interactive Setup Wizard
doc-mcp-setupThe wizard will:
- Check for Docker and Neo4j.
- Ask for your LiteLLM / Gemini Credentials.
- Configure the LLM Mode (LiteLLM vs Gemini Direct).
- Generate a secure
.envfile.
❓ Don't have pipx? Click here to install it
macOS:
brew install pipx
pipx ensurepathWindows:
winget install pipx
pipx ensurepathLinux (Debian/Ubuntu):
sudo apt install pipx
pipx ensurepathRestart your terminal after installing pipx.
If you prefer not to use pipx:
pip install doc-builder-mcp
doc-mcp-setupIf you want to contribute or modify the code:
git clone https://github.com/Hexecu/mcp-doc-builder.git
cd mcp-doc-builder
make full-setupRun the interactive setup wizard:
doc-mcp-setupOr manually configure:
cp ../.env.example ../.env
# Edit .env with your configurationStart the Neo4j database natively with docker or using the provided Makefile:
make neo4j-upThis uses the docker-compose.yml to start the Neo4j instance.
# STDIO mode (for IDE integration)
make server-stdio
# HTTP mode (for API access)
make server| Variable | Description | Default |
|---|---|---|
NEO4J_URI |
Neo4j connection URI | bolt://localhost:7688 |
NEO4J_USERNAME |
Neo4j username | neo4j |
NEO4J_PASSWORD |
Neo4j password | - |
LLM_MODE |
litellm, gemini_direct, or both |
litellm |
LITELLM_BASE_URL |
LiteLLM Gateway URL | - |
LITELLM_API_KEY |
LiteLLM API key | - |
LITELLM_MODEL |
Model name | gemini-2.5-flash |
CRAWLER_MAX_DEPTH |
Maximum crawl depth | 2 |
CRAWLER_RATE_LIMIT |
Seconds between requests | 1.0 |
CRAWLER_MAX_PAGES |
Max pages per source | 500 |
Ingest and index a documentation website.
{
"url": "https://nextjs.org/docs",
"name": "Next.js Docs",
"max_depth": 2
}Search indexed documentation.
{
"query": "how to use React hooks",
"limit": 10,
"search_mode": "hybrid"
}Get comprehensive context for a topic.
{
"topic": "authentication in Next.js",
"include_related": true
}List all indexed documentation sources.
Refresh/re-index a documentation source.
{
"source_id": "source_abc123",
"force": false
}Explore extracted concepts and relationships.
{
"source_id": "source_abc123",
"concept": "useState"
}You can use this MCP server with any compatible AI coding assistant.
- Open Antigravity Settings or locate your configuration file.
- Add the MCP server configuration:
{
"mcpServers": {
"doc-builder": {
"command": "doc-mcp",
"args": [],
"env": {
"NEO4J_URI": "bolt://localhost:7688",
"NEO4J_PASSWORD": "your-password",
"LITELLM_API_KEY": "your-key"
}
}
}
}To maximize the utility of doc-builder, add this specific rule to your Antigravity global profile or project guidelines (e.g., in .gemini/rules.md or global settings):
Documentation Strategy Rule: "Whenever you encounter a new, unknown, or recently updated library, framework, or API that is not fully covered by your base training, you MUST proactively use the
doc_ingestMCP tool to scrape and index its official documentation. Before writing complex implementation code for unfamiliar tools, always querydoc_searchordoc_contextto understand the best practices and latest syntax."
Cursor supports MCP natively. To add the server:
- Open Cursor Settings (Cmd/Ctrl + Shift + J) > Features > MCP.
- Click + Add new MCP server.
- Set the Type to
command. - Set the Name to
doc-builder. - Set the Command to
doc-mcp(assuming you installed viapipx). - Add the necessary environment variables (
NEO4J_PASSWORD,LITELLM_API_KEY, etc.) directly in the Cursor UI environment section.
If you use Claude Dev, Roo Code, or similar MCP clients in VS Code:
- Open the MCP configuration file (usually found at
~/Library/Application Support/Code/User/globalStorage/saoudrizwan.claude-dev/settings/cline_mcp_settings.jsonon Mac). - Add the server entry:
{
"mcpServers": {
"doc-builder": {
"command": "doc-mcp",
"args": [],
"env": {
"NEO4J_URI": "bolt://localhost:7688",
"NEO4J_PASSWORD": "your-password",
"LITELLM_API_KEY": "your-key"
}
}
}
}mcp-doc-builder/
├── docker-compose.yml # Neo4j container
├── .env.example # Configuration template
└── server/
├── pyproject.toml # Python package
└── src/doc_builder/
├── main.py # MCP server entry
├── config.py # Settings
├── cli/ # Setup wizard & status
├── crawler/ # Web scraping
│ ├── spider.py # Async crawler
│ ├── parser.py # HTML parsing
│ └── agent.py # LLM link evaluation
├── vector/ # Vectorization
│ ├── embedder.py # Gemini embeddings
│ ├── chunker.py # Smart chunking
│ └── indexer.py # Neo4j vector index
├── ontology/ # Knowledge extraction
│ ├── extractor.py # Concept extraction
│ ├── metatag.py # Metatag processing
│ └── linker.py # Relationship building
├── kg/ # Neo4j graph
│ ├── neo4j.py # Async client
│ ├── repo.py # Query repository
│ └── schema.cypher # Database schema
├── llm/ # LLM integration
│ ├── client.py # LiteLLM wrapper
│ └── prompts/ # Prompt templates
├── mcp/ # MCP protocol
│ ├── tools.py # Tool definitions
│ ├── resources.py # Resource handlers
│ └── prompts.py # Workflow prompts
└── security/ # Auth & validation
- DocSource: Documentation root (URL, name, status)
- DocPage: Individual pages with metadata
- DocChunk: Vectorized content chunks with embeddings
- DocConcept: Extracted concepts (APIs, patterns, entities)
- DocMetatag: Page metatags (og:, twitter:, etc.)
- DocCrawlJob: Crawl job tracking
(DocSource)-[:CONTAINS]->(DocPage)(DocPage)-[:LINKS_TO]->(DocPage)(DocPage)-[:HAS_CHUNK]->(DocChunk)(DocChunk)-[:MENTIONS]->(DocConcept)(DocConcept)-[:RELATES_TO]->(DocConcept)
# Interactive setup
doc-mcp-setup
# Health check
doc-mcp-status --doctor
# Run server
doc-mcp# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Type checking
mypy src/
# Linting
ruff check src/MIT
- MCP KG Memory: Knowledge graph memory for AI coding assistants
- Model Context Protocol: MCP specification