Skip to content

Latest commit

 

History

History
105 lines (71 loc) · 3.23 KB

File metadata and controls

105 lines (71 loc) · 3.23 KB

Knowledge Base Construction and Retrieval Documentation

Overview

The knowledge base system implements RAG (Retrieval-Augmented Generation) functionality based on vector search, enabling semantic retrieval of knowledge from documents to support contextual answers.

Core Components

1. Knowledge Index (retrieval.knowledge_index)

The KnowledgeIndex uses FAISS for efficient vector retrieval:

  • Vector Storage: FAISS-based efficient vector index
  • Similarity Calculation: Cosine similarity
  • Persistence: Supports index saving and loading
  • Scalability: Supports incremental addition of new documents

2. Text Encoder (model.text_encoder)

The TextEncoder transforms text into vector representations:

  • Backend Support: SentenceTransformers and HuggingFace backends
  • Stability: Optimized for stability on macOS and other platforms
  • Batch Processing: Supports batch encoding for efficiency
  • Normalization: Outputs normalized vectors for similarity calculation

3. Tool Interface (tools.knowledge_tool)

The KnowledgeBaseTool provides standardized knowledge retrieval interface for LLMs:

class KnowledgeBaseTool:
    def search(
        self,
        query: str,
        top_k: int = 5
    ) -> List[str]

Returns a list of strings, each element being a text chunk.

Construction Process

1. Document Preprocessing

  • PDF Parsing: Extract content from PDF documents
  • Text Chunking: Split long documents into smaller passages
  • Metadata Preservation: Maintain source file information

2. Vector Index Construction

  • Batch Encoding: Convert text chunks to vectors
  • Index Building: Establish index in vector space
  • Persistent Storage: Save index files

3. Retrieval Process

  • Query Encoding: Transform query to vector
  • Approximate Search: Find similar vectors in index
  • Result Ranking: Sort by similarity score

Script Details

Construction Scripts

  • scripts/build_knowledge_index.py: Build knowledge base index
  • scripts/build_complete_knowledge_rag.py: Build complete RAG system
  • scripts/build_external_knowledge_data.py: Process external knowledge documents
  • scripts/retrieve_knowledge.py: Knowledge base retrieval demonstration

Example Scripts

  • scripts/retrieve_knowledge.py: Knowledge base retrieval demonstration
  • examples/knowledge_rag_example.py: RAG application example
  • examples/knowledge_rag_usage_examples.py: Various usage examples
  • examples/knowledge_rag_validation.py: Validate RAG effectiveness

Configuration Parameters

Relevant parameters in config.py:

  • knowledge_index_path: Knowledge base index path
  • embedding_dim: Vector dimension (default 384 for MiniLM)
  • top_k: Number of results to return

Usage Examples

from retrieval.knowledge_index import KnowledgeIndex
from model.text_encoder import TextEncoder
from tools.knowledge_tool import KnowledgeBaseTool

# Load knowledge base
knowledge_index = KnowledgeIndex()
knowledge_index.load("./data/knowledge_index")

# Initialize encoder
text_encoder = TextEncoder()

# Create tool instance
knowledge_tool = KnowledgeBaseTool(knowledge_index, text_encoder)

# Execute retrieval
results = knowledge_tool.search("deep learning fundamentals", top_k=3)