KnowMore is a production-grade Retrieval-Augmented Generation (RAG) system engineered to provide high-fidelity, grounded intelligence without reliance on external APIs or GPU hardware. By integrating a Hierarchical Navigable Small World (HNSW) vector search with quantized Large Language Models (LLMs), this system delivers enterprise-level semantic search and response generation on standard CPU-based infrastructure.
Important Disclaimer: Due to Non-Disclosure Agreements (NDA), the source code and specific execution logic for this project are not publicly available. This tool was developed as a proprietary internal engineering solution specifically for Nokia-based OLTs, NTs, and LTs. This documentation serves as a conceptual and architectural overview of the work performed.
- AI-Powered Semantic Search: Leverages FAISS HNSW indexing and
sentence-transformersto move beyond keyword matching, understanding the deep technical intent behind engineering queries. - Knowledge Ingestion & Structured Storage: Utilizes a SQLite3 backend as a centralized "Source of Truth," allowing for structured management of semi-structured technical data before neural processing.
- Snippet-Based Knowledge Architecture: Employs a granular "snippet" approach to data organization, ensuring the RAG pipeline retrieves the exact technical procedure or error code needed without extraneous "noise."
- Version Control & Contributor Tracking: (Internal Logic) Implements metadata tracking for knowledge entries, ensuring clear visibility into when technical specs were updated and by which engineering contributor.
- PII Anonymization & Data Sanitization: Features a robust pre-processing layer that scrubs sensitive data or PII (Personally Identifiable Information) before embedding, ensuring compliance with strict enterprise data privacy standards.
Traditional RAG implementations rely on static document files (PDF/MD) that reside on a file server. Updating these systems is a manual, error-prone process involving regeneration and re-uploading of entire files.
KnowMore pioneers a Dynamic Optimization Loop:
- Structured-to-Neural Bridge: Unlike traditional pipelines, KnowMore uses a SQLite-to-Document conversion layer.
- Instant Re-Indexing: The "Optimise Knowledge Base" feature allows administrators to edit database records and instantly trigger a new chunks JSON and FAISS index. This ensures the assistant's intelligence remains "live" and synchronized with the latest engineering changes.
Deployment in secure or legacy engineering environments often precludes the use of cloud APIs or high-end NVIDIA hardware.
- Compute-Efficient Inference: By utilizing 4-bit Q4_K_M quantization, the system achieves a memory-to-logic "sweet spot" that runs entirely on standard Intel/AMD server CPUs.
- Hardware Democratization: This architecture enables the deployment of Large Language Models within standard enterprise infrastructure, requiring only 16 CPU cores and 30Gi of RAM to deliver professional-grade troubleshooting.
The system is designed to bypass the traditional limitations of static, document-based RAG pipelines. Instead, it utilizes a dynamic data pipeline that facilitates real-time knowledge base optimization.
- Vector Infrastructure: Utilizes FAISS (Facebook AI Similarity Search) with an HNSW index for high-speed, approximate nearest neighbor retrieval.
- Local Inference: Powered by a quantized LLaMA 3 (8B) model in GGUF format via the
llama.cppruntime. - Hardware Agnostic: Specifically optimized for No-GPU environments, achieving low-latency inference on standard 16-core server CPUs.
- Dynamic Knowledge Management: Features an "Optimise Knowledge Base" utility that allows for on-the-fly index regeneration and metadata updates without requiring manual file server uploads.
| Layer | Technology | Implementation Detail |
|---|---|---|
| LLM Engine | llama.cpp |
GGUF Quantized LLaMA 3 (8B) |
| Embeddings | sentence-transformers |
MiniLM-L6-v2 for CPU efficiency |
| Vector Search | FAISS | HNSWFlat (M=32, efConstruction=200) |
| Data Logic | Python 3.x | Custom sliding window chunking & hashing |
| Output Rendering | Markdown/HTML | Post-processed deterministic outputs |
The system processes structured technical documentation using a sliding window strategy. Each chunk is set to 250 words with a 50-word overlap. This specific configuration ensures that semantic context is preserved across boundaries, significantly reducing the risk of "lost-in-the-middle" context fragmentation during the retrieval phase.
Retrieval is not merely a distance calculation. KnowMore implements a multi-step refinement process:
- Query Normalization: Strips conversational noise to focus on core semantic intent.
- Similarity Thresholding: A strict threshold (0.45) is applied to ensure only relevant context is passed to the LLM.
- MD5 Deduplication: Uses MD5 hashing on initial character strings to prevent redundant data from consuming the LLM's limited context window.
By leveraging 4-bit quantization, the LLaMA 3 (8B) model is capable of running within a 30Gi RAM footprint. The inference engine is tuned for deterministic performance with a temperature of 0.2, ensuring that technical responses remain factual and consistent across multiple queries.
Building a production-grade RAG system on a 16-core CPU required deliberate trade-offs to balance accuracy, memory safety, and latency.
Initially, IndexFlatL2 was used for 100% retrieval accuracy. However, as the knowledge base scaled, search latency increased linearly.
-
Decision: Migrated to HNSW (Hierarchical Navigable Small World) with
$M=32$ andefConstruction=200. -
Trade-off: Accepted a negligible 0.01% drop in recall for a logarithmic increase in search speed. This was vital to ensure the retrieval phase occupied
$<1%$ of the total pipeline time, leaving maximum overhead for the LLM.
With a local LLaMA 3 (8B) model, the context window is a finite resource.
- Decision: Implemented a Strict Similarity Threshold (0.45) and capped context at 3 chunks.
- Trade-off: While a lower threshold would provide more data, it introduced "noise" that caused the LLM to hallucinate during technical protocol explanations. We prioritized grounding (precision) over breadth (recall) to ensure engineering accuracy.
- Decision: Selected 4-bit Q4_K_M quantization via
llama.cpp. - Trade-off: While 8-bit quantization offers higher logic fidelity, it risked OOM (Out of Memory) errors when the FAISS index and the OS buffer cache were active. The 4-bit Medium "K-spec" provided the optimal "sweet spot," retaining 95%+ of the base model's perplexity while fitting comfortably within the 30Gi RAM envelope.
-
Memory Pressure Management: Operating on 30Gi RAM meant the
llama.cppprocess and the FAISS HNSW index competed for the same heap space. I implemented a strict memory-mapping (mmap) strategy for the GGUF model to ensure the OS could manage paging effectively without crashing the vector search. - The "Cold Start" Problem: Initial queries were slow due to model loading. I developed a pre-warm script that initializes the model and performs a dummy embedding pass upon system startup, reducing first-response latency by 40%.
-
Precision vs. Recall in HNSW: Standard flat indexes were too slow for the document volume. I transitioned to
IndexHNSWFlatwith$M=32$ . I found that increasingefConstructionto 200 was the "sweet spot" for our specific technical documentation, ensuring that obscure OLT error codes were never missed during retrieval.
Testing was conducted on a 16-core Intel Core i7-10700 CPU @ 2.90GHz with 30Gi RAM.
| Metric | Duration (s) | Observations |
|---|---|---|
| Average Time | 33.18s | The standard overhead for a 200-token generation pass. |
| Minimum Time | 13.80s | Observed on "Cache-hit" scenarios or queries requiring short responses. |
| Maximum Time | 53.51s | Occurred during complex multi-document synthesis or cold-start model loading. |
- Memory Footprint: Sustained at ~14GB (Model + Index + Runtime), leaving 50% head-room for system operations.
- CPU Load: Peaked at 85% during the embedding generation and LLM prompt processing phase, normalizing to 40% during token streaming.
- Throughput: Optimized for a single-user engineering assistant; sequential query handling ensured no race conditions on the FAISS index during "Live Data" updates.
| Parameter | Value | Logic / Rationale |
|---|---|---|
| Chunk Size | 250 Words | Balances the limited context window of LLaMA 3 while maintaining technical phrasing. |
| Similarity Threshold | 0.45 | Calibrated via a "Golden Dataset" of 50 manual queries to filter out noise while allowing for semantic variations. |
| Top-K Retrieval | 5 | Determined through testing to be the maximum context count before the LLM began "hallucinating" by merging unrelated document data. |
| Quantization | Q4_K_M | Selected as the optimal balance between 4-bit compression and maintaining the logic required for complex network protocols. |


