Port graphrag-rs core library to Swift by ronaldmannak · Pull Request #1 · PicoMLX/GraphRAG

ronaldmannak · 2026-06-28T01:45:53Z

Summary

This PR ports the core library of graphrag-rs — a graph-based retrieval-augmented generation system — into idiomatic, dependency-free Swift 6 code. The implementation runs fully offline by default and optionally integrates with local Ollama servers for LLM-backed extraction and answer generation.

Key Changes

Core Infrastructure

Added strongly-typed identifiers (DocumentID, EntityID, ChunkID) and domain models (Entity, Relationship, TextChunk, Document)
Implemented unified error handling via GraphRAGError enum
Defined pluggable protocols for LanguageModel, EmbeddingModel, and EntityExtracting

Text Processing

Ported HierarchicalChunker: recursive, separator-aware text chunking with configurable overlap
Implemented TextProcessor for document chunking with metadata enrichment (word counts, keywords)
Added TfIdfKeywordExtractor for corpus-aware keyword extraction with stopword filtering

Entity & Relationship Extraction

Implemented PatternEntityExtractor: offline, heuristic-based entity detection via Title-Case span analysis and suffix/prefix classification
Implemented LLMEntityExtractor: LLM-driven extraction with staged JSON parsing fallbacks and optional gleaning rounds
Added Prompts module with extraction and answer-generation templates

Knowledge Graph

Ported KnowledgeGraph: value-type graph with insertion-order preservation, O(1) entity/relationship lookup, and adjacency indexes
Implemented graph analytics: degree, closeness, and betweenness centrality (Brandes' algorithm)
Added PageRank computation and GraphTraversal (BFS/multi-source BFS with edge-strength filtering)

Retrieval

Implemented BM25Retriever: Okapi BM25 keyword search with configurable k1/b parameters
Implemented InMemoryVectorStore: brute-force cosine-similarity vector search
Implemented HybridRetriever: fused keyword + semantic search with configurable fusion methods (RRF, weighted sum, max-score)

Embeddings

Implemented HashEmbedder: deterministic, offline embedding via FNV-1a hashing with L2 normalization
Ported OllamaConfig and OllamaLanguageModel: HTTP client for local Ollama daemon (completions and embeddings)

High-Level Orchestration

Implemented GraphRAG actor: thread-safe orchestrator managing document ingestion, graph building, and question answering
Implemented GraphRAGBuilder: fluent configuration builder with sensible defaults
Implemented Config hierarchy for chunking, embedding, entity extraction, retrieval, and graph parameters

Testing

Added comprehensive test suite covering chunking, keyword extraction, BM25 ranking, vector similarity, embeddings, entity extraction, and end-to-end QA

Notable Implementation Details

Character-based offsets: The Rust implementation uses UTF-8 byte indices; this port operates on Swift Character arrays (extended grapheme clusters) for Unicode safety by construction
Actor-based concurrency: GraphRAG is an actor to safely share mutable graph/index state across async tasks
Pluggable backends: Embedders, language models, and entity extractors are injected as existentials, allowing offline defaults with optional LLM/Ollama integration
Deterministic hashing: HashEmbedder provides stable, reproducible embeddings without external dependencies
Graceful degradation: TF-IDF scoring and BM25 retrieval degrade gracefully when corpus statistics are unavailable

https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

Reimplement the core of the Rust crate graphrag-rs (https://github.com/automataIA/graphrag-rs) as an idiomatic, dependency-free Swift 6 package. Modules: - Core: Document/TextChunk/Entity/Relationship model, typed IDs, GraphRAGError, and pluggable protocols (LanguageModel, EmbeddingModel, EntityExtracting, ChunkingStrategy). - Text: HierarchicalChunker, TextProcessor, TfIdfKeywordExtractor. - Graph: KnowledgeGraph (value-type adjacency), PageRank, traversal (BFS/DFS/ego/paths), analytics (degree/closeness/betweenness/components). - Retrieval: BM25Retriever, cosine InMemoryVectorStore, HybridRetriever with RRF/weighted/CombSUM/MaxScore fusion. - Entity: PatternEntityExtractor (offline) and LLMEntityExtractor with the upstream extraction prompts and staged JSON-recovery parsing. - Embeddings: deterministic offline HashEmbedder and OllamaEmbedder; OllamaClient for LLM completion. - Orchestration: GraphRAG actor (ingest -> build -> ask), GraphRAGBuilder, Config. Defaults mirror the Rust crate (PageRank d=0.85/tol=1e-6, BM25 k1=1.2/b=0.75, hybrid RRF k=60, weights 0.7/0.3, traversal depth 3, min strength 0.5). Runs fully offline by default; optional local Ollama for LLM-backed extraction and answer generation. Includes Swift Testing coverage and a README. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

gemini-code-assist

Code Review

This pull request introduces a comprehensive Swift 6 port of the Rust graphrag-rs crate, implementing a fully offline, graph-based Retrieval Augmented Generation pipeline. It includes core domain models, hierarchical text chunking, TF-IDF keyword extraction, a value-type knowledge graph, graph algorithms (PageRank, BFS/DFS traversal, centrality analytics), hybrid retrieval (BM25 and vector search fusion), and local LLM integration via Ollama. The review feedback identifies several critical correctness issues and architectural improvements: correcting the BM25 scoring formula to avoid double-penalizing document length, fixing character offset calculations during punctuation trimming in pattern extraction, adopting modern Swift 6 async-native APIs for network requests to support task cancellation, ensuring Unicode-safe offset calculations in LLM extraction, and implementing the documented deduplication behavior for graph neighbors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

…ighbor dedup - BM25: use raw term frequency; document-length normalization is handled only by the |D|/avgdl denominator factor (previously normalized twice). - PatternExtractor: trim leading/trailing punctuation on the index range so recorded span offsets stay aligned with the source text. - Ollama: replace the continuation-wrapped dataTask with async-native URLSession.data(for:), which supports task cancellation. - LLMExtractor: derive mention endOffset from the matched range's upperBound (case folding can change grapheme counts). - KnowledgeGraph.neighbors(of:): deduplicate by (neighbor, relationType) to match the documented contract. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

chatgpt-codex-connector

💡 Codex Review

https://github.com/PicoMLX/GraphRAG/blob/c2513c089f3a462b4200149487edb526c4acbda9/Sources/GraphRAG/Retrieval/Hybrid.swift#L336-L337
Drop zero-similarity semantic hits before RRF

When keyword search finds nothing and the query embedding has zero or negative cosine similarity to every chunk, vectors.search still returns the nearest maxCandidates; RRF then disables the absolute score threshold and returns those arbitrary chunks. In the offline pipeline this makes ask() answer from unrelated context instead of taking its no-results path for empty or off-topic queries.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Hybrid: drop non-positive cosine hits before RRF so off-topic/empty queries take the no-results path instead of surfacing arbitrary chunks. - Hybrid: reject non-positive search limits (prefix traps on negatives). - Engine: remove a document's stale chunks when it is replaced by the same id. - Engine: guard against reentrant/overlapping builds; mark unbuilt on entry so a failed rebuild can't leave stale state queryable; only mark built if no document was ingested during the build (snapshot chunk ids). - Engine/Builder: honor config.embedding.backend == "ollama" for a manually constructed/deserialized Config, not only the withOllamaEmbeddings() flag. - Traversal: stop DFS expansion at maxDepth so edges to unvisited nodes are not recorded (matches BFS and the documented depth limit). - KeywordExtraction: smoothed IDF (log((N+1)/(df+1))+1) so a fresh extractor ranks by term frequency instead of collapsing every score to zero. - Chunking: reject negative chunk overlap (would skip text between chunks). - PatternExtractor: stop a Title-Case run at separating punctuation so "Alice, Bob" yields two entities rather than one merged node. Adds regression tests for negative overlap, document replacement, DFS depth bounds, and negative top-K. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

ronaldmannak · 2026-06-28T02:03:38Z

All Codex findings are addressed in af6d547 (replies left on each inline thread; threads resolved):

Drop zero-similarity semantic hits before RRF (the top-level review note): HybridRetriever.search now filters out non-positive cosine hits before fusion, so an off-topic/empty query (no keyword matches and all similarities ≤ 0) takes ask()'s no-results path instead of surfacing arbitrary nearest chunks.
Stale chunks removed when a document is replaced (removeChunks(forDocument:)).
Reentrant/overlapping builds rejected; isBuilt=false on entry; build only succeeds if no document was ingested mid-build (snapshot + ingestionVersion).
Builder/Engine honor config.embedding.backend == "ollama".
DFS stops expanding at maxDepth (no dangling edges).
Smoothed TF-IDF so a fresh extractor ranks by term frequency.
Negative chunk overlap and non-positive search limits rejected.
Title-Case runs split on ,/;/: so comma-separated names become separate entities.

Regression tests added for document replacement, DFS depth bounds, negative overlap, and negative top-K.

@codex review

Generated by Claude Code

Runs `swift build`/`swift test` in the official swift:6.3 container on every push and pull request, matching the package's swift-tools-version. Adds a CI status badge to the README. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

The swift:6.3 image does not exist on Docker Hub yet (6.3 is newly announced and not in the stable registry), so the container failed to start. No released toolchain can parse swift-tools-version 6.3 either. The port only uses standard Swift 6 features (actors, Sendable, async/await, swift-testing), so lower the manifest floor to 6.1 and pin the CI container to swift:6.1 so builds and tests actually run. Bump both back to 6.3 once a stable 6.3 toolchain/image ships. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c845d99b90

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Engine: honor `approach` (hybrid/keyword/semantic) by toggling the BM25 and vector stages in retrieval. - Engine: thread `retrieval.similarityThreshold` into the semantic stage so weak matches can be suppressed as configured. - Engine: enforce `maxEntitiesPerChunk` (keep highest-confidence) and drop relationships whose endpoints didn't survive the cap. - Engine: always write a chunk's entity ids during build (empty clears stale ids left by a prior build when extraction now returns none). - Config/Builder: remove redundant chunkSize/overlap from TextConfig so the top-level chunkSize/chunkOverlap is the single source of truth (a manually constructed Config can no longer chunk at the wrong size). - Analytics.density(): count unique undirected pairs so reciprocal/multiple typed edges can't push density above 1.0. - KnowledgeGraph.neighbors(of:): keep the highest-confidence edge per (neighbor, relationType) so a weak A->B can't hide a strong reciprocal B->A from strength-filtered traversals. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 159b1162ac

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

… heuristics Pattern extractor: - Stop Title-Case runs at sentence punctuation (./!/?) with an abbreviation/title exception so "Acme. Bob" splits but "Dr. Smith"/"Acme Inc." stay merged. - Recognize entities preceded by opening punctuation (quotes/parens). - Orient asymmetric inferred relationships (WORKS_FOR/LEADS/BORN_IN/ HEADQUARTERED_IN/LOCATED_IN) by entity role, not text order. Config (single source of truth): - Remove duplicated topK/similarityThreshold from RetrievalConfig; top-level Config.topKResults/similarityThreshold are authoritative (Engine threads the latter into the semantic stage). - Builder: drive the embedding backend solely from config.embedding.backend (no sticky flag, so a later withConfig can switch back to hash), and sync the Ollama embedder dimension from the config. Robustness: - Engine.build re-fetches each chunk before writing so a document replaced during an extraction/embedding await isn't clobbered by the pre-await snapshot. - PageRank clamps negative maxIterations and guards topEntities against negative k. - Traversal findAllPaths treats negative depth as no expansion; egoNetwork de-duplicates emitted edges. - HybridRetriever.index drops a stale vector when reindexing an id without an embedding. - Prompts.fill scans once and never re-substitutes inserted values (keeps literal braces intact). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5cf557671a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-28T02:59:54Z

+                // Keep only relationships whose endpoints exist in the graph
+                // (drops edges to entities removed by the per-chunk cap).
+                for relationship in relationships
+                where graph.contains(relationship.source) && graph.contains(relationship.target) {


Filter capped relationships against selected entities

Fresh evidence after the cap fix is that this filter checks global graph state, so when the per-chunk cap drops an entity that already exists from an earlier chunk, graph.contains still passes and records the relationship. For example with maxEntitiesPerChunk = 1, if chunk 1 added A and chunk 2 extracts B high-confidence, A low-confidence, plus A-B, the cap removes A from chunk 2 but the edge is still added because A is already in the graph; that defeats the cap and attaches evidence to an entity the chunk deliberately discarded.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T02:59:54Z

+                            let word = String(chars[ws..<j])
+                                .trimmingCharacters(in: CharacterSet(charactersIn: ".!?"))
+                                .lowercased()
+                            if !PatternEntityExtractor.abbreviations.contains(word) { break }


Stop org abbreviations from merging the next sentence

For sentence-final organization suffixes such as Acme Inc. Bob joined, inc is in the abbreviation allowlist, so this branch does not break on the period and the next uppercase word is consumed as the same span (Acme Inc. Bob). That drops both the organization and the person into a bogus merged entity for a common company-name pattern; only title/person abbreviations that need a following name should be allowed to cross the period.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T02:59:54Z

+
+    public mutating func addChunk(_ chunk: TextChunk) {
+        if chunksByID[chunk.id] == nil { chunkOrder.append(chunk.id) }
+        chunksByID[chunk.id] = chunk


Keep document chunk copies in sync

When build() later updates a chunk with extracted entity IDs and embeddings, this replaces only chunksByID; the copy inside documentsByID[chunk.documentID].chunks remains the pre-build version. Consumers reading graph.document(id)?.chunks or the documents[].chunks section of saved JSON therefore see stale chunks missing entities/embeddings while graph.chunks contains the enriched data.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T02:59:54Z

+            }
+        }
+        let index = relationships.count
+        relationships.append(relationship)


Reject relationships with missing endpoints

If a public caller adds a relationship whose source exists but target has not been added, the edge is still indexed here and neighbors(of:) exposes the missing target. Traversal from the existing source can then return an EntityID in TraversalResult.entities that is absent from the graph, so relationship insertion should validate both endpoints or ignore dangling edges.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T02:59:54Z

+        for chunk in graph.chunks {
+            index(id: chunk.id.raw, content: chunk.content, embedding: chunk.embedding)


Clear stale ids when indexing a new graph

This public full-graph indexing path only upserts the chunks present in the new graph, so if a caller indexes graph A and later indexes graph B after chunks were removed, the old ids remain in contents, BM25, and the vector store. Searches can still return deleted chunk text unless callers know to call clear() first, which makes index(graph:) unsafe as a reindex operation.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5cf557671a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-28T03:00:04Z

+                for relationship in relationships
+                where graph.contains(relationship.source) && graph.contains(relationship.target) {


Filter relationships against this chunk's capped entities

Fresh evidence after the earlier cap fix is that this condition checks the global graph, not the entities that survived the current chunk's maxEntitiesPerChunk cap. If a low-confidence endpoint is dropped in this chunk but the same entity id was seen in a previous chunk, graph.contains still passes and the over-cap relationship is added with this chunk as context, defeating the per-chunk cap and creating edges for entities excluded from the chunk.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T03:00:04Z

+    public mutating func addChunk(_ chunk: TextChunk) {
+        if chunksByID[chunk.id] == nil { chunkOrder.append(chunk.id) }
+        chunksByID[chunk.id] = chunk


Keep document chunk snapshots in sync

When build() later calls graph.addChunk(current) to attach entity ids and embeddings, this only updates chunksByID; the copy stored in documentsByID[chunk.documentID].chunks remains the original pre-build chunk. After a successful build, callers reading knowledgeGraph().document(id)?.chunks or the encoded documents section from save(toJSON:) see chunks with stale entities/embedding even though graph.chunks contains the indexed data.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T03:00:04Z

+            if path.count > maxDepth { continue }
+            for (neighbor, _) in neighbors(of: current) where !visited.contains(neighbor) {
+                let newPath = path + [neighbor]
+                if neighbor == target { return newPath }


Enforce zero-hop maxDepth in relationship paths

For maxDepth == 0, path.count is still 1 at the source, so this loop expands one edge and returns a direct neighbor before checking the new path length. Callers asking for paths within zero hops between different entities can therefore get [source, target], contrary to the documented hop-count limit.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T03:00:04Z

+        for rawWord in text.split(whereSeparator: { $0.isWhitespace }) {
+            var cleaned = ""
+            for ch in rawWord where ch.isLetter || ch.isNumber {
+                cleaned.append(contentsOf: ch.lowercased())
+            }


Split punctuation-separated BM25 terms

When indexed text contains common punctuation between words, such as graph-based, nodes/edges, or Paris,France, this tokenizer strips the punctuation and concatenates the pieces into one token (graphbased, nodesedges, etc.). Keyword search for the individual terms then cannot match those documents, which is especially visible for keyword retrieval mode where BM25 is the only signal.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T03:00:04Z

+            if let range = lowerContent.range(of: name.lowercased()) {
+                // Derive both offsets from the matched range; case folding can
+                // change grapheme counts, so `start + name.count` is unreliable.
+                let start = lowerContent.distance(from: lowerContent.startIndex, to: range.lowerBound)
+                let end = lowerContent.distance(from: lowerContent.startIndex, to: range.upperBound)
+                mentions.append(


Match LLM entity mentions on token boundaries

For LLM extraction, this records the first case-insensitive substring match for an entity name. If the name appears inside an earlier token, e.g. the model extracts Ann from Annabelle later met Ann, the mention offsets point at Annabelle instead of the actual entity mention, corrupting the graph's evidence spans for that entity.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T03:00:04Z

+        for i in 0..<entities.count {
+            for j in (i + 1)..<entities.count {
+                let a = entities[i]
+                let b = entities[j]
+                let relType = relationType(for: a.entityType, b.entityType, context: context)


Scope pattern relationships to nearby mentions

When one chunk contains more than one unrelated fact, these nested loops evaluate every entity pair against the whole chunk context. For example, Alice Smith works for Acme Corporation. Bob Jones works for Beta Corporation. makes the single works for phrase apply to both people and both organizations, so default offline builds add false WORKS_FOR edges whenever multiple facts share a chunk.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T03:00:04Z

+        self.numCtx = numCtx
+    }
+
+    var baseURL: String { "\(host):\(port)" }


Normalize bare Ollama host names

When callers pass a normal host value like host: "localhost" or host: "127.0.0.1", this constructs URLs such as localhost:11434/api/tags without an HTTP scheme; URL treats localhost as the scheme and leaves the host empty, so URLSession rejects the request and Ollama appears unavailable despite a valid host. Either normalize bare hosts by adding http:// or make the parameter clearly require a full URL base.

Useful? React with 👍 / 👎.

…dation - PatternExtractor: only person titles cross a sentence period (org suffixes like "Acme Inc. Bob" now split); infer relationships only between entities sharing a sentence, using that sentence as keyword context (prevents one "works for" phrase linking every pair in a multi-fact chunk). - KnowledgeGraph: addRelationship ignores dangling edges (both endpoints must be nodes); addChunk/removeChunks keep the document's embedded chunk copies in sync so document(id)?.chunks and saved JSON reflect enrichment. - Engine: scope the per-chunk relationship filter to the entities that survived this chunk's maxEntitiesPerChunk cap (not global graph state). - HybridRetriever.index(graph:) clears first so it's a safe full reindex. - BM25 tokenizer splits on punctuation ("graph-based" -> "graph","based"). - Ollama baseURL normalizes bare hosts ("localhost" -> "http://localhost"). - LLMExtractor matches entity mentions on token boundaries ("Ann" no longer matches inside "Annabelle"). findAllPaths zero-hop was already handled by the `remaining <= 0` guard. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

ronaldmannak · 2026-06-28T03:06:52Z

All Codex round-5 findings are addressed in 74524ba:

Pattern extractor / sentence boundaries — only person titles (Dr.) cross a sentence period; org suffixes like Acme Inc. Bob now split into two entities.
Pattern extractor / relationship scoping — relationships are inferred only between entities that share a sentence, using that sentence as the keyword context, so one works for phrase no longer links every person/org pair in a multi-fact chunk.
KnowledgeGraph / dangling edges — addRelationship ignores edges whose endpoints aren't both nodes.
KnowledgeGraph / chunk sync — addChunk/removeChunks keep the document's embedded chunk copies in sync, so document(id)?.chunks and saved JSON reflect entities/embeddings.
Engine / per-chunk cap — the relationship filter is scoped to the entities that survived this chunk's maxEntitiesPerChunk cap, not global graph state.
HybridRetriever.index(graph:) — clears first, so it's a safe full reindex.
BM25 tokenizer — splits on punctuation (graph-based → graph, based).
Ollama — normalizes bare hosts (localhost → http://localhost).
LLMExtractor — matches entity mentions on token boundaries (Ann no longer matches inside Annabelle).
findAllPaths zero-hop — already prevented by the remaining <= 0 guard from 5cf5576.

Note: CI still can't run on this PR (no GitHub-hosted runner is provisioned for the private org repo), so none of these rounds are compile-verified yet — that needs an org Actions/billing (or self-hosted runner, or public-repo) change.

Generated by Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 74524ba374

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-28T03:20:27Z

+        var sid = 0
+        for k in 0..<chars.count {
+            sentenceID[k] = sid
+            if chars[k] == "." || chars[k] == "!" || chars[k] == "?" { sid += 1 }


Preserve abbreviation periods when sentence-scoping relations

When the pattern extractor handles a titled/entity suffix abbreviation in a real sentence, e.g. Dr. Smith works for Acme Inc., this increments the sentence id at the period in Dr. even though capitalizedSpans intentionally keeps Dr. Smith as one entity. sameSentenceMentions then sees the person mention in sentence 0 and the organization in sentence 1, so the offline extractor drops the WORKS_FOR relationship entirely for common abbreviated names/suffixes.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T03:20:27Z

+                var lo = min(aOff, bOff)
+                while lo > 0 && sentence(of: lo - 1) == sentence(of: aOff) { lo -= 1 }
+                var hi = max(aOff, bOff)
+                while hi < chars.count && sentence(of: hi) == sentence(of: aOff) { hi += 1 }
+                let context = (lo < hi ? String(chars[lo..<hi]) : "").lowercased()


Scope relation keywords to each mention pair

Fresh evidence after the same-sentence guard is that the keyword context is still the entire sentence, so a sentence like Alice Smith works for Acme Corporation and Bob Jones works for Beta Corporation. makes the single works for phrase apply to every person/org pair in that sentence. This produces false WORKS_FOR edges such as Alice→Beta and Bob→Acme; the context needs to be limited to the text between/around the selected mention pair rather than the whole sentence.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T03:20:27Z

+        for rawWord in text.split(whereSeparator: { $0.isWhitespace }) {
+            var cleaned = ""
+            for ch in rawWord where ch.isLetter || ch.isNumber {
+                cleaned.append(contentsOf: ch.lowercased())
+            }
+            if !cleaned.isEmpty { tokens.append(cleaned) }


Split hash-embedding tokens on punctuation

In semantic-only mode, the default HashEmbedder is the only retrieval signal, but this tokenizer strips punctuation inside each whitespace-delimited word and concatenates the pieces. Indexed text like graph-based or nodes/edges hashes as graphbased/nodesedges, while queries for graph based or nodes edges hash different tokens and lose the semantic overlap that should retrieve those chunks.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T03:20:27Z

+            nodeCount: n,
+            edgeCount: relationshipCount,
+            averageDegree: avgDegree,
+            maxDepth: 0


Report the actual graph max depth

For any non-empty connected graph, this public stats method always reports maxDepth as zero, so analytics or monitoring code that uses KnowledgeGraph.stats() gets a diameter/depth value that says the graph has no reachable hops even when relationships exist. This should be computed from the graph topology or omitted rather than hard-coded.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T03:20:27Z

+    public init(documentFrequencies: [String: Int] = [:], totalDocuments: Int = 1) {
+        self.documentFrequencies = documentFrequencies
+        self.totalDocuments = max(1, totalDocuments)


Start TF-IDF corpus counts at zero

A default TfIdfKeywordExtractor already has totalDocuments == 1, so callers that build corpus statistics solely through addDocumentToCorpus get an off-by-one corpus size: after one added document corpusStats() reports two documents and IDF is computed with an extra phantom document. The fallback for an empty corpus should not be stored as a real document count before any documents are added.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T03:20:27Z

+        let d = dampingFactor
+        let teleport = (1.0 - d) / Double(n)


Clamp PageRank damping to a probability

Because dampingFactor is public, callers can construct PageRank(dampingFactor: 3) and this uses it directly as a probability; the teleport term becomes negative and PageRank can return negative scores despite the API documenting scores in [0, 1]. Validate or clamp the damping factor to [0, 1] before using it in the iteration.

Useful? React with 👍 / 👎.

…tion scoping - HashEmbedder tokenizer splits on punctuation (matches BM25) so "graph-based" and the query "graph based" share tokens in semantic-only mode. - TfIdfKeywordExtractor starts totalDocuments at 0 (no phantom document); the smoothed IDF still handles an empty corpus. - PageRank clamps dampingFactor to [0, 1] so a bad value can't yield negative scores. - KnowledgeGraph.stats() reports the real maxDepth (undirected graph diameter) instead of a hardcoded 0. - PatternExtractor: an abbreviation period ending a person title ("Dr.") no longer splits the sentence, so "Dr. Smith works for Acme Inc." keeps its WORKS_FOR edge; relationship keyword context is now the span between the two mentions, and pairs with an intervening same-type entity are skipped, removing false edges in multi-fact sentences. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

ronaldmannak · 2026-06-28T03:24:59Z

Codex round-6 findings addressed in 3bfd74f:

HashEmbedder tokenizer — splits on punctuation (like BM25), so graph-based and a graph based query share tokens in semantic-only mode.
TF-IDF corpus count — totalDocuments starts at 0 (no phantom document); the smoothed IDF still handles an empty corpus.
PageRank — dampingFactor is clamped to [0, 1] so a bad value can't produce negative scores.
KnowledgeGraph.stats().maxDepth — now the real undirected graph diameter instead of a hardcoded 0.
PatternExtractor (regression from round 5) — an abbreviation period ending a person title (Dr.) no longer splits the sentence, so Dr. Smith works for Acme Inc. keeps its WORKS_FOR edge.
PatternExtractor relation scoping — keyword context is now the span strictly between the two mentions, and a pair is skipped when another same-type entity lies between them, eliminating the false WORKS_FOR edges in multi-fact sentences.

Note: this is an offline heuristic extractor, not a full relation classifier — the proximity rules trade a little recall for far fewer false edges. CI still can't run on this PR (no runner provisioned for the private org repo), so these remain compile-unverified.

Generated by Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3bfd74fa26

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-28T03:38:59Z

+            return PatternEntityExtractor.personTitles.contains(
+                String(chars[s..<periodIndex]).lowercased())


Keep org suffix periods within the same sentence

Fresh evidence in the current code is that this abbreviation check only consults personTitles, so organization suffixes such as Inc./Corp. still increment the sentence id. When the text is Acme Inc. was founded by Sam Altman, the org mention is placed in a different sentence from the person mention and sameSentenceMentions drops the LEADS relationship even though the abbreviation is mid-sentence.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T03:39:00Z

+        entitiesByID.removeAll()
+        entityOrder.removeAll()
+        relationships.removeAll()
+        outgoing.removeAll()
+        incoming.removeAll()


Clear chunk entity references too

When this public clear method is called, it removes the entity table but leaves every TextChunk.entities array in chunksByID and the embedded document chunk copies untouched. Callers that clear and then inspect graph.chunks or save the graph get chunks containing dangling entity IDs that graph.contains no longer recognizes; the same inconsistent snapshot can be exposed if a rebuild fails after this clear step.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T03:39:00Z

+            }
+
+            // Advance with overlap, snapped back to a word boundary.
+            var nextStart = max(0, end - overlap)


Clamp public chunker overlap before advancing

When callers use HierarchicalChunker directly with a negative overlap, this computes the next start as end - overlap, which moves it past the current chunk end and silently leaves source text uncovered; for example, chunkSize: 200, overlap: -50 jumps from roughly 200 to 250 and drops the intervening characters. TextProcessor rejects this input, but the public chunkText/chunkSpans APIs do not, so direct chunking can lose document content.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T03:39:00Z

+            if current.count > 2, !TfIdfKeywordExtractor.defaultStopwords.contains(current) {
+                tokens.append(current)


Index two-letter acronym terms in BM25

In keyword-only retrieval, important acronym-only queries such as AI, ML, or EU are dropped by this length check, so a document containing AI safety cannot be found by the query AI even though BM25 is the only signal. Common short stopwords are already filtered separately, so excluding every two-letter token loses meaningful technical and location terms.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T03:39:00Z

+            if var current = graph.chunk(id) {
+                current.entities = entities.map(\.id)
+                graph.addChunk(current)


Don't apply extraction results to replaced chunks

When another task replaces the same document id while extractor.extract is suspended, graph.chunk(id) can now be the new chunk with the same id, but this writes entity ids extracted from the old content onto it; stale entities and relationships were also inserted just above. Even though the final build is marked unbuilt, knowledgeGraph()/save(toJSON:) can expose the new chunk annotated with entities from the replaced text until a later rebuild.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T03:39:00Z

+                Relationship(
+                    source: sourceID, target: targetID, relationType: relType,
+                    confidence: data.strength ?? 0.7, context: [chunk.id]))


Clamp LLM relationship strengths

When an LLM returns a strength outside the expected 0...1 range, such as 10 or -1, this stores it directly as relationship confidence. Those values are later used for traversal filtering and PageRank edge weights, so a single unbounded model output can either overweight an edge massively or make an otherwise valid relationship disappear under the default strength filter.

Useful? React with 👍 / 👎.

…, BM25 - PatternExtractor: sentence segmentation treats org suffixes (Inc./Corp./Ltd.) and common abbreviations as non-sentence-ending, so "Acme Inc. was founded by Sam Altman" keeps its LEADS edge. Entity-span splitting still uses personTitles. - KnowledgeGraph.clearEntitiesAndRelationships clears chunk entity references in both chunksByID and the document copies, so no chunk points at a removed id. - BM25 keeps 2-letter acronym terms (AI/ML/EU); only single chars are dropped. - HierarchicalChunker clamps negative overlap (public API parity with TextProcessor). - LLMExtractor clamps model-provided relationship strength to [0, 1]. - Engine annotates a chunk during build only when its content still matches what was extracted/embedded, so a document replaced mid-await isn't tagged with stale entities/embeddings. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

ronaldmannak · 2026-06-28T06:54:31Z

✅ CI is green on b3de614 — swift build --build-tests and swift test both pass in the swift:6.1 container now that the repo is public and a runner is available. This is the first compile-verification of the port; the build is clean under Swift 6 strict concurrency and the test suite passes.

Codex round-7 findings were addressed in b3de614:

Sentence segmentation treats org suffixes (Inc./Corp./Ltd.) and common abbreviations as non-sentence-ending, so Acme Inc. was founded by Sam Altman keeps its LEADS edge (entity-span splitting still uses the narrower person titles).
clearEntitiesAndRelationships now also clears chunk entity references (in chunksByID and the document copies) so no chunk points at a removed entity id.
BM25 keeps 2-letter acronym terms (AI/ML/EU); only single characters are dropped.
HierarchicalChunker clamps a negative overlap (public-API parity with TextProcessor).
LLMExtractor clamps model-provided relationship strength to [0, 1].
Engine build annotates a chunk only when its content still matches what was extracted/embedded, so a document replaced mid-await isn't tagged with stale entities/embeddings.

Generated by Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b3de614f12

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-28T07:02:16Z

+        for id in chunkIDs {
+            guard let chunk = graph.chunk(id) else { continue }
+            let extracted = try await extractor.extract(from: chunk)
+            var entities = extracted.entities


Filter all extractors by minConfidence

When GraphRAG uses Ollama/LLM extraction or any injected extractor, EntityConfig.minConfidence is never applied here: LLMEntityExtractor assigns every entity confidence 0.9, and the engine only caps by count before adding. A config such as EntityConfig(minConfidence: 0.95) therefore still indexes those entities and their relationships, so the public threshold silently works only for the default pattern extractor.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T07:02:16Z

+    /// splitting still uses the narrower `personTitles`.
+    static let sentenceAbbreviations: Set<String> = [
+        "dr", "prof", "mr", "mrs", "ms", "jr", "sr", "st",
+        "inc", "corp", "ltd", "llc", "co", "etc", "vs",


Disambiguate sentence-final org suffixes

When a company suffix actually terminates a sentence, e.g. Alice Smith works for Acme Inc. Bob Jones works for Beta Corp., including inc/corp in this unconditional abbreviation set prevents the sentence id from advancing. The same-sentence guard then treats both facts as one sentence and can emit cross-sentence false edges such as Acme→Bob or Alice→Bob; the abbreviation check needs following-token context instead of always suppressing these periods.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T07:02:16Z

+
+    public mutating func addDocument(_ document: Document) {
+        if documentsByID[document.id] == nil { documentOrder.append(document.id) }
+        documentsByID[document.id] = document


Remove old chunks when replacing documents

When KnowledgeGraph is used directly, this overwrite path does not remove chunks already indexed for the same document id. After adding a document and its chunks, then calling addDocument again with the same id for a reingest, graph.chunks and saved JSON can still contain the old chunk text even though document(id) now points at the replacement; the cleanup done by GraphRAG.addDocument is not applied for direct KnowledgeGraph callers.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T07:02:16Z

+            let gleanResponse = try await model.complete(gleanPrompt, params: params)
+            if let extra = LLMEntityExtractor.parse(gleanResponse) {
+                if extra.entities.isEmpty && extra.relationships.isEmpty { break }
+                output.entities.append(contentsOf: extra.entities)


Deduplicate gleaned LLM entities before capping

When gleaningRounds > 0 and a continuation repeats an entity from an earlier pass, this appends the duplicate and convert returns both copies with the same id. The engine applies maxEntitiesPerChunk before KnowledgeGraph.addEntity merges ids, so repeated high-confidence entities can consume the cap and drop distinct entities and their relationships from the chunk.

Useful? React with 👍 / 👎.

…dedup, reingest - PatternExtractor: an abbreviation period ends a sentence when the next word is capitalized, so "Acme Inc. was founded by Sam Altman" stays one sentence while "...Acme Inc. Bob Jones..." splits (person titles still never split). - Engine: apply EntityConfig.minConfidence uniformly to every extractor's output (LLM/injected extractors don't self-filter), before the per-chunk cap. - LLMExtractor: deduplicate entities by id in convert (gleaning passes can repeat an entity) so a duplicate can't consume the per-chunk cap. - KnowledgeGraph.addDocument: when replacing an existing document id, purge the previous version's chunks so direct callers don't retain stale chunk text. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

ronaldmannak · 2026-06-28T07:05:10Z

Codex round-8 findings addressed in f6bdaf4:

Sentence-final org suffixes — an abbreviation period now ends a sentence when the next word is capitalized, so Acme Inc. was founded by Sam Altman stays one sentence (keeps LEADS) while …Acme Inc. Bob Jones… correctly splits. Person titles still never split.
minConfidence for all extractors — EntityConfig.minConfidence is now applied uniformly in the build pipeline (LLM/injected extractors don't self-filter), before the per-chunk cap.
Gleaning dedup — LLMEntityExtractor.convert deduplicates entities by id, so a repeated entity from a continuation pass can't consume the maxEntitiesPerChunk cap.
Direct reingest — KnowledgeGraph.addDocument purges the previous version's chunks when replacing an existing document id, so direct callers don't retain stale chunk text.

(CI now runs on each push; b3de614 was green and f6bdaf4 is building.)

Generated by Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f6bdaf4fea

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-28T07:16:33Z

+            // "... Acme Inc. Bob ..." splits.
+            if PatternEntityExtractor.sentenceAbbreviations.contains(word) {
+                var t = periodIndex + 1
+                while t < chars.count && chars[t] == " " { t += 1 }


Treat newlines after org suffixes as sentence breaks

Fresh evidence in the current code is that this look-ahead skips only literal spaces, so an organization suffix before a line break is treated as non-terminal. For text like Alice Smith works for Acme Inc.\nBob Jones works for Beta Corp., Inc. does not advance the sentence id, causing both lines to be considered one sentence and allowing cross-sentence false relationships to be inferred; skip all non-newline whitespace carefully or treat \n after the suffix as a boundary before checking the next capitalized word.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T07:16:33Z

+        var outWeight = [Double](repeating: 0, count: n)
+        for rel in graph.relationships {
+            guard let s = indexOf[rel.source], let t = indexOf[rel.target] else { continue }
+            let w = Double(max(rel.confidence, 0.0001))


Ignore zero-confidence edges in PageRank

When a relationship has confidence == 0 (for example, an LLM strength of 0 after clamping or a public caller marking an edge as no-strength), this converts it to a tiny positive weight and also uses that same tiny value in the source's total out-weight. If it is the source's only outgoing edge, the edge receives 100% of that node's PageRank contribution despite being explicitly zero-confidence; skip non-positive weights or treat those sources as dangling instead.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T07:16:33Z

+        for ma in a.mentions {
+            for mb in b.mentions where sentence(ma.startOffset) == sentence(mb.startOffset) {
+                return (ma.startOffset, mb.startOffset)
+            }


Consider later mention pairs before fixing relation type

When the same entity pair appears more than once in a chunk, this returns the first same-sentence mentions even if that co-occurrence has only generic context. For example, Alice Smith met Acme Corporation. Later, Alice Smith works for Acme Corporation. records only ASSOCIATED_WITH from the first sentence and never considers the later works for mention pair, so the specific WORKS_FOR edge is missed; evaluate all same-sentence mention pairs or prefer the pair whose local context matches a typed relation.

Useful? React with 👍 / 👎.

…r, PR weights - PatternExtractor: a newline (or end of text) after an org-suffix abbreviation is a sentence boundary, so "Acme Inc.\nBob Jones ..." splits into two sentences. - PatternExtractor: evaluate every same-sentence mention pair for an entity pair and keep the most specific typed relation, so a later "works for" co-occurrence isn't shadowed by an earlier generic one. - PageRank: skip non-positive-confidence edges (treating such sources as dangling) so a single zero-confidence edge can't capture all of a node's mass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

ronaldmannak · 2026-06-28T07:19:08Z

Codex round-9 findings addressed in 420f1cc:

Newline sentence boundaries — a newline (or end of text) after an org-suffix abbreviation now ends the sentence, so Acme Inc.\nBob Jones works for Beta Corp. splits into two sentences (no cross-line false edges).
Best mention pair — for a given entity pair, all same-sentence mention pairs are evaluated and the most specific typed relation wins, so Alice met Acme. Later, Alice works for Acme. yields WORKS_FOR instead of only the earlier ASSOCIATED_WITH.
PageRank zero-confidence edges — non-positive-confidence edges are skipped (their sources treated as dangling), so a single zero-confidence edge can't capture 100% of a node's PageRank mass.

(f6bdaf4 was green; 420f1cc is building.)

Generated by Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 420f1ccf74

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-28T07:31:52Z

+        - entity_name: Name of the entity, capitalized
+        - entity_type: One of the following types: [{entity_types}]
+        - entity_description: Comprehensive description of the entity's attributes and activities


Accept the prompt's entity field names

When the LLM follows these bullet instructions and returns keys like entity_name / entity_type instead of the shorter keys in the later JSON example, EntityData only decodes name / type, defaults both fields to empty strings, and convert drops the entity entirely. This makes Ollama-backed extraction brittle for a response format the prompt itself asks for; align the prompt or decode both key styles.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T07:31:52Z

+                    // (./!/?) also ends it ("Acme. Bob") — unless the word is a
+                    // known abbreviation/title like "Dr." so "Dr. Smith" merges.
+                    if j > runStart, let last = chars[j - 1].unicodeScalars.first {
+                        if CharacterSet(charactersIn: ",;:").contains(last) { break }


Keep comma-suffixed organization names together

For common legal names such as Alice works for Acme, Inc., this unconditional comma break ends the capitalized span at Acme; under the default minConfidence the single-word Acme is discarded and Inc can be extracted as the organization instead. That drops or corrupts organization relationships for normal company-name punctuation, so the comma should not split when it is immediately followed by an organization suffix.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T07:31:52Z

+        if words.count >= 2 {
+            return ("PERSON", 0.8)


Strip leading articles before classifying spans

When a sentence starts with names like The United States or The University of California, the captured span still includes The, so the known-location and organization-prefix checks miss and this fallback labels the whole span as PERSON. That corrupts common location/organization nodes and their downstream relationships; trim leading articles/blocklisted words before applying the location/prefix checks or before falling back to person.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T07:31:52Z

+        // Accept bare hosts ("localhost", "127.0.0.1"): without a scheme, URL
+        // parses the host as the scheme and the request fails.
+        let normalizedHost = host.contains("://") ? host : "http://\(host)"
+        return "\(normalizedHost):\(port)"


Avoid appending a second Ollama port

If callers pass the common full Ollama base URL as the host, e.g. OllamaConfig(host: "http://localhost:11434") or withOllama(host: "http://localhost:11434"), this returns http://localhost:11434:11434, which URL(string:) rejects and makes availability checks and requests fail. Since the code already accepts hosts with a scheme, detect an existing port or document/enforce that host must exclude it.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T07:31:53Z

+        } else {
+            // Replacing an existing id: purge the previous version's chunks so
+            // direct KnowledgeGraph callers don't retain stale chunk text.
+            removeChunks(forDocument: document.id)


Drop stale graph evidence when replacing documents

When KnowledgeGraph is used directly and a document id is replaced after entities/relationships were added for its old chunks, this only removes the chunk records. The old entities still keep mentions pointing at deleted chunk IDs and relationships still keep those chunk IDs in context, so graph traversal/stats and saved JSON can expose facts from a document version that no longer exists; replacement should also remove or scrub evidence tied to the removed chunks.

Useful? React with 👍 / 👎.

…, articles, evidence scrub - PatternExtractor: a comma followed by an org suffix doesn't split the run, so "Acme, Inc." stays one organization span; leading articles ("The United States", "The University of California") are stripped before classification so they map to location/organization instead of person. - LLMExtractor: EntityData/RelationshipData decode both the short JSON-example keys and the prompt's verbose names (entity_name/source_entity/…), so Ollama responses in either shape parse. - Ollama baseURL: don't append a second port when the host already includes one ("http://localhost:11434"). - KnowledgeGraph.removeChunks: scrub entity mentions and relationship context that reference removed chunks, so replacing a document leaves no stale evidence. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

ronaldmannak · 2026-06-28T07:35:33Z

Codex round-10 findings addressed in 38c78b7:

Comma-suffixed org names — a comma followed by an org suffix no longer splits the span, so Acme, Inc. stays one organization.
Leading articles — The United States / The University of California are stripped of the leading article before classification, so they resolve to location/organization rather than person.
LLM key names — EntityData/RelationshipData decode both the short JSON-example keys and the prompt's verbose names (entity_name/source_entity/…), so Ollama responses in either shape parse.
Ollama host with port — baseURL no longer appends a second port when the host already has one (http://localhost:11434).
Reingest evidence — KnowledgeGraph.removeChunks scrubs entity mentions and relationship-context entries referencing removed chunks, so replacing a document leaves no stale evidence.

(420f1cc was green; 38c78b7 is building.)

Generated by Claude Code

The alias CodingKeys added in round 10 (entity_name, source_entity, ...) broke synthesized Encodable conformance, since every CodingKeys case must map to a stored property. These structs are only decoded, so declare them Decodable instead of Codable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f872b93bc4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-28T07:54:48Z

+    public func findRelationshipPath(
+        from source: EntityID, to target: EntityID, maxDepth: Int = 5
+    ) -> [EntityID]? {
+        if source == target { return [source] }


Require endpoints to exist for self paths

When callers ask for a path from an ID to itself and that entity was never added, this returns [source] before checking graph membership, producing a valid-looking zero-hop path for a missing node. Other traversal APIs guard graph.contains(...); this should verify the endpoint exists before returning the self path.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T07:54:48Z

+                        let upper = min(hi + 1, chars.count)
+                        let context = (lo < upper ? String(chars[lo..<upper]) : "").lowercased()


Include trailing relation cues in pair context

When the relation phrase comes after both mentions, such as Alice Smith and Bob Jones are married, hi is only the later mention's start offset and this slice stops after its first character, so relationType never sees cues like married or colleague and downgrades the edge to generic KNOWS. Include the full second mention plus a bounded following/sentence window for the selected pair.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T07:54:48Z

+        self.retriever = HybridRetriever(
+            config: HybridConfig(maxCandidates: max(100, config.topKResults * 10)))


Wire retrieval config into search

When callers set Config(retrieval: ...), the graph-expansion knobs are silently ignored: this creates HybridRetriever with only a derived maxCandidates, and a repo-wide search for config.retrieval/maxExpansionDepth shows no production code reads them. As a result maxExpansionDepth, entityWeight, chunkWeight, and graphWeight have no effect on search/ask, despite being public configuration.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T07:54:48Z

+        for k in 0..<chars.count {
+            sentenceID[k] = sid
+            let c = chars[k]
+            if c == "!" || c == "?" || (c == "." && periodIsSentenceEnd(k)) { sid += 1 }


Treat bare newlines as sentence boundaries

When chunks contain line-separated facts without sentence punctuation, this sentence map never increments on \n, so both lines share one sentence. For example Alice Smith works for Acme Corporation\nBob Jones works for Beta Corporation can emit cross-line generic edges such as Alice↔Bob or Acme↔Bob even though the same-sentence guard was meant to prevent facts in separate lines from being linked.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T07:54:48Z

+                        // Proximity heuristic: skip if another same-type entity
+                        // lies between them — the phrase likely belongs to that
+                        // nearer pair. (Offline extractor, not a full classifier.)
+                        if hasInterveningSameType(a, b, lo: lo, hi: hi, among: entities) { continue }


Keep coordinated subjects linked to shared objects

When a relation applies to a coordinated list, such as Alice Smith and Bob Jones are employed by Acme Corporation, the Alice→Acme pair is skipped solely because Bob is another PERSON mention between them. That leaves only the nearest subject linked to the organization and drops valid relationships for earlier coordinated subjects; the intervening-entity heuristic needs to allow shared predicates rather than unconditionally skipping these pairs.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T07:54:48Z

+        // Stage 1: entity & relationship extraction per chunk.
+        for id in chunkIDs {
+            guard let chunk = graph.chunk(id) else { continue }
+            let extracted = try await extractor.extract(from: chunk)


Skip relationship extraction when disabled

When EntityConfig(extractRelationships: false) is used with an Ollama/LLM extractor, this still invokes the full extractor that prompts the model to identify relationships; the flag only prevents inserting them later. In that configuration users still pay the latency/token cost of relationship extraction even though they explicitly disabled relationships, so the build should avoid requesting or inferring them in the first place.

Useful? React with 👍 / 👎.

… dead config - KnowledgeGraph.findRelationshipPath requires both endpoints to exist before returning the self path. - PatternExtractor: treat bare newlines as sentence boundaries; extend a pair's keyword context to the sentence end when no other entity mention follows the later mention (captures trailing cues like "... are married" without leaking into a following pair). Documented the deliberate precision-over-recall trade-off for coordinated subjects. - Remove the unused RetrievalConfig (maxExpansionDepth / *Weight knobs were never read) so Config exposes no silently-ignored settings. - Engine: documented that extractRelationships gates insertion only (skipping an LLM's relationship prompting would need a protocol change). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

ronaldmannak · 2026-06-28T07:59:11Z

Codex round-11 — addressed in ab37e9a:

Fixed

findRelationshipPath now requires both endpoints to exist before returning the zero-hop self path.
Sentence segmentation treats bare newlines as boundaries (line-per-fact chunks no longer merge).
Relationship keyword context extends to the sentence end when no other entity mention follows the later mention, so trailing cues (… are married) are captured — without leaking into a following pair's facts.
Removed the unused RetrievalConfig (its maxExpansionDepth/*Weight knobs were never read) so Config no longer exposes silently-ignored settings.

Rightfully ignored (with in-code rationale)

Coordinated subjects (Alice and Bob are employed by Acme → only Bob links): the intervening-same-type guard deliberately favors precision; relaxing it reintroduces cross-fact false edges (Alice works for Acme and Bob works for Beta → Alice→Beta). Documented at the guard.
Skip LLM relationship prompting when extractRelationships == false: the EntityExtracting protocol returns entities+relationships in one call, so the flag gates insertion; fully skipping the model's relationship work needs a protocol/prompt change. Documented in the build loop.

(f872b93 is green; ab37e9a building.)

Generated by Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab37e9a485

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-28T08:14:01Z

+            }
+        }
+        for idx in relationships.indices where !relationships[idx].context.isEmpty {
+            relationships[idx].context.removeAll { removed.contains($0) }


Drop relationships once their evidence chunks are removed

Fresh evidence in the current code is that replacement now scrubs removed chunk IDs from context, but it keeps the relationship in relationships/adjacency even when that was its only evidence. When a direct KnowledgeGraph caller replaces a document after adding an extracted edge with context: [oldChunk], relationshipCount, traversal, and saved JSON still expose that stale fact as a context-less edge for the old document version.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T08:14:01Z

+    public var maxConnections: Int
+    public var threshold: Float


Honor or remove dead graph config knobs

Fresh evidence after the unused retrieval config was removed is that these public graph knobs are still silently ignored: a repo-wide search for config.graph/maxConnections finds no production reads outside this declaration. Callers can set Config(graph: GraphConfig(maxConnections: 1, threshold: 0.99)) and get the same graph construction/retrieval behavior, so experiments relying on those limits or thresholds are misconfigured without any signal.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T08:14:01Z

+        for (neighbor, relationship) in graph.neighbors(of: current) {
+            guard passesFilter(relationship) else { continue }
+            if !visited.contains(neighbor) {


Deduplicate entity paths across relation types

When two nodes have multiple relationship types between them, neighbors(of:) returns one entry per type, but findAllPaths returns only [EntityID] paths. This loop therefore recurses once per edge type and appends identical entity paths; with a low maxPaths, duplicates like two copies of [A, B] can consume the cap before distinct paths are returned.

Useful? React with 👍 / 👎.

… config, path dedup - KnowledgeGraph.removeChunks now drops a relationship whose only evidence was a removed chunk (context-less leftover would expose a stale fact) and rebuilds the adjacency index; edges that never had context are preserved. - Remove the unused GraphConfig (maxConnections/threshold were never read) so Config exposes no silently-ignored knobs. - findAllPaths collapses parallel edges (multiple relation types between the same nodes) to one neighbor, so entity-only paths aren't duplicated and don't waste the maxPaths budget. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

ronaldmannak · 2026-06-28T08:16:07Z

Codex round-12 findings addressed in b6dfa55:

Evidence-less edges — removeChunks now drops a relationship whose only context was a removed chunk (and rebuilds adjacency), so replacing a document leaves no stale context-less edge. Edges that never had context are preserved.
Dead graph config — removed the unused GraphConfig (maxConnections/threshold were never read), matching the earlier RetrievalConfig cleanup, so Config exposes no silently-ignored knobs.
Path dedup — findAllPaths collapses parallel edges (multiple relation types between the same nodes) to a single neighbor, so entity-only paths aren't duplicated and don't consume the maxPaths budget.

(ab37e9a is green; b6dfa55 building.)

Generated by Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b6dfa551cd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-28T08:30:28Z

+                entity.mentions = kept
+                entitiesByID[eid] = entity


Drop entities left with no surviving mentions

When a direct KnowledgeGraph caller replaces a document after indexing entities from its old chunks, this loop removes the old chunk mentions but writes the entity back even when kept is empty. The stale entity still appears in entityCount, entities, stats, and saved JSON even though none of its evidence chunks exist anymore; entities whose mentions were exhausted by the replacement should be removed unless they have independent evidence.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T08:30:28Z

+    /// Byte offset of the chunk start within the original document content.
+    public var startOffset: Int
+    /// Byte offset of the chunk end within the original document content.


Stop documenting character offsets as byte offsets

These public fields are populated by TextProcessor/HierarchicalChunker using Array(text) character indices, not UTF-8 byte offsets. For any non-ASCII document, callers following this API documentation will slice or highlight at the wrong byte positions in the original content; either the stored offsets need to be byte counts or the public contract should say they are character offsets.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-28T08:30:28Z

+        // Stage 2: embed chunks.
+        for id in chunkIDs {
+            guard let chunk = graph.chunk(id) else { continue }
+            let embedding = try await embedder.embed(chunk.content)


Skip embeddings for keyword-only builds

When callers configure Config(approach: "keyword"), retrieval later sets includeSemantic to false and never uses query or chunk embeddings, but build() still calls the embedder for every chunk here. With an Ollama or otherwise remote/unavailable embedder, a keyword-only corpus can fail or pay full embedding latency even though BM25 is the only signal used; the embedding stage should be skipped when semantic retrieval is disabled.

Useful? React with 👍 / 👎.

…-only embeds, fix offset docs - KnowledgeGraph.removeChunks removes entities whose mentions are entirely exhausted by the removal (no independent evidence) and drops relationships whose endpoint was removed, alongside the existing context scrub. - Engine.build skips the embedding stage entirely for approach == "keyword", so a keyword-only corpus doesn't pay embedder latency or fail on a remote embedder. - TextChunk start/end offsets are documented as character (grapheme) offsets, not UTF-8 byte offsets, matching how the chunker computes them. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

ronaldmannak · 2026-06-28T08:32:21Z

Codex round-13 findings addressed in 8f51562:

Orphaned entities on reingest — removeChunks now removes an entity whose mentions are entirely exhausted by the removal (no independent evidence) and drops relationships whose endpoint was removed, in addition to the existing context scrub.
Keyword-only builds skip embeddings — build() skips the embedding stage entirely when approach == "keyword", so a keyword-only corpus doesn't pay embedder latency or fail on a remote/unavailable embedder.
Offset docs — TextChunk.startOffset/endOffset are now documented as character (grapheme) offsets, not UTF-8 byte offsets, matching how the chunker computes them.

(b6dfa55 is green; 8f51562 building.)

Generated by Claude Code

gemini-code-assist Bot reviewed Jun 28, 2026

View reviewed changes

claude added 2 commits June 28, 2026 01:49

Update BM25 docstring to match raw-tf scoring

33d42c8

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

chatgpt-codex-connector Bot reviewed Jun 28, 2026

View reviewed changes

claude added 2 commits June 28, 2026 02:05

chatgpt-codex-connector Bot reviewed Jun 28, 2026

View reviewed changes

		for chunk in graph.chunks {
		index(id: chunk.id.raw, content: chunk.content, embedding: chunk.embedding)

		for relationship in relationships
		where graph.contains(relationship.source) && graph.contains(relationship.target) {

		return PatternEntityExtractor.personTitles.contains(
		String(chars[s..<periodIndex]).lowercased())

		if current.count > 2, !TfIdfKeywordExtractor.defaultStopwords.contains(current) {
		tokens.append(current)

		let upper = min(hi + 1, chars.count)
		let context = (lo < upper ? String(chars[lo..<upper]) : "").lowercased()

		self.retriever = HybridRetriever(
		config: HybridConfig(maxCandidates: max(100, config.topKResults * 10)))

Uh oh!

Conversation

ronaldmannak commented Jun 28, 2026

Summary

Key Changes

Notable Implementation Details

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ronaldmannak commented Jun 28, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 28, 2026