Port graphrag-rs core library to Swift#1
Conversation
Reimplement the core of the Rust crate graphrag-rs (https://github.com/automataIA/graphrag-rs) as an idiomatic, dependency-free Swift 6 package. Modules: - Core: Document/TextChunk/Entity/Relationship model, typed IDs, GraphRAGError, and pluggable protocols (LanguageModel, EmbeddingModel, EntityExtracting, ChunkingStrategy). - Text: HierarchicalChunker, TextProcessor, TfIdfKeywordExtractor. - Graph: KnowledgeGraph (value-type adjacency), PageRank, traversal (BFS/DFS/ego/paths), analytics (degree/closeness/betweenness/components). - Retrieval: BM25Retriever, cosine InMemoryVectorStore, HybridRetriever with RRF/weighted/CombSUM/MaxScore fusion. - Entity: PatternEntityExtractor (offline) and LLMEntityExtractor with the upstream extraction prompts and staged JSON-recovery parsing. - Embeddings: deterministic offline HashEmbedder and OllamaEmbedder; OllamaClient for LLM completion. - Orchestration: GraphRAG actor (ingest -> build -> ask), GraphRAGBuilder, Config. Defaults mirror the Rust crate (PageRank d=0.85/tol=1e-6, BM25 k1=1.2/b=0.75, hybrid RRF k=60, weights 0.7/0.3, traversal depth 3, min strength 0.5). Runs fully offline by default; optional local Ollama for LLM-backed extraction and answer generation. Includes Swift Testing coverage and a README. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive Swift 6 port of the Rust graphrag-rs crate, implementing a fully offline, graph-based Retrieval Augmented Generation pipeline. It includes core domain models, hierarchical text chunking, TF-IDF keyword extraction, a value-type knowledge graph, graph algorithms (PageRank, BFS/DFS traversal, centrality analytics), hybrid retrieval (BM25 and vector search fusion), and local LLM integration via Ollama. The review feedback identifies several critical correctness issues and architectural improvements: correcting the BM25 scoring formula to avoid double-penalizing document length, fixing character offset calculations during punctuation trimming in pattern extraction, adopting modern Swift 6 async-native APIs for network requests to support task cancellation, ensuring Unicode-safe offset calculations in LLM extraction, and implementing the documented deduplication behavior for graph neighbors.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
…ighbor dedup - BM25: use raw term frequency; document-length normalization is handled only by the |D|/avgdl denominator factor (previously normalized twice). - PatternExtractor: trim leading/trailing punctuation on the index range so recorded span offsets stay aligned with the source text. - Ollama: replace the continuation-wrapped dataTask with async-native URLSession.data(for:), which supports task cancellation. - LLMExtractor: derive mention endOffset from the matched range's upperBound (case folding can change grapheme counts). - KnowledgeGraph.neighbors(of:): deduplicate by (neighbor, relationType) to match the documented contract. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh
There was a problem hiding this comment.
💡 Codex Review
https://github.com/PicoMLX/GraphRAG/blob/c2513c089f3a462b4200149487edb526c4acbda9/Sources/GraphRAG/Retrieval/Hybrid.swift#L336-L337
Drop zero-similarity semantic hits before RRF
When keyword search finds nothing and the query embedding has zero or negative cosine similarity to every chunk, vectors.search still returns the nearest maxCandidates; RRF then disables the absolute score threshold and returns those arbitrary chunks. In the offline pipeline this makes ask() answer from unrelated context instead of taking its no-results path for empty or off-topic queries.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Hybrid: drop non-positive cosine hits before RRF so off-topic/empty queries take the no-results path instead of surfacing arbitrary chunks. - Hybrid: reject non-positive search limits (prefix traps on negatives). - Engine: remove a document's stale chunks when it is replaced by the same id. - Engine: guard against reentrant/overlapping builds; mark unbuilt on entry so a failed rebuild can't leave stale state queryable; only mark built if no document was ingested during the build (snapshot chunk ids). - Engine/Builder: honor config.embedding.backend == "ollama" for a manually constructed/deserialized Config, not only the withOllamaEmbeddings() flag. - Traversal: stop DFS expansion at maxDepth so edges to unvisited nodes are not recorded (matches BFS and the documented depth limit). - KeywordExtraction: smoothed IDF (log((N+1)/(df+1))+1) so a fresh extractor ranks by term frequency instead of collapsing every score to zero. - Chunking: reject negative chunk overlap (would skip text between chunks). - PatternExtractor: stop a Title-Case run at separating punctuation so "Alice, Bob" yields two entities rather than one merged node. Adds regression tests for negative overlap, document replacement, DFS depth bounds, and negative top-K. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh
|
All Codex findings are addressed in af6d547 (replies left on each inline thread; threads resolved):
Regression tests added for document replacement, DFS depth bounds, negative overlap, and negative top-K. @codex review Generated by Claude Code |
Runs `swift build`/`swift test` in the official swift:6.3 container on every push and pull request, matching the package's swift-tools-version. Adds a CI status badge to the README. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh
The swift:6.3 image does not exist on Docker Hub yet (6.3 is newly announced and not in the stable registry), so the container failed to start. No released toolchain can parse swift-tools-version 6.3 either. The port only uses standard Swift 6 features (actors, Sendable, async/await, swift-testing), so lower the manifest floor to 6.1 and pin the CI container to swift:6.1 so builds and tests actually run. Bump both back to 6.3 once a stable 6.3 toolchain/image ships. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c845d99b90
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Engine: honor `approach` (hybrid/keyword/semantic) by toggling the BM25 and vector stages in retrieval. - Engine: thread `retrieval.similarityThreshold` into the semantic stage so weak matches can be suppressed as configured. - Engine: enforce `maxEntitiesPerChunk` (keep highest-confidence) and drop relationships whose endpoints didn't survive the cap. - Engine: always write a chunk's entity ids during build (empty clears stale ids left by a prior build when extraction now returns none). - Config/Builder: remove redundant chunkSize/overlap from TextConfig so the top-level chunkSize/chunkOverlap is the single source of truth (a manually constructed Config can no longer chunk at the wrong size). - Analytics.density(): count unique undirected pairs so reciprocal/multiple typed edges can't push density above 1.0. - KnowledgeGraph.neighbors(of:): keep the highest-confidence edge per (neighbor, relationType) so a weak A->B can't hide a strong reciprocal B->A from strength-filtered traversals. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 159b1162ac
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
… heuristics Pattern extractor: - Stop Title-Case runs at sentence punctuation (./!/?) with an abbreviation/title exception so "Acme. Bob" splits but "Dr. Smith"/"Acme Inc." stay merged. - Recognize entities preceded by opening punctuation (quotes/parens). - Orient asymmetric inferred relationships (WORKS_FOR/LEADS/BORN_IN/ HEADQUARTERED_IN/LOCATED_IN) by entity role, not text order. Config (single source of truth): - Remove duplicated topK/similarityThreshold from RetrievalConfig; top-level Config.topKResults/similarityThreshold are authoritative (Engine threads the latter into the semantic stage). - Builder: drive the embedding backend solely from config.embedding.backend (no sticky flag, so a later withConfig can switch back to hash), and sync the Ollama embedder dimension from the config. Robustness: - Engine.build re-fetches each chunk before writing so a document replaced during an extraction/embedding await isn't clobbered by the pre-await snapshot. - PageRank clamps negative maxIterations and guards topEntities against negative k. - Traversal findAllPaths treats negative depth as no expansion; egoNetwork de-duplicates emitted edges. - HybridRetriever.index drops a stale vector when reindexing an id without an embedding. - Prompts.fill scans once and never re-substitutes inserted values (keeps literal braces intact). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5cf557671a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Keep only relationships whose endpoints exist in the graph | ||
| // (drops edges to entities removed by the per-chunk cap). | ||
| for relationship in relationships | ||
| where graph.contains(relationship.source) && graph.contains(relationship.target) { |
There was a problem hiding this comment.
Filter capped relationships against selected entities
Fresh evidence after the cap fix is that this filter checks global graph state, so when the per-chunk cap drops an entity that already exists from an earlier chunk, graph.contains still passes and records the relationship. For example with maxEntitiesPerChunk = 1, if chunk 1 added A and chunk 2 extracts B high-confidence, A low-confidence, plus A-B, the cap removes A from chunk 2 but the edge is still added because A is already in the graph; that defeats the cap and attaches evidence to an entity the chunk deliberately discarded.
Useful? React with 👍 / 👎.
| let word = String(chars[ws..<j]) | ||
| .trimmingCharacters(in: CharacterSet(charactersIn: ".!?")) | ||
| .lowercased() | ||
| if !PatternEntityExtractor.abbreviations.contains(word) { break } |
There was a problem hiding this comment.
Stop org abbreviations from merging the next sentence
For sentence-final organization suffixes such as Acme Inc. Bob joined, inc is in the abbreviation allowlist, so this branch does not break on the period and the next uppercase word is consumed as the same span (Acme Inc. Bob). That drops both the organization and the person into a bogus merged entity for a common company-name pattern; only title/person abbreviations that need a following name should be allowed to cross the period.
Useful? React with 👍 / 👎.
|
|
||
| public mutating func addChunk(_ chunk: TextChunk) { | ||
| if chunksByID[chunk.id] == nil { chunkOrder.append(chunk.id) } | ||
| chunksByID[chunk.id] = chunk |
There was a problem hiding this comment.
Keep document chunk copies in sync
When build() later updates a chunk with extracted entity IDs and embeddings, this replaces only chunksByID; the copy inside documentsByID[chunk.documentID].chunks remains the pre-build version. Consumers reading graph.document(id)?.chunks or the documents[].chunks section of saved JSON therefore see stale chunks missing entities/embeddings while graph.chunks contains the enriched data.
Useful? React with 👍 / 👎.
| } | ||
| } | ||
| let index = relationships.count | ||
| relationships.append(relationship) |
There was a problem hiding this comment.
Reject relationships with missing endpoints
If a public caller adds a relationship whose source exists but target has not been added, the edge is still indexed here and neighbors(of:) exposes the missing target. Traversal from the existing source can then return an EntityID in TraversalResult.entities that is absent from the graph, so relationship insertion should validate both endpoints or ignore dangling edges.
Useful? React with 👍 / 👎.
| for chunk in graph.chunks { | ||
| index(id: chunk.id.raw, content: chunk.content, embedding: chunk.embedding) |
There was a problem hiding this comment.
Clear stale ids when indexing a new graph
This public full-graph indexing path only upserts the chunks present in the new graph, so if a caller indexes graph A and later indexes graph B after chunks were removed, the old ids remain in contents, BM25, and the vector store. Searches can still return deleted chunk text unless callers know to call clear() first, which makes index(graph:) unsafe as a reindex operation.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5cf557671a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for relationship in relationships | ||
| where graph.contains(relationship.source) && graph.contains(relationship.target) { |
There was a problem hiding this comment.
Filter relationships against this chunk's capped entities
Fresh evidence after the earlier cap fix is that this condition checks the global graph, not the entities that survived the current chunk's maxEntitiesPerChunk cap. If a low-confidence endpoint is dropped in this chunk but the same entity id was seen in a previous chunk, graph.contains still passes and the over-cap relationship is added with this chunk as context, defeating the per-chunk cap and creating edges for entities excluded from the chunk.
Useful? React with 👍 / 👎.
| public mutating func addChunk(_ chunk: TextChunk) { | ||
| if chunksByID[chunk.id] == nil { chunkOrder.append(chunk.id) } | ||
| chunksByID[chunk.id] = chunk |
There was a problem hiding this comment.
Keep document chunk snapshots in sync
When build() later calls graph.addChunk(current) to attach entity ids and embeddings, this only updates chunksByID; the copy stored in documentsByID[chunk.documentID].chunks remains the original pre-build chunk. After a successful build, callers reading knowledgeGraph().document(id)?.chunks or the encoded documents section from save(toJSON:) see chunks with stale entities/embedding even though graph.chunks contains the indexed data.
Useful? React with 👍 / 👎.
| if path.count > maxDepth { continue } | ||
| for (neighbor, _) in neighbors(of: current) where !visited.contains(neighbor) { | ||
| let newPath = path + [neighbor] | ||
| if neighbor == target { return newPath } |
There was a problem hiding this comment.
Enforce zero-hop maxDepth in relationship paths
For maxDepth == 0, path.count is still 1 at the source, so this loop expands one edge and returns a direct neighbor before checking the new path length. Callers asking for paths within zero hops between different entities can therefore get [source, target], contrary to the documented hop-count limit.
Useful? React with 👍 / 👎.
| for rawWord in text.split(whereSeparator: { $0.isWhitespace }) { | ||
| var cleaned = "" | ||
| for ch in rawWord where ch.isLetter || ch.isNumber { | ||
| cleaned.append(contentsOf: ch.lowercased()) | ||
| } |
There was a problem hiding this comment.
Split punctuation-separated BM25 terms
When indexed text contains common punctuation between words, such as graph-based, nodes/edges, or Paris,France, this tokenizer strips the punctuation and concatenates the pieces into one token (graphbased, nodesedges, etc.). Keyword search for the individual terms then cannot match those documents, which is especially visible for keyword retrieval mode where BM25 is the only signal.
Useful? React with 👍 / 👎.
| if let range = lowerContent.range(of: name.lowercased()) { | ||
| // Derive both offsets from the matched range; case folding can | ||
| // change grapheme counts, so `start + name.count` is unreliable. | ||
| let start = lowerContent.distance(from: lowerContent.startIndex, to: range.lowerBound) | ||
| let end = lowerContent.distance(from: lowerContent.startIndex, to: range.upperBound) | ||
| mentions.append( |
There was a problem hiding this comment.
Match LLM entity mentions on token boundaries
For LLM extraction, this records the first case-insensitive substring match for an entity name. If the name appears inside an earlier token, e.g. the model extracts Ann from Annabelle later met Ann, the mention offsets point at Annabelle instead of the actual entity mention, corrupting the graph's evidence spans for that entity.
Useful? React with 👍 / 👎.
| for i in 0..<entities.count { | ||
| for j in (i + 1)..<entities.count { | ||
| let a = entities[i] | ||
| let b = entities[j] | ||
| let relType = relationType(for: a.entityType, b.entityType, context: context) |
There was a problem hiding this comment.
Scope pattern relationships to nearby mentions
When one chunk contains more than one unrelated fact, these nested loops evaluate every entity pair against the whole chunk context. For example, Alice Smith works for Acme Corporation. Bob Jones works for Beta Corporation. makes the single works for phrase apply to both people and both organizations, so default offline builds add false WORKS_FOR edges whenever multiple facts share a chunk.
Useful? React with 👍 / 👎.
| self.numCtx = numCtx | ||
| } | ||
|
|
||
| var baseURL: String { "\(host):\(port)" } |
There was a problem hiding this comment.
Normalize bare Ollama host names
When callers pass a normal host value like host: "localhost" or host: "127.0.0.1", this constructs URLs such as localhost:11434/api/tags without an HTTP scheme; URL treats localhost as the scheme and leaves the host empty, so URLSession rejects the request and Ollama appears unavailable despite a valid host. Either normalize bare hosts by adding http:// or make the parameter clearly require a full URL base.
Useful? React with 👍 / 👎.
…dation
- PatternExtractor: only person titles cross a sentence period (org suffixes like
"Acme Inc. Bob" now split); infer relationships only between entities sharing a
sentence, using that sentence as keyword context (prevents one "works for"
phrase linking every pair in a multi-fact chunk).
- KnowledgeGraph: addRelationship ignores dangling edges (both endpoints must be
nodes); addChunk/removeChunks keep the document's embedded chunk copies in sync
so document(id)?.chunks and saved JSON reflect enrichment.
- Engine: scope the per-chunk relationship filter to the entities that survived
this chunk's maxEntitiesPerChunk cap (not global graph state).
- HybridRetriever.index(graph:) clears first so it's a safe full reindex.
- BM25 tokenizer splits on punctuation ("graph-based" -> "graph","based").
- Ollama baseURL normalizes bare hosts ("localhost" -> "http://localhost").
- LLMExtractor matches entity mentions on token boundaries ("Ann" no longer
matches inside "Annabelle").
findAllPaths zero-hop was already handled by the `remaining <= 0` guard.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh
|
All Codex round-5 findings are addressed in
Note: CI still can't run on this PR (no GitHub-hosted runner is provisioned for the private org repo), so none of these rounds are compile-verified yet — that needs an org Actions/billing (or self-hosted runner, or public-repo) change. Generated by Claude Code |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 74524ba374
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| var sid = 0 | ||
| for k in 0..<chars.count { | ||
| sentenceID[k] = sid | ||
| if chars[k] == "." || chars[k] == "!" || chars[k] == "?" { sid += 1 } |
There was a problem hiding this comment.
Preserve abbreviation periods when sentence-scoping relations
When the pattern extractor handles a titled/entity suffix abbreviation in a real sentence, e.g. Dr. Smith works for Acme Inc., this increments the sentence id at the period in Dr. even though capitalizedSpans intentionally keeps Dr. Smith as one entity. sameSentenceMentions then sees the person mention in sentence 0 and the organization in sentence 1, so the offline extractor drops the WORKS_FOR relationship entirely for common abbreviated names/suffixes.
Useful? React with 👍 / 👎.
| var lo = min(aOff, bOff) | ||
| while lo > 0 && sentence(of: lo - 1) == sentence(of: aOff) { lo -= 1 } | ||
| var hi = max(aOff, bOff) | ||
| while hi < chars.count && sentence(of: hi) == sentence(of: aOff) { hi += 1 } | ||
| let context = (lo < hi ? String(chars[lo..<hi]) : "").lowercased() |
There was a problem hiding this comment.
Scope relation keywords to each mention pair
Fresh evidence after the same-sentence guard is that the keyword context is still the entire sentence, so a sentence like Alice Smith works for Acme Corporation and Bob Jones works for Beta Corporation. makes the single works for phrase apply to every person/org pair in that sentence. This produces false WORKS_FOR edges such as Alice→Beta and Bob→Acme; the context needs to be limited to the text between/around the selected mention pair rather than the whole sentence.
Useful? React with 👍 / 👎.
| for rawWord in text.split(whereSeparator: { $0.isWhitespace }) { | ||
| var cleaned = "" | ||
| for ch in rawWord where ch.isLetter || ch.isNumber { | ||
| cleaned.append(contentsOf: ch.lowercased()) | ||
| } | ||
| if !cleaned.isEmpty { tokens.append(cleaned) } |
There was a problem hiding this comment.
Split hash-embedding tokens on punctuation
In semantic-only mode, the default HashEmbedder is the only retrieval signal, but this tokenizer strips punctuation inside each whitespace-delimited word and concatenates the pieces. Indexed text like graph-based or nodes/edges hashes as graphbased/nodesedges, while queries for graph based or nodes edges hash different tokens and lose the semantic overlap that should retrieve those chunks.
Useful? React with 👍 / 👎.
| nodeCount: n, | ||
| edgeCount: relationshipCount, | ||
| averageDegree: avgDegree, | ||
| maxDepth: 0 |
There was a problem hiding this comment.
Report the actual graph max depth
For any non-empty connected graph, this public stats method always reports maxDepth as zero, so analytics or monitoring code that uses KnowledgeGraph.stats() gets a diameter/depth value that says the graph has no reachable hops even when relationships exist. This should be computed from the graph topology or omitted rather than hard-coded.
Useful? React with 👍 / 👎.
| public init(documentFrequencies: [String: Int] = [:], totalDocuments: Int = 1) { | ||
| self.documentFrequencies = documentFrequencies | ||
| self.totalDocuments = max(1, totalDocuments) |
There was a problem hiding this comment.
Start TF-IDF corpus counts at zero
A default TfIdfKeywordExtractor already has totalDocuments == 1, so callers that build corpus statistics solely through addDocumentToCorpus get an off-by-one corpus size: after one added document corpusStats() reports two documents and IDF is computed with an extra phantom document. The fallback for an empty corpus should not be stored as a real document count before any documents are added.
Useful? React with 👍 / 👎.
| let d = dampingFactor | ||
| let teleport = (1.0 - d) / Double(n) |
There was a problem hiding this comment.
Clamp PageRank damping to a probability
Because dampingFactor is public, callers can construct PageRank(dampingFactor: 3) and this uses it directly as a probability; the teleport term becomes negative and PageRank can return negative scores despite the API documenting scores in [0, 1]. Validate or clamp the damping factor to [0, 1] before using it in the iteration.
Useful? React with 👍 / 👎.
…tion scoping
- HashEmbedder tokenizer splits on punctuation (matches BM25) so "graph-based"
and the query "graph based" share tokens in semantic-only mode.
- TfIdfKeywordExtractor starts totalDocuments at 0 (no phantom document); the
smoothed IDF still handles an empty corpus.
- PageRank clamps dampingFactor to [0, 1] so a bad value can't yield negative
scores.
- KnowledgeGraph.stats() reports the real maxDepth (undirected graph diameter)
instead of a hardcoded 0.
- PatternExtractor: an abbreviation period ending a person title ("Dr.") no
longer splits the sentence, so "Dr. Smith works for Acme Inc." keeps its
WORKS_FOR edge; relationship keyword context is now the span between the two
mentions, and pairs with an intervening same-type entity are skipped, removing
false edges in multi-fact sentences.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh
|
Codex round-6 findings addressed in
Note: this is an offline heuristic extractor, not a full relation classifier — the proximity rules trade a little recall for far fewer false edges. CI still can't run on this PR (no runner provisioned for the private org repo), so these remain compile-unverified. Generated by Claude Code |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3bfd74fa26
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return PatternEntityExtractor.personTitles.contains( | ||
| String(chars[s..<periodIndex]).lowercased()) |
There was a problem hiding this comment.
Keep org suffix periods within the same sentence
Fresh evidence in the current code is that this abbreviation check only consults personTitles, so organization suffixes such as Inc./Corp. still increment the sentence id. When the text is Acme Inc. was founded by Sam Altman, the org mention is placed in a different sentence from the person mention and sameSentenceMentions drops the LEADS relationship even though the abbreviation is mid-sentence.
Useful? React with 👍 / 👎.
| entitiesByID.removeAll() | ||
| entityOrder.removeAll() | ||
| relationships.removeAll() | ||
| outgoing.removeAll() | ||
| incoming.removeAll() |
There was a problem hiding this comment.
Clear chunk entity references too
When this public clear method is called, it removes the entity table but leaves every TextChunk.entities array in chunksByID and the embedded document chunk copies untouched. Callers that clear and then inspect graph.chunks or save the graph get chunks containing dangling entity IDs that graph.contains no longer recognizes; the same inconsistent snapshot can be exposed if a rebuild fails after this clear step.
Useful? React with 👍 / 👎.
| } | ||
|
|
||
| // Advance with overlap, snapped back to a word boundary. | ||
| var nextStart = max(0, end - overlap) |
There was a problem hiding this comment.
Clamp public chunker overlap before advancing
When callers use HierarchicalChunker directly with a negative overlap, this computes the next start as end - overlap, which moves it past the current chunk end and silently leaves source text uncovered; for example, chunkSize: 200, overlap: -50 jumps from roughly 200 to 250 and drops the intervening characters. TextProcessor rejects this input, but the public chunkText/chunkSpans APIs do not, so direct chunking can lose document content.
Useful? React with 👍 / 👎.
| if current.count > 2, !TfIdfKeywordExtractor.defaultStopwords.contains(current) { | ||
| tokens.append(current) |
There was a problem hiding this comment.
Index two-letter acronym terms in BM25
In keyword-only retrieval, important acronym-only queries such as AI, ML, or EU are dropped by this length check, so a document containing AI safety cannot be found by the query AI even though BM25 is the only signal. Common short stopwords are already filtered separately, so excluding every two-letter token loses meaningful technical and location terms.
Useful? React with 👍 / 👎.
| if var current = graph.chunk(id) { | ||
| current.entities = entities.map(\.id) | ||
| graph.addChunk(current) |
There was a problem hiding this comment.
Don't apply extraction results to replaced chunks
When another task replaces the same document id while extractor.extract is suspended, graph.chunk(id) can now be the new chunk with the same id, but this writes entity ids extracted from the old content onto it; stale entities and relationships were also inserted just above. Even though the final build is marked unbuilt, knowledgeGraph()/save(toJSON:) can expose the new chunk annotated with entities from the replaced text until a later rebuild.
Useful? React with 👍 / 👎.
| Relationship( | ||
| source: sourceID, target: targetID, relationType: relType, | ||
| confidence: data.strength ?? 0.7, context: [chunk.id])) |
There was a problem hiding this comment.
Clamp LLM relationship strengths
When an LLM returns a strength outside the expected 0...1 range, such as 10 or -1, this stores it directly as relationship confidence. Those values are later used for traversal filtering and PageRank edge weights, so a single unbounded model output can either overweight an edge massively or make an otherwise valid relationship disappear under the default strength filter.
Useful? React with 👍 / 👎.
…, BM25 - PatternExtractor: sentence segmentation treats org suffixes (Inc./Corp./Ltd.) and common abbreviations as non-sentence-ending, so "Acme Inc. was founded by Sam Altman" keeps its LEADS edge. Entity-span splitting still uses personTitles. - KnowledgeGraph.clearEntitiesAndRelationships clears chunk entity references in both chunksByID and the document copies, so no chunk points at a removed id. - BM25 keeps 2-letter acronym terms (AI/ML/EU); only single chars are dropped. - HierarchicalChunker clamps negative overlap (public API parity with TextProcessor). - LLMExtractor clamps model-provided relationship strength to [0, 1]. - Engine annotates a chunk during build only when its content still matches what was extracted/embedded, so a document replaced mid-await isn't tagged with stale entities/embeddings. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh
|
✅ CI is green on Codex round-7 findings were addressed in
Generated by Claude Code |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b3de614f12
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for id in chunkIDs { | ||
| guard let chunk = graph.chunk(id) else { continue } | ||
| let extracted = try await extractor.extract(from: chunk) | ||
| var entities = extracted.entities |
There was a problem hiding this comment.
Filter all extractors by minConfidence
When GraphRAG uses Ollama/LLM extraction or any injected extractor, EntityConfig.minConfidence is never applied here: LLMEntityExtractor assigns every entity confidence 0.9, and the engine only caps by count before adding. A config such as EntityConfig(minConfidence: 0.95) therefore still indexes those entities and their relationships, so the public threshold silently works only for the default pattern extractor.
Useful? React with 👍 / 👎.
| /// splitting still uses the narrower `personTitles`. | ||
| static let sentenceAbbreviations: Set<String> = [ | ||
| "dr", "prof", "mr", "mrs", "ms", "jr", "sr", "st", | ||
| "inc", "corp", "ltd", "llc", "co", "etc", "vs", |
There was a problem hiding this comment.
Disambiguate sentence-final org suffixes
When a company suffix actually terminates a sentence, e.g. Alice Smith works for Acme Inc. Bob Jones works for Beta Corp., including inc/corp in this unconditional abbreviation set prevents the sentence id from advancing. The same-sentence guard then treats both facts as one sentence and can emit cross-sentence false edges such as Acme→Bob or Alice→Bob; the abbreviation check needs following-token context instead of always suppressing these periods.
Useful? React with 👍 / 👎.
|
|
||
| public mutating func addDocument(_ document: Document) { | ||
| if documentsByID[document.id] == nil { documentOrder.append(document.id) } | ||
| documentsByID[document.id] = document |
There was a problem hiding this comment.
Remove old chunks when replacing documents
When KnowledgeGraph is used directly, this overwrite path does not remove chunks already indexed for the same document id. After adding a document and its chunks, then calling addDocument again with the same id for a reingest, graph.chunks and saved JSON can still contain the old chunk text even though document(id) now points at the replacement; the cleanup done by GraphRAG.addDocument is not applied for direct KnowledgeGraph callers.
Useful? React with 👍 / 👎.
| let gleanResponse = try await model.complete(gleanPrompt, params: params) | ||
| if let extra = LLMEntityExtractor.parse(gleanResponse) { | ||
| if extra.entities.isEmpty && extra.relationships.isEmpty { break } | ||
| output.entities.append(contentsOf: extra.entities) |
There was a problem hiding this comment.
Deduplicate gleaned LLM entities before capping
When gleaningRounds > 0 and a continuation repeats an entity from an earlier pass, this appends the duplicate and convert returns both copies with the same id. The engine applies maxEntitiesPerChunk before KnowledgeGraph.addEntity merges ids, so repeated high-confidence entities can consume the cap and drop distinct entities and their relationships from the chunk.
Useful? React with 👍 / 👎.
…dedup, reingest - PatternExtractor: an abbreviation period ends a sentence when the next word is capitalized, so "Acme Inc. was founded by Sam Altman" stays one sentence while "...Acme Inc. Bob Jones..." splits (person titles still never split). - Engine: apply EntityConfig.minConfidence uniformly to every extractor's output (LLM/injected extractors don't self-filter), before the per-chunk cap. - LLMExtractor: deduplicate entities by id in convert (gleaning passes can repeat an entity) so a duplicate can't consume the per-chunk cap. - KnowledgeGraph.addDocument: when replacing an existing document id, purge the previous version's chunks so direct callers don't retain stale chunk text. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh
|
Codex round-8 findings addressed in
(CI now runs on each push; Generated by Claude Code |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f6bdaf4fea
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // "... Acme Inc. Bob ..." splits. | ||
| if PatternEntityExtractor.sentenceAbbreviations.contains(word) { | ||
| var t = periodIndex + 1 | ||
| while t < chars.count && chars[t] == " " { t += 1 } |
There was a problem hiding this comment.
Treat newlines after org suffixes as sentence breaks
Fresh evidence in the current code is that this look-ahead skips only literal spaces, so an organization suffix before a line break is treated as non-terminal. For text like Alice Smith works for Acme Inc.\nBob Jones works for Beta Corp., Inc. does not advance the sentence id, causing both lines to be considered one sentence and allowing cross-sentence false relationships to be inferred; skip all non-newline whitespace carefully or treat \n after the suffix as a boundary before checking the next capitalized word.
Useful? React with 👍 / 👎.
| var outWeight = [Double](repeating: 0, count: n) | ||
| for rel in graph.relationships { | ||
| guard let s = indexOf[rel.source], let t = indexOf[rel.target] else { continue } | ||
| let w = Double(max(rel.confidence, 0.0001)) |
There was a problem hiding this comment.
Ignore zero-confidence edges in PageRank
When a relationship has confidence == 0 (for example, an LLM strength of 0 after clamping or a public caller marking an edge as no-strength), this converts it to a tiny positive weight and also uses that same tiny value in the source's total out-weight. If it is the source's only outgoing edge, the edge receives 100% of that node's PageRank contribution despite being explicitly zero-confidence; skip non-positive weights or treat those sources as dangling instead.
Useful? React with 👍 / 👎.
| for ma in a.mentions { | ||
| for mb in b.mentions where sentence(ma.startOffset) == sentence(mb.startOffset) { | ||
| return (ma.startOffset, mb.startOffset) | ||
| } |
There was a problem hiding this comment.
Consider later mention pairs before fixing relation type
When the same entity pair appears more than once in a chunk, this returns the first same-sentence mentions even if that co-occurrence has only generic context. For example, Alice Smith met Acme Corporation. Later, Alice Smith works for Acme Corporation. records only ASSOCIATED_WITH from the first sentence and never considers the later works for mention pair, so the specific WORKS_FOR edge is missed; evaluate all same-sentence mention pairs or prefer the pair whose local context matches a typed relation.
Useful? React with 👍 / 👎.
…r, PR weights - PatternExtractor: a newline (or end of text) after an org-suffix abbreviation is a sentence boundary, so "Acme Inc.\nBob Jones ..." splits into two sentences. - PatternExtractor: evaluate every same-sentence mention pair for an entity pair and keep the most specific typed relation, so a later "works for" co-occurrence isn't shadowed by an earlier generic one. - PageRank: skip non-positive-confidence edges (treating such sources as dangling) so a single zero-confidence edge can't capture all of a node's mass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh
|
Codex round-9 findings addressed in
( Generated by Claude Code |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 420f1ccf74
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - entity_name: Name of the entity, capitalized | ||
| - entity_type: One of the following types: [{entity_types}] | ||
| - entity_description: Comprehensive description of the entity's attributes and activities |
There was a problem hiding this comment.
Accept the prompt's entity field names
When the LLM follows these bullet instructions and returns keys like entity_name / entity_type instead of the shorter keys in the later JSON example, EntityData only decodes name / type, defaults both fields to empty strings, and convert drops the entity entirely. This makes Ollama-backed extraction brittle for a response format the prompt itself asks for; align the prompt or decode both key styles.
Useful? React with 👍 / 👎.
| // (./!/?) also ends it ("Acme. Bob") — unless the word is a | ||
| // known abbreviation/title like "Dr." so "Dr. Smith" merges. | ||
| if j > runStart, let last = chars[j - 1].unicodeScalars.first { | ||
| if CharacterSet(charactersIn: ",;:").contains(last) { break } |
There was a problem hiding this comment.
Keep comma-suffixed organization names together
For common legal names such as Alice works for Acme, Inc., this unconditional comma break ends the capitalized span at Acme; under the default minConfidence the single-word Acme is discarded and Inc can be extracted as the organization instead. That drops or corrupts organization relationships for normal company-name punctuation, so the comma should not split when it is immediately followed by an organization suffix.
Useful? React with 👍 / 👎.
| if words.count >= 2 { | ||
| return ("PERSON", 0.8) |
There was a problem hiding this comment.
Strip leading articles before classifying spans
When a sentence starts with names like The United States or The University of California, the captured span still includes The, so the known-location and organization-prefix checks miss and this fallback labels the whole span as PERSON. That corrupts common location/organization nodes and their downstream relationships; trim leading articles/blocklisted words before applying the location/prefix checks or before falling back to person.
Useful? React with 👍 / 👎.
| // Accept bare hosts ("localhost", "127.0.0.1"): without a scheme, URL | ||
| // parses the host as the scheme and the request fails. | ||
| let normalizedHost = host.contains("://") ? host : "http://\(host)" | ||
| return "\(normalizedHost):\(port)" |
There was a problem hiding this comment.
Avoid appending a second Ollama port
If callers pass the common full Ollama base URL as the host, e.g. OllamaConfig(host: "http://localhost:11434") or withOllama(host: "http://localhost:11434"), this returns http://localhost:11434:11434, which URL(string:) rejects and makes availability checks and requests fail. Since the code already accepts hosts with a scheme, detect an existing port or document/enforce that host must exclude it.
Useful? React with 👍 / 👎.
| } else { | ||
| // Replacing an existing id: purge the previous version's chunks so | ||
| // direct KnowledgeGraph callers don't retain stale chunk text. | ||
| removeChunks(forDocument: document.id) |
There was a problem hiding this comment.
Drop stale graph evidence when replacing documents
When KnowledgeGraph is used directly and a document id is replaced after entities/relationships were added for its old chunks, this only removes the chunk records. The old entities still keep mentions pointing at deleted chunk IDs and relationships still keep those chunk IDs in context, so graph traversal/stats and saved JSON can expose facts from a document version that no longer exists; replacement should also remove or scrub evidence tied to the removed chunks.
Useful? React with 👍 / 👎.
…, articles, evidence scrub
- PatternExtractor: a comma followed by an org suffix doesn't split the run, so
"Acme, Inc." stays one organization span; leading articles ("The United States",
"The University of California") are stripped before classification so they map
to location/organization instead of person.
- LLMExtractor: EntityData/RelationshipData decode both the short JSON-example
keys and the prompt's verbose names (entity_name/source_entity/…), so Ollama
responses in either shape parse.
- Ollama baseURL: don't append a second port when the host already includes one
("http://localhost:11434").
- KnowledgeGraph.removeChunks: scrub entity mentions and relationship context that
reference removed chunks, so replacing a document leaves no stale evidence.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh
|
Codex round-10 findings addressed in
( Generated by Claude Code |
The alias CodingKeys added in round 10 (entity_name, source_entity, ...) broke synthesized Encodable conformance, since every CodingKeys case must map to a stored property. These structs are only decoded, so declare them Decodable instead of Codable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f872b93bc4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| public func findRelationshipPath( | ||
| from source: EntityID, to target: EntityID, maxDepth: Int = 5 | ||
| ) -> [EntityID]? { | ||
| if source == target { return [source] } |
There was a problem hiding this comment.
Require endpoints to exist for self paths
When callers ask for a path from an ID to itself and that entity was never added, this returns [source] before checking graph membership, producing a valid-looking zero-hop path for a missing node. Other traversal APIs guard graph.contains(...); this should verify the endpoint exists before returning the self path.
Useful? React with 👍 / 👎.
| let upper = min(hi + 1, chars.count) | ||
| let context = (lo < upper ? String(chars[lo..<upper]) : "").lowercased() |
There was a problem hiding this comment.
Include trailing relation cues in pair context
When the relation phrase comes after both mentions, such as Alice Smith and Bob Jones are married, hi is only the later mention's start offset and this slice stops after its first character, so relationType never sees cues like married or colleague and downgrades the edge to generic KNOWS. Include the full second mention plus a bounded following/sentence window for the selected pair.
Useful? React with 👍 / 👎.
| self.retriever = HybridRetriever( | ||
| config: HybridConfig(maxCandidates: max(100, config.topKResults * 10))) |
There was a problem hiding this comment.
Wire retrieval config into search
When callers set Config(retrieval: ...), the graph-expansion knobs are silently ignored: this creates HybridRetriever with only a derived maxCandidates, and a repo-wide search for config.retrieval/maxExpansionDepth shows no production code reads them. As a result maxExpansionDepth, entityWeight, chunkWeight, and graphWeight have no effect on search/ask, despite being public configuration.
Useful? React with 👍 / 👎.
| for k in 0..<chars.count { | ||
| sentenceID[k] = sid | ||
| let c = chars[k] | ||
| if c == "!" || c == "?" || (c == "." && periodIsSentenceEnd(k)) { sid += 1 } |
There was a problem hiding this comment.
Treat bare newlines as sentence boundaries
When chunks contain line-separated facts without sentence punctuation, this sentence map never increments on \n, so both lines share one sentence. For example Alice Smith works for Acme Corporation\nBob Jones works for Beta Corporation can emit cross-line generic edges such as Alice↔Bob or Acme↔Bob even though the same-sentence guard was meant to prevent facts in separate lines from being linked.
Useful? React with 👍 / 👎.
| // Proximity heuristic: skip if another same-type entity | ||
| // lies between them — the phrase likely belongs to that | ||
| // nearer pair. (Offline extractor, not a full classifier.) | ||
| if hasInterveningSameType(a, b, lo: lo, hi: hi, among: entities) { continue } |
There was a problem hiding this comment.
Keep coordinated subjects linked to shared objects
When a relation applies to a coordinated list, such as Alice Smith and Bob Jones are employed by Acme Corporation, the Alice→Acme pair is skipped solely because Bob is another PERSON mention between them. That leaves only the nearest subject linked to the organization and drops valid relationships for earlier coordinated subjects; the intervening-entity heuristic needs to allow shared predicates rather than unconditionally skipping these pairs.
Useful? React with 👍 / 👎.
| // Stage 1: entity & relationship extraction per chunk. | ||
| for id in chunkIDs { | ||
| guard let chunk = graph.chunk(id) else { continue } | ||
| let extracted = try await extractor.extract(from: chunk) |
There was a problem hiding this comment.
Skip relationship extraction when disabled
When EntityConfig(extractRelationships: false) is used with an Ollama/LLM extractor, this still invokes the full extractor that prompts the model to identify relationships; the flag only prevents inserting them later. In that configuration users still pay the latency/token cost of relationship extraction even though they explicitly disabled relationships, so the build should avoid requesting or inferring them in the first place.
Useful? React with 👍 / 👎.
… dead config - KnowledgeGraph.findRelationshipPath requires both endpoints to exist before returning the self path. - PatternExtractor: treat bare newlines as sentence boundaries; extend a pair's keyword context to the sentence end when no other entity mention follows the later mention (captures trailing cues like "... are married" without leaking into a following pair). Documented the deliberate precision-over-recall trade-off for coordinated subjects. - Remove the unused RetrievalConfig (maxExpansionDepth / *Weight knobs were never read) so Config exposes no silently-ignored settings. - Engine: documented that extractRelationships gates insertion only (skipping an LLM's relationship prompting would need a protocol change). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh
|
Codex round-11 — addressed in Fixed
Rightfully ignored (with in-code rationale)
( Generated by Claude Code |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ab37e9a485
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| } | ||
| for idx in relationships.indices where !relationships[idx].context.isEmpty { | ||
| relationships[idx].context.removeAll { removed.contains($0) } |
There was a problem hiding this comment.
Drop relationships once their evidence chunks are removed
Fresh evidence in the current code is that replacement now scrubs removed chunk IDs from context, but it keeps the relationship in relationships/adjacency even when that was its only evidence. When a direct KnowledgeGraph caller replaces a document after adding an extracted edge with context: [oldChunk], relationshipCount, traversal, and saved JSON still expose that stale fact as a context-less edge for the old document version.
Useful? React with 👍 / 👎.
| public var maxConnections: Int | ||
| public var threshold: Float |
There was a problem hiding this comment.
Honor or remove dead graph config knobs
Fresh evidence after the unused retrieval config was removed is that these public graph knobs are still silently ignored: a repo-wide search for config.graph/maxConnections finds no production reads outside this declaration. Callers can set Config(graph: GraphConfig(maxConnections: 1, threshold: 0.99)) and get the same graph construction/retrieval behavior, so experiments relying on those limits or thresholds are misconfigured without any signal.
Useful? React with 👍 / 👎.
| for (neighbor, relationship) in graph.neighbors(of: current) { | ||
| guard passesFilter(relationship) else { continue } | ||
| if !visited.contains(neighbor) { |
There was a problem hiding this comment.
Deduplicate entity paths across relation types
When two nodes have multiple relationship types between them, neighbors(of:) returns one entry per type, but findAllPaths returns only [EntityID] paths. This loop therefore recurses once per edge type and appends identical entity paths; with a low maxPaths, duplicates like two copies of [A, B] can consume the cap before distinct paths are returned.
Useful? React with 👍 / 👎.
… config, path dedup - KnowledgeGraph.removeChunks now drops a relationship whose only evidence was a removed chunk (context-less leftover would expose a stale fact) and rebuilds the adjacency index; edges that never had context are preserved. - Remove the unused GraphConfig (maxConnections/threshold were never read) so Config exposes no silently-ignored knobs. - findAllPaths collapses parallel edges (multiple relation types between the same nodes) to one neighbor, so entity-only paths aren't duplicated and don't waste the maxPaths budget. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh
|
Codex round-12 findings addressed in
( Generated by Claude Code |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b6dfa551cd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| entity.mentions = kept | ||
| entitiesByID[eid] = entity |
There was a problem hiding this comment.
Drop entities left with no surviving mentions
When a direct KnowledgeGraph caller replaces a document after indexing entities from its old chunks, this loop removes the old chunk mentions but writes the entity back even when kept is empty. The stale entity still appears in entityCount, entities, stats, and saved JSON even though none of its evidence chunks exist anymore; entities whose mentions were exhausted by the replacement should be removed unless they have independent evidence.
Useful? React with 👍 / 👎.
| /// Byte offset of the chunk start within the original document content. | ||
| public var startOffset: Int | ||
| /// Byte offset of the chunk end within the original document content. |
There was a problem hiding this comment.
Stop documenting character offsets as byte offsets
These public fields are populated by TextProcessor/HierarchicalChunker using Array(text) character indices, not UTF-8 byte offsets. For any non-ASCII document, callers following this API documentation will slice or highlight at the wrong byte positions in the original content; either the stored offsets need to be byte counts or the public contract should say they are character offsets.
Useful? React with 👍 / 👎.
| // Stage 2: embed chunks. | ||
| for id in chunkIDs { | ||
| guard let chunk = graph.chunk(id) else { continue } | ||
| let embedding = try await embedder.embed(chunk.content) |
There was a problem hiding this comment.
Skip embeddings for keyword-only builds
When callers configure Config(approach: "keyword"), retrieval later sets includeSemantic to false and never uses query or chunk embeddings, but build() still calls the embedder for every chunk here. With an Ollama or otherwise remote/unavailable embedder, a keyword-only corpus can fail or pay full embedding latency even though BM25 is the only signal used; the embedding stage should be skipped when semantic retrieval is disabled.
Useful? React with 👍 / 👎.
…-only embeds, fix offset docs - KnowledgeGraph.removeChunks removes entities whose mentions are entirely exhausted by the removal (no independent evidence) and drops relationships whose endpoint was removed, alongside the existing context scrub. - Engine.build skips the embedding stage entirely for approach == "keyword", so a keyword-only corpus doesn't pay embedder latency or fail on a remote embedder. - TextChunk start/end offsets are documented as character (grapheme) offsets, not UTF-8 byte offsets, matching how the chunker computes them. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh
|
Codex round-13 findings addressed in
( Generated by Claude Code |
Summary
This PR ports the core library of
graphrag-rs— a graph-based retrieval-augmented generation system — into idiomatic, dependency-free Swift 6 code. The implementation runs fully offline by default and optionally integrates with local Ollama servers for LLM-backed extraction and answer generation.Key Changes
Core Infrastructure
DocumentID,EntityID,ChunkID) and domain models (Entity,Relationship,TextChunk,Document)GraphRAGErrorenumLanguageModel,EmbeddingModel, andEntityExtractingText Processing
HierarchicalChunker: recursive, separator-aware text chunking with configurable overlapTextProcessorfor document chunking with metadata enrichment (word counts, keywords)TfIdfKeywordExtractorfor corpus-aware keyword extraction with stopword filteringEntity & Relationship Extraction
PatternEntityExtractor: offline, heuristic-based entity detection via Title-Case span analysis and suffix/prefix classificationLLMEntityExtractor: LLM-driven extraction with staged JSON parsing fallbacks and optional gleaning roundsPromptsmodule with extraction and answer-generation templatesKnowledge Graph
KnowledgeGraph: value-type graph with insertion-order preservation, O(1) entity/relationship lookup, and adjacency indexesPageRankcomputation andGraphTraversal(BFS/multi-source BFS with edge-strength filtering)Retrieval
BM25Retriever: Okapi BM25 keyword search with configurable k1/b parametersInMemoryVectorStore: brute-force cosine-similarity vector searchHybridRetriever: fused keyword + semantic search with configurable fusion methods (RRF, weighted sum, max-score)Embeddings
HashEmbedder: deterministic, offline embedding via FNV-1a hashing with L2 normalizationOllamaConfigandOllamaLanguageModel: HTTP client for local Ollama daemon (completions and embeddings)High-Level Orchestration
GraphRAGactor: thread-safe orchestrator managing document ingestion, graph building, and question answeringGraphRAGBuilder: fluent configuration builder with sensible defaultsConfighierarchy for chunking, embedding, entity extraction, retrieval, and graph parametersTesting
Notable Implementation Details
Characterarrays (extended grapheme clusters) for Unicode safety by constructionGraphRAGis anactorto safely share mutable graph/index state across async tasksHashEmbedderprovides stable, reproducible embeddings without external dependencieshttps://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh