Skip to content

Port graphrag-rs core library to Swift#1

Open
ronaldmannak wants to merge 18 commits into
mainfrom
claude/graphrag-rs-port-j0im9e
Open

Port graphrag-rs core library to Swift#1
ronaldmannak wants to merge 18 commits into
mainfrom
claude/graphrag-rs-port-j0im9e

Conversation

@ronaldmannak

Copy link
Copy Markdown
Contributor

Summary

This PR ports the core library of graphrag-rs — a graph-based retrieval-augmented generation system — into idiomatic, dependency-free Swift 6 code. The implementation runs fully offline by default and optionally integrates with local Ollama servers for LLM-backed extraction and answer generation.

Key Changes

Core Infrastructure

  • Added strongly-typed identifiers (DocumentID, EntityID, ChunkID) and domain models (Entity, Relationship, TextChunk, Document)
  • Implemented unified error handling via GraphRAGError enum
  • Defined pluggable protocols for LanguageModel, EmbeddingModel, and EntityExtracting

Text Processing

  • Ported HierarchicalChunker: recursive, separator-aware text chunking with configurable overlap
  • Implemented TextProcessor for document chunking with metadata enrichment (word counts, keywords)
  • Added TfIdfKeywordExtractor for corpus-aware keyword extraction with stopword filtering

Entity & Relationship Extraction

  • Implemented PatternEntityExtractor: offline, heuristic-based entity detection via Title-Case span analysis and suffix/prefix classification
  • Implemented LLMEntityExtractor: LLM-driven extraction with staged JSON parsing fallbacks and optional gleaning rounds
  • Added Prompts module with extraction and answer-generation templates

Knowledge Graph

  • Ported KnowledgeGraph: value-type graph with insertion-order preservation, O(1) entity/relationship lookup, and adjacency indexes
  • Implemented graph analytics: degree, closeness, and betweenness centrality (Brandes' algorithm)
  • Added PageRank computation and GraphTraversal (BFS/multi-source BFS with edge-strength filtering)

Retrieval

  • Implemented BM25Retriever: Okapi BM25 keyword search with configurable k1/b parameters
  • Implemented InMemoryVectorStore: brute-force cosine-similarity vector search
  • Implemented HybridRetriever: fused keyword + semantic search with configurable fusion methods (RRF, weighted sum, max-score)

Embeddings

  • Implemented HashEmbedder: deterministic, offline embedding via FNV-1a hashing with L2 normalization
  • Ported OllamaConfig and OllamaLanguageModel: HTTP client for local Ollama daemon (completions and embeddings)

High-Level Orchestration

  • Implemented GraphRAG actor: thread-safe orchestrator managing document ingestion, graph building, and question answering
  • Implemented GraphRAGBuilder: fluent configuration builder with sensible defaults
  • Implemented Config hierarchy for chunking, embedding, entity extraction, retrieval, and graph parameters

Testing

  • Added comprehensive test suite covering chunking, keyword extraction, BM25 ranking, vector similarity, embeddings, entity extraction, and end-to-end QA

Notable Implementation Details

  • Character-based offsets: The Rust implementation uses UTF-8 byte indices; this port operates on Swift Character arrays (extended grapheme clusters) for Unicode safety by construction
  • Actor-based concurrency: GraphRAG is an actor to safely share mutable graph/index state across async tasks
  • Pluggable backends: Embedders, language models, and entity extractors are injected as existentials, allowing offline defaults with optional LLM/Ollama integration
  • Deterministic hashing: HashEmbedder provides stable, reproducible embeddings without external dependencies
  • Graceful degradation: TF-IDF scoring and BM25 retrieval degrade gracefully when corpus statistics are unavailable

https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

Reimplement the core of the Rust crate graphrag-rs
(https://github.com/automataIA/graphrag-rs) as an idiomatic, dependency-free
Swift 6 package.

Modules:
- Core: Document/TextChunk/Entity/Relationship model, typed IDs, GraphRAGError,
  and pluggable protocols (LanguageModel, EmbeddingModel, EntityExtracting,
  ChunkingStrategy).
- Text: HierarchicalChunker, TextProcessor, TfIdfKeywordExtractor.
- Graph: KnowledgeGraph (value-type adjacency), PageRank, traversal
  (BFS/DFS/ego/paths), analytics (degree/closeness/betweenness/components).
- Retrieval: BM25Retriever, cosine InMemoryVectorStore, HybridRetriever with
  RRF/weighted/CombSUM/MaxScore fusion.
- Entity: PatternEntityExtractor (offline) and LLMEntityExtractor with the
  upstream extraction prompts and staged JSON-recovery parsing.
- Embeddings: deterministic offline HashEmbedder and OllamaEmbedder; OllamaClient
  for LLM completion.
- Orchestration: GraphRAG actor (ingest -> build -> ask), GraphRAGBuilder, Config.

Defaults mirror the Rust crate (PageRank d=0.85/tol=1e-6, BM25 k1=1.2/b=0.75,
hybrid RRF k=60, weights 0.7/0.3, traversal depth 3, min strength 0.5). Runs
fully offline by default; optional local Ollama for LLM-backed extraction and
answer generation. Includes Swift Testing coverage and a README.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive Swift 6 port of the Rust graphrag-rs crate, implementing a fully offline, graph-based Retrieval Augmented Generation pipeline. It includes core domain models, hierarchical text chunking, TF-IDF keyword extraction, a value-type knowledge graph, graph algorithms (PageRank, BFS/DFS traversal, centrality analytics), hybrid retrieval (BM25 and vector search fusion), and local LLM integration via Ollama. The review feedback identifies several critical correctness issues and architectural improvements: correcting the BM25 scoring formula to avoid double-penalizing document length, fixing character offset calculations during punctuation trimming in pattern extraction, adopting modern Swift 6 async-native APIs for network requests to support task cancellation, ensuring Unicode-safe offset calculations in LLM extraction, and implementing the documented deduplication behavior for graph neighbors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread Sources/GraphRAG/Retrieval/BM25.swift Outdated
Comment thread Sources/GraphRAG/Entity/PatternExtractor.swift Outdated
Comment thread Sources/GraphRAG/Embeddings/Ollama.swift
Comment thread Sources/GraphRAG/Entity/LLMExtractor.swift Outdated
Comment thread Sources/GraphRAG/Graph/KnowledgeGraph.swift
claude added 2 commits June 28, 2026 01:49
…ighbor dedup

- BM25: use raw term frequency; document-length normalization is handled only by
  the |D|/avgdl denominator factor (previously normalized twice).
- PatternExtractor: trim leading/trailing punctuation on the index range so
  recorded span offsets stay aligned with the source text.
- Ollama: replace the continuation-wrapped dataTask with async-native
  URLSession.data(for:), which supports task cancellation.
- LLMExtractor: derive mention endOffset from the matched range's upperBound
  (case folding can change grapheme counts).
- KnowledgeGraph.neighbors(of:): deduplicate by (neighbor, relationType) to match
  the documented contract.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

https://github.com/PicoMLX/GraphRAG/blob/c2513c089f3a462b4200149487edb526c4acbda9/Sources/GraphRAG/Retrieval/Hybrid.swift#L336-L337
P2 Badge Drop zero-similarity semantic hits before RRF

When keyword search finds nothing and the query embedding has zero or negative cosine similarity to every chunk, vectors.search still returns the nearest maxCandidates; RRF then disables the absolute score threshold and returns those arbitrary chunks. In the offline pipeline this makes ask() answer from unrelated context instead of taking its no-results path for empty or off-topic queries.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread Sources/GraphRAG/GraphRAG/Engine.swift
Comment thread Sources/GraphRAG/Text/KeywordExtraction.swift Outdated
Comment thread Sources/GraphRAG/Graph/Traversal.swift
Comment thread Sources/GraphRAG/Text/Chunking.swift
Comment thread Sources/GraphRAG/GraphRAG/Engine.swift Outdated
Comment thread Sources/GraphRAG/Entity/PatternExtractor.swift
Comment thread Sources/GraphRAG/GraphRAG/Builder.swift Outdated
Comment thread Sources/GraphRAG/GraphRAG/Engine.swift
Comment thread Sources/GraphRAG/Retrieval/Hybrid.swift
- Hybrid: drop non-positive cosine hits before RRF so off-topic/empty queries
  take the no-results path instead of surfacing arbitrary chunks.
- Hybrid: reject non-positive search limits (prefix traps on negatives).
- Engine: remove a document's stale chunks when it is replaced by the same id.
- Engine: guard against reentrant/overlapping builds; mark unbuilt on entry so a
  failed rebuild can't leave stale state queryable; only mark built if no
  document was ingested during the build (snapshot chunk ids).
- Engine/Builder: honor config.embedding.backend == "ollama" for a manually
  constructed/deserialized Config, not only the withOllamaEmbeddings() flag.
- Traversal: stop DFS expansion at maxDepth so edges to unvisited nodes are not
  recorded (matches BFS and the documented depth limit).
- KeywordExtraction: smoothed IDF (log((N+1)/(df+1))+1) so a fresh extractor
  ranks by term frequency instead of collapsing every score to zero.
- Chunking: reject negative chunk overlap (would skip text between chunks).
- PatternExtractor: stop a Title-Case run at separating punctuation so
  "Alice, Bob" yields two entities rather than one merged node.

Adds regression tests for negative overlap, document replacement, DFS depth
bounds, and negative top-K.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

Copy link
Copy Markdown
Contributor Author

All Codex findings are addressed in af6d547 (replies left on each inline thread; threads resolved):

  • Drop zero-similarity semantic hits before RRF (the top-level review note): HybridRetriever.search now filters out non-positive cosine hits before fusion, so an off-topic/empty query (no keyword matches and all similarities ≤ 0) takes ask()'s no-results path instead of surfacing arbitrary nearest chunks.
  • Stale chunks removed when a document is replaced (removeChunks(forDocument:)).
  • Reentrant/overlapping builds rejected; isBuilt=false on entry; build only succeeds if no document was ingested mid-build (snapshot + ingestionVersion).
  • Builder/Engine honor config.embedding.backend == "ollama".
  • DFS stops expanding at maxDepth (no dangling edges).
  • Smoothed TF-IDF so a fresh extractor ranks by term frequency.
  • Negative chunk overlap and non-positive search limits rejected.
  • Title-Case runs split on ,/;/: so comma-separated names become separate entities.

Regression tests added for document replacement, DFS depth bounds, negative overlap, and negative top-K.

@codex review


Generated by Claude Code

claude added 2 commits June 28, 2026 02:05
Runs `swift build`/`swift test` in the official swift:6.3 container on every
push and pull request, matching the package's swift-tools-version. Adds a CI
status badge to the README.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh
The swift:6.3 image does not exist on Docker Hub yet (6.3 is newly announced
and not in the stable registry), so the container failed to start. No released
toolchain can parse swift-tools-version 6.3 either. The port only uses standard
Swift 6 features (actors, Sendable, async/await, swift-testing), so lower the
manifest floor to 6.1 and pin the CI container to swift:6.1 so builds and tests
actually run. Bump both back to 6.3 once a stable 6.3 toolchain/image ships.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c845d99b90

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread Sources/GraphRAG/GraphRAG/Engine.swift Outdated
Comment thread Sources/GraphRAG/Graph/Analytics.swift Outdated
Comment thread Sources/GraphRAG/GraphRAG/Engine.swift
Comment thread Sources/GraphRAG/GraphRAG/Engine.swift
Comment thread Sources/GraphRAG/GraphRAG/Engine.swift
Comment thread Sources/GraphRAG/GraphRAG/Engine.swift Outdated
Comment thread Sources/GraphRAG/Graph/KnowledgeGraph.swift Outdated
- Engine: honor `approach` (hybrid/keyword/semantic) by toggling the BM25 and
  vector stages in retrieval.
- Engine: thread `retrieval.similarityThreshold` into the semantic stage so weak
  matches can be suppressed as configured.
- Engine: enforce `maxEntitiesPerChunk` (keep highest-confidence) and drop
  relationships whose endpoints didn't survive the cap.
- Engine: always write a chunk's entity ids during build (empty clears stale ids
  left by a prior build when extraction now returns none).
- Config/Builder: remove redundant chunkSize/overlap from TextConfig so the
  top-level chunkSize/chunkOverlap is the single source of truth (a manually
  constructed Config can no longer chunk at the wrong size).
- Analytics.density(): count unique undirected pairs so reciprocal/multiple typed
  edges can't push density above 1.0.
- KnowledgeGraph.neighbors(of:): keep the highest-confidence edge per
  (neighbor, relationType) so a weak A->B can't hide a strong reciprocal B->A
  from strength-filtered traversals.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 159b1162ac

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread Sources/GraphRAG/GraphRAG/Engine.swift Outdated
Comment thread Sources/GraphRAG/Entity/PatternExtractor.swift Outdated
Comment thread Sources/GraphRAG/GraphRAG/Builder.swift Outdated
Comment thread Sources/GraphRAG/Retrieval/Hybrid.swift Outdated
Comment thread Sources/GraphRAG/GraphRAG/Engine.swift Outdated
Comment thread Sources/GraphRAG/Entity/PatternExtractor.swift
Comment thread Sources/GraphRAG/Graph/Traversal.swift Outdated
Comment thread Sources/GraphRAG/Graph/PageRank.swift Outdated
Comment thread Sources/GraphRAG/Entity/Prompts.swift Outdated
Comment thread Sources/GraphRAG/Entity/PatternExtractor.swift Outdated
… heuristics

Pattern extractor:
- Stop Title-Case runs at sentence punctuation (./!/?) with an abbreviation/title
  exception so "Acme. Bob" splits but "Dr. Smith"/"Acme Inc." stay merged.
- Recognize entities preceded by opening punctuation (quotes/parens).
- Orient asymmetric inferred relationships (WORKS_FOR/LEADS/BORN_IN/
  HEADQUARTERED_IN/LOCATED_IN) by entity role, not text order.

Config (single source of truth):
- Remove duplicated topK/similarityThreshold from RetrievalConfig; top-level
  Config.topKResults/similarityThreshold are authoritative (Engine threads the
  latter into the semantic stage).
- Builder: drive the embedding backend solely from config.embedding.backend (no
  sticky flag, so a later withConfig can switch back to hash), and sync the
  Ollama embedder dimension from the config.

Robustness:
- Engine.build re-fetches each chunk before writing so a document replaced during
  an extraction/embedding await isn't clobbered by the pre-await snapshot.
- PageRank clamps negative maxIterations and guards topEntities against negative k.
- Traversal findAllPaths treats negative depth as no expansion; egoNetwork
  de-duplicates emitted edges.
- HybridRetriever.index drops a stale vector when reindexing an id without an
  embedding.
- Prompts.fill scans once and never re-substitutes inserted values (keeps literal
  braces intact).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5cf557671a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread Sources/GraphRAG/GraphRAG/Engine.swift Outdated
// Keep only relationships whose endpoints exist in the graph
// (drops edges to entities removed by the per-chunk cap).
for relationship in relationships
where graph.contains(relationship.source) && graph.contains(relationship.target) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Filter capped relationships against selected entities

Fresh evidence after the cap fix is that this filter checks global graph state, so when the per-chunk cap drops an entity that already exists from an earlier chunk, graph.contains still passes and records the relationship. For example with maxEntitiesPerChunk = 1, if chunk 1 added A and chunk 2 extracts B high-confidence, A low-confidence, plus A-B, the cap removes A from chunk 2 but the edge is still added because A is already in the graph; that defeats the cap and attaches evidence to an entity the chunk deliberately discarded.

Useful? React with 👍 / 👎.

let word = String(chars[ws..<j])
.trimmingCharacters(in: CharacterSet(charactersIn: ".!?"))
.lowercased()
if !PatternEntityExtractor.abbreviations.contains(word) { break }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop org abbreviations from merging the next sentence

For sentence-final organization suffixes such as Acme Inc. Bob joined, inc is in the abbreviation allowlist, so this branch does not break on the period and the next uppercase word is consumed as the same span (Acme Inc. Bob). That drops both the organization and the person into a bogus merged entity for a common company-name pattern; only title/person abbreviations that need a following name should be allowed to cross the period.

Useful? React with 👍 / 👎.


public mutating func addChunk(_ chunk: TextChunk) {
if chunksByID[chunk.id] == nil { chunkOrder.append(chunk.id) }
chunksByID[chunk.id] = chunk

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep document chunk copies in sync

When build() later updates a chunk with extracted entity IDs and embeddings, this replaces only chunksByID; the copy inside documentsByID[chunk.documentID].chunks remains the pre-build version. Consumers reading graph.document(id)?.chunks or the documents[].chunks section of saved JSON therefore see stale chunks missing entities/embeddings while graph.chunks contains the enriched data.

Useful? React with 👍 / 👎.

}
}
let index = relationships.count
relationships.append(relationship)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject relationships with missing endpoints

If a public caller adds a relationship whose source exists but target has not been added, the edge is still indexed here and neighbors(of:) exposes the missing target. Traversal from the existing source can then return an EntityID in TraversalResult.entities that is absent from the graph, so relationship insertion should validate both endpoints or ignore dangling edges.

Useful? React with 👍 / 👎.

Comment on lines +93 to +94
for chunk in graph.chunks {
index(id: chunk.id.raw, content: chunk.content, embedding: chunk.embedding)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clear stale ids when indexing a new graph

This public full-graph indexing path only upserts the chunks present in the new graph, so if a caller indexes graph A and later indexes graph B after chunks were removed, the old ids remain in contents, BM25, and the vector store. Searches can still return deleted chunk text unless callers know to call clear() first, which makes index(graph:) unsafe as a reindex operation.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5cf557671a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread Sources/GraphRAG/GraphRAG/Engine.swift Outdated
Comment on lines +121 to +122
for relationship in relationships
where graph.contains(relationship.source) && graph.contains(relationship.target) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Filter relationships against this chunk's capped entities

Fresh evidence after the earlier cap fix is that this condition checks the global graph, not the entities that survived the current chunk's maxEntitiesPerChunk cap. If a low-confidence endpoint is dropped in this chunk but the same entity id was seen in a previous chunk, graph.contains still passes and the over-cap relationship is added with this chunk as context, defeating the per-chunk cap and creating edges for entities excluded from the chunk.

Useful? React with 👍 / 👎.

Comment on lines +84 to +86
public mutating func addChunk(_ chunk: TextChunk) {
if chunksByID[chunk.id] == nil { chunkOrder.append(chunk.id) }
chunksByID[chunk.id] = chunk

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep document chunk snapshots in sync

When build() later calls graph.addChunk(current) to attach entity ids and embeddings, this only updates chunksByID; the copy stored in documentsByID[chunk.documentID].chunks remains the original pre-build chunk. After a successful build, callers reading knowledgeGraph().document(id)?.chunks or the encoded documents section from save(toJSON:) see chunks with stale entities/embedding even though graph.chunks contains the indexed data.

Useful? React with 👍 / 👎.

Comment on lines +179 to +182
if path.count > maxDepth { continue }
for (neighbor, _) in neighbors(of: current) where !visited.contains(neighbor) {
let newPath = path + [neighbor]
if neighbor == target { return newPath }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce zero-hop maxDepth in relationship paths

For maxDepth == 0, path.count is still 1 at the source, so this loop expands one edge and returns a direct neighbor before checking the new path length. Callers asking for paths within zero hops between different entities can therefore get [source, target], contrary to the documented hop-count limit.

Useful? React with 👍 / 👎.

Comment thread Sources/GraphRAG/Retrieval/BM25.swift Outdated
Comment on lines +130 to +134
for rawWord in text.split(whereSeparator: { $0.isWhitespace }) {
var cleaned = ""
for ch in rawWord where ch.isLetter || ch.isNumber {
cleaned.append(contentsOf: ch.lowercased())
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Split punctuation-separated BM25 terms

When indexed text contains common punctuation between words, such as graph-based, nodes/edges, or Paris,France, this tokenizer strips the punctuation and concatenates the pieces into one token (graphbased, nodesedges, etc.). Keyword search for the individual terms then cannot match those documents, which is especially visible for keyword retrieval mode where BM25 is the only signal.

Useful? React with 👍 / 👎.

Comment on lines +90 to +95
if let range = lowerContent.range(of: name.lowercased()) {
// Derive both offsets from the matched range; case folding can
// change grapheme counts, so `start + name.count` is unreliable.
let start = lowerContent.distance(from: lowerContent.startIndex, to: range.lowerBound)
let end = lowerContent.distance(from: lowerContent.startIndex, to: range.upperBound)
mentions.append(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match LLM entity mentions on token boundaries

For LLM extraction, this records the first case-insensitive substring match for an entity name. If the name appears inside an earlier token, e.g. the model extracts Ann from Annabelle later met Ann, the mention offsets point at Annabelle instead of the actual entity mention, corrupting the graph's evidence spans for that entity.

Useful? React with 👍 / 👎.

Comment on lines +224 to +228
for i in 0..<entities.count {
for j in (i + 1)..<entities.count {
let a = entities[i]
let b = entities[j]
let relType = relationType(for: a.entityType, b.entityType, context: context)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Scope pattern relationships to nearby mentions

When one chunk contains more than one unrelated fact, these nested loops evaluate every entity pair against the whole chunk context. For example, Alice Smith works for Acme Corporation. Bob Jones works for Beta Corporation. makes the single works for phrase apply to both people and both organizations, so default offline builds add false WORKS_FOR edges whenever multiple facts share a chunk.

Useful? React with 👍 / 👎.

self.numCtx = numCtx
}

var baseURL: String { "\(host):\(port)" }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Normalize bare Ollama host names

When callers pass a normal host value like host: "localhost" or host: "127.0.0.1", this constructs URLs such as localhost:11434/api/tags without an HTTP scheme; URL treats localhost as the scheme and leaves the host empty, so URLSession rejects the request and Ollama appears unavailable despite a valid host. Either normalize bare hosts by adding http:// or make the parameter clearly require a full URL base.

Useful? React with 👍 / 👎.

…dation

- PatternExtractor: only person titles cross a sentence period (org suffixes like
  "Acme Inc. Bob" now split); infer relationships only between entities sharing a
  sentence, using that sentence as keyword context (prevents one "works for"
  phrase linking every pair in a multi-fact chunk).
- KnowledgeGraph: addRelationship ignores dangling edges (both endpoints must be
  nodes); addChunk/removeChunks keep the document's embedded chunk copies in sync
  so document(id)?.chunks and saved JSON reflect enrichment.
- Engine: scope the per-chunk relationship filter to the entities that survived
  this chunk's maxEntitiesPerChunk cap (not global graph state).
- HybridRetriever.index(graph:) clears first so it's a safe full reindex.
- BM25 tokenizer splits on punctuation ("graph-based" -> "graph","based").
- Ollama baseURL normalizes bare hosts ("localhost" -> "http://localhost").
- LLMExtractor matches entity mentions on token boundaries ("Ann" no longer
  matches inside "Annabelle").

findAllPaths zero-hop was already handled by the `remaining <= 0` guard.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

Copy link
Copy Markdown
Contributor Author

All Codex round-5 findings are addressed in 74524ba:

  • Pattern extractor / sentence boundaries — only person titles (Dr.) cross a sentence period; org suffixes like Acme Inc. Bob now split into two entities.
  • Pattern extractor / relationship scoping — relationships are inferred only between entities that share a sentence, using that sentence as the keyword context, so one works for phrase no longer links every person/org pair in a multi-fact chunk.
  • KnowledgeGraph / dangling edgesaddRelationship ignores edges whose endpoints aren't both nodes.
  • KnowledgeGraph / chunk syncaddChunk/removeChunks keep the document's embedded chunk copies in sync, so document(id)?.chunks and saved JSON reflect entities/embeddings.
  • Engine / per-chunk cap — the relationship filter is scoped to the entities that survived this chunk's maxEntitiesPerChunk cap, not global graph state.
  • HybridRetriever.index(graph:) — clears first, so it's a safe full reindex.
  • BM25 tokenizer — splits on punctuation (graph-basedgraph, based).
  • Ollama — normalizes bare hosts (localhosthttp://localhost).
  • LLMExtractor — matches entity mentions on token boundaries (Ann no longer matches inside Annabelle).
  • findAllPaths zero-hop — already prevented by the remaining <= 0 guard from 5cf5576.

Note: CI still can't run on this PR (no GitHub-hosted runner is provisioned for the private org repo), so none of these rounds are compile-verified yet — that needs an org Actions/billing (or self-hosted runner, or public-repo) change.


Generated by Claude Code

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 74524ba374

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

var sid = 0
for k in 0..<chars.count {
sentenceID[k] = sid
if chars[k] == "." || chars[k] == "!" || chars[k] == "?" { sid += 1 }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve abbreviation periods when sentence-scoping relations

When the pattern extractor handles a titled/entity suffix abbreviation in a real sentence, e.g. Dr. Smith works for Acme Inc., this increments the sentence id at the period in Dr. even though capitalizedSpans intentionally keeps Dr. Smith as one entity. sameSentenceMentions then sees the person mention in sentence 0 and the organization in sentence 1, so the offline extractor drops the WORKS_FOR relationship entirely for common abbreviated names/suffixes.

Useful? React with 👍 / 👎.

Comment on lines +250 to +254
var lo = min(aOff, bOff)
while lo > 0 && sentence(of: lo - 1) == sentence(of: aOff) { lo -= 1 }
var hi = max(aOff, bOff)
while hi < chars.count && sentence(of: hi) == sentence(of: aOff) { hi += 1 }
let context = (lo < hi ? String(chars[lo..<hi]) : "").lowercased()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Scope relation keywords to each mention pair

Fresh evidence after the same-sentence guard is that the keyword context is still the entire sentence, so a sentence like Alice Smith works for Acme Corporation and Bob Jones works for Beta Corporation. makes the single works for phrase apply to every person/org pair in that sentence. This produces false WORKS_FOR edges such as Alice→Beta and Bob→Acme; the context needs to be limited to the text between/around the selected mention pair rather than the whole sentence.

Useful? React with 👍 / 👎.

Comment on lines +50 to +55
for rawWord in text.split(whereSeparator: { $0.isWhitespace }) {
var cleaned = ""
for ch in rawWord where ch.isLetter || ch.isNumber {
cleaned.append(contentsOf: ch.lowercased())
}
if !cleaned.isEmpty { tokens.append(cleaned) }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Split hash-embedding tokens on punctuation

In semantic-only mode, the default HashEmbedder is the only retrieval signal, but this tokenizer strips punctuation inside each whitespace-delimited word and concatenates the pieces. Indexed text like graph-based or nodes/edges hashes as graphbased/nodesedges, while queries for graph based or nodes edges hash different tokens and lose the semantic overlap that should retrieve those chunks.

Useful? React with 👍 / 👎.

nodeCount: n,
edgeCount: relationshipCount,
averageDegree: avgDegree,
maxDepth: 0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Report the actual graph max depth

For any non-empty connected graph, this public stats method always reports maxDepth as zero, so analytics or monitoring code that uses KnowledgeGraph.stats() gets a diameter/depth value that says the graph has no reachable hops even when relationships exist. This should be computed from the graph topology or omitted rather than hard-coded.

Useful? React with 👍 / 👎.

Comment on lines +16 to +18
public init(documentFrequencies: [String: Int] = [:], totalDocuments: Int = 1) {
self.documentFrequencies = documentFrequencies
self.totalDocuments = max(1, totalDocuments)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Start TF-IDF corpus counts at zero

A default TfIdfKeywordExtractor already has totalDocuments == 1, so callers that build corpus statistics solely through addDocumentToCorpus get an off-by-one corpus size: after one added document corpusStats() reports two documents and IDF is computed with an extra phantom document. The fallback for an empty corpus should not be stored as a real document count before any documents are added.

Useful? React with 👍 / 👎.

Comment thread Sources/GraphRAG/Graph/PageRank.swift Outdated
Comment on lines +41 to +42
let d = dampingFactor
let teleport = (1.0 - d) / Double(n)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clamp PageRank damping to a probability

Because dampingFactor is public, callers can construct PageRank(dampingFactor: 3) and this uses it directly as a probability; the teleport term becomes negative and PageRank can return negative scores despite the API documenting scores in [0, 1]. Validate or clamp the damping factor to [0, 1] before using it in the iteration.

Useful? React with 👍 / 👎.

…tion scoping

- HashEmbedder tokenizer splits on punctuation (matches BM25) so "graph-based"
  and the query "graph based" share tokens in semantic-only mode.
- TfIdfKeywordExtractor starts totalDocuments at 0 (no phantom document); the
  smoothed IDF still handles an empty corpus.
- PageRank clamps dampingFactor to [0, 1] so a bad value can't yield negative
  scores.
- KnowledgeGraph.stats() reports the real maxDepth (undirected graph diameter)
  instead of a hardcoded 0.
- PatternExtractor: an abbreviation period ending a person title ("Dr.") no
  longer splits the sentence, so "Dr. Smith works for Acme Inc." keeps its
  WORKS_FOR edge; relationship keyword context is now the span between the two
  mentions, and pairs with an intervening same-type entity are skipped, removing
  false edges in multi-fact sentences.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

Copy link
Copy Markdown
Contributor Author

Codex round-6 findings addressed in 3bfd74f:

  • HashEmbedder tokenizer — splits on punctuation (like BM25), so graph-based and a graph based query share tokens in semantic-only mode.
  • TF-IDF corpus counttotalDocuments starts at 0 (no phantom document); the smoothed IDF still handles an empty corpus.
  • PageRankdampingFactor is clamped to [0, 1] so a bad value can't produce negative scores.
  • KnowledgeGraph.stats().maxDepth — now the real undirected graph diameter instead of a hardcoded 0.
  • PatternExtractor (regression from round 5) — an abbreviation period ending a person title (Dr.) no longer splits the sentence, so Dr. Smith works for Acme Inc. keeps its WORKS_FOR edge.
  • PatternExtractor relation scoping — keyword context is now the span strictly between the two mentions, and a pair is skipped when another same-type entity lies between them, eliminating the false WORKS_FOR edges in multi-fact sentences.

Note: this is an offline heuristic extractor, not a full relation classifier — the proximity rules trade a little recall for far fewer false edges. CI still can't run on this PR (no runner provisioned for the private org repo), so these remain compile-unverified.


Generated by Claude Code

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3bfd74fa26

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +234 to +235
return PatternEntityExtractor.personTitles.contains(
String(chars[s..<periodIndex]).lowercased())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep org suffix periods within the same sentence

Fresh evidence in the current code is that this abbreviation check only consults personTitles, so organization suffixes such as Inc./Corp. still increment the sentence id. When the text is Acme Inc. was founded by Sam Altman, the org mention is placed in a different sentence from the person mention and sameSentenceMentions drops the LEADS relationship even though the abbreviation is mid-sentence.

Useful? React with 👍 / 👎.

Comment on lines +119 to +123
entitiesByID.removeAll()
entityOrder.removeAll()
relationships.removeAll()
outgoing.removeAll()
incoming.removeAll()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clear chunk entity references too

When this public clear method is called, it removes the entity table but leaves every TextChunk.entities array in chunksByID and the embedded document chunk copies untouched. Callers that clear and then inspect graph.chunks or save the graph get chunks containing dangling entity IDs that graph.contains no longer recognizes; the same inconsistent snapshot can be exposed if a rebuild fails after this clear step.

Useful? React with 👍 / 👎.

}

// Advance with overlap, snapped back to a word boundary.
var nextStart = max(0, end - overlap)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clamp public chunker overlap before advancing

When callers use HierarchicalChunker directly with a negative overlap, this computes the next start as end - overlap, which moves it past the current chunk end and silently leaves source text uncovered; for example, chunkSize: 200, overlap: -50 jumps from roughly 200 to 250 and drops the intervening characters. TextProcessor rejects this input, but the public chunkText/chunkSpans APIs do not, so direct chunking can lose document content.

Useful? React with 👍 / 👎.

Comment thread Sources/GraphRAG/Retrieval/BM25.swift Outdated
Comment on lines +132 to +133
if current.count > 2, !TfIdfKeywordExtractor.defaultStopwords.contains(current) {
tokens.append(current)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Index two-letter acronym terms in BM25

In keyword-only retrieval, important acronym-only queries such as AI, ML, or EU are dropped by this length check, so a document containing AI safety cannot be found by the query AI even though BM25 is the only signal. Common short stopwords are already filtered separately, so excluding every two-letter token loses meaningful technical and location terms.

Useful? React with 👍 / 👎.

Comment thread Sources/GraphRAG/GraphRAG/Engine.swift Outdated
Comment on lines +133 to +135
if var current = graph.chunk(id) {
current.entities = entities.map(\.id)
graph.addChunk(current)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't apply extraction results to replaced chunks

When another task replaces the same document id while extractor.extract is suspended, graph.chunk(id) can now be the new chunk with the same id, but this writes entity ids extracted from the old content onto it; stale entities and relationships were also inserted just above. Even though the final build is marked unbuilt, knowledgeGraph()/save(toJSON:) can expose the new chunk annotated with entities from the replaced text until a later rebuild.

Useful? React with 👍 / 👎.

Comment on lines +115 to +117
Relationship(
source: sourceID, target: targetID, relationType: relType,
confidence: data.strength ?? 0.7, context: [chunk.id]))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clamp LLM relationship strengths

When an LLM returns a strength outside the expected 0...1 range, such as 10 or -1, this stores it directly as relationship confidence. Those values are later used for traversal filtering and PageRank edge weights, so a single unbounded model output can either overweight an edge massively or make an otherwise valid relationship disappear under the default strength filter.

Useful? React with 👍 / 👎.

…, BM25

- PatternExtractor: sentence segmentation treats org suffixes (Inc./Corp./Ltd.)
  and common abbreviations as non-sentence-ending, so "Acme Inc. was founded by
  Sam Altman" keeps its LEADS edge. Entity-span splitting still uses personTitles.
- KnowledgeGraph.clearEntitiesAndRelationships clears chunk entity references in
  both chunksByID and the document copies, so no chunk points at a removed id.
- BM25 keeps 2-letter acronym terms (AI/ML/EU); only single chars are dropped.
- HierarchicalChunker clamps negative overlap (public API parity with TextProcessor).
- LLMExtractor clamps model-provided relationship strength to [0, 1].
- Engine annotates a chunk during build only when its content still matches what
  was extracted/embedded, so a document replaced mid-await isn't tagged with
  stale entities/embeddings.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

Copy link
Copy Markdown
Contributor Author

CI is green on b3de614swift build --build-tests and swift test both pass in the swift:6.1 container now that the repo is public and a runner is available. This is the first compile-verification of the port; the build is clean under Swift 6 strict concurrency and the test suite passes.

Codex round-7 findings were addressed in b3de614:

  • Sentence segmentation treats org suffixes (Inc./Corp./Ltd.) and common abbreviations as non-sentence-ending, so Acme Inc. was founded by Sam Altman keeps its LEADS edge (entity-span splitting still uses the narrower person titles).
  • clearEntitiesAndRelationships now also clears chunk entity references (in chunksByID and the document copies) so no chunk points at a removed entity id.
  • BM25 keeps 2-letter acronym terms (AI/ML/EU); only single characters are dropped.
  • HierarchicalChunker clamps a negative overlap (public-API parity with TextProcessor).
  • LLMExtractor clamps model-provided relationship strength to [0, 1].
  • Engine build annotates a chunk only when its content still matches what was extracted/embedded, so a document replaced mid-await isn't tagged with stale entities/embeddings.

Generated by Claude Code

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b3de614f12

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread Sources/GraphRAG/GraphRAG/Engine.swift Outdated
for id in chunkIDs {
guard let chunk = graph.chunk(id) else { continue }
let extracted = try await extractor.extract(from: chunk)
var entities = extracted.entities

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Filter all extractors by minConfidence

When GraphRAG uses Ollama/LLM extraction or any injected extractor, EntityConfig.minConfidence is never applied here: LLMEntityExtractor assigns every entity confidence 0.9, and the engine only caps by count before adding. A config such as EntityConfig(minConfidence: 0.95) therefore still indexes those entities and their relationships, so the public threshold silently works only for the default pattern extractor.

Useful? React with 👍 / 👎.

/// splitting still uses the narrower `personTitles`.
static let sentenceAbbreviations: Set<String> = [
"dr", "prof", "mr", "mrs", "ms", "jr", "sr", "st",
"inc", "corp", "ltd", "llc", "co", "etc", "vs",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Disambiguate sentence-final org suffixes

When a company suffix actually terminates a sentence, e.g. Alice Smith works for Acme Inc. Bob Jones works for Beta Corp., including inc/corp in this unconditional abbreviation set prevents the sentence id from advancing. The same-sentence guard then treats both facts as one sentence and can emit cross-sentence false edges such as Acme→Bob or Alice→Bob; the abbreviation check needs following-token context instead of always suppressing these periods.

Useful? React with 👍 / 👎.


public mutating func addDocument(_ document: Document) {
if documentsByID[document.id] == nil { documentOrder.append(document.id) }
documentsByID[document.id] = document

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove old chunks when replacing documents

When KnowledgeGraph is used directly, this overwrite path does not remove chunks already indexed for the same document id. After adding a document and its chunks, then calling addDocument again with the same id for a reingest, graph.chunks and saved JSON can still contain the old chunk text even though document(id) now points at the replacement; the cleanup done by GraphRAG.addDocument is not applied for direct KnowledgeGraph callers.

Useful? React with 👍 / 👎.

let gleanResponse = try await model.complete(gleanPrompt, params: params)
if let extra = LLMEntityExtractor.parse(gleanResponse) {
if extra.entities.isEmpty && extra.relationships.isEmpty { break }
output.entities.append(contentsOf: extra.entities)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Deduplicate gleaned LLM entities before capping

When gleaningRounds > 0 and a continuation repeats an entity from an earlier pass, this appends the duplicate and convert returns both copies with the same id. The engine applies maxEntitiesPerChunk before KnowledgeGraph.addEntity merges ids, so repeated high-confidence entities can consume the cap and drop distinct entities and their relationships from the chunk.

Useful? React with 👍 / 👎.

…dedup, reingest

- PatternExtractor: an abbreviation period ends a sentence when the next word is
  capitalized, so "Acme Inc. was founded by Sam Altman" stays one sentence while
  "...Acme Inc. Bob Jones..." splits (person titles still never split).
- Engine: apply EntityConfig.minConfidence uniformly to every extractor's output
  (LLM/injected extractors don't self-filter), before the per-chunk cap.
- LLMExtractor: deduplicate entities by id in convert (gleaning passes can repeat
  an entity) so a duplicate can't consume the per-chunk cap.
- KnowledgeGraph.addDocument: when replacing an existing document id, purge the
  previous version's chunks so direct callers don't retain stale chunk text.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

Copy link
Copy Markdown
Contributor Author

Codex round-8 findings addressed in f6bdaf4:

  • Sentence-final org suffixes — an abbreviation period now ends a sentence when the next word is capitalized, so Acme Inc. was founded by Sam Altman stays one sentence (keeps LEADS) while …Acme Inc. Bob Jones… correctly splits. Person titles still never split.
  • minConfidence for all extractorsEntityConfig.minConfidence is now applied uniformly in the build pipeline (LLM/injected extractors don't self-filter), before the per-chunk cap.
  • Gleaning dedupLLMEntityExtractor.convert deduplicates entities by id, so a repeated entity from a continuation pass can't consume the maxEntitiesPerChunk cap.
  • Direct reingestKnowledgeGraph.addDocument purges the previous version's chunks when replacing an existing document id, so direct callers don't retain stale chunk text.

(CI now runs on each push; b3de614 was green and f6bdaf4 is building.)


Generated by Claude Code

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f6bdaf4fea

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// "... Acme Inc. Bob ..." splits.
if PatternEntityExtractor.sentenceAbbreviations.contains(word) {
var t = periodIndex + 1
while t < chars.count && chars[t] == " " { t += 1 }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat newlines after org suffixes as sentence breaks

Fresh evidence in the current code is that this look-ahead skips only literal spaces, so an organization suffix before a line break is treated as non-terminal. For text like Alice Smith works for Acme Inc.\nBob Jones works for Beta Corp., Inc. does not advance the sentence id, causing both lines to be considered one sentence and allowing cross-sentence false relationships to be inferred; skip all non-newline whitespace carefully or treat \n after the suffix as a boundary before checking the next capitalized word.

Useful? React with 👍 / 👎.

Comment thread Sources/GraphRAG/Graph/PageRank.swift Outdated
var outWeight = [Double](repeating: 0, count: n)
for rel in graph.relationships {
guard let s = indexOf[rel.source], let t = indexOf[rel.target] else { continue }
let w = Double(max(rel.confidence, 0.0001))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Ignore zero-confidence edges in PageRank

When a relationship has confidence == 0 (for example, an LLM strength of 0 after clamping or a public caller marking an edge as no-strength), this converts it to a tiny positive weight and also uses that same tiny value in the source's total out-weight. If it is the source's only outgoing edge, the edge receives 100% of that node's PageRank contribution despite being explicitly zero-confidence; skip non-positive weights or treat those sources as dangling instead.

Useful? React with 👍 / 👎.

Comment on lines +315 to +318
for ma in a.mentions {
for mb in b.mentions where sentence(ma.startOffset) == sentence(mb.startOffset) {
return (ma.startOffset, mb.startOffset)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Consider later mention pairs before fixing relation type

When the same entity pair appears more than once in a chunk, this returns the first same-sentence mentions even if that co-occurrence has only generic context. For example, Alice Smith met Acme Corporation. Later, Alice Smith works for Acme Corporation. records only ASSOCIATED_WITH from the first sentence and never considers the later works for mention pair, so the specific WORKS_FOR edge is missed; evaluate all same-sentence mention pairs or prefer the pair whose local context matches a typed relation.

Useful? React with 👍 / 👎.

…r, PR weights

- PatternExtractor: a newline (or end of text) after an org-suffix abbreviation
  is a sentence boundary, so "Acme Inc.\nBob Jones ..." splits into two sentences.
- PatternExtractor: evaluate every same-sentence mention pair for an entity pair
  and keep the most specific typed relation, so a later "works for" co-occurrence
  isn't shadowed by an earlier generic one.
- PageRank: skip non-positive-confidence edges (treating such sources as
  dangling) so a single zero-confidence edge can't capture all of a node's mass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

Copy link
Copy Markdown
Contributor Author

Codex round-9 findings addressed in 420f1cc:

  • Newline sentence boundaries — a newline (or end of text) after an org-suffix abbreviation now ends the sentence, so Acme Inc.\nBob Jones works for Beta Corp. splits into two sentences (no cross-line false edges).
  • Best mention pair — for a given entity pair, all same-sentence mention pairs are evaluated and the most specific typed relation wins, so Alice met Acme. Later, Alice works for Acme. yields WORKS_FOR instead of only the earlier ASSOCIATED_WITH.
  • PageRank zero-confidence edges — non-positive-confidence edges are skipped (their sources treated as dangling), so a single zero-confidence edge can't capture 100% of a node's PageRank mass.

(f6bdaf4 was green; 420f1cc is building.)


Generated by Claude Code

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 420f1ccf74

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +20 to +22
- entity_name: Name of the entity, capitalized
- entity_type: One of the following types: [{entity_types}]
- entity_description: Comprehensive description of the entity's attributes and activities

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Accept the prompt's entity field names

When the LLM follows these bullet instructions and returns keys like entity_name / entity_type instead of the shorter keys in the later JSON example, EntityData only decodes name / type, defaults both fields to empty strings, and convert drops the entity entirely. This makes Ollama-backed extraction brittle for a response format the prompt itself asks for; align the prompt or decode both key styles.

Useful? React with 👍 / 👎.

// (./!/?) also ends it ("Acme. Bob") — unless the word is a
// known abbreviation/title like "Dr." so "Dr. Smith" merges.
if j > runStart, let last = chars[j - 1].unicodeScalars.first {
if CharacterSet(charactersIn: ",;:").contains(last) { break }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep comma-suffixed organization names together

For common legal names such as Alice works for Acme, Inc., this unconditional comma break ends the capitalized span at Acme; under the default minConfidence the single-word Acme is discarded and Inc can be extracted as the organization instead. That drops or corrupts organization relationships for normal company-name punctuation, so the comma should not split when it is immediately followed by an organization suffix.

Useful? React with 👍 / 👎.

Comment on lines +209 to +210
if words.count >= 2 {
return ("PERSON", 0.8)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Strip leading articles before classifying spans

When a sentence starts with names like The United States or The University of California, the captured span still includes The, so the known-location and organization-prefix checks miss and this fallback labels the whole span as PERSON. That corrupts common location/organization nodes and their downstream relationships; trim leading articles/blocklisted words before applying the location/prefix checks or before falling back to person.

Useful? React with 👍 / 👎.

// Accept bare hosts ("localhost", "127.0.0.1"): without a scheme, URL
// parses the host as the scheme and the request fails.
let normalizedHost = host.contains("://") ? host : "http://\(host)"
return "\(normalizedHost):\(port)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid appending a second Ollama port

If callers pass the common full Ollama base URL as the host, e.g. OllamaConfig(host: "http://localhost:11434") or withOllama(host: "http://localhost:11434"), this returns http://localhost:11434:11434, which URL(string:) rejects and makes availability checks and requests fail. Since the code already accepts hosts with a scheme, detect an existing port or document/enforce that host must exclude it.

Useful? React with 👍 / 👎.

} else {
// Replacing an existing id: purge the previous version's chunks so
// direct KnowledgeGraph callers don't retain stale chunk text.
removeChunks(forDocument: document.id)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Drop stale graph evidence when replacing documents

When KnowledgeGraph is used directly and a document id is replaced after entities/relationships were added for its old chunks, this only removes the chunk records. The old entities still keep mentions pointing at deleted chunk IDs and relationships still keep those chunk IDs in context, so graph traversal/stats and saved JSON can expose facts from a document version that no longer exists; replacement should also remove or scrub evidence tied to the removed chunks.

Useful? React with 👍 / 👎.

…, articles, evidence scrub

- PatternExtractor: a comma followed by an org suffix doesn't split the run, so
  "Acme, Inc." stays one organization span; leading articles ("The United States",
  "The University of California") are stripped before classification so they map
  to location/organization instead of person.
- LLMExtractor: EntityData/RelationshipData decode both the short JSON-example
  keys and the prompt's verbose names (entity_name/source_entity/…), so Ollama
  responses in either shape parse.
- Ollama baseURL: don't append a second port when the host already includes one
  ("http://localhost:11434").
- KnowledgeGraph.removeChunks: scrub entity mentions and relationship context that
  reference removed chunks, so replacing a document leaves no stale evidence.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

Copy link
Copy Markdown
Contributor Author

Codex round-10 findings addressed in 38c78b7:

  • Comma-suffixed org names — a comma followed by an org suffix no longer splits the span, so Acme, Inc. stays one organization.
  • Leading articlesThe United States / The University of California are stripped of the leading article before classification, so they resolve to location/organization rather than person.
  • LLM key namesEntityData/RelationshipData decode both the short JSON-example keys and the prompt's verbose names (entity_name/source_entity/…), so Ollama responses in either shape parse.
  • Ollama host with portbaseURL no longer appends a second port when the host already has one (http://localhost:11434).
  • Reingest evidenceKnowledgeGraph.removeChunks scrubs entity mentions and relationship-context entries referencing removed chunks, so replacing a document leaves no stale evidence.

(420f1cc was green; 38c78b7 is building.)


Generated by Claude Code

The alias CodingKeys added in round 10 (entity_name, source_entity, ...) broke
synthesized Encodable conformance, since every CodingKeys case must map to a
stored property. These structs are only decoded, so declare them Decodable
instead of Codable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f872b93bc4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

public func findRelationshipPath(
from source: EntityID, to target: EntityID, maxDepth: Int = 5
) -> [EntityID]? {
if source == target { return [source] }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require endpoints to exist for self paths

When callers ask for a path from an ID to itself and that entity was never added, this returns [source] before checking graph membership, producing a valid-looking zero-hop path for a missing node. Other traversal APIs guard graph.contains(...); this should verify the endpoint exists before returning the self path.

Useful? React with 👍 / 👎.

Comment on lines +303 to +304
let upper = min(hi + 1, chars.count)
let context = (lo < upper ? String(chars[lo..<upper]) : "").lowercased()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include trailing relation cues in pair context

When the relation phrase comes after both mentions, such as Alice Smith and Bob Jones are married, hi is only the later mention's start offset and this slice stops after its first character, so relationType never sees cues like married or colleague and downgrades the edge to generic KNOWS. Include the full second mention plus a bounded following/sentence window for the selected pair.

Useful? React with 👍 / 👎.

Comment on lines +39 to +40
self.retriever = HybridRetriever(
config: HybridConfig(maxCandidates: max(100, config.topKResults * 10)))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Wire retrieval config into search

When callers set Config(retrieval: ...), the graph-expansion knobs are silently ignored: this creates HybridRetriever with only a derived maxCandidates, and a repo-wide search for config.retrieval/maxExpansionDepth shows no production code reads them. As a result maxExpansionDepth, entityWeight, chunkWeight, and graphWeight have no effect on search/ask, despite being public configuration.

Useful? React with 👍 / 👎.

for k in 0..<chars.count {
sentenceID[k] = sid
let c = chars[k]
if c == "!" || c == "?" || (c == "." && periodIsSentenceEnd(k)) { sid += 1 }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat bare newlines as sentence boundaries

When chunks contain line-separated facts without sentence punctuation, this sentence map never increments on \n, so both lines share one sentence. For example Alice Smith works for Acme Corporation\nBob Jones works for Beta Corporation can emit cross-line generic edges such as Alice↔Bob or Acme↔Bob even though the same-sentence guard was meant to prevent facts in separate lines from being linked.

Useful? React with 👍 / 👎.

// Proximity heuristic: skip if another same-type entity
// lies between them — the phrase likely belongs to that
// nearer pair. (Offline extractor, not a full classifier.)
if hasInterveningSameType(a, b, lo: lo, hi: hi, among: entities) { continue }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep coordinated subjects linked to shared objects

When a relation applies to a coordinated list, such as Alice Smith and Bob Jones are employed by Acme Corporation, the Alice→Acme pair is skipped solely because Bob is another PERSON mention between them. That leaves only the nearest subject linked to the organization and drops valid relationships for earlier coordinated subjects; the intervening-entity heuristic needs to allow shared predicates rather than unconditionally skipping these pairs.

Useful? React with 👍 / 👎.

// Stage 1: entity & relationship extraction per chunk.
for id in chunkIDs {
guard let chunk = graph.chunk(id) else { continue }
let extracted = try await extractor.extract(from: chunk)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip relationship extraction when disabled

When EntityConfig(extractRelationships: false) is used with an Ollama/LLM extractor, this still invokes the full extractor that prompts the model to identify relationships; the flag only prevents inserting them later. In that configuration users still pay the latency/token cost of relationship extraction even though they explicitly disabled relationships, so the build should avoid requesting or inferring them in the first place.

Useful? React with 👍 / 👎.

… dead config

- KnowledgeGraph.findRelationshipPath requires both endpoints to exist before
  returning the self path.
- PatternExtractor: treat bare newlines as sentence boundaries; extend a pair's
  keyword context to the sentence end when no other entity mention follows the
  later mention (captures trailing cues like "... are married" without leaking
  into a following pair). Documented the deliberate precision-over-recall
  trade-off for coordinated subjects.
- Remove the unused RetrievalConfig (maxExpansionDepth / *Weight knobs were never
  read) so Config exposes no silently-ignored settings.
- Engine: documented that extractRelationships gates insertion only (skipping an
  LLM's relationship prompting would need a protocol change).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

Copy link
Copy Markdown
Contributor Author

Codex round-11 — addressed in ab37e9a:

Fixed

  • findRelationshipPath now requires both endpoints to exist before returning the zero-hop self path.
  • Sentence segmentation treats bare newlines as boundaries (line-per-fact chunks no longer merge).
  • Relationship keyword context extends to the sentence end when no other entity mention follows the later mention, so trailing cues (… are married) are captured — without leaking into a following pair's facts.
  • Removed the unused RetrievalConfig (its maxExpansionDepth/*Weight knobs were never read) so Config no longer exposes silently-ignored settings.

Rightfully ignored (with in-code rationale)

  • Coordinated subjects (Alice and Bob are employed by Acme → only Bob links): the intervening-same-type guard deliberately favors precision; relaxing it reintroduces cross-fact false edges (Alice works for Acme and Bob works for Beta → Alice→Beta). Documented at the guard.
  • Skip LLM relationship prompting when extractRelationships == false: the EntityExtracting protocol returns entities+relationships in one call, so the flag gates insertion; fully skipping the model's relationship work needs a protocol/prompt change. Documented in the build loop.

(f872b93 is green; ab37e9a building.)


Generated by Claude Code

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab37e9a485

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

}
}
for idx in relationships.indices where !relationships[idx].context.isEmpty {
relationships[idx].context.removeAll { removed.contains($0) }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Drop relationships once their evidence chunks are removed

Fresh evidence in the current code is that replacement now scrubs removed chunk IDs from context, but it keeps the relationship in relationships/adjacency even when that was its only evidence. When a direct KnowledgeGraph caller replaces a document after adding an extracted edge with context: [oldChunk], relationshipCount, traversal, and saved JSON still expose that stale fact as a context-less edge for the old document version.

Useful? React with 👍 / 👎.

Comment thread Sources/GraphRAG/GraphRAG/Config.swift Outdated
Comment on lines +18 to +19
public var maxConnections: Int
public var threshold: Float

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor or remove dead graph config knobs

Fresh evidence after the unused retrieval config was removed is that these public graph knobs are still silently ignored: a repo-wide search for config.graph/maxConnections finds no production reads outside this declaration. Callers can set Config(graph: GraphConfig(maxConnections: 1, threshold: 0.99)) and get the same graph construction/retrieval behavior, so experiments relying on those limits or thresholds are misconfigured without any signal.

Useful? React with 👍 / 👎.

Comment thread Sources/GraphRAG/Graph/Traversal.swift Outdated
Comment on lines +193 to +195
for (neighbor, relationship) in graph.neighbors(of: current) {
guard passesFilter(relationship) else { continue }
if !visited.contains(neighbor) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Deduplicate entity paths across relation types

When two nodes have multiple relationship types between them, neighbors(of:) returns one entry per type, but findAllPaths returns only [EntityID] paths. This loop therefore recurses once per edge type and appends identical entity paths; with a low maxPaths, duplicates like two copies of [A, B] can consume the cap before distinct paths are returned.

Useful? React with 👍 / 👎.

… config, path dedup

- KnowledgeGraph.removeChunks now drops a relationship whose only evidence was a
  removed chunk (context-less leftover would expose a stale fact) and rebuilds the
  adjacency index; edges that never had context are preserved.
- Remove the unused GraphConfig (maxConnections/threshold were never read) so
  Config exposes no silently-ignored knobs.
- findAllPaths collapses parallel edges (multiple relation types between the same
  nodes) to one neighbor, so entity-only paths aren't duplicated and don't waste
  the maxPaths budget.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

Copy link
Copy Markdown
Contributor Author

Codex round-12 findings addressed in b6dfa55:

  • Evidence-less edgesremoveChunks now drops a relationship whose only context was a removed chunk (and rebuilds adjacency), so replacing a document leaves no stale context-less edge. Edges that never had context are preserved.
  • Dead graph config — removed the unused GraphConfig (maxConnections/threshold were never read), matching the earlier RetrievalConfig cleanup, so Config exposes no silently-ignored knobs.
  • Path dedupfindAllPaths collapses parallel edges (multiple relation types between the same nodes) to a single neighbor, so entity-only paths aren't duplicated and don't consume the maxPaths budget.

(ab37e9a is green; b6dfa55 building.)


Generated by Claude Code

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b6dfa551cd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +128 to +129
entity.mentions = kept
entitiesByID[eid] = entity

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Drop entities left with no surviving mentions

When a direct KnowledgeGraph caller replaces a document after indexing entities from its old chunks, this loop removes the old chunk mentions but writes the entity back even when kept is empty. The stale entity still appears in entityCount, entities, stats, and saved JSON even though none of its evidence chunks exist anymore; entities whose mentions were exhausted by the replacement should be removed unless they have independent evidence.

Useful? React with 👍 / 👎.

Comment thread Sources/GraphRAG/Core/Models.swift Outdated
Comment on lines +38 to +40
/// Byte offset of the chunk start within the original document content.
public var startOffset: Int
/// Byte offset of the chunk end within the original document content.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop documenting character offsets as byte offsets

These public fields are populated by TextProcessor/HierarchicalChunker using Array(text) character indices, not UTF-8 byte offsets. For any non-ASCII document, callers following this API documentation will slice or highlight at the wrong byte positions in the original content; either the stored offsets need to be byte counts or the public contract should say they are character offsets.

Useful? React with 👍 / 👎.

Comment thread Sources/GraphRAG/GraphRAG/Engine.swift Outdated
// Stage 2: embed chunks.
for id in chunkIDs {
guard let chunk = graph.chunk(id) else { continue }
let embedding = try await embedder.embed(chunk.content)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip embeddings for keyword-only builds

When callers configure Config(approach: "keyword"), retrieval later sets includeSemantic to false and never uses query or chunk embeddings, but build() still calls the embedder for every chunk here. With an Ollama or otherwise remote/unavailable embedder, a keyword-only corpus can fail or pay full embedding latency even though BM25 is the only signal used; the embedding stage should be skipped when semantic retrieval is disabled.

Useful? React with 👍 / 👎.

…-only embeds, fix offset docs

- KnowledgeGraph.removeChunks removes entities whose mentions are entirely
  exhausted by the removal (no independent evidence) and drops relationships whose
  endpoint was removed, alongside the existing context scrub.
- Engine.build skips the embedding stage entirely for approach == "keyword", so a
  keyword-only corpus doesn't pay embedder latency or fail on a remote embedder.
- TextChunk start/end offsets are documented as character (grapheme) offsets, not
  UTF-8 byte offsets, matching how the chunker computes them.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VUzx3BzstYuB68txcHvAUh

Copy link
Copy Markdown
Contributor Author

Codex round-13 findings addressed in 8f51562:

  • Orphaned entities on reingestremoveChunks now removes an entity whose mentions are entirely exhausted by the removal (no independent evidence) and drops relationships whose endpoint was removed, in addition to the existing context scrub.
  • Keyword-only builds skip embeddingsbuild() skips the embedding stage entirely when approach == "keyword", so a keyword-only corpus doesn't pay embedder latency or fail on a remote/unavailable embedder.
  • Offset docsTextChunk.startOffset/endOffset are now documented as character (grapheme) offsets, not UTF-8 byte offsets, matching how the chunker computes them.

(b6dfa55 is green; 8f51562 building.)


Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants