Add Yarn configuration and enhance TF-IDF vector database with anti-query support#11
Conversation
|
Warning Review limit reached
More reviews will be available in 31 minutes and 29 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
WalkthroughThis PR introduces BM25 scoring mode and negative/anti-query support to ChangesVectoriaDB BM25 Scoring and Negative Queries
Yarn 4 CI Tooling
Sequence DiagramsequenceDiagram
participant Caller
participant VectoriaDB
participant TFIDFVectoria
participant TFIDFEmbeddingService
participant maxNegativeSimilarity
rect rgba(100, 149, 237, 0.5)
note over Caller, TFIDFVectoria: TF-IDF / BM25 search path
Caller->>TFIDFVectoria: search(query, { negativeQuery, negativeWeight, scoring })
TFIDFVectoria->>TFIDFEmbeddingService: tokenize / embed query
TFIDFVectoria->>TFIDFEmbeddingService: tokenize / embed negativeQuery
loop each document
TFIDFVectoria->>maxNegativeSimilarity: maxNegSim(docVector, negativeVectors)
maxNegativeSimilarity-->>TFIDFVectoria: penalty
TFIDFVectoria->>TFIDFVectoria: score = baseScore - negativeWeight * penalty
end
TFIDFVectoria-->>Caller: filtered, sorted SearchResults
end
rect rgba(144, 238, 144, 0.5)
note over Caller, VectoriaDB: HNSW / brute-force search path
Caller->>VectoriaDB: search(query, { negativeQuery, negativeWeight })
VectoriaDB->>TFIDFEmbeddingService: embedNegatives(negativeQuery)
VectoriaDB->>VectoriaDB: searchWithHNSW / searchBruteForce(negativeVectors, negativeWeight)
loop each candidate
VectoriaDB->>maxNegativeSimilarity: maxNegSim(docVector, negativeVectors)
maxNegativeSimilarity-->>VectoriaDB: penalty
VectoriaDB->>VectoriaDB: adjustedScore = (1 - distance) - negativeWeight * penalty
end
VectoriaDB->>VectoriaDB: re-sort by adjustedScore, slice top-K
VectoriaDB-->>Caller: SearchResults
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
libs/vectoriadb/src/tfidf.embedding.service.ts (1)
123-141: 💤 Low valueConsider validating individual entry structure in
importState.The validation checks that
idfanddfare arrays, but doesn't verify each entry is a valid[string, number]tuple. A malformed snapshot with entries like["term", "not-a-number"]or[null, 1]would silently create a corrupt model with NaN-poisoned or invalid scores.🛡️ Optional: Add entry-level validation
importState(state: { idf: Array<[string, number]>; df: Array<[string, number]>; documentCount: number }): void { if (!Array.isArray(state.idf) || !Array.isArray(state.df)) { throw new Error('Invalid TFIDF model state: idf and df must be arrays of [term, number] pairs'); } + for (const entry of state.idf) { + if (!Array.isArray(entry) || typeof entry[0] !== 'string' || !Number.isFinite(entry[1])) { + throw new Error('Invalid TFIDF model state: idf entries must be [string, number] pairs'); + } + } + for (const entry of state.df) { + if (!Array.isArray(entry) || typeof entry[0] !== 'string' || !Number.isFinite(entry[1])) { + throw new Error('Invalid TFIDF model state: df entries must be [string, number] pairs'); + } + } if (!Number.isFinite(state.documentCount) || state.documentCount < 0) {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@libs/vectoriadb/src/tfidf.embedding.service.ts` around lines 123 - 141, In the importState method, add entry-level validation for the idf and df arrays to ensure each element is a valid [string, number] pair. After validating that idf and df are arrays, iterate through each entry and check that it has exactly two elements where the first is a string and the second is a finite number. This prevents malformed snapshots from creating corrupt models with invalid or NaN-poisoned scores when the entries are passed to the Map constructor.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.yarnrc.yml:
- Around line 1-2: The approvedGitRepositories setting with the wildcard pattern
"**" is unnecessary and weakens supply-chain security controls since no
git-based dependencies are currently in use. Remove the entire
approvedGitRepositories configuration block from the .yarnrc.yml file entirely,
as this setting should only be included if and when actual git-sourced packages
are needed in the future, at which point it should be restricted to specific
trusted repositories or organizations rather than using a permissive wildcard.
In `@libs/vectoriadb/src/vectoria-tfidf.ts`:
- Around line 47-54: The SearchResult.score property documentation needs to be
updated to reflect that scores are no longer always bounded to [0, 1] after
introducing BM25 scoring alongside cosine similarity. Locate the SearchResult
interface/type definition (likely in a shared types or results file) and update
its score field documentation to clarify that score bounds depend on the scoring
method used - cosine scores are bounded in roughly [0, 1], while BM25 scores are
unbounded. This ensures the public API contract accurately represents the actual
behavior of both scoring modes.
- Around line 306-313: The search method in the vectoria-tfidf.ts file does not
validate its input parameters before using them. After destructuring the options
in the search method, add validation checks for the query, topK, and
negativeWeight parameters. Specifically, verify that query is not empty, topK is
a positive number, and negativeWeight is a valid non-negative number that is not
NaN. Throw descriptive errors for any invalid inputs to fail fast rather than
producing incorrect ranking behavior downstream.
- Around line 468-475: The loadSnapshot() method directly assigns untrusted
snapshot data to this.config, this.bm25, and related fields without proper
validation. Before assigning values from snapshot.scoring, snapshot.bm25, and
snapshot.config, add validation to ensure all BM25 parameters are finite numbers
and within valid ranges, that scoring configuration is properly shaped, and that
config values are safe and not malicious. Use validation checks (such as
Number.isFinite()) on numeric BM25 parameters and validate object shapes before
spreading snapshot.bm25 and assigning snapshot.config values to prevent
injection of invalid data that could destabilize the scoring system.
In `@libs/vectoriadb/src/vectoria.ts`:
- Around line 273-277: The negativeWeight variable assigned from
options.negativeWeight is not validated before being used in score arithmetic
operations. Add validation to ensure negativeWeight is a positive finite number
(greater than 0 and not NaN or Infinity), and throw or reject an error if the
value is invalid. This validation should occur immediately after the
negativeWeight assignment to prevent invalid values from propagating downstream
in the semantic search scoring logic.
---
Nitpick comments:
In `@libs/vectoriadb/src/tfidf.embedding.service.ts`:
- Around line 123-141: In the importState method, add entry-level validation for
the idf and df arrays to ensure each element is a valid [string, number] pair.
After validating that idf and df are arrays, iterate through each entry and
check that it has exactly two elements where the first is a string and the
second is a finite number. This prevents malformed snapshots from creating
corrupt models with invalid or NaN-poisoned scores when the entries are passed
to the Map constructor.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a6d14b06-9748-4dff-8786-cfe66e71ddb2
⛔ Files ignored due to path filters (2)
.yarn/install-state.gzis excluded by!**/.yarn/**,!**/*.gzyarn.lockis excluded by!**/yarn.lock,!**/*.lock
📒 Files selected for processing (18)
.github/actions/setup-node-yarn/action.yml.github/workflows/create-release-branch.yml.github/workflows/publish-release.yml.github/workflows/push.yml.gitignore.yarnrc.ymllibs/vectoriadb/package.jsonlibs/vectoriadb/src/__tests__/similarity.spec.tslibs/vectoriadb/src/__tests__/vectoria-tfidf-negative.spec.tslibs/vectoriadb/src/__tests__/vectoria-tfidf-snapshot-bm25.spec.tslibs/vectoriadb/src/index.tslibs/vectoriadb/src/interfaces.tslibs/vectoriadb/src/similarity.utils.tslibs/vectoriadb/src/tfidf.embedding.service.tslibs/vectoriadb/src/vectoria-tfidf.tslibs/vectoriadb/src/vectoria.tslibs/vectoriadb/src/worker.tspackage.json
Summary by CodeRabbit
Release Notes
New Features
Chores