Skip to content

Add Yarn configuration and enhance TF-IDF vector database with anti-query support#11

Merged
alexmercerpo merged 2 commits into
mainfrom
fixes
Jun 18, 2026
Merged

Add Yarn configuration and enhance TF-IDF vector database with anti-query support#11
alexmercerpo merged 2 commits into
mainfrom
fixes

Conversation

@alexmercerpo

@alexmercerpo alexmercerpo commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

Release Notes

  • New Features

    • Added BM25 scoring option to TF-IDF vector database (alternative to cosine similarity).
    • Added negative/anti-query search support to demote results matching exclusion terms.
    • Created worker-safe entrypoint for VectoriaDB.
  • Chores

    • Upgraded project to Yarn 4 for improved package management.
    • Consolidated Node and Yarn setup into a reusable GitHub Action for CI workflows.

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@alexmercerpo, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 31 minutes and 29 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: af745bd9-5fd6-4396-b670-c7c18fdd7331

📥 Commits

Reviewing files that changed from the base of the PR and between 3edbb92 and 0b8357a.

📒 Files selected for processing (5)
  • .yarnrc.yml
  • libs/vectoriadb/src/interfaces.ts
  • libs/vectoriadb/src/tfidf.embedding.service.ts
  • libs/vectoriadb/src/vectoria-tfidf.ts
  • libs/vectoriadb/src/vectoria.ts

Walkthrough

This PR introduces BM25 scoring mode and negative/anti-query support to TFIDFVectoria and VectoriaDB (HNSW and brute-force paths), adds a new maxNegativeSimilarity utility, and extends snapshot serialization to persist BM25 state. Separately, it migrates all CI workflows to a new local composite GitHub Action for Yarn 4 setup.

Changes

VectoriaDB BM25 Scoring and Negative Queries

Layer / File(s) Summary
Public types, SearchOptions, and exports
libs/vectoriadb/src/interfaces.ts, libs/vectoriadb/src/vectoria-tfidf.ts, libs/vectoriadb/src/index.ts, libs/vectoriadb/src/worker.ts, libs/vectoriadb/package.json
SearchOptions gains negativeQuery and negativeWeight; TFIDFDocument gains optional termCounts/length; new TFIDFScoring, BM25Params, updated TFIDFVectoriaConfig and TFIDFSnapshot types are added; index.ts and worker.ts barrel re-exports and ./worker package subpath export are added.
maxNegativeSimilarity utility and TFIDFEmbeddingService BM25 internals
libs/vectoriadb/src/similarity.utils.ts, libs/vectoriadb/src/tfidf.embedding.service.ts, libs/vectoriadb/src/__tests__/similarity.spec.ts
New maxNegativeSimilarity computes max cosine similarity against a negative vector set; TFIDFEmbeddingService adds a documentFrequency map, bm25Idf(), exportState()/importState(), and clears df in clear(); utility tests added.
TFIDFVectoria BM25 scoring, negative queries, and snapshot persistence
libs/vectoriadb/src/vectoria-tfidf.ts, libs/vectoriadb/src/__tests__/vectoria-tfidf-negative.spec.ts, libs/vectoriadb/src/__tests__/vectoria-tfidf-snapshot-bm25.spec.ts
Constructor wires BM25 defaults; reindex conditionally computes termCounts and avgDocLength; search branches on scoring mode with negative-penalty subtraction; bm25Score(), toSnapshot()/loadSnapshot() (with validation and sanitizeObject) added; full negative-query and BM25 snapshot tests added including hardening tests.
VectoriaDB HNSW and brute-force negative query support
libs/vectoriadb/src/vectoria.ts
search() adds embedNegatives() helper; searchWithHNSW overscans when negatives present, applies adjusted score, and re-sorts post-iteration; searchBruteForce subtracts maxNegativeSimilarity penalty from cosine similarity.

Yarn 4 CI Tooling

Layer / File(s) Summary
Composite action and Yarn 4 config
.github/actions/setup-node-yarn/action.yml, .yarnrc.yml, .gitignore, package.json
New composite action validates inputs, enables Corepack, runs actions/setup-node@v6, and conditionally installs; .yarnrc.yml sets nodeLinker: node-modules; .gitignore adds Yarn berry block; package.json pins packageManager: yarn@4.14.1.
Workflow integrations
.github/workflows/push.yml, .github/workflows/create-release-branch.yml, .github/workflows/publish-release.yml
All three CI workflows replace standalone actions/setup-node@v6 + yarn install steps with the local ./.github/actions/setup-node-yarn composite action; test job sets install: "false".

Sequence Diagram

sequenceDiagram
  participant Caller
  participant VectoriaDB
  participant TFIDFVectoria
  participant TFIDFEmbeddingService
  participant maxNegativeSimilarity

  rect rgba(100, 149, 237, 0.5)
    note over Caller, TFIDFVectoria: TF-IDF / BM25 search path
    Caller->>TFIDFVectoria: search(query, { negativeQuery, negativeWeight, scoring })
    TFIDFVectoria->>TFIDFEmbeddingService: tokenize / embed query
    TFIDFVectoria->>TFIDFEmbeddingService: tokenize / embed negativeQuery
    loop each document
      TFIDFVectoria->>maxNegativeSimilarity: maxNegSim(docVector, negativeVectors)
      maxNegativeSimilarity-->>TFIDFVectoria: penalty
      TFIDFVectoria->>TFIDFVectoria: score = baseScore - negativeWeight * penalty
    end
    TFIDFVectoria-->>Caller: filtered, sorted SearchResults
  end

  rect rgba(144, 238, 144, 0.5)
    note over Caller, VectoriaDB: HNSW / brute-force search path
    Caller->>VectoriaDB: search(query, { negativeQuery, negativeWeight })
    VectoriaDB->>TFIDFEmbeddingService: embedNegatives(negativeQuery)
    VectoriaDB->>VectoriaDB: searchWithHNSW / searchBruteForce(negativeVectors, negativeWeight)
    loop each candidate
      VectoriaDB->>maxNegativeSimilarity: maxNegSim(docVector, negativeVectors)
      maxNegativeSimilarity-->>VectoriaDB: penalty
      VectoriaDB->>VectoriaDB: adjustedScore = (1 - distance) - negativeWeight * penalty
    end
    VectoriaDB->>VectoriaDB: re-sort by adjustedScore, slice top-K
    VectoriaDB-->>Caller: SearchResults
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 Hop hop! The rabbit scores with flair,
BM25 floats through the air.
Negative queries? Demote that noise!
Yarn 4 combs through cache with poise.
Snapshots saved, the worker runs free —
VectoriaDB, as sharp as can be!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: adding Yarn configuration (multiple Yarn-related files) and enhancing TF-IDF with anti-query/negative-query support (new SearchOptions fields, similarity utilities, BM25 support, and worker export).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
libs/vectoriadb/src/tfidf.embedding.service.ts (1)

123-141: 💤 Low value

Consider validating individual entry structure in importState.

The validation checks that idf and df are arrays, but doesn't verify each entry is a valid [string, number] tuple. A malformed snapshot with entries like ["term", "not-a-number"] or [null, 1] would silently create a corrupt model with NaN-poisoned or invalid scores.

🛡️ Optional: Add entry-level validation
   importState(state: { idf: Array<[string, number]>; df: Array<[string, number]>; documentCount: number }): void {
     if (!Array.isArray(state.idf) || !Array.isArray(state.df)) {
       throw new Error('Invalid TFIDF model state: idf and df must be arrays of [term, number] pairs');
     }
+    for (const entry of state.idf) {
+      if (!Array.isArray(entry) || typeof entry[0] !== 'string' || !Number.isFinite(entry[1])) {
+        throw new Error('Invalid TFIDF model state: idf entries must be [string, number] pairs');
+      }
+    }
+    for (const entry of state.df) {
+      if (!Array.isArray(entry) || typeof entry[0] !== 'string' || !Number.isFinite(entry[1])) {
+        throw new Error('Invalid TFIDF model state: df entries must be [string, number] pairs');
+      }
+    }
     if (!Number.isFinite(state.documentCount) || state.documentCount < 0) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/vectoriadb/src/tfidf.embedding.service.ts` around lines 123 - 141, In
the importState method, add entry-level validation for the idf and df arrays to
ensure each element is a valid [string, number] pair. After validating that idf
and df are arrays, iterate through each entry and check that it has exactly two
elements where the first is a string and the second is a finite number. This
prevents malformed snapshots from creating corrupt models with invalid or
NaN-poisoned scores when the entries are passed to the Map constructor.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.yarnrc.yml:
- Around line 1-2: The approvedGitRepositories setting with the wildcard pattern
"**" is unnecessary and weakens supply-chain security controls since no
git-based dependencies are currently in use. Remove the entire
approvedGitRepositories configuration block from the .yarnrc.yml file entirely,
as this setting should only be included if and when actual git-sourced packages
are needed in the future, at which point it should be restricted to specific
trusted repositories or organizations rather than using a permissive wildcard.

In `@libs/vectoriadb/src/vectoria-tfidf.ts`:
- Around line 47-54: The SearchResult.score property documentation needs to be
updated to reflect that scores are no longer always bounded to [0, 1] after
introducing BM25 scoring alongside cosine similarity. Locate the SearchResult
interface/type definition (likely in a shared types or results file) and update
its score field documentation to clarify that score bounds depend on the scoring
method used - cosine scores are bounded in roughly [0, 1], while BM25 scores are
unbounded. This ensures the public API contract accurately represents the actual
behavior of both scoring modes.
- Around line 306-313: The search method in the vectoria-tfidf.ts file does not
validate its input parameters before using them. After destructuring the options
in the search method, add validation checks for the query, topK, and
negativeWeight parameters. Specifically, verify that query is not empty, topK is
a positive number, and negativeWeight is a valid non-negative number that is not
NaN. Throw descriptive errors for any invalid inputs to fail fast rather than
producing incorrect ranking behavior downstream.
- Around line 468-475: The loadSnapshot() method directly assigns untrusted
snapshot data to this.config, this.bm25, and related fields without proper
validation. Before assigning values from snapshot.scoring, snapshot.bm25, and
snapshot.config, add validation to ensure all BM25 parameters are finite numbers
and within valid ranges, that scoring configuration is properly shaped, and that
config values are safe and not malicious. Use validation checks (such as
Number.isFinite()) on numeric BM25 parameters and validate object shapes before
spreading snapshot.bm25 and assigning snapshot.config values to prevent
injection of invalid data that could destabilize the scoring system.

In `@libs/vectoriadb/src/vectoria.ts`:
- Around line 273-277: The negativeWeight variable assigned from
options.negativeWeight is not validated before being used in score arithmetic
operations. Add validation to ensure negativeWeight is a positive finite number
(greater than 0 and not NaN or Infinity), and throw or reject an error if the
value is invalid. This validation should occur immediately after the
negativeWeight assignment to prevent invalid values from propagating downstream
in the semantic search scoring logic.

---

Nitpick comments:
In `@libs/vectoriadb/src/tfidf.embedding.service.ts`:
- Around line 123-141: In the importState method, add entry-level validation for
the idf and df arrays to ensure each element is a valid [string, number] pair.
After validating that idf and df are arrays, iterate through each entry and
check that it has exactly two elements where the first is a string and the
second is a finite number. This prevents malformed snapshots from creating
corrupt models with invalid or NaN-poisoned scores when the entries are passed
to the Map constructor.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a6d14b06-9748-4dff-8786-cfe66e71ddb2

📥 Commits

Reviewing files that changed from the base of the PR and between 8473469 and 3edbb92.

⛔ Files ignored due to path filters (2)
  • .yarn/install-state.gz is excluded by !**/.yarn/**, !**/*.gz
  • yarn.lock is excluded by !**/yarn.lock, !**/*.lock
📒 Files selected for processing (18)
  • .github/actions/setup-node-yarn/action.yml
  • .github/workflows/create-release-branch.yml
  • .github/workflows/publish-release.yml
  • .github/workflows/push.yml
  • .gitignore
  • .yarnrc.yml
  • libs/vectoriadb/package.json
  • libs/vectoriadb/src/__tests__/similarity.spec.ts
  • libs/vectoriadb/src/__tests__/vectoria-tfidf-negative.spec.ts
  • libs/vectoriadb/src/__tests__/vectoria-tfidf-snapshot-bm25.spec.ts
  • libs/vectoriadb/src/index.ts
  • libs/vectoriadb/src/interfaces.ts
  • libs/vectoriadb/src/similarity.utils.ts
  • libs/vectoriadb/src/tfidf.embedding.service.ts
  • libs/vectoriadb/src/vectoria-tfidf.ts
  • libs/vectoriadb/src/vectoria.ts
  • libs/vectoriadb/src/worker.ts
  • package.json

Comment thread .yarnrc.yml Outdated
Comment thread libs/vectoriadb/src/vectoria-tfidf.ts
Comment thread libs/vectoriadb/src/vectoria-tfidf.ts
Comment thread libs/vectoriadb/src/vectoria-tfidf.ts
Comment thread libs/vectoriadb/src/vectoria.ts
@alexmercerpo alexmercerpo merged commit 1d52f44 into main Jun 18, 2026
3 checks passed
@alexmercerpo alexmercerpo deleted the fixes branch June 18, 2026 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants