Implement determinant-diversity search in the new post-processing pipeline by narendatha · Pull Request #858 · microsoft/DiskANN

narendatha · 2026-03-23T11:14:50Z

Add Determinant-Diversity Search Support

Summary

This PR implements diversity-maximizing search using determinant-diversity post-processing.
Instead of returning only the k nearest neighbors by distance, determinant-diversity search selects k vectors that maximize the volume spanned by query-scaled vectors, producing more diverse and complementary results.

Motivation

Standard nearest neighbor search often returns highly similar or redundant results.
Determinant-diversity search helps retrieve complementary results that cover different aspects of a query.
This PR uses determinant maximization of the Gram matrix of query-scaled vectors.

Algorithm

Each candidate vector is scaled by (similarity_to_query)^power, then vectors are greedily selected to maximize the determinant of the Gram matrix (equivalently maximizing spanned volume).

Two variants are implemented:

eta = 0: pure greedy orthogonalization (pivoted QR style)
eta > 0: ridge-regularized variant balancing diversity and numerical stability

Both variants run in O(n * k * d) where n is number of candidates, k is number of selected results, and d is vector dimension.

Performance Characteristics

In benchmark comparison runs (normal search vs determinant-diversity search), we observe the expected trade-off:

lower QPS due to diversity optimization overhead
recall trade-off depending on search parameters

Parameter Tuning Guide

`determinant_diversity_power` (default: 2.0)

Controls relevance vs diversity:

Higher (e.g., 3.0+): stronger relevance weighting, typically less diversity
Lower (e.g., 1.0): more diversity emphasis
0: pure diversity maximization without relevance weighting

`determinant_diversity_eta` (default: 0.01)

Ridge regularization for numerical robustness and relevance bias:

Higher (e.g., 0.1+): more robust, typically more relevance-biased
Lower (e.g., 0.001): closer to pure determinant maximization
0: pure greedy orthogonalization (no ridge regularization)

Recommended Settings

Use Case	determinant_diversity_power	determinant_diversity_eta
Balanced (default)	2.0	0.01
High diversity	1.0	0.001
High relevance	3.0	0.1

Usage

Benchmark Configuration (disk-index)

{
  "search_phase": {
    "search_list": [100, 200, 400],
    "recall_at": 10,
    "is_determinant_diversity_search": true,
    "determinant_diversity_eta": 0.01,
    "determinant_diversity_power": 2.0
  }
}

Testing

unit tests for determinant-diversity post-processing and parameter validation
cargo fmt --all
cargo clippy --workspace --all-targets -- -D warnings
cargo check --package diskann-benchmark --features disk-index

How to try

build this branch
set benchmark input paths to your dataset/index
run diskann-benchmark with determinant-diversity parameters

- Simplify Search trait: move processor/output buffer to method-level generics - Remove Internal<FullPrecision> strategy split; use RemoveDeletedIdsAndCopy for delete ops - Add DefaultSearchStrategy aggregate trait combining SearchStrategy + HasDefaultProcessor - Update benchmark-core helpers to use aggregate trait (reduce recurring bounds) - Wire range search output buffer through to caller (support dynamic output handling) - Add no-op SearchOutputBuffer impl for () type to preserve compatibility

…roviders This commit moves the determinant_diversity_post_process module from diskann to diskann-providers, as it does not depend on diskann internals and logically belongs with other post-processing logic in the providers layer. Changes: - Move determinant_diversity_post_process.rs from diskann/src/graph/search/ to diskann-providers/src/model/graph/provider/async_/ - Update all imports across workspace to use diskann_providers location - Add diskann-providers dependency to diskann-benchmark-core (required for DeterminantDiversitySearchParams access) - Remove old module reference from diskann/src/graph/search/mod.rs - Update diskann-benchmark, diskann-disk imports to use new location Validated with: - cargo clippy --workspace --all-targets -- -D warnings - cargo fmt --all This results in cleaner architectural separation where determinant-diversity search parameters stay with the provider infrastructure that implements them.

…uce clones - Add DeterminantDiversityError enum for parameter validation - Convert DeterminantDiversitySearchParams::new() to return Result<Self, Error> - Validate top_k > 0, eta >= 0.0 and finite, power > 0.0 and finite - Optimize post_process_with_eta_f32: precompute projections to eliminate vector clones - Optimize post_process_greedy_orthogonalization_f32: single r_star_copy before projection loop - Expand test suite from 3 to 11 tests (7 validation + 4 algorithm tests) - Update callsites in disk_index/search.rs and index/search/knn.rs for error handling - Add early validation checks in main router function

- Extract shared run-loop logic into reusable helpers - Route both knn and determinant-diversity through closure-based parameter builders - Preserve determinant-diversity parameter validation/error propagation - Reduce duplicated benchmark orchestration code

- Promote DelegateDefaultPostProcessor as the canonical trait in glue - Remove compatibility alias layer for HasDefaultProcessor - Rename all trait bounds/impls/usages across diskann, providers, disk, benchmark, and label-filter - Keep delegate_default_post_process! macro usage aligned with trait naming

- Add runtime filter_start_points flag to RemoveDeletedIdsAndCopy and Rerank - Route default search through runtime-configurable processors (no FilterStartPoints pipeline) - Set inplace-delete search processors to filter_start_points=false - Remove Internal<T> strategy indirection and update async providers accordingly

…cess_refactor_3

- Preserve inner search_post_processor for Cached<S> inplace-delete path - Add CachedPostProcess wrapper to avoid PostProcess impl overlap - Keep default post-processing delegation unchanged for normal search

The post-processing refactor should not touch diskann-quantization. Restores license headers on generated flatbuffer files and reverts a lazy_format! call-site change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Determinant diversity post-processing should be landed separately. This removes: - determinant_diversity_post_process.rs from diskann-providers - determinant_diversity.rs from diskann-benchmark-core - All diversity-related PostProcess impls (full_precision, disk_provider) - Diversity benchmark infrastructure (run_determinant_diversity, DeterminantDiversityKnn trait, search_determinant_diversity) - Diversity input parsing and validation from async_ and disk inputs - Diversity example JSON files Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace the PostProcess trait indirection with HRTB bounds directly on search methods. The key insight is that SearchAccessor<'a> has no 'where Self: 'a' clause, making the HRTB 'for<'a> PP: SearchPostProcess< S::SearchAccessor<'a>, T, O>' safe even with generic S. Changes in diskann/: - Remove PostProcess trait, DelegateDefaultPostProcessor trait, DefaultPostProcess ZST, blanket impl, delegate_default_post_process! macro - Add HasDefaultProcessor trait and has_default_processor! macro - Update DefaultSearchStrategy = SearchStrategy + HasDefaultProcessor - Update InplaceDeleteStrategy: SearchPostProcessor now carries the HRTB SearchPostProcess bound directly - Search trait now requires S: SearchStrategy (needed for GAT projection) - All Search impls (Knn, RecordedKnn, Range, Diverse, Multihop) call processor.post_process() directly instead of strategy.post_process_with() - DiskANNIndex::search() uses HasDefaultProcessor::create_processor() - DiskANNIndex::search_with() takes PP with HRTB bound - Update test provider to use HasDefaultProcessor + CopyIds Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Delete all forwarding PostProcess impl blocks (10 impls), rename DelegateDefaultPostProcessor to HasDefaultProcessor across all provider delete CachedPostProcess<P> newtype (replaced by Pipeline<Unwrap, S::SearchPostProcessor>), and replace longhand SearchStrategy + HasDefaultProcessor with DefaultSearchStrategy where appropriate. Net: -383 lines of pure forwarding boilerplate eliminated. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Convert 3 SearchPostProcess implementations from manual async desugaring (fn -> impl Future + Send with async move block) to native async fn. The recursive test_spawning in provider.rs is kept manual because it needs an explicit 'static bound on the returned future. Added T: Sync bound to RemoveDeletedIdsAndCopy impl because async fn captures all parameters (including unused &T) in the future, requiring &T: Send → T: Sync. This is always satisfied at call sites. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove leftover SearchOutputBuffer, IntoANNResult, and Neighbor imports in debug_provider.rs that were only used by the deleted PostProcess impls. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Restore main's structural approach: post-processors (RemoveDeletedIdsAndCopy, Rerank) are clean ZSTs that only handle deletion filtering and reranking. Start-point filtering is composed via Pipeline<FilterStartPoints, Base> at the type level: - HasDefaultProcessor returns Pipeline<FilterStartPoints, Base> (filters start points during regular search) - InplaceDeleteStrategy returns Base directly (no start-point filtering during delete operations) This eliminates the runtime 'filter_start_points: bool' flag, makes the post-processors synchronous again (no .await needed), and restores their Error types to Infallible/Panics instead of ANNError. Also reverts diskann-benchmark/src/backend/index/search/knn.rs to main. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove the unnecessary search_core helper that the PR introduced. Inline the search body back into Knn::search, matching main's structure. The only semantic difference from main is the Option A change: processor is now a method parameter instead of coming from strategy.post_processor(). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

diskann-benchmark-core/Cargo.toml

…ant_diversity

codecov-commenter · 2026-03-26T09:50:25Z

Codecov Report

❌ Patch coverage is 55.43478% with 287 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.04%. Comparing base (b616aa3) to head (58e8917).

Files with missing lines	Patch %	Lines
diskann-disk/src/search/provider/disk_provider.rs	26.31%	84 Missing ⚠️
...vider/async_/determinant_diversity_post_process.rs	83.42%	58 Missing ⚠️
diskann-benchmark-core/src/search/graph/knn.rs	0.00%	47 Missing ⚠️
diskann-benchmark/src/inputs/disk.rs	0.00%	31 Missing ⚠️
diskann-benchmark/src/inputs/async_.rs	22.85%	27 Missing ⚠️
diskann-benchmark/src/backend/index/benchmarks.rs	29.41%	24 Missing ⚠️
diskann-benchmark/src/backend/index/search/knn.rs	50.00%	14 Missing ⚠️
diskann-tools/src/utils/search_disk_index.rs	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #858      +/-   ##
==========================================
- Coverage   90.45%   89.04%   -1.41%     
==========================================
  Files         442      443       +1     
  Lines       83248    83872     +624     
==========================================
- Hits        75301    74683     -618     
- Misses       7947     9189    +1242

Flag	Coverage Δ
miri	`89.04% <55.43%> (-1.41%)`	⬇️
unittests	`88.88% <55.43%> (-1.53%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
diskann-benchmark-runner/src/any.rs	`100.00% <100.00%> (ø)`
diskann-benchmark/src/backend/index/spherical.rs	`100.00% <ø> (ø)`
diskann-disk/src/build/builder/core.rs	`95.26% <100.00%> (+0.01%)`	⬆️
diskann-tools/src/utils/search_disk_index.rs	`0.00% <0.00%> (ø)`
diskann-benchmark/src/backend/index/search/knn.rs	`66.66% <50.00%> (-9.90%)`	⬇️
diskann-benchmark/src/backend/index/benchmarks.rs	`44.52% <29.41%> (-2.05%)`	⬇️
diskann-benchmark/src/inputs/async_.rs	`36.83% <22.85%> (-0.90%)`	⬇️
diskann-benchmark/src/inputs/disk.rs	`3.87% <0.00%> (-0.60%)`	⬇️
diskann-benchmark-core/src/search/graph/knn.rs	`75.00% <0.00%> (-19.92%)`	⬇️
...vider/async_/determinant_diversity_post_process.rs	`83.42% <83.42%> (ø)`
... and 1 more

... and 39 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR adds determinant-diversity (DPP-like) reranking as an explicit search post-processing option, wiring it through the async provider pipeline, the disk-index searcher, and benchmark configuration/execution paths.

Changes:

Introduces a determinant-diversity post-processor (with params + validation) for async graph search.
Adds an optional determinant-diversity rerank path to the disk-index searcher API and benchmark “disk-index” runner.
Extends benchmark input schemas and search harnesses to support running KNN with an explicit post-processor.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
tmp/wiki_compare_determinant_diversity.json	Adds a sample config comparing baseline vs determinant-diversity for async index build/search.
diskann-tools/src/utils/search_disk_index.rs	Updates disk-index search invocation for the expanded search() signature.
diskann-providers/src/model/graph/provider/async_/mod.rs	Exposes the new determinant-diversity post-processing module and types.
diskann-providers/src/model/graph/provider/async_/determinant_diversity_post_process.rs	Implements determinant-diversity reranking + parameter validation + unit tests.
diskann-disk/src/search/provider/disk_provider.rs	Adds determinant-diversity rerank post-processing to disk-index search via search_with().
diskann-disk/src/build/builder/core.rs	Updates tests/call sites for the expanded disk-index search API.
diskann-benchmark/src/inputs/disk.rs	Adds disk-index benchmark config flags/params + validation + display output.
diskann-benchmark/src/inputs/async_.rs	Adds async topk benchmark params for determinant-diversity reranking + validation.
diskann-benchmark/src/backend/index/spherical.rs	Enables determinant-diversity post-processing in spherical benchmark backend via KNNWithPostProcessor.
diskann-benchmark/src/backend/index/search/knn.rs	Refactors KNN runner to support both plain KNN and KNN-with-postprocessor runners.
diskann-benchmark/src/backend/index/benchmarks.rs	Enables determinant-diversity reranking in the general benchmark backend.
diskann-benchmark/src/backend/disk_index/search.rs	Threads determinant-diversity parameters into disk-index benchmark search execution and stats output.
diskann-benchmark/example/openai-disk-determinant-diversity-compare.json	Adds an example config comparing disk-index baseline vs determinant-diversity.
diskann-benchmark-runner/src/any.rs	Updates a dispatch error test string.
diskann-benchmark-core/src/search/graph/knn.rs	Adds KNNWithPostProcessor helper to run graph KNN searches using search_with().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

diskann-disk/src/search/provider/disk_provider.rs

diskann-benchmark/src/inputs/async_.rs

diskann-providers/src/model/graph/provider/async_/determinant_diversity_post_process.rs

diskann-benchmark/src/inputs/disk.rs

diskann-benchmark/src/inputs/async_.rs

narendatha and others added 30 commits March 9, 2026 19:22

Refactor search post-processing API and default processors

e116635

Fix nextest -Dwarnings build and cached delete postprocess bounds

855d673

Add determinant-diversity search mode to benchmark pipeline

ecd81bb

Refactor search API to explicit processors and add search_with

f3d308a

Add disk determinant-diversity search wiring and benchmark inputs

4a74343

Refactor processor passing by value and remove KnnWith

2729ac7

Fix: Align SearchOutputBuffer bound with trait definition (+?Sized)

2149388

Merge remote-tracking branch 'origin/main' into u/narendatha/post_pro…

0540f42

…cess_refactor_3

merge issues fix

6a1872d

fix merge issues, clippy, fmt

750d542

Fix cached inplace-delete post-processor wiring

68e4bfd

- Preserve inner search_post_processor for Cached<S> inplace-delete path - Add CachedPostProcess wrapper to avoid PostProcess impl overlap - Keep default post-processing delegation unchanged for normal search

Revert unrelated quantization changes

48cd833

The post-processing refactor should not touch diskann-quantization. Restores license headers on generated flatbuffer files and reverts a lazy_format! call-site change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix unused imports from PostProcess removal

5e2db52

Remove leftover SearchOutputBuffer, IntoANNResult, and Neighbor imports in debug_provider.rs that were only used by the deleted PostProcess impls. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove some more JSONs.

349d0b6

Remove misc files.

008c09b

Almost there.

46f05b3

Last cleanups (outside of caching).

6dec4bc

Renames.

1cc2816

Mark Hildebrand and others added 13 commits March 14, 2026 13:17

Get caching provider working again.

2ad1ed4

Clean up some stragglers.

0ee76b1

Unify naming.

5fef991

Potential fix for pull request finding

ca84f41

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

0fcfde6

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

7ee952b

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

bc11046

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

4d44eb9

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into mhildebr/poster

cbb2deb

Bump version.

1f688bb

Reset back to 0.49.1

635012d

SearchStrategy no longer needs the output "ID" type.

3145904

Add determinant-diversity search post-process and benchmark integration

58579bb

narendatha changed the title ~~Port determinant-diversity post-process to new search pipeline~~ Implement determinant-diversity search in the new post-processing pipeline Mar 23, 2026

narendatha added 2 commits March 23, 2026 16:50

Remove temporary benchmark results artifact

329793b

Rename rag terminology to determinant_diversity

595e949

Base automatically changed from mhildebr/poster to main March 23, 2026 17:21

Merge main into branch and resolve conflicts

f4bc307

hildebrandmw reviewed Mar 24, 2026

View reviewed changes

diskann-benchmark-core/Cargo.toml Outdated Show resolved Hide resolved

narendatha added 3 commits March 26, 2026 15:00

Generalize benchmark KNN post-processing

43ccd91

bug fix in determinant diversity algorithm.

3ff406d

Merge remote-tracking branch 'origin/main' into u/narendatha/determin…

55d3a3e

…ant_diversity

Reduce determinant rerank allocations

75f9aaa

narendatha marked this pull request as ready for review March 26, 2026 11:00

narendatha requested review from a team and Copilot March 26, 2026 11:00

Copilot started reviewing on behalf of narendatha March 26, 2026 11:01 View session

Copilot AI reviewed Mar 26, 2026

View reviewed changes

determinant diversity: disk index support and refactoring

58e8917

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement determinant-diversity search in the new post-processing pipeline#858

Implement determinant-diversity search in the new post-processing pipeline#858
narendatha wants to merge 51 commits intomainfrom
u/narendatha/determinant_diversity

narendatha commented Mar 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

codecov-commenter commented Mar 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

narendatha commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Determinant-Diversity Search Support

Summary

Motivation

Algorithm

Performance Characteristics

Parameter Tuning Guide

determinant_diversity_power (default: 2.0)

determinant_diversity_eta (default: 0.01)

Recommended Settings

Usage

Benchmark Configuration (disk-index)

Testing

How to try

Uh oh!

Uh oh!

codecov-commenter commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

narendatha commented Mar 23, 2026 •

edited

Loading

`determinant_diversity_power` (default: 2.0)

`determinant_diversity_eta` (default: 0.01)

codecov-commenter commented Mar 26, 2026 •

edited

Loading