Refactor parallel writer #595

MarkWolters · 2026-01-14T21:14:28Z

Sequence diagram of current OnDiskGraphIndexWriter usage by Cassandra:
Cassandra_OnDiskGraphIndexWriter_CurrentState_SequenceDiagram.md

Sequence diagram of proposed future OnDiskParallelGraphIndexWriter usage:
OnDiskParallelGraphIndexWriter_SequenceDiagram.md

Perf test results:
refactor_parallel.tar.gz

Refactoring of the parallelization of graph index writer.

This PR splits the parallel writer into a separate class rather than maintaining if-based branches throughout a single class (OnDiskGraphIndexWriter). A large amount of common code has been abstracted into the new RandomAccessOnDiskGraphIndexWriter making the hierarchy cleaner and easier to understand and maintain.

Previously it was discovered that calling write() after calling writeInline() would results in the features from writeInline() being overwritten with zeroes. This is resolved in this case by checking for feature provider being null, emulating how it is done in sequential writes.

github-actions · 2026-01-14T21:14:41Z

Before you submit for review:

Does your PR follow guidelines from CONTRIBUTIONS.md?
Did you summarize what this PR does clearly and concisely?
Did you include performance data for changes which may be performance impacting?
Did you include useful docs for any user-facing changes or features?
Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
Did you trigger and review regression testing results against the base branch via Run Bench Main?
Did you adhere to the code formatting guidelines (TBD)
Did you group your changes for easy review, providing meaningful descriptions for each commit?
Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

Copilot

Pull request overview

This PR refactors the parallel writing functionality in the graph index writer by introducing a cleaner class hierarchy. The main change is the extraction of parallel writing logic into a dedicated OnDiskParallelGraphIndexWriter class, replacing the previous approach of using conditional branches within a single OnDiskGraphIndexWriter class.

Changes:

Introduced RandomAccessOnDiskGraphIndexWriter as a base class containing common functionality for random access writers
Created OnDiskParallelGraphIndexWriter as a separate class for parallel writing operations
Simplified OnDiskGraphIndexWriter to focus on sequential writing only
Updated examples and benchmarks to use the new OnDiskParallelGraphIndexWriter.Builder instead of the withParallelWrites() flag

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
RandomAccessOnDiskGraphIndexWriter.java	New base class abstracting common functionality for random access graph index writers
OnDiskParallelGraphIndexWriter.java	New dedicated class for parallel graph index writing with async I/O support
OnDiskGraphIndexWriter.java	Simplified to sequential writing only, removing parallel write logic and extending the new base class
ParallelGraphWriter.java	Updated to accept `featuresPreWritten` parameter for handling pre-written features
NodeRecordTask.java	Enhanced to handle cases where features are pre-written via `writeInline()`
GraphIndexWriterTypes.java	Updated enum values from `ON_DISK_SEQUENTIAL`/`ON_DISK_PARALLEL` to `RANDOM_ACCESS`/`RANDOM_ACCESS_PARALLEL`
GraphIndexWriter.java	Refactored factory methods to align with new writer types
ParallelWriteExample.java	Updated to use `OnDiskParallelGraphIndexWriter.Builder` instead of `withParallelWrites()`
Grid.java	Changed type references from `OnDiskGraphIndexWriter` to `OnDiskParallelGraphIndexWriter`
ParallelWriteBenchmark.java	Updated to instantiate appropriate writer class based on parallel flag

Comments suppressed due to low confidence (1)

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/NodeRecordTask.java:1

Multiple buffer allocations are created for each node when featuresPreWritten is true. Consider object pooling or buffer reuse strategies to reduce allocation overhead, especially for large graphs with many nodes.

/*

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

...e/src/main/java/io/github/jbellis/jvector/graph/disk/RandomAccessOnDiskGraphIndexWriter.java

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/GraphIndexWriterTypes.java

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/GraphIndexWriter.java

benchmarks-jmh/src/main/java/io/github/jbellis/jvector/bench/ParallelWriteBenchmark.java

ashkrisk

A few comments, mostly around leftover artifacts from the refactor.

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/GraphIndexWriter.java

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndexWriter.java

...e/src/main/java/io/github/jbellis/jvector/graph/disk/RandomAccessOnDiskGraphIndexWriter.java

...-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskParallelGraphIndexWriter.java

...s/src/test/java/io/github/jbellis/jvector/graph/disk/TestOnDiskParallelGraphIndexWriter.java

ashkrisk · 2026-01-23T15:55:08Z

...-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskParallelGraphIndexWriter.java

+    @Override
+    public synchronized void writeFeaturesInline(int ordinal, Map<FeatureId, Feature.State> stateMap) throws IOException {
+        super.writeFeaturesInline(ordinal, stateMap);
+        featuresPreWritten = true;


I'm not plugged in to any discussion around this, but doesn't the (now sequential) OnDiskGraphIndexWriter have the same problem with pre-written features? Was it a deliberate choice to have this flag used only by the parallel writer and not the sequential one?

The sequential version can use seek() to advance past the area with pre-written features but the parallel version cannot, as the records are prebuilt in memory before being written to disk. There is a version of the parallel writer that does use seek rather than prebuilding in memory but this cuts down on the parallelization and is slower in testing (see the parallel_writer_v2 branch if you are curious).

Looking at the sequential version, it selectively seeks past specific pre-written features by checking for nulls in the featureStateSuppliers:

jvector/jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndexWriter.java

Lines 121 to 122 in dc54ea9

if (supplier == null) {

out.seek(out.position() + feature.featureSize());

The parallel version will skip over all features as long as writeFeaturesInline is called once for any ordinal, regardless of what values the user supplied in featureStateSuppliers at the time of the final write. This isn't exactly the same behavior as the sequential version.

It might be helpful to document the valid ways in which writeFeaturesInline and the final featureStateSuppliers can be combined? Especially if some combinations should be considered "undefined behavior" or subclass-dependent.

ashkrisk · 2026-01-27T09:44:48Z

...e/src/main/java/io/github/jbellis/jvector/graph/disk/RandomAccessOnDiskGraphIndexWriter.java

+     * @param ordinal the (new) ordinal whose inline features should be written
+     * @param stateMap mapping of configured {@link FeatureId}s to their {@link Feature.State}
+     *
+     * @throws IllegalStateException if no file path was provided at construction time;


nit: Does this method throw IllegalStateException? It writes to the existing RandomAccessWriter and doesn't seem to consider any file paths.

ashkrisk · 2026-01-27T10:03:59Z

...-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskParallelGraphIndexWriter.java

+    @Override
+    public synchronized void writeFeaturesInline(int ordinal, Map<FeatureId, Feature.State> stateMap) throws IOException {
+        super.writeFeaturesInline(ordinal, stateMap);
+        featuresPreWritten = true;


Looking at the sequential version, it selectively seeks past specific pre-written features by checking for nulls in the featureStateSuppliers:

jvector/jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndexWriter.java

Lines 121 to 122 in dc54ea9

if (supplier == null) {

out.seek(out.position() + feature.featureSize());

The parallel version will skip over all features as long as writeFeaturesInline is called once for any ordinal, regardless of what values the user supplied in featureStateSuppliers at the time of the final write. This isn't exactly the same behavior as the sequential version.

It might be helpful to document the valid ways in which writeFeaturesInline and the final featureStateSuppliers can be combined? Especially if some combinations should be considered "undefined behavior" or subclass-dependent.

This reverts commit f18f30c.

MarkWolters added 2 commits January 13, 2026 14:08

refactor of parallel on disk graph index writes

4947b0e

refactor of parallel graph index writes

bfa658d

MarkWolters requested a review from Copilot January 15, 2026 13:08

Copilot AI reviewed Jan 15, 2026

View reviewed changes

MarkWolters added 3 commits January 16, 2026 10:20

added unit test of 1 v 2 phase writes

d2dd213

update for review comments

d6b9c8a

fix for bug that could result in numTasks=0

60dde25

MarkWolters marked this pull request as ready for review January 16, 2026 17:21

MarkWolters requested review from jshook, marianotepper and tlwillke as code owners January 16, 2026 17:21

ashkrisk reviewed Jan 23, 2026

View reviewed changes

MarkWolters added 3 commits January 23, 2026 13:00

initial impl

031716a

updates to javadoc comments and imports

0d4c4a1

deprecated writeInline and moved error block

dc54ea9

ashkrisk reviewed Jan 27, 2026

View reviewed changes

MarkWolters added 5 commits January 27, 2026 09:03

cleanup

8395737

Merge branch 'parallel_writer_v2' into refactor_parallel_writer

c89177f

use asyncFileChannel for inline writes

175453e

preallocate disk space

f18f30c

Revert "preallocate disk space"

b23e9f6

This reverts commit f18f30c.

MarkWolters closed this Jan 28, 2026

	if (supplier == null) {
	out.seek(out.position() + feature.featureSize());

Refactor parallel writer #595

Refactor parallel writer #595

Uh oh!

Conversation

MarkWolters commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 14, 2026 • edited by MarkWolters Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ashkrisk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ashkrisk Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

MarkWolters Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

ashkrisk Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashkrisk Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashkrisk Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MarkWolters commented Jan 14, 2026 •

edited

Loading

github-actions bot commented Jan 14, 2026 •

edited by MarkWolters

Loading

ashkrisk Jan 27, 2026 •

edited

Loading

ashkrisk Jan 27, 2026 •

edited

Loading

ashkrisk Jan 27, 2026 •

edited

Loading