Skip to content

Conversation

@MarkWolters
Copy link
Contributor

@MarkWolters MarkWolters commented Jan 14, 2026

Sequence diagram of current OnDiskGraphIndexWriter usage by Cassandra:
Cassandra_OnDiskGraphIndexWriter_CurrentState_SequenceDiagram.md

Sequence diagram of proposed future OnDiskParallelGraphIndexWriter usage:
OnDiskParallelGraphIndexWriter_SequenceDiagram.md

Perf test results:
refactor_parallel.tar.gz

Refactoring of the parallelization of graph index writer.

This PR splits the parallel writer into a separate class rather than maintaining if-based branches throughout a single class (OnDiskGraphIndexWriter). A large amount of common code has been abstracted into the new RandomAccessOnDiskGraphIndexWriter making the hierarchy cleaner and easier to understand and maintain.

Previously it was discovered that calling write() after calling writeInline() would results in the features from writeInline() being overwritten with zeroes. This is resolved in this case by checking for feature provider being null, emulating how it is done in sequential writes.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 14, 2026

Before you submit for review:

  • Does your PR follow guidelines from CONTRIBUTIONS.md?
  • Did you summarize what this PR does clearly and concisely?
  • Did you include performance data for changes which may be performance impacting?
  • Did you include useful docs for any user-facing changes or features?
  • Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
  • Did you trigger and review regression testing results against the base branch via Run Bench Main?
  • Did you adhere to the code formatting guidelines (TBD)
  • Did you group your changes for easy review, providing meaningful descriptions for each commit?
  • Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

@MarkWolters MarkWolters requested a review from Copilot January 15, 2026 13:08
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the parallel writing functionality in the graph index writer by introducing a cleaner class hierarchy. The main change is the extraction of parallel writing logic into a dedicated OnDiskParallelGraphIndexWriter class, replacing the previous approach of using conditional branches within a single OnDiskGraphIndexWriter class.

Changes:

  • Introduced RandomAccessOnDiskGraphIndexWriter as a base class containing common functionality for random access writers
  • Created OnDiskParallelGraphIndexWriter as a separate class for parallel writing operations
  • Simplified OnDiskGraphIndexWriter to focus on sequential writing only
  • Updated examples and benchmarks to use the new OnDiskParallelGraphIndexWriter.Builder instead of the withParallelWrites() flag

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
RandomAccessOnDiskGraphIndexWriter.java New base class abstracting common functionality for random access graph index writers
OnDiskParallelGraphIndexWriter.java New dedicated class for parallel graph index writing with async I/O support
OnDiskGraphIndexWriter.java Simplified to sequential writing only, removing parallel write logic and extending the new base class
ParallelGraphWriter.java Updated to accept featuresPreWritten parameter for handling pre-written features
NodeRecordTask.java Enhanced to handle cases where features are pre-written via writeInline()
GraphIndexWriterTypes.java Updated enum values from ON_DISK_SEQUENTIAL/ON_DISK_PARALLEL to RANDOM_ACCESS/RANDOM_ACCESS_PARALLEL
GraphIndexWriter.java Refactored factory methods to align with new writer types
ParallelWriteExample.java Updated to use OnDiskParallelGraphIndexWriter.Builder instead of withParallelWrites()
Grid.java Changed type references from OnDiskGraphIndexWriter to OnDiskParallelGraphIndexWriter
ParallelWriteBenchmark.java Updated to instantiate appropriate writer class based on parallel flag
Comments suppressed due to low confidence (1)

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/NodeRecordTask.java:1

  • Multiple buffer allocations are created for each node when featuresPreWritten is true. Consider object pooling or buffer reuse strategies to reduce allocation overhead, especially for large graphs with many nodes.
/*

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@MarkWolters MarkWolters marked this pull request as ready for review January 16, 2026 17:21
Copy link
Contributor

@ashkrisk ashkrisk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, mostly around leftover artifacts from the refactor.

@Override
public synchronized void writeFeaturesInline(int ordinal, Map<FeatureId, Feature.State> stateMap) throws IOException {
super.writeFeaturesInline(ordinal, stateMap);
featuresPreWritten = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not plugged in to any discussion around this, but doesn't the (now sequential) OnDiskGraphIndexWriter have the same problem with pre-written features? Was it a deliberate choice to have this flag used only by the parallel writer and not the sequential one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sequential version can use seek() to advance past the area with pre-written features but the parallel version cannot, as the records are prebuilt in memory before being written to disk. There is a version of the parallel writer that does use seek rather than prebuilding in memory but this cuts down on the parallelization and is slower in testing (see the parallel_writer_v2 branch if you are curious).

Copy link
Contributor

@ashkrisk ashkrisk Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the sequential version, it selectively seeks past specific pre-written features by checking for nulls in the featureStateSuppliers:

if (supplier == null) {
out.seek(out.position() + feature.featureSize());

The parallel version will skip over all features as long as writeFeaturesInline is called once for any ordinal, regardless of what values the user supplied in featureStateSuppliers at the time of the final write. This isn't exactly the same behavior as the sequential version.

It might be helpful to document the valid ways in which writeFeaturesInline and the final featureStateSuppliers can be combined? Especially if some combinations should be considered "undefined behavior" or subclass-dependent.

* @param ordinal the (new) ordinal whose inline features should be written
* @param stateMap mapping of configured {@link FeatureId}s to their {@link Feature.State}
*
* @throws IllegalStateException if no file path was provided at construction time;
Copy link
Contributor

@ashkrisk ashkrisk Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Does this method throw IllegalStateException? It writes to the existing RandomAccessWriter and doesn't seem to consider any file paths.

@Override
public synchronized void writeFeaturesInline(int ordinal, Map<FeatureId, Feature.State> stateMap) throws IOException {
super.writeFeaturesInline(ordinal, stateMap);
featuresPreWritten = true;
Copy link
Contributor

@ashkrisk ashkrisk Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the sequential version, it selectively seeks past specific pre-written features by checking for nulls in the featureStateSuppliers:

if (supplier == null) {
out.seek(out.position() + feature.featureSize());

The parallel version will skip over all features as long as writeFeaturesInline is called once for any ordinal, regardless of what values the user supplied in featureStateSuppliers at the time of the final write. This isn't exactly the same behavior as the sequential version.

It might be helpful to document the valid ways in which writeFeaturesInline and the final featureStateSuppliers can be combined? Especially if some combinations should be considered "undefined behavior" or subclass-dependent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants