-
Notifications
You must be signed in to change notification settings - Fork 146
Refactor parallel writer #595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Before you submit for review:
If you did not complete any of these, then please explain below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the parallel writing functionality in the graph index writer by introducing a cleaner class hierarchy. The main change is the extraction of parallel writing logic into a dedicated OnDiskParallelGraphIndexWriter class, replacing the previous approach of using conditional branches within a single OnDiskGraphIndexWriter class.
Changes:
- Introduced
RandomAccessOnDiskGraphIndexWriteras a base class containing common functionality for random access writers - Created
OnDiskParallelGraphIndexWriteras a separate class for parallel writing operations - Simplified
OnDiskGraphIndexWriterto focus on sequential writing only - Updated examples and benchmarks to use the new
OnDiskParallelGraphIndexWriter.Builderinstead of thewithParallelWrites()flag
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| RandomAccessOnDiskGraphIndexWriter.java | New base class abstracting common functionality for random access graph index writers |
| OnDiskParallelGraphIndexWriter.java | New dedicated class for parallel graph index writing with async I/O support |
| OnDiskGraphIndexWriter.java | Simplified to sequential writing only, removing parallel write logic and extending the new base class |
| ParallelGraphWriter.java | Updated to accept featuresPreWritten parameter for handling pre-written features |
| NodeRecordTask.java | Enhanced to handle cases where features are pre-written via writeInline() |
| GraphIndexWriterTypes.java | Updated enum values from ON_DISK_SEQUENTIAL/ON_DISK_PARALLEL to RANDOM_ACCESS/RANDOM_ACCESS_PARALLEL |
| GraphIndexWriter.java | Refactored factory methods to align with new writer types |
| ParallelWriteExample.java | Updated to use OnDiskParallelGraphIndexWriter.Builder instead of withParallelWrites() |
| Grid.java | Changed type references from OnDiskGraphIndexWriter to OnDiskParallelGraphIndexWriter |
| ParallelWriteBenchmark.java | Updated to instantiate appropriate writer class based on parallel flag |
Comments suppressed due to low confidence (1)
jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/NodeRecordTask.java:1
- Multiple buffer allocations are created for each node when
featuresPreWrittenis true. Consider object pooling or buffer reuse strategies to reduce allocation overhead, especially for large graphs with many nodes.
/*
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...e/src/main/java/io/github/jbellis/jvector/graph/disk/RandomAccessOnDiskGraphIndexWriter.java
Outdated
Show resolved
Hide resolved
jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/GraphIndexWriterTypes.java
Show resolved
Hide resolved
jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/GraphIndexWriter.java
Show resolved
Hide resolved
benchmarks-jmh/src/main/java/io/github/jbellis/jvector/bench/ParallelWriteBenchmark.java
Show resolved
Hide resolved
ashkrisk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments, mostly around leftover artifacts from the refactor.
jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/GraphIndexWriter.java
Outdated
Show resolved
Hide resolved
jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndexWriter.java
Outdated
Show resolved
Hide resolved
...e/src/main/java/io/github/jbellis/jvector/graph/disk/RandomAccessOnDiskGraphIndexWriter.java
Show resolved
Hide resolved
...-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskParallelGraphIndexWriter.java
Outdated
Show resolved
Hide resolved
...-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskParallelGraphIndexWriter.java
Outdated
Show resolved
Hide resolved
...-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskParallelGraphIndexWriter.java
Show resolved
Hide resolved
...s/src/test/java/io/github/jbellis/jvector/graph/disk/TestOnDiskParallelGraphIndexWriter.java
Outdated
Show resolved
Hide resolved
| @Override | ||
| public synchronized void writeFeaturesInline(int ordinal, Map<FeatureId, Feature.State> stateMap) throws IOException { | ||
| super.writeFeaturesInline(ordinal, stateMap); | ||
| featuresPreWritten = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not plugged in to any discussion around this, but doesn't the (now sequential) OnDiskGraphIndexWriter have the same problem with pre-written features? Was it a deliberate choice to have this flag used only by the parallel writer and not the sequential one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sequential version can use seek() to advance past the area with pre-written features but the parallel version cannot, as the records are prebuilt in memory before being written to disk. There is a version of the parallel writer that does use seek rather than prebuilding in memory but this cuts down on the parallelization and is slower in testing (see the parallel_writer_v2 branch if you are curious).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the sequential version, it selectively seeks past specific pre-written features by checking for nulls in the featureStateSuppliers:
jvector/jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndexWriter.java
Lines 121 to 122 in dc54ea9
| if (supplier == null) { | |
| out.seek(out.position() + feature.featureSize()); |
The parallel version will skip over all features as long as writeFeaturesInline is called once for any ordinal, regardless of what values the user supplied in featureStateSuppliers at the time of the final write. This isn't exactly the same behavior as the sequential version.
It might be helpful to document the valid ways in which writeFeaturesInline and the final featureStateSuppliers can be combined? Especially if some combinations should be considered "undefined behavior" or subclass-dependent.
| * @param ordinal the (new) ordinal whose inline features should be written | ||
| * @param stateMap mapping of configured {@link FeatureId}s to their {@link Feature.State} | ||
| * | ||
| * @throws IllegalStateException if no file path was provided at construction time; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Does this method throw IllegalStateException? It writes to the existing RandomAccessWriter and doesn't seem to consider any file paths.
| @Override | ||
| public synchronized void writeFeaturesInline(int ordinal, Map<FeatureId, Feature.State> stateMap) throws IOException { | ||
| super.writeFeaturesInline(ordinal, stateMap); | ||
| featuresPreWritten = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the sequential version, it selectively seeks past specific pre-written features by checking for nulls in the featureStateSuppliers:
jvector/jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndexWriter.java
Lines 121 to 122 in dc54ea9
| if (supplier == null) { | |
| out.seek(out.position() + feature.featureSize()); |
The parallel version will skip over all features as long as writeFeaturesInline is called once for any ordinal, regardless of what values the user supplied in featureStateSuppliers at the time of the final write. This isn't exactly the same behavior as the sequential version.
It might be helpful to document the valid ways in which writeFeaturesInline and the final featureStateSuppliers can be combined? Especially if some combinations should be considered "undefined behavior" or subclass-dependent.
Sequence diagram of current OnDiskGraphIndexWriter usage by Cassandra:
Cassandra_OnDiskGraphIndexWriter_CurrentState_SequenceDiagram.md
Sequence diagram of proposed future OnDiskParallelGraphIndexWriter usage:
OnDiskParallelGraphIndexWriter_SequenceDiagram.md
Perf test results:
refactor_parallel.tar.gz
Refactoring of the parallelization of graph index writer.
This PR splits the parallel writer into a separate class rather than maintaining
if-based branches throughout a single class (OnDiskGraphIndexWriter). A large amount of common code has been abstracted into the new RandomAccessOnDiskGraphIndexWriter making the hierarchy cleaner and easier to understand and maintain.Previously it was discovered that calling write() after calling writeInline() would results in the features from writeInline() being overwritten with zeroes. This is resolved in this case by checking for feature provider being null, emulating how it is done in sequential writes.