Skip to content

chore(cuda): stream migration refactoring#2662

Merged
gaxiom merged 45 commits intodevelop-v2.0.0-rc.1from
chore/stream-migration
Apr 13, 2026
Merged

chore(cuda): stream migration refactoring#2662
gaxiom merged 45 commits intodevelop-v2.0.0-rc.1from
chore/stream-migration

Conversation

@gaxiom
Copy link
Copy Markdown
Contributor

@gaxiom gaxiom commented Apr 6, 2026

Closes INT-6464

CUDA Explicit Stream Migration (OpenVM side)

Companion to stark-backend#317. Adds cudaStream_t stream to all OpenVM CUDA launchers and FFI wrappers, injects DeviceContext through the VM builder and chip construction APIs, removes all PTDS references, and lifts DeviceCtx to a first-class associated type on ProverDevice — eliminating MaybeDeviceContext, #[cfg] duplication, and conditional device_ctx: Option<&DeviceContext> parameters.

224 files changed, +2902 / -1353

See cuda-stream-migration-design.md for full design rationale.


CUDA launchers (~67 .cu files + 2 .cuh headers)

Every extern "C" launcher gains cudaStream_t stream as final parameter. All kernel launches use <<<grid, block, 0, stream>>>. All CUB calls (DeviceScan, DeviceReduce, DeviceMergeSort) pass explicit stream. scan.cuh and affine_scan.cuh had = cudaStreamPerThread default parameters removed — stream is now required.


FFI wrappers (15 cuda_abi.rs / abi.rs files)

Every extern "C" declaration and safe Rust wrapper gains stream: cudaStream_t. Covers crates/vm, crates/circuits/primitives, crates/circuits/poseidon2-air, crates/recursion (6 files), and all extensions (rv32im, keccak256, bigint, sha2, deferral).


VM builder — DeviceContext injection

VmBuilder::create_chip_complex gains device: &E::PD parameter. VirtualMachine::new passes engine.device() through. 25 implementations updated (19 CPU impls ignore it, 6 GPU impls extract DeviceContext).


Stateful GPU chip constructors

GPU chips that allocate persistent DeviceBuffers now store device_ctx: DeviceContext and use _on allocation variants:

  • BitwiseOperationLookupChipGPU — histogram accumulator
  • VariableRangeCheckerChipGPU — range check accumulator
  • RangeTupleCheckerChipGPU — range tuple accumulator
  • Poseidon2ChipGPU — shared hash records buffer + index counter
  • MemoryMerkleTree — merkle tree device state

The Chip::generate_proving_ctx(&self, records) trait is unchanged — chips access self.device_ctx internally.


VM system CUDA modules (crates/vm/src/system/cuda/)

All GPU system modules use self.device_ctx.stream.as_raw(): poseidon2.rs, memory.rs, boundary.rs, phantom.rs, program.rs, merkle_tree/mod.rs.


DeviceCtx as associated type on ProverDevice (87 files, -630 net lines)

The final refactor lifts DeviceContext from a #[cfg(feature = "cuda")]-gated parameter to a first-class associated type on ProverDevice (defined in stark-backend#317):

// In ProverDevice (stark-backend):
type DeviceCtx: Clone + Send + Sync;
fn device_ctx(&self) -> &Self::DeviceCtx;
// GpuDevice: DeviceCtx = DeviceContext
// CpuDevice: DeviceCtx = ()

This eliminates:

  • MaybeDeviceContext trait + all 3 impls + device_ctx_for_engine() helper — deleted entirely
  • ~6 pairs of #[cfg(feature = "cuda")] / #[cfg(not)] duplicated methods in continuations provers (inner/mod.rs, deferral/hook/mod.rs, deferral/inner/mod.rs) — unified into single versions
  • ~30 #[cfg(feature = "cuda")] device_ctx: Option<&DeviceContext> conditional parameters — replaced with unconditional &DC generic
  • cfg-duplicated VerifierTraceGen trait methods in recursion/src/system/mod.rsunified

Callers now use engine.device().device_ctx() directly. The EngineDeviceCtx<E> type alias avoids spelling out the full associated type path.


Recursion — VerifierTraceGen and module tracegen

VerifierTraceGen and InnerTraceGen traits gain a DC generic parameter for the device context type. generate_proving_ctxs accepts &DC unconditionally. Stream synchronization at proof boundaries via device_ctx.stream.synchronize().


Continuations + guest verifier

InnerTraceGen, DeferralHookTraceGen, DeferralInnerTraceGen, RootTraceGen — all accept &DC and use the engine-owned context. agg_prove, from_pk, new — single versions, no #[cfg] duplication. Guest verifier circuit trace generation passes device context through.


Extensions

All extension GPU modules pass stream to FFI calls: rv32im (all ALU/branch/load/store cuda modules), sha2, keccak256, bigint, deferral.


Removed

  • --default-stream=per-thread from scripts/bin/ptx_details.rs
  • All cudaStreamPerThread imports and usages
  • All current_stream_sync() / current_stream_id() usages
  • MaybeDeviceContext trait + device_ctx_for_engine() helper
  • Temporary DeviceContext escape hatches (fresh streams created outside engine)
  • ~400 lines of #[cfg] duplication in continuations provers

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment has been minimized.

@github-actions

This comment was marked as outdated.

@github-actions

This comment has been minimized.

@github-actions

This comment was marked as outdated.

@github-actions

This comment has been minimized.

@branch-rebase-bot branch-rebase-bot Bot force-pushed the develop-v2.0.0-rc.1 branch from 078d162 to a2b01fb Compare April 6, 2026 20:21
@gaxiom gaxiom force-pushed the chore/stream-migration branch from 7201f5f to 16724df Compare April 7, 2026 19:41
@github-actions

This comment was marked as outdated.

@github-actions

This comment has been minimized.

@github-actions

This comment was marked as outdated.

@github-actions

This comment has been minimized.

@gaxiom gaxiom marked this pull request as ready for review April 7, 2026 21:59
@github-actions

This comment was marked as outdated.

@branch-rebase-bot branch-rebase-bot Bot force-pushed the develop-v2.0.0-rc.1 branch 3 times, most recently from dd42a19 to 312fe84 Compare April 8, 2026 19:44
@gaxiom gaxiom force-pushed the chore/stream-migration branch from 83e2cd1 to 7b0f7ae Compare April 8, 2026 20:53
@github-actions

This comment was marked as outdated.

@github-actions

This comment has been minimized.

@shuklaayush shuklaayush force-pushed the develop-v2.0.0-rc.1 branch from a5750ef to 534de3a Compare April 8, 2026 21:19
@gaxiom gaxiom force-pushed the chore/stream-migration branch from 7b0f7ae to 0d0d427 Compare April 9, 2026 15:39
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 9, 2026

Claude finished @gaxiom's task in 8m 53s —— View job


Code Review: chore(cuda): stream migration refactoring

  • Gather context — review latest commits and diff against base branch
  • Review core infrastructure changes (primitives, utils, hybrid_chip)
  • Review CUDA kernel and FFI wrapper changes
  • Review VM builder and DeviceContext injection
  • Review recursion crate changes
  • Review continuations crate changes
  • Review extension changes
  • Post review summary

Summary

220 files changed across 29 commits. The migration from per-thread-default-stream (PTDS) to explicit DeviceContext stream passing is clean, consistent, and correct. No critical issues found. LGTM.


Architecture

The PR establishes a well-structured pattern for explicit CUDA stream management:

  1. MaybeDeviceContext trait (recursion/src/system/mod.rs:61-84): GpuDeviceSome(&device_ctx), CpuDevice/ReferenceDeviceNone. The device_ctx_for_engine helper provides clean extraction from any engine.

  2. GPU chip pattern: All GPU chips (VariableRangeCheckerChipGPU, BitwiseOperationLookupChipGPU, RangeTupleCheckerChipGPU, Poseidon2ChipGPU, MemoryMerkleTree, etc.) store DeviceContext and use _on(&self.device_ctx) for buffer ops and self.device_ctx.stream.as_raw() for kernel launches.

  3. CUDA FFI: All extern "C" launchers (~65 .cu files) accept cudaStream_t stream as the last parameter. All kernel launches use 4-arg syntax <<<grid, block, shmem, stream>>>. All CUB calls pass the explicit stream. Headers (scan.cuh, affine_scan.cuh) have no default parameter values.

  4. VM builder: VmBuilder::create_chip_complex gains device: &E::PD, threaded from engine.device() in VirtualMachine::new. All 25 implementations updated.

  5. Continuations: All four trace generators (InnerTraceGen, RootTraceGen, DeferralHookTraceGen, DeferralInnerTraceGen) accept #[cfg(feature = "cuda")] device_ctx: Option<&DeviceContext>. GPU impls .expect() with descriptive messages.

  6. Synchronization: Serial-on-same-stream approach with device_ctx.stream.synchronize() at proof boundaries. Well-documented with // PERF and // NOTE comments.


Verification Results

  • Zero leftover cudaStreamPerThread references — confirmed via grep across the entire workspace
  • Zero leftover current_stream_sync() / current_stream_id() references
  • All FFI wrappers in cuda_abi.rs files consistently pass stream: cudaStream_t as the final parameter
  • All buffer allocations use _on() variants (with_capacity_on, to_device_on, to_host_on, fill_zero_on)
  • All extensions (rv32im, sha2, keccak256, bigint, deferral) properly pass ctx.stream.as_raw() to FFI calls
  • DeviceContext::clone() is safeStreamGuard uses Arc internally, so clones share the same underlying CUDA stream handle without double-free risk
  • Latest commit (0d0d427) is a clean rename of ctxdevice_ctx for clarity — no logic changes

Observations (non-blocking)

1. #[cfg(feature = "cuda")] duplication in prover constructors

Files like continuations/src/prover/inner/mod.rs, deferral/hook/mod.rs, and deferral/inner/mod.rs have duplicate new(), from_pk(), agg_prove(), and agg_prove_no_def() implementations — one for #[cfg(not(feature = "cuda"))] and one for #[cfg(feature = "cuda")]. The bodies are identical; only the where-clause differs (E::PD: MaybeDeviceContext). Three prover types × four methods = ~12 duplicated function bodies. A macro or conditional trait bound could reduce this in a follow-up.

2. touchemall uses device_synchronize() (global sync)

primitives/src/utils.rs — The debug utility calls device_synchronize() then reads the trace on a fresh temporary stream. A stream-specific stream.synchronize() on the proving context's stream would be more precise. Since this is behind feature = "touchemall" and is a debug-only utility, it's fine as-is.

3. .unwrap() on GPU allocation in chip constructors

GPU chips use .unwrap() on fill_zero_on(), to_device_on(), and with_capacity_on() calls. These would only fail on catastrophic GPU errors (OOM, driver crash), so panicking is defensible. If graceful GPU error recovery is ever needed, these would need to propagate Result.


Verdict

LGTM. The DeviceContext threading is correct and consistent across all ~220 changed files. The synchronization model is sound. No correctness, safety, or lifetime issues found. Benchmarks show no regression. Clean removal of all PTDS references confirmed.

@github-actions

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@stephenh-axiom-xyz stephenh-axiom-xyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got about 25% through, but in general I think the pattern we have where DeviceContext only works with GpuDevice is really hurting us. I think DeviceContext should be a per-device concept, and this would allow us to avoid essentially everywhere we're doing specific #[cfg(not(feature = "cuda"))]

Comment thread crates/circuits/primitives/src/bitwise_op_lookup/cuda.rs
let trace = DeviceMatrix::<F>::with_capacity(self.count.len(), N + 1);
let d_sizes = self.sizes.to_device().unwrap();
let trace = DeviceMatrix::<F>::with_capacity_on(self.count.len(), N + 1, &self.device_ctx);
trace.buffer().fill_zero_on(&self.device_ctx).unwrap();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Same comment as above, do we need to zero this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above — needed because GPU memory pool buffers contain stale data. Added comments explaining the rationale.

Comment thread crates/circuits/primitives/src/var_range/cuda.rs
Comment thread crates/circuits/primitives/src/utils.rs Outdated
Comment thread crates/continuations/src/prover/deferral/hook/mod.rs Outdated
Comment thread crates/recursion/src/gkr/mod.rs Outdated
Comment thread crates/continuations/src/circuit/deferral/hook/trace.rs Outdated
Comment thread crates/recursion/src/system/mod.rs Outdated
Comment thread crates/recursion/src/system/mod.rs Outdated
Comment thread crates/recursion/src/system/mod.rs Outdated
@github-actions

This comment was marked as outdated.

gaxiom and others added 21 commits April 13, 2026 16:31
… duplication in provers

- Add ctx.stream.synchronize() before return in GPU RootTraceGen methods
  to prevent async race when DeviceContext is dropped with in-flight transfers
- Merge identical new() and from_pk() cfg variants in InnerAggregationProver,
  DeferralInnerProver, and DeferralHookProver (removed unnecessary
  MaybeDeviceContext bound from methods that don't use device_ctx_for_engine)
- Remove redundant #[cfg(feature = "cuda")] on device_ctx_for_engine() calls
  inside methods already gated by #[cfg(feature = "cuda")]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…or cuda builds

commit_child_vk requires E::PD: MaybeDeviceContext when the cuda feature
is enabled, so the cfg duplication on new() and from_pk() cannot be removed.
Restores the two-variant pattern for these methods.

The agg_prove/prove cleanup (removing redundant #[cfg] on device_ctx_for_engine
inside already-cfg'd methods) remains.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nstead of creating temp streams

The GPU RootTraceGen impl was creating throwaway DeviceContext with fresh
streams for H2D transfers. This is inconsistent with the rest of the
codebase (InnerTraceGen, DeferralHookTraceGen) which accept the engine's
DeviceContext.

- Add #[cfg(feature = "cuda")] device_ctx parameter to RootTraceGen trait
- GPU impl uses the passed-in context instead of allocating a new stream
- CPU impl ignores the parameter
- Caller in prover/root/trace.rs passes device_ctx through
- Removes the synchronize() workaround since the stream is no longer temporary

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rename all DeviceContext-typed variables, parameters, and struct fields
from `ctx` to `device_ctx` across the entire openvm codebase. This
disambiguates from other context types (ProvingContext, AirProvingContext,
etc.) that coexist in the same files.

Also updates Cargo.lock for stark-backend device_ctx rename.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Make device context a first-class associated type on ProverDevice so
every device carries its own context type — GpuDeviceCtx for GPU, () for
CPU. This eliminates MaybeDeviceContext, device_ctx_for_engine, and all
#[cfg(feature = "cuda")] device_ctx parameter duplication.

Key changes:
- ProverDevice gains DeviceCtx associated type + device_ctx() method
- VerifierTraceGen, InnerTraceGen, RootTraceGen, DeferralHookTraceGen,
  DeferralInnerTraceGen, DeferredVerifyTraceGen all gain DC generic
- ~12 cfg-duplicated method pairs unified into single methods
- DeviceContext renamed to GpuDeviceCtx
- GkrExpBitsLenSink + WhirExpBitsLenSink consolidated into ExpBitsLenSink
- gpu_ctx variables renamed for consistency
- Zero-fill comments added explaining necessity

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Restore Cargo.lock to avoid bumping AWS deps beyond rustc 1.90.0 MSRV,
  only updating stark-backend git refs
- Rename `ctx` -> `device_ctx` in gkr/mod.rs GPU tracegen (missed in
  the previous rename pass, flagged by CI review)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix generate_proving_ctxs_base calls in CUDA test block: None -> &(),
  Some(&ctx) -> &ctx
- Fix generate_proving_ctx turbofish: ::<PB> -> ::<PB, _> for new DC
  generic parameter
- Use .device_ctx field directly instead of trait method to avoid
  ambiguous ProverDevice resolution
- Remove unused ProverDevice imports in test files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This reverts commit 824c41ab46c0567c9d0cad3d264f539d140d9cdc.
@gaxiom gaxiom force-pushed the chore/stream-migration branch from ad81c01 to 36ad944 Compare April 13, 2026 16:31
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 13, 2026

Claude finished @gaxiom's task in 4m 4s —— View job


Code Review: chore(cuda): stream migration refactoring

  • Gather context — review latest commits and diff against base branch
  • Review core infrastructure changes (primitives, utils, hybrid_chip)
  • Review CUDA kernel and FFI wrapper changes
  • Review VM builder and DeviceContext injection
  • Review recursion crate changes
  • Review continuations crate changes
  • Review extension and test changes
  • Post review summary

Summary

45 commits, 221 files changed (+2870/-1396). The migration from PTDS to explicit DeviceContext stream passing is clean, consistent, and correct. Three prior review rounds have been substantively addressed. The most recent commit (36ad944 resolve comment) partially resolves one of the two approval-pending style nits from @stephenh-axiom-xyz, but in a way that differs from the reviewer's concrete suggestion — flagging that below. LGTM overall; see notes below for follow-ups.


Architecture (unchanged from prior approvals)

Layer Pattern
Device context type DeviceCtx associated type on ProverDevice() for CPU, GpuDeviceCtx for GPU
VM builder VmBuilder::create_chip_complex takes device_ctx: &EngineDeviceCtx<E> (crates/vm/src/arch/config.rs:87-95)
GPU chip storage device_ctx: GpuDeviceCtx field, _on(&self.device_ctx) for buffer ops
CUDA FFI cudaStream_t stream as last parameter in all extern "C" functions
Prover late binding No DC generic stored on provers; extracted at prove time via engine.device().device_ctx()
Trait-level generics VerifierTraceGen<PB, SC, DC>, InnerTraceGen<PB, DC>, RootTraceGen<PB, DC>, DeferralHookTraceGen<PB, DC>, DeferralInnerTraceGen<PB, DC> — unified, no cfg duplication
Module context TraceGenModule::ModuleSpecificCtx GAT per module
Synchronization Serial-on-same-stream with stream.synchronize() at proof boundaries

Verification Results

  • Zero remaining cudaStreamPerThread / current_stream_sync() / current_stream_id() / MaybeDeviceContext / ExpBitsLenSink / GpuDeviceCtx::for_current_device() references — confirmed via grep
  • All CUDA kernel launchers use 4-arg syntax <<<grid, block, 0, stream>>>
  • All FFI wrappers consistently pass stream: cudaStream_t as final parameter
  • All buffer allocations use _on() variants (with_capacity_on, to_device_on, fill_zero_on)
  • scan.cuh (crates/recursion/cuda/include/scan.cuh:32,47) and affine_scan.cuh — default cudaStreamPerThread parameters correctly removed; stream required
  • scripts/bin/ptx_details.rs--default-stream=per-thread removed
  • All GPU chip zero-fills have explanatory comments (e.g., crates/circuits/primitives/src/var_range/cuda.rs:66,77)
  • DeviceContext::clone() is safe — StreamGuard uses Arc internally

Issues with the latest commit (36ad944 resolve comment)

The reviewer's approval left two concrete style nits. The most recent commit addresses one of them, but in a way that does not match the reviewer's explicit suggestion:

1. apply_merkle_precomputation — reviewer's suggested pattern not applied (crates/recursion/src/system/mod.rs:682-705)

The reviewer asked for a single unified function with a #[cfg(feature = "cuda")]-gated parameter:

fn apply_merkle_precomputation(
    proof: &Proof<BabyBearPoseidon2Config>,
    preflight: &mut Preflight,
    #[cfg(feature = "cuda")] device_ctx: &GpuDeviceCtx,
) { ... }

The commit instead split the function into two separate functionsapply_merkle_precomputation_cpu (always-on, line 682) and apply_merkle_precomputation (cuda-gated, line 693) — and deleted the private MerklePrecomputationDeviceCtx dispatch trait. The bodies are identical except for one line (compute_merkle_precomputation vs compute_merkle_precomputation_cuda), so this reintroduces ~11 lines of duplication.

The reviewer's #[cfg]-on-parameter pattern is cleaner, avoids the _cpu suffix, and avoids the duplicate 4-field assignment body. Fix this →

2. _device_ctx prefix still present (crates/recursion/src/system/mod.rs:1088)

Still _device_ctx: &() with the _ prefix. Reviewer explicitly asked for #[cfg_attr(not(feature = "cuda"), allow(unused_variables))] on device_ctx instead. Unaddressed by the latest commit. Fix this →


Other findings (non-blocking)

1. Missing size_of::<F>() == size_of::<u32>() assertion in two primitives

var_range/cuda.rs:49 correctly asserts the size match before casting DeviceBuffer<F> to *const u32. The same cast is performed unchecked in:

  • bitwise_op_lookupcrates/circuits/primitives/src/cuda_abi.rs:48 (d_count.as_ptr() as *const u32)
  • range_tuplecrates/circuits/primitives/src/cuda_abi.rs:111 (d_count.as_ptr() as *const u32)

Since F = BabyBear is 32-bit today, this is currently a non-issue, but for consistency the same assert_eq!(size_of::<F>(), size_of::<u32>()) should be added to the generate_proving_ctx bodies of those two chips. Not introduced by this PR, but worth flagging while the file is under review.

2. Stacking ModuleSpecificCtx is an owned type (crates/recursion/src/stacking/mod.rs:585)

Defines type ModuleSpecificCtx<'a> = GpuDeviceCtx (owned), while GkrModule (crates/recursion/src/gkr/mod.rs:665) uses a tuple of references (&'a GpuExpBitsLenTraceGenerator, &'a GpuDeviceCtx). Since the trait method takes ctx: &Self::ModuleSpecificCtx<'_>, the parameter type for Stacking is &GpuDeviceCtx — correct, just inconsistent with the rest of the codebase. Works fine given GpuDeviceCtx: Clone with Arc internals.

3. #[cfg] bifurcation inside non-cfg'd CPU VerifierTraceGen impl (crates/recursion/src/system/mod.rs:1104-1113, 1120-1123)

The CPU VerifierTraceGen impl's preflight thread closure contains a #[cfg(feature = "cuda")] / #[cfg(not(feature = "cuda"))] block. With cuda enabled, the CPU impl runs run_preflight_without_merkle and applies merkle precomputation separately afterward (calling apply_merkle_precomputation_cpu). With cuda disabled, it calls run_preflight directly. The feature-flag bifurcation inside an impl that is otherwise generic over DC = () is subtle — behavior depends on the feature flag rather than the type parameter. Worth documenting inline for future maintainers.

4. .unwrap() on GPU allocations (pre-existing pattern)

All GPU chips use .unwrap() on fill_zero_on(), to_device_on(), with_capacity_on(). These would only fail on catastrophic GPU errors (OOM, driver crash), so panicking is defensible. Not introduced by this PR.

5. touchemall uses device_synchronize() (global sync) (crates/circuits/primitives/src/utils.rs:140)

Debug utility behind feature = "touchemall". Global sync is safer here since the utility may need to read data written by any stream. Fine as-is.


Verdict

LGTM. The DeviceContext threading is correct and consistent across all 221 changed files. The DeviceCtx associated type on ProverDevice design elegantly resolves the cfg-duplication concern from earlier review rounds. The late-binding pattern for provers is clean. Benchmarks show no regression. Clean removal of all PTDS references confirmed.

Pending items:

  • Address reviewer's #[cfg]-on-parameter suggestion for apply_merkle_precomputation (commit 36ad944 took a different approach)
  • Replace _device_ctx prefix with #[cfg_attr(not(feature = "cuda"), allow(unused_variables))] at system/mod.rs:1088
  • v2.0.0-beta.2 tag
  • Update to stark-backend branch in Cargo.toml before final merge
    | Branch chore/stream-migration

@github-actions
Copy link
Copy Markdown

group app.proof_time_ms app.cycles leaf.proof_time_ms
fibonacci 3,849 12,000,265 959
keccak 18,626 18,655,329 3,319
sha2_bench 9,847 14,793,960 1,393
regex 1,418 4,137,067 376
ecrecover 651 123,583 276
pairing 907 1,745,757 284
kitchen_sink 2,139 2,579,903 432

Note: cells_used metrics omitted because CUDA tracegen does not expose unpadded trace heights.

Commit: 36ad944

Benchmark Workflow

@gaxiom gaxiom merged commit 749e37d into develop-v2.0.0-rc.1 Apr 13, 2026
72 checks passed
@gaxiom gaxiom deleted the chore/stream-migration branch April 13, 2026 16:47
branch-rebase-bot Bot pushed a commit that referenced this pull request Apr 24, 2026
Closes INT-6464

## CUDA Explicit Stream Migration (OpenVM side)

Companion to
[stark-backend#317](openvm-org/stark-backend#317).
Adds `cudaStream_t stream` to all OpenVM CUDA launchers and FFI
wrappers, injects `DeviceContext` through the VM builder and chip
construction APIs, removes all PTDS references, and lifts `DeviceCtx` to
a first-class associated type on `ProverDevice` — eliminating
`MaybeDeviceContext`, `#[cfg]` duplication, and conditional `device_ctx:
Option<&DeviceContext>` parameters.

**224 files changed, +2902 / -1353**

See
[cuda-stream-migration-design.md](https://github.com/openvm-org/v2-proof-system/blob/test/stream-migration/cuda-stream-migration-design.md)
for full design rationale.

---

## CUDA launchers (~67 `.cu` files + 2 `.cuh` headers)

Every `extern "C"` launcher gains `cudaStream_t stream` as final
parameter. All kernel launches use `<<<grid, block, 0, stream>>>`. All
CUB calls (`DeviceScan`, `DeviceReduce`, `DeviceMergeSort`) pass
explicit `stream`. `scan.cuh` and `affine_scan.cuh` had `=
cudaStreamPerThread` default parameters removed — `stream` is now
required.

---

## FFI wrappers (15 `cuda_abi.rs` / `abi.rs` files)

Every `extern "C"` declaration and safe Rust wrapper gains `stream:
cudaStream_t`. Covers `crates/vm`, `crates/circuits/primitives`,
`crates/circuits/poseidon2-air`, `crates/recursion` (6 files), and all
extensions (`rv32im`, `keccak256`, `bigint`, `sha2`, `deferral`).

---

## VM builder — `DeviceContext` injection

`VmBuilder::create_chip_complex` gains `device: &E::PD` parameter.
`VirtualMachine::new` passes `engine.device()` through. 25
implementations updated (19 CPU impls ignore it, 6 GPU impls extract
`DeviceContext`).

---

## Stateful GPU chip constructors

GPU chips that allocate persistent `DeviceBuffer`s now store
`device_ctx: DeviceContext` and use `_on` allocation variants:

- `BitwiseOperationLookupChipGPU` — histogram accumulator
- `VariableRangeCheckerChipGPU` — range check accumulator
- `RangeTupleCheckerChipGPU` — range tuple accumulator
- `Poseidon2ChipGPU` — shared hash records buffer + index counter
- `MemoryMerkleTree` — merkle tree device state

The `Chip::generate_proving_ctx(&self, records)` trait is **unchanged**
— chips access `self.device_ctx` internally.

---

## VM system CUDA modules (`crates/vm/src/system/cuda/`)

All GPU system modules use `self.device_ctx.stream.as_raw()`:
`poseidon2.rs`, `memory.rs`, `boundary.rs`, `phantom.rs`, `program.rs`,
`merkle_tree/mod.rs`.

---

## `DeviceCtx` as associated type on `ProverDevice` (87 files, -630 net
lines)

The final refactor lifts `DeviceContext` from a `#[cfg(feature =
"cuda")]`-gated parameter to a first-class associated type on
`ProverDevice` (defined in stark-backend#317):

```rust
// In ProverDevice (stark-backend):
type DeviceCtx: Clone + Send + Sync;
fn device_ctx(&self) -> &Self::DeviceCtx;
// GpuDevice: DeviceCtx = DeviceContext
// CpuDevice: DeviceCtx = ()
```

This eliminates:
- `MaybeDeviceContext` trait + all 3 impls + `device_ctx_for_engine()`
helper — **deleted entirely**
- ~6 pairs of `#[cfg(feature = "cuda")]` / `#[cfg(not)]` duplicated
methods in continuations provers (`inner/mod.rs`,
`deferral/hook/mod.rs`, `deferral/inner/mod.rs`) — **unified into single
versions**
- ~30 `#[cfg(feature = "cuda")] device_ctx: Option<&DeviceContext>`
conditional parameters — **replaced with unconditional `&DC` generic**
- cfg-duplicated `VerifierTraceGen` trait methods in
`recursion/src/system/mod.rs` — **unified**

Callers now use `engine.device().device_ctx()` directly. The
`EngineDeviceCtx<E>` type alias avoids spelling out the full associated
type path.

---

## Recursion — `VerifierTraceGen` and module tracegen

`VerifierTraceGen` and `InnerTraceGen` traits gain a `DC` generic
parameter for the device context type. `generate_proving_ctxs` accepts
`&DC` unconditionally. Stream synchronization at proof boundaries via
`device_ctx.stream.synchronize()`.

---

## Continuations + guest verifier

`InnerTraceGen`, `DeferralHookTraceGen`, `DeferralInnerTraceGen`,
`RootTraceGen` — all accept `&DC` and use the engine-owned context.
`agg_prove`, `from_pk`, `new` — single versions, no `#[cfg]`
duplication. Guest verifier circuit trace generation passes device
context through.

---

## Extensions

All extension GPU modules pass `stream` to FFI calls: `rv32im` (all
ALU/branch/load/store cuda modules), `sha2`, `keccak256`, `bigint`,
`deferral`.

---

## Removed

- `--default-stream=per-thread` from `scripts/bin/ptx_details.rs`
- All `cudaStreamPerThread` imports and usages
- All `current_stream_sync()` / `current_stream_id()` usages
- `MaybeDeviceContext` trait + `device_ctx_for_engine()` helper
- Temporary `DeviceContext` escape hatches (fresh streams created
outside engine)
- ~400 lines of `#[cfg]` duplication in continuations provers

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants