feat: GPU witness generation (RV32IM + Keccak + ShardRam)#1259
Open
feat: GPU witness generation (RV32IM + Keccak + ShardRam)#1259
Conversation
hero78119
reviewed
Mar 25, 2026
Collaborator
hero78119
left a comment
There was a problem hiding this comment.
A quick review regarding to tracer & SortedNextAccesses field
1a84578 to
4f4d4f4
Compare
Resolve 6 conflict files, while other are just pure merge CONFLICT (content): Merge conflict in ceno_emul/src/lib.rs CONFLICT (content): Merge conflict in ceno_zkvm/src/e2e.rs CONFLICT (content): Merge conflict in ceno_zkvm/src/instructions/riscv/ecall.rs CONFLICT (content): Merge conflict in ceno_zkvm/src/precompiles/mod.rs CONFLICT (content): Merge conflict in ceno_zkvm/src/structs.rs CONFLICT (content): Merge conflict in ceno_zkvm/src/tables/ram/ram_circuit.rs --------- Co-authored-by: Velaciela <git.rover@outlook.com> Co-authored-by: xkx <xiakunxian130@gmail.com> Co-authored-by: Ray Gao <qg2153@columbia.edu>
3f5c562 to
beb98a2
Compare
Collaborator
Author
|
GPU witgen now runs successfully on the reth-benchmark machine. Notes:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
related: #1265
GPU Witness Generation
Accelerate witness generation by offloading computation from CPU to GPU.
This module (
ceno_zkvm/src/instructions/gpu/) contains all GPU-side dispatch,caching, and utility code for the witness generation pipeline.
The CUDA backend lives in the sibling repo
ceno-gpu/(cuda_hal/src/common/witgen/).Architecture
Module Layout
Data Flow
Per-Shard Pipeline
Within
generate_witness()(e2e.rs), each shard executes:Vec<StepRecord>(cached, shared across all chips)gpu_fill_witnessmatchesGpuWitgenKind→ 22 kernel variantsShardContextGPU/CPU Decision (dispatch.rs)
Keccak Dispatch
Keccak has a dedicated GPU dispatch path (
chips/keccak.rs::gpu_assign_keccak_instances)separate from
try_gpu_assign_instancesbecause:new_by_rotationpacked_instanceswithsyscall_witnessesThe LK/shardram collection logic is identical to the standard path.
Lk and Shardram Collection
After GPU computes the witness matrix, LK multiplicities and shard RAM records
are collected through one of several paths (priority order):
cpu_collect_shardramcpu_collect_lk_and_shardramassign_instanceassign_instanceCurrently all non-Keccak kinds use Path A. Paths B-E are fallback/debug paths.
E2E Pipeline Modes (e2e.rs)
Environment Variables
CENO_GPU_ENABLE_WITGENCENO_GPU_DISABLE_WITGEN_KINDSadd,keccak,lw. Falls back to CPU for those chips.CENO_GPU_DEBUG_COMPARE_WITGENCENO_GPU_DEBUG_COMPARE_WITGENCoverageWhen set, all failures are collected into a
DebugCompareReport(thread-local).Detailed mismatches are logged via
tracing::error!in real time; at pipeline endassert_debug_compare_report()prints a summary table and panics if any failures exist.Per-chip (in dispatch.rs, for each opcode circuit):
debug_compare_final_lk— GPU LK multiplicity vs CPUassign_instancebaseline (all 8 lookup tables)debug_compare_witness— GPU witness matrix vs CPU witness (element-by-element)debug_compare_shardram— GPU shard records (read_records, write_records, addr_accessed) vs CPUdebug_compare_shard_ec— GPU compact EC records vs CPU-computed EC points (nonce, x[7], y[7])Per-chip, Keccak-specific (in chips/keccak.rs):
debug_compare_keccak— Combined witness + LK + shard comparison for keccak's rotation-aware layoutShardRamCircuit (in chips/shard_ram.rs):
debug_compare_shard_ram_witness— GPU ShardRam witness vs CPU baseline (from ShardRamInput)debug_compare_shard_ram_witness_from_device— GPU ShardRam witness vs CPU baseline (D2H device buffer → convert → CPU assign)Per-shard, E2E level (in e2e.rs, all chips combined):
log_shard_ctx_diff— Aggregated addr_accessed comparison (write/read_records skipped when GPU witgen enabled)log_combined_lk_diff— Merged LK multiplicities afterfinalize_lk_multiplicities()(catches cross-chip merge issues)Tests
79 tests total (
cargo test --features gpu,u16limb_circuit -p ceno_zkvm --lib -- "gpu")chips/*.rs(31 viatest_colmap!macro + 2 manual)chips/*.rsassign_instance(element-by-element witness comparison)gpu/mod.rscollect_lk_and_shardram/collect_shardramvsassign_instancebaselineutils/mod.rsLkOp::encode_all()produces correct table/key pairsscheme/septic_curve.rsto_ec_pointscheme/septic_curve.rsscheme/septic_curve.rsseptic_point_from_xvs CPURunning Tests
Per-Chip Boilerplate Macros
Three macros in
instructions.rsreduce per-chip GPU integration to ~3 lines: