inst64: Drive on-the-fly transpose; add the snitch integration harness#113
Draft
DanielKellerM wants to merge 10 commits into
Draft
inst64: Drive on-the-fly transpose; add the snitch integration harness#113DanielKellerM wants to merge 10 commits into
DanielKellerM wants to merge 10 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR wires the on-the-fly transpose feature end-to-end for the Snitch inst64 frontend and adds a standalone Snitch integration harness, while also extending the backend generation flow to optionally include compute support in selected variants.
Changes:
- Add a typed per-transfer
opt.computecapability (transpose op + params) and route it through legalizer/backend/transport to a write-seam compute dispatcher and transpose engine. - Extend the
inst64frontend to decode transpose requests from spareDMCPYargb bits, expand transpose geometry via a newidma_transpose_midend, and reject malformed transpose requests. - Add new SV/DPI-C testbenches (engine-level + ND/back-to-back) and a Snitch
inst64integration harness + Makefile flow (snitch_transpose_sweep), plus docs and Bender target support (split_rtl).
Reviewed changes
Copilot reviewed 36 out of 37 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| util/mario/util.py | Add parsing for --compute-ids configuration strings (ops + fd/hd). |
| util/mario/transport_layer.py | Pass compute enable/op info into transport-layer templating context. |
| util/mario/legalizer.py | Pass compute enable flag into legalizer templating context. |
| util/mario/backend.py | Enforce “single AXI write port” constraint for compute-enabled variants and pass ops into backend templating context. |
| util/gen_idma.py | Add --compute-ids CLI and propagate compute config into generators. |
| test/tb_idma_transpose_nd.sv | Multi-tile end-to-end transpose test via ND midend → compute backend → AXI sim mem. |
| test/tb_idma_transpose_b2b.sv | End-to-end back-to-back transpose regression to distinct destinations. |
| test/tb_idma_otf_transpose.sv | Standalone transpose engine SV testbench using DPI-C golden model. |
| test/midend/tb_idma_transpose_midend.sv | Unit test for transpose geometry expansion midend. |
| test/midend/tb_idma_nd_midend_b2b.sv | Back-to-back ND midend base-address reload regression under backpressure. |
| test/idma_transpose_dpi.c | DPI-C golden model for element-granular transpose verification. |
| test/idma_test.sv | Extend request-driving task to optionally program transpose compute options. |
| systems/snitch/test/tb_idma_inst64_transpose.sv | Snitch inst64 end-to-end transpose integration test (incl. rejects + no-leak). |
| systems/snitch/test/tb_idma_inst64_copy.sv | Snitch inst64 plain-copy regression. |
| systems/snitch/test/idma_inst64_tb_pkg.sv | Package/types/constants for the standalone Snitch harness. |
| systems/snitch/test/idma_inst64_drv_if.sv | Accelerator-bus BFM tasks, including DMCPY-encoded transpose launch helpers. |
| systems/snitch/test/idma_inst64_base.sv | Base harness instantiating idma_inst64_top + AXI sim memories. |
| systems/snitch/README.md | Document Snitch harness purpose, build flow, and transpose contract. |
| systems/snitch/Makefile | Standalone build + sim/sweep targets for the Snitch harness. |
| systems/snitch/.gitignore | Ignore build products for the Snitch harness flow. |
| src/midend/idma_transpose_midend.sv | New combinational transpose geometry expander (NumDim=4) for ND midend. |
| src/midend/idma_nd_midend.sv | Add non-synth assertion enforcing stride width == address width. |
| src/include/idma/typedef.svh | Extend options_t with typed compute options field. |
| src/idma_pkg.sv | Define compute op enum, transpose params, compute options, and feature enable struct. |
| src/frontend/inst64/idma_inst64_top.sv | Add ComputeEnable param, decode/validate transpose from DMCPY, splice transpose midend, widen strides to addr width, add backend capability cross-check. |
| src/db/idma_tilelink.yml | Forward compute options into write datapath request struct where needed. |
| src/db/idma_axi.yml | Forward compute options; extend AXI write template to accept strobe mask + beat-done pulse. |
| src/backend/tpl/idma_transport_layer.sv.tpl | Add write-seam compute integration (dispatcher + mask/beat-done plumbing). |
| src/backend/tpl/idma_legalizer.sv.tpl | Force decouple on compute transfers; propagate compute options into mutable transfer opts and write datapath req. |
| src/backend/tpl/idma_backend.sv.tpl | Add compute-enabled variant metadata (ComputeEnable), enforce NO_ERROR_HANDLING, increase meta FIFO depth for compute latency, propagate compute options into write datapath req. |
| src/backend/idma_otf_transpose.sv | New transpose engine (tile ping-pong) producing per-byte strobe mask. |
| src/backend/idma_otf_compute.sv | New write-seam compute dispatcher (currently transpose only). |
| src/backend/idma_axi_write.sv | Add external strobe mask input and a strobe-independent “beat accepted” pulse output. |
| jobs/backend_rw_axi/transpose_none.txt | Add job artifact/marker for transpose-none configuration (empty in this diff). |
| idma.mk | Add compute-enabled variant list (IDMA_VIDMA_IDS), propagate to generator, add simulation targets for transpose regressions, include split_rtl in vsim script target set. |
| doc/transpose-engine-routing-plan.md | Detailed routing/signaling plan and rationale for transpose integration. |
| Bender.yml | Add compute RTL, new midend, Snitch harness sources, transpose tests, and introduce split_rtl generated-file selection. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+173
to
+185
| /// Extra write-descriptor slots covering the compute (transpose) tile-fill latency | ||
| localparam int unsigned ComputeFifoDepth = ${"StrbWidth" if enable_compute else "32'd0"}; | ||
| % if enable_compute: | ||
|
|
||
| /// Per-op compute set baked into this variant (frontends may cross-check) | ||
| localparam idma_pkg::compute_enable_t ComputeEnable = | ||
| '{${', '.join("%s: 1'b1" % op for op in compute_ops)}}; | ||
| `ifndef SYNTHESIS | ||
| // no engine flush on abort: compute is incompatible with error handling | ||
| initial assert (ErrorCap == idma_pkg::NO_ERROR_HANDLING) else | ||
| $fatal(1, "compute requires ErrorCap == NO_ERROR_HANDLING"); | ||
| `endif | ||
| % endif |
Comment on lines
+177
to
+181
| // full/empty token | ||
| always_ff @(posedge clk_i or negedge rst_ni) begin | ||
| if (!rst_ni || clear_i || exec_done) begin | ||
| full_q <= 2'b00; | ||
| end else begin |
Verbatim copy of rtl/datamover_engine.sv from pulp-platform/datamover@d58a985 (branch cdurrer/konark), the transpose core of the Ratha HWPE. Co-authored-by: Sergio Mazzola <smazzola@iis.ee.ethz.ch> Co-authored-by: Cyrill Durrer <cdurrer@iis.ee.ethz.ch>
Rework the imported datamover_engine.sv to iDMA conventions: plain valid/ready with byte/strb ports, no hwpe_stream/hci dependencies, transpose only. Runtime element size (int8/fp16/fp32), element-granular edge strobe, ping-pong tile banks with a half-area FullDuplex=0 option, and a standalone DPI-C golden regression.
compute_options_t carries {enable, op, params} in the request options;
transpose_options_t packs the element mode and tensor shape; compute_enable_t
is the compile-time per-op build gate.
idma_transpose_midend derives the NumDim=4 tiled walk (row / row-tile / col-tile) from the tensor shape and the bus StrbWidth, leaving the generic nd_midend to walk it; the geometry folds to shifts except one stride product. Guards the domain (StrbWidth >= 4, reserved mode, zero dims) and documents the tile-padded access contract; nd_midend asserts strides match the address width.
idma_otf_compute latches the per-transfer compute options and runtime-selects one op per transfer; the AXI write manager gains an external strobe mask and a strobe-independent beat-done so edge tiles drain. Compute support is decided at generation time: IDMA_VIDMA_IDS entries (variant[:ops][:fd|hd]) render the seam, the per-op ComputeEnable set and the transpose duplex into the listed variants only, non-compute variants are untouched. The write-side FIFOs grow by a tile to clear the legalizer in-flight bound and compute variants require NO_ERROR_HANDLING.
Multi-tile aligned and edge transposes through the rw_axi backend, back-to-back geometry-leak checks, an nd_midend burst-address regression, a field-for-field midend unit test and launch_tf transpose options; the engine regression runs in both duplex modes.
Decode the transpose from spare DMCPY argb bits into opt.compute, expand NumDim to 4 with addr-width strides and splice the transpose midend between the request FIFO and the nd_midend, gated by a ComputeEnable parameter. Malformed requests (no hardware, reserved mode, zero dims, unaligned dst) get an error response and the backend's baked compute set is cross-checked at elaboration.
Standalone BFM harness driving the accelerator port: copy and transpose testbenches and a sweep covering all element sizes, tiling, edge, back-to-back, leak and reject cases, registered behind the snitch_cluster target; the flow regenerates the RTL before compiling.
30bf0a1 to
854427a
Compare
lint-authors requires a blank line after the Authors block (the YAML folded header-regex carries a trailing newline) and the plural "Authors:" tag. Normalize the new transpose/snitch files accordingly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on #112 (the first 9 commits are that PR — review the last 2 commits here). This adds the frontend and system side of the on-the-fly transpose: the Snitch
inst64frontend learns to issue transpose transfers, and a standalone integration harness verifies the full path end-to-end.inst64 frontend
DMCPYargb bits:{enable[5], mode[7:6], tensor_m[19:8], tensor_n[31:20]}(register form only)ComputeEnableparameter gates everything at compile time: with it cleared the frontend elaborates exactly as before (NumDim stays 2, no expander)NumDim=4, strides widen to the address width, andidma_transpose_midendis spliced between the request FIFO and thend_midend