Add on-the-fly compute support with a transpose engine at the write seam#112
Draft
DanielKellerM wants to merge 7 commits into
Draft
Add on-the-fly compute support with a transpose engine at the write seam#112DanielKellerM wants to merge 7 commits into
DanielKellerM wants to merge 7 commits into
Conversation
Verbatim copy of rtl/datamover_engine.sv from pulp-platform/datamover@d58a985 (branch cdurrer/konark), the transpose core of the Ratha HWPE. Co-authored-by: Sergio Mazzola <smazzola@iis.ee.ethz.ch> Co-authored-by: Cyrill Durrer <cdurrer@iis.ee.ethz.ch>
Rework the imported datamover_engine.sv to iDMA conventions: plain valid/ready with byte/strb ports, no hwpe_stream/hci dependencies, transpose only. Runtime element size (int8/fp16/fp32), element-granular edge strobe, ping-pong tile banks with a half-area FullDuplex=0 option, and a standalone DPI-C golden regression.
compute_options_t carries {enable, op, params} in the request options;
transpose_options_t packs the element mode and tensor shape; compute_enable_t
is the compile-time per-op build gate.
idma_transpose_midend derives the NumDim=4 tiled walk (row / row-tile / col-tile) from the tensor shape and the bus StrbWidth, leaving the generic nd_midend to walk it; the geometry folds to shifts except one stride product. Guards the domain (StrbWidth >= 4, reserved mode, zero dims) and documents the tile-padded access contract; nd_midend asserts strides match the address width.
idma_otf_compute latches the per-transfer compute options and runtime-selects one op per transfer; the AXI write manager gains an external strobe mask and a strobe-independent beat-done so edge tiles drain. Compute support is decided at generation time: IDMA_VIDMA_IDS entries (variant[:ops][:fd|hd]) render the seam, the per-op ComputeEnable set and the transpose duplex into the listed variants only, non-compute variants are untouched. The write-side FIFOs grow by a tile to clear the legalizer in-flight bound and compute variants require NO_ERROR_HANDLING.
Multi-tile aligned and edge transposes through the rw_axi backend, back-to-back geometry-leak checks, an nd_midend burst-address regression, a field-for-field midend unit test and launch_tf transpose options; the engine regression runs in both duplex modes.
There was a problem hiding this comment.
Pull request overview
This PR adds optional on-the-fly compute support to generated iDMA backends by introducing a transpose operation that is dispatched at the AXI write seam and carried per-transfer via new opt.compute request fields. It extends the generator to selectively include compute logic per backend variant, adds a transpose-geometry midend, updates write masking/beat retirement for edge tiles, and provides multiple new regressions (engine-only, ND end-to-end, and back-to-back).
Changes:
- Extend request/options types with
compute_options_tand add generator plumbing to enable compute per backend variant (--compute-ids/IDMA_VIDMA_IDS). - Add transpose datapath blocks (
idma_otf_compute,idma_otf_transpose) and integrate them into the transport write seam, including strobe-mask support and strobe-independent beat retirement inidma_axi_write. - Add transpose geometry expander midend plus new directed/unit testbenches and a DPI-C golden model.
Reviewed changes
Copilot reviewed 27 out of 28 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| util/mario/util.py | Adds parsing for compute-enabled backend variant IDs (--compute-ids). |
| util/mario/transport_layer.py | Threads compute configuration into transport-layer template context. |
| util/mario/legalizer.py | Threads compute enable into legalizer template context. |
| util/mario/backend.py | Enforces compute placement constraints (single AXI write port) and passes op set into backend template context. |
| util/gen_idma.py | Adds --compute-ids CLI support and forwards compute configuration into renderers. |
| src/include/idma/typedef.svh | Extends options_t with compute field to carry per-transfer compute config. |
| src/idma_pkg.sv | Introduces compute op enums and packed option/enable types for on-the-fly compute. |
| src/midend/idma_transpose_midend.sv | New midend that expands transpose requests into a NumDim=4 tiled ND walk. |
| src/midend/idma_nd_midend.sv | Adds a simulation-time stride/address width consistency assert. |
| src/backend/idma_otf_compute.sv | New write-seam dispatcher that latches per-transfer compute options and selects the active engine. |
| src/backend/idma_otf_transpose.sv | New transpose engine (tile ping-pong) with edge masking. |
| src/backend/idma_axi_write.sv | Adds external strobe mask input and strobe-independent beat-done output for correct draining of masked beats. |
| src/db/idma_axi.yml | Wires compute into write datapath request and connects new idma_axi_write ports. |
| src/db/idma_tilelink.yml | Forwards compute into write datapath request struct literal. |
| src/backend/tpl/idma_transport_layer.sv.tpl | Integrates compute at the write seam and carries shifted external mask to the write manager. |
| src/backend/tpl/idma_legalizer.sv.tpl | Forces decouple signals when compute is enabled and forwards opt.compute through mutable options. |
| src/backend/tpl/idma_backend.sv.tpl | Adds compute fields to datapath request structs and increases meta FIFO depth for compute latency. |
| test/idma_test.sv | Extends test driver task to optionally enable/parameterize transpose per transfer. |
| test/idma_transpose_dpi.c | Adds DPI-C golden transpose model for standalone engine verification. |
| test/tb_idma_otf_transpose.sv | New standalone transpose-engine self-checking regression using DPI golden. |
| test/tb_idma_transpose_nd.sv | New end-to-end ND→backend transpose regression with edge/padding checks. |
| test/tb_idma_transpose_b2b.sv | New end-to-end back-to-back transpose regression to catch stale per-transfer state. |
| test/midend/tb_idma_transpose_midend.sv | Unit test verifying transpose midend geometry expansion and passthrough behavior. |
| test/midend/tb_idma_nd_midend_b2b.sv | Back-to-back ND midend regression under backpressure to catch base-address reuse. |
| doc/transpose-engine-routing-plan.md | Detailed routing/signaling design doc for transpose integration. |
| idma.mk | Adds compute-ids generation hook, split-RTL sim flow tag, and new Questa/VSIM targets for transpose regressions. |
| Bender.yml | Adds new RTL sources and introduces a split_rtl target flow for per-variant generated files and transpose tests. |
| jobs/backend_rw_axi/transpose_none.txt | Adds a placeholder job file for the rw_axi backend. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
54
to
58
| # prepare database and ids | ||
| protocol_ids = prepare_ids(args.ids) | ||
| compute_cfg = prepare_compute_ids(args.compute_ids) | ||
| frontend_ids = prepare_fids(args.fids) | ||
| protocol_db = read_database(args.db) |
Comment on lines
+140
to
+157
| def prepare_compute_ids(compute_id_strs: list) -> dict: | ||
| """Parses compute configuration IDs: <variant>[:<op>[,<op>...]][:fd|hd]""" | ||
| res = {} | ||
| for cid_str in (compute_id_strs or []): | ||
| parts = cid_str.split(':') | ||
| ops = ['transpose'] | ||
| full_duplex = True | ||
| for part in parts[1:]: | ||
| if part in ('fd', 'hd'): | ||
| full_duplex = part == 'fd' | ||
| else: | ||
| ops = part.split(',') | ||
| for op in ops: | ||
| if op not in ('transpose',): | ||
| print(f'[MARIO] {op} is a non-supported compute op in {cid_str}', file=sys.stderr) | ||
| sys.exit(1) | ||
| res[parts[0]] = {'ops': ops, 'full_duplex': full_duplex} | ||
| return res |
Comment on lines
+48
to
+53
| // geometry: NE is a power of two, so only shifts and AND-masks | ||
| logic [1:0] eff_mode; // element-size mode, saturated at LaneW | ||
| logic [LaneW:0] ne_m1; // NE-1 | ||
| logic [3:0] log2_ne; // log2(NE) = LaneW - eff_mode | ||
| logic [DimWidth-1:0] y_tiles, n_tiles; // row-tiles, col-tiles | ||
| logic [LaneW:0] leftover_rows, leftover_cols; // M%NE, N%NE (run-global) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds optional on-the-fly compute to the generated iDMA backends: transfers can be transformed while they stream through the DMA, with no extra memory passes. The first (and currently only) compute op is a matrix transpose; the architecture is an extension point for further ops.
The transpose datapath is adapted from the
datamover(Ratha) HWPE — imported verbatim from pulp-platform/datamover@d58a985 in the first commit (original authors credited), then reworked to iDMA conventions.Architecture
idma_pkg):opt.compute = {enable, op, params}carried per transfer; transpose params are the element mode (int8/fp16/fp32) and the tensor shape (M, N up to 4095).NumDim=4ND walk (row / row-tile / col-tile) for the unmodifiedidma_nd_midend; the geometry strength-reduces to shifts except one 12x12 stride product.idma_otf_computelatches the per-transfer options and selects one op; the AXI write manager gains an external strobe mask plus a strobe-independent beat-done so partial edge tiles drain correctly.NE = StrbWidth/elem), runtime element size, element-granular edge masking. Steady state1 + 1/NEcycles per tile (~98% of bus peak at NE=64).Generation-time configuration
Compute is a generation decision, not an SV parameter:
IDMA_VIDMA_IDSentries takevariant[:ops][:fd|hd](defaultrw_axi= transpose, full duplex). Non-listed variants render without any compute logic. Thehdoption builds a single tile bank — half the buffer area (StrbWidth^2bytes per bank, e.g. 4 KiB at 512 bit) for roughly half the streaming rate.Constraints, enforced at generation/elaboration time: compute variants need a single AXI write port and
NO_ERROR_HANDLING; sources/destinations must be readable/writable up to the tile-padded bounds (documented contract; writes are strobe-masked, reads are not).Verification
rw_axibackend, back-to-back geometry-leak checks, annd_midendburst-address regression and a field-for-field midend unit test