Add on-the-fly compute support with a transpose engine at the write seam by DanielKellerM · Pull Request #112 · pulp-platform/iDMA

DanielKellerM · 2026-06-10T14:13:49Z

Summary

This PR adds optional on-the-fly compute to the generated iDMA backends: transfers can be transformed while they stream through the DMA, with no extra memory passes. The first (and currently only) compute op is a matrix transpose; the architecture is an extension point for further ops.

The transpose datapath is adapted from the datamover (Ratha) HWPE — imported verbatim from pulp-platform/datamover@d58a985 in the first commit (original authors credited), then reworked to iDMA conventions.

Architecture

opt.compute (request)  →  idma_transpose_midend  →  idma_nd_midend  →  backend
                           (tensor shape → 4-D walk)                     │
                                       write seam: idma_otf_compute ── idma_otf_transpose

Request model (idma_pkg): opt.compute = {enable, op, params} carried per transfer; transpose params are the element mode (int8/fp16/fp32) and the tensor shape (M, N up to 4095).
Geometry midend: expands a compact transpose request into the tiled NumDim=4 ND walk (row / row-tile / col-tile) for the unmodified idma_nd_midend; the geometry strength-reduces to shifts except one 12x12 stride product.
Write-seam dispatcher: idma_otf_compute latches the per-transfer options and selects one op; the AXI write manager gains an external strobe mask plus a strobe-independent beat-done so partial edge tiles drain correctly.
Engine: NE x NE tile ping-pong (NE = StrbWidth/elem), runtime element size, element-granular edge masking. Steady state 1 + 1/NE cycles per tile (~98% of bus peak at NE=64).

Generation-time configuration

Compute is a generation decision, not an SV parameter: IDMA_VIDMA_IDS entries take variant[:ops][:fd|hd] (default rw_axi = transpose, full duplex). Non-listed variants render without any compute logic. The hd option builds a single tile bank — half the buffer area (StrbWidth^2 bytes per bank, e.g. 4 KiB at 512 bit) for roughly half the streaming rate.

Constraints, enforced at generation/elaboration time: compute variants need a single AXI write port and NO_ERROR_HANDLING; sources/destinations must be readable/writable up to the tile-padded bounds (documented contract; writes are strobe-masked, reads are not).

Verification

Standalone engine regression against a DPI-C golden model, in both duplex modes
Multi-tile aligned + edge ND transposes through the generated rw_axi backend, back-to-back geometry-leak checks, an nd_midend burst-address regression and a field-for-field midend unit test
Non-compute variants regenerate logic-identical to the base branch

Verbatim copy of rtl/datamover_engine.sv from pulp-platform/datamover@d58a985 (branch cdurrer/konark), the transpose core of the Ratha HWPE. Co-authored-by: Sergio Mazzola <smazzola@iis.ee.ethz.ch> Co-authored-by: Cyrill Durrer <cdurrer@iis.ee.ethz.ch>

Rework the imported datamover_engine.sv to iDMA conventions: plain valid/ready with byte/strb ports, no hwpe_stream/hci dependencies, transpose only. Runtime element size (int8/fp16/fp32), element-granular edge strobe, ping-pong tile banks with a half-area FullDuplex=0 option, and a standalone DPI-C golden regression.

compute_options_t carries {enable, op, params} in the request options; transpose_options_t packs the element mode and tensor shape; compute_enable_t is the compile-time per-op build gate.

idma_transpose_midend derives the NumDim=4 tiled walk (row / row-tile / col-tile) from the tensor shape and the bus StrbWidth, leaving the generic nd_midend to walk it; the geometry folds to shifts except one stride product. Guards the domain (StrbWidth >= 4, reserved mode, zero dims) and documents the tile-padded access contract; nd_midend asserts strides match the address width.

idma_otf_compute latches the per-transfer compute options and runtime-selects one op per transfer; the AXI write manager gains an external strobe mask and a strobe-independent beat-done so edge tiles drain. Compute support is decided at generation time: IDMA_VIDMA_IDS entries (variant[:ops][:fd|hd]) render the seam, the per-op ComputeEnable set and the transpose duplex into the listed variants only, non-compute variants are untouched. The write-side FIFOs grow by a tile to clear the legalizer in-flight bound and compute variants require NO_ERROR_HANDLING.

Multi-tile aligned and edge transposes through the rw_axi backend, back-to-back geometry-leak checks, an nd_midend burst-address regression, a field-for-field midend unit test and launch_tf transpose options; the engine regression runs in both duplex modes.

Copilot

Pull request overview

This PR adds optional on-the-fly compute support to generated iDMA backends by introducing a transpose operation that is dispatched at the AXI write seam and carried per-transfer via new opt.compute request fields. It extends the generator to selectively include compute logic per backend variant, adds a transpose-geometry midend, updates write masking/beat retirement for edge tiles, and provides multiple new regressions (engine-only, ND end-to-end, and back-to-back).

Changes:

Extend request/options types with compute_options_t and add generator plumbing to enable compute per backend variant (--compute-ids / IDMA_VIDMA_IDS).
Add transpose datapath blocks (idma_otf_compute, idma_otf_transpose) and integrate them into the transport write seam, including strobe-mask support and strobe-independent beat retirement in idma_axi_write.
Add transpose geometry expander midend plus new directed/unit testbenches and a DPI-C golden model.

Reviewed changes

Copilot reviewed 27 out of 28 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
util/mario/util.py	Adds parsing for compute-enabled backend variant IDs (`--compute-ids`).
util/mario/transport_layer.py	Threads compute configuration into transport-layer template context.
util/mario/legalizer.py	Threads compute enable into legalizer template context.
util/mario/backend.py	Enforces compute placement constraints (single AXI write port) and passes op set into backend template context.
util/gen_idma.py	Adds `--compute-ids` CLI support and forwards compute configuration into renderers.
src/include/idma/typedef.svh	Extends `options_t` with `compute` field to carry per-transfer compute config.
src/idma_pkg.sv	Introduces compute op enums and packed option/enable types for on-the-fly compute.
src/midend/idma_transpose_midend.sv	New midend that expands transpose requests into a NumDim=4 tiled ND walk.
src/midend/idma_nd_midend.sv	Adds a simulation-time stride/address width consistency assert.
src/backend/idma_otf_compute.sv	New write-seam dispatcher that latches per-transfer compute options and selects the active engine.
src/backend/idma_otf_transpose.sv	New transpose engine (tile ping-pong) with edge masking.
src/backend/idma_axi_write.sv	Adds external strobe mask input and strobe-independent beat-done output for correct draining of masked beats.
src/db/idma_axi.yml	Wires compute into write datapath request and connects new `idma_axi_write` ports.
src/db/idma_tilelink.yml	Forwards compute into write datapath request struct literal.
src/backend/tpl/idma_transport_layer.sv.tpl	Integrates compute at the write seam and carries shifted external mask to the write manager.
src/backend/tpl/idma_legalizer.sv.tpl	Forces decouple signals when compute is enabled and forwards `opt.compute` through mutable options.
src/backend/tpl/idma_backend.sv.tpl	Adds compute fields to datapath request structs and increases meta FIFO depth for compute latency.
test/idma_test.sv	Extends test driver task to optionally enable/parameterize transpose per transfer.
test/idma_transpose_dpi.c	Adds DPI-C golden transpose model for standalone engine verification.
test/tb_idma_otf_transpose.sv	New standalone transpose-engine self-checking regression using DPI golden.
test/tb_idma_transpose_nd.sv	New end-to-end ND→backend transpose regression with edge/padding checks.
test/tb_idma_transpose_b2b.sv	New end-to-end back-to-back transpose regression to catch stale per-transfer state.
test/midend/tb_idma_transpose_midend.sv	Unit test verifying transpose midend geometry expansion and passthrough behavior.
test/midend/tb_idma_nd_midend_b2b.sv	Back-to-back ND midend regression under backpressure to catch base-address reuse.
doc/transpose-engine-routing-plan.md	Detailed routing/signaling design doc for transpose integration.
idma.mk	Adds compute-ids generation hook, split-RTL sim flow tag, and new Questa/VSIM targets for transpose regressions.
Bender.yml	Adds new RTL sources and introduces a `split_rtl` target flow for per-variant generated files and transpose tests.
jobs/backend_rw_axi/transpose_none.txt	Adds a placeholder job file for the rw_axi backend.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

    # prepare database and ids
    protocol_ids = prepare_ids(args.ids)
+    compute_cfg = prepare_compute_ids(args.compute_ids)
    frontend_ids = prepare_fids(args.fids)
    protocol_db = read_database(args.db)


+def prepare_compute_ids(compute_id_strs: list) -> dict:
+    """Parses compute configuration IDs: <variant>[:<op>[,<op>...]][:fd|hd]"""
+    res = {}
+    for cid_str in (compute_id_strs or []):
+        parts = cid_str.split(':')
+        ops = ['transpose']
+        full_duplex = True
+        for part in parts[1:]:
+            if part in ('fd', 'hd'):
+                full_duplex = part == 'fd'
+            else:
+                ops = part.split(',')
+        for op in ops:
+            if op not in ('transpose',):
+                print(f'[MARIO] {op} is a non-supported compute op in {cid_str}', file=sys.stderr)
+                sys.exit(1)
+        res[parts[0]] = {'ops': ops, 'full_duplex': full_duplex}
+    return res


+  // geometry: NE is a power of two, so only shifts and AND-masks
+  logic [1:0]       eff_mode;         // element-size mode, saturated at LaneW
+  logic [LaneW:0]   ne_m1;            // NE-1
+  logic [3:0]       log2_ne;          // log2(NE) = LaneW - eff_mode
+  logic [DimWidth-1:0] y_tiles, n_tiles; // row-tiles, col-tiles
+  logic [LaneW:0]   leftover_rows, leftover_cols;  // M%NE, N%NE (run-global)


FrancescoConti and others added 7 commits June 10, 2026 16:03

idma_pkg: Add the per-transfer compute request model

9f4141e

compute_options_t carries {enable, op, params} in the request options; transpose_options_t packs the element mode and tensor shape; compute_enable_t is the compile-time per-op build gate.

doc: Add the transpose engine routing plan

577c381

DanielKellerM requested a review from thommythomaso as a code owner June 10, 2026 14:13

Copilot AI review requested due to automatic review settings June 10, 2026 14:13

Copilot started reviewing on behalf of DanielKellerM June 10, 2026 14:14 View session

DanielKellerM mentioned this pull request Jun 10, 2026

inst64: Drive on-the-fly transpose; add the snitch integration harness #113

Draft

Copilot AI reviewed Jun 10, 2026

View reviewed changes

DanielKellerM marked this pull request as draft June 10, 2026 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add on-the-fly compute support with a transpose engine at the write seam#112

Add on-the-fly compute support with a transpose engine at the write seam#112
DanielKellerM wants to merge 7 commits into
pulp-platform:develfrom
DanielKellerM:compute/transpose-engine

DanielKellerM commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

DanielKellerM commented Jun 10, 2026

Summary

Architecture

Generation-time configuration

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants