Skip to content

Add on-the-fly compute support with a transpose engine at the write seam#112

Draft
DanielKellerM wants to merge 7 commits into
pulp-platform:develfrom
DanielKellerM:compute/transpose-engine
Draft

Add on-the-fly compute support with a transpose engine at the write seam#112
DanielKellerM wants to merge 7 commits into
pulp-platform:develfrom
DanielKellerM:compute/transpose-engine

Conversation

@DanielKellerM

Copy link
Copy Markdown
Collaborator

Summary

This PR adds optional on-the-fly compute to the generated iDMA backends: transfers can be transformed while they stream through the DMA, with no extra memory passes. The first (and currently only) compute op is a matrix transpose; the architecture is an extension point for further ops.

The transpose datapath is adapted from the datamover (Ratha) HWPE — imported verbatim from pulp-platform/datamover@d58a985 in the first commit (original authors credited), then reworked to iDMA conventions.

Architecture

opt.compute (request)  →  idma_transpose_midend  →  idma_nd_midend  →  backend
                           (tensor shape → 4-D walk)                     │
                                       write seam: idma_otf_compute ── idma_otf_transpose
  • Request model (idma_pkg): opt.compute = {enable, op, params} carried per transfer; transpose params are the element mode (int8/fp16/fp32) and the tensor shape (M, N up to 4095).
  • Geometry midend: expands a compact transpose request into the tiled NumDim=4 ND walk (row / row-tile / col-tile) for the unmodified idma_nd_midend; the geometry strength-reduces to shifts except one 12x12 stride product.
  • Write-seam dispatcher: idma_otf_compute latches the per-transfer options and selects one op; the AXI write manager gains an external strobe mask plus a strobe-independent beat-done so partial edge tiles drain correctly.
  • Engine: NE x NE tile ping-pong (NE = StrbWidth/elem), runtime element size, element-granular edge masking. Steady state 1 + 1/NE cycles per tile (~98% of bus peak at NE=64).

Generation-time configuration

Compute is a generation decision, not an SV parameter: IDMA_VIDMA_IDS entries take variant[:ops][:fd|hd] (default rw_axi = transpose, full duplex). Non-listed variants render without any compute logic. The hd option builds a single tile bank — half the buffer area (StrbWidth^2 bytes per bank, e.g. 4 KiB at 512 bit) for roughly half the streaming rate.

Constraints, enforced at generation/elaboration time: compute variants need a single AXI write port and NO_ERROR_HANDLING; sources/destinations must be readable/writable up to the tile-padded bounds (documented contract; writes are strobe-masked, reads are not).

Verification

  • Standalone engine regression against a DPI-C golden model, in both duplex modes
  • Multi-tile aligned + edge ND transposes through the generated rw_axi backend, back-to-back geometry-leak checks, an nd_midend burst-address regression and a field-for-field midend unit test
  • Non-compute variants regenerate logic-identical to the base branch

FrancescoConti and others added 7 commits June 10, 2026 16:03
Verbatim copy of rtl/datamover_engine.sv from pulp-platform/datamover@d58a985
(branch cdurrer/konark), the transpose core of the Ratha HWPE.

Co-authored-by: Sergio Mazzola <smazzola@iis.ee.ethz.ch>
Co-authored-by: Cyrill Durrer <cdurrer@iis.ee.ethz.ch>
Rework the imported datamover_engine.sv to iDMA conventions: plain valid/ready
with byte/strb ports, no hwpe_stream/hci dependencies, transpose only. Runtime
element size (int8/fp16/fp32), element-granular edge strobe, ping-pong tile
banks with a half-area FullDuplex=0 option, and a standalone DPI-C golden
regression.
compute_options_t carries {enable, op, params} in the request options;
transpose_options_t packs the element mode and tensor shape; compute_enable_t
is the compile-time per-op build gate.
idma_transpose_midend derives the NumDim=4 tiled walk (row / row-tile /
col-tile) from the tensor shape and the bus StrbWidth, leaving the generic
nd_midend to walk it; the geometry folds to shifts except one stride product.
Guards the domain (StrbWidth >= 4, reserved mode, zero dims) and documents the
tile-padded access contract; nd_midend asserts strides match the address width.
idma_otf_compute latches the per-transfer compute options and runtime-selects
one op per transfer; the AXI write manager gains an external strobe mask and a
strobe-independent beat-done so edge tiles drain. Compute support is decided
at generation time: IDMA_VIDMA_IDS entries (variant[:ops][:fd|hd]) render the
seam, the per-op ComputeEnable set and the transpose duplex into the listed
variants only, non-compute variants are untouched. The write-side FIFOs grow
by a tile to clear the legalizer in-flight bound and compute variants require
NO_ERROR_HANDLING.
Multi-tile aligned and edge transposes through the rw_axi backend, back-to-back
geometry-leak checks, an nd_midend burst-address regression, a field-for-field
midend unit test and launch_tf transpose options; the engine regression runs in
both duplex modes.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds optional on-the-fly compute support to generated iDMA backends by introducing a transpose operation that is dispatched at the AXI write seam and carried per-transfer via new opt.compute request fields. It extends the generator to selectively include compute logic per backend variant, adds a transpose-geometry midend, updates write masking/beat retirement for edge tiles, and provides multiple new regressions (engine-only, ND end-to-end, and back-to-back).

Changes:

  • Extend request/options types with compute_options_t and add generator plumbing to enable compute per backend variant (--compute-ids / IDMA_VIDMA_IDS).
  • Add transpose datapath blocks (idma_otf_compute, idma_otf_transpose) and integrate them into the transport write seam, including strobe-mask support and strobe-independent beat retirement in idma_axi_write.
  • Add transpose geometry expander midend plus new directed/unit testbenches and a DPI-C golden model.

Reviewed changes

Copilot reviewed 27 out of 28 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
util/mario/util.py Adds parsing for compute-enabled backend variant IDs (--compute-ids).
util/mario/transport_layer.py Threads compute configuration into transport-layer template context.
util/mario/legalizer.py Threads compute enable into legalizer template context.
util/mario/backend.py Enforces compute placement constraints (single AXI write port) and passes op set into backend template context.
util/gen_idma.py Adds --compute-ids CLI support and forwards compute configuration into renderers.
src/include/idma/typedef.svh Extends options_t with compute field to carry per-transfer compute config.
src/idma_pkg.sv Introduces compute op enums and packed option/enable types for on-the-fly compute.
src/midend/idma_transpose_midend.sv New midend that expands transpose requests into a NumDim=4 tiled ND walk.
src/midend/idma_nd_midend.sv Adds a simulation-time stride/address width consistency assert.
src/backend/idma_otf_compute.sv New write-seam dispatcher that latches per-transfer compute options and selects the active engine.
src/backend/idma_otf_transpose.sv New transpose engine (tile ping-pong) with edge masking.
src/backend/idma_axi_write.sv Adds external strobe mask input and strobe-independent beat-done output for correct draining of masked beats.
src/db/idma_axi.yml Wires compute into write datapath request and connects new idma_axi_write ports.
src/db/idma_tilelink.yml Forwards compute into write datapath request struct literal.
src/backend/tpl/idma_transport_layer.sv.tpl Integrates compute at the write seam and carries shifted external mask to the write manager.
src/backend/tpl/idma_legalizer.sv.tpl Forces decouple signals when compute is enabled and forwards opt.compute through mutable options.
src/backend/tpl/idma_backend.sv.tpl Adds compute fields to datapath request structs and increases meta FIFO depth for compute latency.
test/idma_test.sv Extends test driver task to optionally enable/parameterize transpose per transfer.
test/idma_transpose_dpi.c Adds DPI-C golden transpose model for standalone engine verification.
test/tb_idma_otf_transpose.sv New standalone transpose-engine self-checking regression using DPI golden.
test/tb_idma_transpose_nd.sv New end-to-end ND→backend transpose regression with edge/padding checks.
test/tb_idma_transpose_b2b.sv New end-to-end back-to-back transpose regression to catch stale per-transfer state.
test/midend/tb_idma_transpose_midend.sv Unit test verifying transpose midend geometry expansion and passthrough behavior.
test/midend/tb_idma_nd_midend_b2b.sv Back-to-back ND midend regression under backpressure to catch base-address reuse.
doc/transpose-engine-routing-plan.md Detailed routing/signaling design doc for transpose integration.
idma.mk Adds compute-ids generation hook, split-RTL sim flow tag, and new Questa/VSIM targets for transpose regressions.
Bender.yml Adds new RTL sources and introduces a split_rtl target flow for per-variant generated files and transpose tests.
jobs/backend_rw_axi/transpose_none.txt Adds a placeholder job file for the rw_axi backend.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread util/gen_idma.py
Comment on lines 54 to 58
# prepare database and ids
protocol_ids = prepare_ids(args.ids)
compute_cfg = prepare_compute_ids(args.compute_ids)
frontend_ids = prepare_fids(args.fids)
protocol_db = read_database(args.db)
Comment thread util/mario/util.py
Comment on lines +140 to +157
def prepare_compute_ids(compute_id_strs: list) -> dict:
"""Parses compute configuration IDs: <variant>[:<op>[,<op>...]][:fd|hd]"""
res = {}
for cid_str in (compute_id_strs or []):
parts = cid_str.split(':')
ops = ['transpose']
full_duplex = True
for part in parts[1:]:
if part in ('fd', 'hd'):
full_duplex = part == 'fd'
else:
ops = part.split(',')
for op in ops:
if op not in ('transpose',):
print(f'[MARIO] {op} is a non-supported compute op in {cid_str}', file=sys.stderr)
sys.exit(1)
res[parts[0]] = {'ops': ops, 'full_duplex': full_duplex}
return res
Comment on lines +48 to +53
// geometry: NE is a power of two, so only shifts and AND-masks
logic [1:0] eff_mode; // element-size mode, saturated at LaneW
logic [LaneW:0] ne_m1; // NE-1
logic [3:0] log2_ne; // log2(NE) = LaneW - eff_mode
logic [DimWidth-1:0] y_tiles, n_tiles; // row-tiles, col-tiles
logic [LaneW:0] leftover_rows, leftover_cols; // M%NE, N%NE (run-global)
@DanielKellerM DanielKellerM marked this pull request as draft June 10, 2026 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants