Skip to content

inst64: Drive on-the-fly transpose; add the snitch integration harness#113

Draft
DanielKellerM wants to merge 10 commits into
pulp-platform:develfrom
DanielKellerM:systems/snitch-integration
Draft

inst64: Drive on-the-fly transpose; add the snitch integration harness#113
DanielKellerM wants to merge 10 commits into
pulp-platform:develfrom
DanielKellerM:systems/snitch-integration

Conversation

@DanielKellerM

@DanielKellerM DanielKellerM commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

Stacked on #112 (the first 9 commits are that PR — review the last 2 commits here). This adds the frontend and system side of the on-the-fly transpose: the Snitch inst64 frontend learns to issue transpose transfers, and a standalone integration harness verifies the full path end-to-end.

inst64 frontend

  • Transpose requests are encoded in spare DMCPY argb bits: {enable[5], mode[7:6], tensor_m[19:8], tensor_n[31:20]} (register form only)
  • A ComputeEnable parameter gates everything at compile time: with it cleared the frontend elaborates exactly as before (NumDim stays 2, no expander)
  • With transpose enabled, NumDim=4, strides widen to the address width, and idma_transpose_midend is spliced between the request FIFO and the nd_midend
  • Malformed requests are rejected with an accelerator error response instead of mis-executing: transpose without hardware, reserved element mode, zero tensor dimensions, dst not bus-aligned
  • An elaboration assert cross-checks the generated backend's baked compute capability against the frontend's

Copilot AI review requested due to automatic review settings June 10, 2026 14:14
@DanielKellerM DanielKellerM marked this pull request as draft June 10, 2026 14:18

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR wires the on-the-fly transpose feature end-to-end for the Snitch inst64 frontend and adds a standalone Snitch integration harness, while also extending the backend generation flow to optionally include compute support in selected variants.

Changes:

  • Add a typed per-transfer opt.compute capability (transpose op + params) and route it through legalizer/backend/transport to a write-seam compute dispatcher and transpose engine.
  • Extend the inst64 frontend to decode transpose requests from spare DMCPY argb bits, expand transpose geometry via a new idma_transpose_midend, and reject malformed transpose requests.
  • Add new SV/DPI-C testbenches (engine-level + ND/back-to-back) and a Snitch inst64 integration harness + Makefile flow (snitch_transpose_sweep), plus docs and Bender target support (split_rtl).

Reviewed changes

Copilot reviewed 36 out of 37 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
util/mario/util.py Add parsing for --compute-ids configuration strings (ops + fd/hd).
util/mario/transport_layer.py Pass compute enable/op info into transport-layer templating context.
util/mario/legalizer.py Pass compute enable flag into legalizer templating context.
util/mario/backend.py Enforce “single AXI write port” constraint for compute-enabled variants and pass ops into backend templating context.
util/gen_idma.py Add --compute-ids CLI and propagate compute config into generators.
test/tb_idma_transpose_nd.sv Multi-tile end-to-end transpose test via ND midend → compute backend → AXI sim mem.
test/tb_idma_transpose_b2b.sv End-to-end back-to-back transpose regression to distinct destinations.
test/tb_idma_otf_transpose.sv Standalone transpose engine SV testbench using DPI-C golden model.
test/midend/tb_idma_transpose_midend.sv Unit test for transpose geometry expansion midend.
test/midend/tb_idma_nd_midend_b2b.sv Back-to-back ND midend base-address reload regression under backpressure.
test/idma_transpose_dpi.c DPI-C golden model for element-granular transpose verification.
test/idma_test.sv Extend request-driving task to optionally program transpose compute options.
systems/snitch/test/tb_idma_inst64_transpose.sv Snitch inst64 end-to-end transpose integration test (incl. rejects + no-leak).
systems/snitch/test/tb_idma_inst64_copy.sv Snitch inst64 plain-copy regression.
systems/snitch/test/idma_inst64_tb_pkg.sv Package/types/constants for the standalone Snitch harness.
systems/snitch/test/idma_inst64_drv_if.sv Accelerator-bus BFM tasks, including DMCPY-encoded transpose launch helpers.
systems/snitch/test/idma_inst64_base.sv Base harness instantiating idma_inst64_top + AXI sim memories.
systems/snitch/README.md Document Snitch harness purpose, build flow, and transpose contract.
systems/snitch/Makefile Standalone build + sim/sweep targets for the Snitch harness.
systems/snitch/.gitignore Ignore build products for the Snitch harness flow.
src/midend/idma_transpose_midend.sv New combinational transpose geometry expander (NumDim=4) for ND midend.
src/midend/idma_nd_midend.sv Add non-synth assertion enforcing stride width == address width.
src/include/idma/typedef.svh Extend options_t with typed compute options field.
src/idma_pkg.sv Define compute op enum, transpose params, compute options, and feature enable struct.
src/frontend/inst64/idma_inst64_top.sv Add ComputeEnable param, decode/validate transpose from DMCPY, splice transpose midend, widen strides to addr width, add backend capability cross-check.
src/db/idma_tilelink.yml Forward compute options into write datapath request struct where needed.
src/db/idma_axi.yml Forward compute options; extend AXI write template to accept strobe mask + beat-done pulse.
src/backend/tpl/idma_transport_layer.sv.tpl Add write-seam compute integration (dispatcher + mask/beat-done plumbing).
src/backend/tpl/idma_legalizer.sv.tpl Force decouple on compute transfers; propagate compute options into mutable transfer opts and write datapath req.
src/backend/tpl/idma_backend.sv.tpl Add compute-enabled variant metadata (ComputeEnable), enforce NO_ERROR_HANDLING, increase meta FIFO depth for compute latency, propagate compute options into write datapath req.
src/backend/idma_otf_transpose.sv New transpose engine (tile ping-pong) producing per-byte strobe mask.
src/backend/idma_otf_compute.sv New write-seam compute dispatcher (currently transpose only).
src/backend/idma_axi_write.sv Add external strobe mask input and a strobe-independent “beat accepted” pulse output.
jobs/backend_rw_axi/transpose_none.txt Add job artifact/marker for transpose-none configuration (empty in this diff).
idma.mk Add compute-enabled variant list (IDMA_VIDMA_IDS), propagate to generator, add simulation targets for transpose regressions, include split_rtl in vsim script target set.
doc/transpose-engine-routing-plan.md Detailed routing/signaling plan and rationale for transpose integration.
Bender.yml Add compute RTL, new midend, Snitch harness sources, transpose tests, and introduce split_rtl generated-file selection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +173 to +185
/// Extra write-descriptor slots covering the compute (transpose) tile-fill latency
localparam int unsigned ComputeFifoDepth = ${"StrbWidth" if enable_compute else "32'd0"};
% if enable_compute:

/// Per-op compute set baked into this variant (frontends may cross-check)
localparam idma_pkg::compute_enable_t ComputeEnable =
'{${', '.join("%s: 1'b1" % op for op in compute_ops)}};
`ifndef SYNTHESIS
// no engine flush on abort: compute is incompatible with error handling
initial assert (ErrorCap == idma_pkg::NO_ERROR_HANDLING) else
$fatal(1, "compute requires ErrorCap == NO_ERROR_HANDLING");
`endif
% endif
Comment on lines +177 to +181
// full/empty token
always_ff @(posedge clk_i or negedge rst_ni) begin
if (!rst_ni || clear_i || exec_done) begin
full_q <= 2'b00;
end else begin
FrancescoConti and others added 9 commits June 11, 2026 16:50
Verbatim copy of rtl/datamover_engine.sv from pulp-platform/datamover@d58a985
(branch cdurrer/konark), the transpose core of the Ratha HWPE.

Co-authored-by: Sergio Mazzola <smazzola@iis.ee.ethz.ch>
Co-authored-by: Cyrill Durrer <cdurrer@iis.ee.ethz.ch>
Rework the imported datamover_engine.sv to iDMA conventions: plain valid/ready
with byte/strb ports, no hwpe_stream/hci dependencies, transpose only. Runtime
element size (int8/fp16/fp32), element-granular edge strobe, ping-pong tile
banks with a half-area FullDuplex=0 option, and a standalone DPI-C golden
regression.
compute_options_t carries {enable, op, params} in the request options;
transpose_options_t packs the element mode and tensor shape; compute_enable_t
is the compile-time per-op build gate.
idma_transpose_midend derives the NumDim=4 tiled walk (row / row-tile /
col-tile) from the tensor shape and the bus StrbWidth, leaving the generic
nd_midend to walk it; the geometry folds to shifts except one stride product.
Guards the domain (StrbWidth >= 4, reserved mode, zero dims) and documents the
tile-padded access contract; nd_midend asserts strides match the address width.
idma_otf_compute latches the per-transfer compute options and runtime-selects
one op per transfer; the AXI write manager gains an external strobe mask and a
strobe-independent beat-done so edge tiles drain. Compute support is decided
at generation time: IDMA_VIDMA_IDS entries (variant[:ops][:fd|hd]) render the
seam, the per-op ComputeEnable set and the transpose duplex into the listed
variants only, non-compute variants are untouched. The write-side FIFOs grow
by a tile to clear the legalizer in-flight bound and compute variants require
NO_ERROR_HANDLING.
Multi-tile aligned and edge transposes through the rw_axi backend, back-to-back
geometry-leak checks, an nd_midend burst-address regression, a field-for-field
midend unit test and launch_tf transpose options; the engine regression runs in
both duplex modes.
Decode the transpose from spare DMCPY argb bits into opt.compute, expand
NumDim to 4 with addr-width strides and splice the transpose midend between
the request FIFO and the nd_midend, gated by a ComputeEnable parameter.
Malformed requests (no hardware, reserved mode, zero dims, unaligned dst) get
an error response and the backend's baked compute set is cross-checked at
elaboration.
Standalone BFM harness driving the accelerator port: copy and transpose
testbenches and a sweep covering all element sizes, tiling, edge, back-to-back,
leak and reject cases, registered behind the snitch_cluster target; the flow
regenerates the RTL before compiling.
@DanielKellerM DanielKellerM force-pushed the systems/snitch-integration branch from 30bf0a1 to 854427a Compare June 11, 2026 14:52
lint-authors requires a blank line after the Authors block (the YAML
folded header-regex carries a trailing newline) and the plural
"Authors:" tag. Normalize the new transpose/snitch files accordingly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants