Skip to content

Conversation

@arai713
Copy link
Contributor

@arai713 arai713 commented Jan 5, 2026

This PR introduces the generation of unit tests for Stream-K using Tile Engine. This will allow us to scale up the unit tests we have for better coverage and maintainability. It establishes a small test targeting fp16 and bf16 data types and covers both atomic and parallel reduction strategies within the compv3 pipeline. It also lays the groundwork for expanding to the full set of Stream-K smoke and extended tests.

These are the supported instances:

test_gemm_streamk_tile_engine_bf16_ccr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_ccr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_ccr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_ccr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_crr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_crr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_crr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_crr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_rcr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_rcr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_rcr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_rcr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_rrr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_rrr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_rrr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_bf16_rrr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_ccr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_ccr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_ccr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_ccr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_crr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_crr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_crr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_crr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_rcr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_rcr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_rcr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_rcr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_rrr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_rrr_small_datatype_config_compv3_default_intrawave_atomic_False_False_False_True_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_rrr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_False_128x128x64_2x2x1_16x16x16
test_gemm_streamk_tile_engine_fp16_rrr_small_datatype_config_compv3_default_intrawave_reduction_False_False_False_True_128x128x64_2x2x1_16x16x16

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

Comment on lines +7 to +9
{"m": 256, "n": 256, "k": 128, "split_k": 1},
{"m": 512, "n": 256, "k": 256, "split_k": 1},
{"m": 256, "n": 512, "k": 256, "split_k": 1}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these sizes cover the existing cases in test/ck_tile/gemm_streamk?
If so, did we want to remove the old test files and replace them with these to avoid adding redundant tests?

@arai713 arai713 force-pushed the arai/ck_tile/streamk_tile_engine_test branch 2 times, most recently from 7db8519 to 0e6e6a7 Compare January 9, 2026 19:52
@arai713 arai713 marked this pull request as ready for review January 9, 2026 19:52
- **SKIPPED**: Kernel validation returned "Arguments not supported" (expected for certain problem sizes/configurations) ⚠️
- **FAILED**: Actual error or incorrect computation results ❌

When a kernel's `IsSupportedArgument()` check fails (e.g., due to vector alignment requirements, dimension constraints, or padding limitations), the test is automatically skipped rather than failed. This allows comprehensive testing across various problem sizes while gracefully handling configurations that don't meet specific kernel requirements.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is there should never be a set of arguments passed to the test suite that are not supported.
We should have a set of black and white tests that we know will either absolutely pass or absolutely fail.

Because in a scenario, where IsSupportedAruguments()'s implementation is changed, for instance someone unintentionally reduces the vector alignment requirements, it will result in valid kernels just being skipped rather than failing. So, we might miss this regression.

IsSupportedArgument() is pretty fragile in a sense because it calls a lot of getter functions from all over the place.

@arai713 arai713 force-pushed the arai/ck_tile/streamk_tile_engine_test branch from 0e6e6a7 to 89e4aba Compare January 12, 2026 19:57
Comment on lines +250 to +270
# Enable parallel compilation optimizations
# Set up job pools for better parallel compilation control
set_property(GLOBAL PROPERTY JOB_POOLS
compile_heavy=4 # Limit heavy compilations to prevent OOM
compile_normal=16 # Allow more parallel normal compilations
)

# Enable compiler cache if available and explicitly requested
# Disabled by default due to permission issues in CI environments
option(ENABLE_CCACHE_TESTS "Enable ccache for test compilation" OFF)
if(ENABLE_CCACHE_TESTS)
find_program(CCACHE_PROGRAM ccache)
if(CCACHE_PROGRAM)
set(CMAKE_CXX_COMPILER_LAUNCHER ${CCACHE_PROGRAM})
message(STATUS "Using ccache for faster test compilation")
else()
message(WARNING "ccache requested but not found")
endif()
else()
message(STATUS "ccache disabled for tests (use -DENABLE_CCACHE_TESTS=ON to enable)")
endif()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tabbed in.

Also there are job pools set up here. Where are they used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe these were set up with the purpose of reducing the load at build time, but I don't think they are actually being used anywhere.

Comment on lines +20 to +21
- **`--gen_individual`**: Generate all kernel headers in parallel during CMake configuration
- **`--gen_single`**: Generate individual kernel header for each configuration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between these two is not very clear for me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gen_single creates one single instance, while gen_individual creates individual headers for all the combinations available


## Data Type Support
-**fp16, bf16**: Fully supported - all layouts (rcr, rrr, ccr, crr)
-**fp64**: Not supported (hardware MFMA limitation)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fp64 mfma is supported on gfx9 (minus gfx908). Is this more of a CK limitation?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces Stream-K GEMM unit tests using the Tile Engine infrastructure, establishing a test generation framework that mirrors tile_engine's kernel generation methodology. The tests cover fp16 and bf16 data types across all matrix layouts (rcr, rrr, ccr, crr) for both atomic and parallel reduction strategies, as well as a newly added TreeReduction strategy.

Changes:

  • Added support for TreeReduction strategy across profiler, instance builder, and configuration files
  • Created a comprehensive test framework that generates individual test executables for each kernel configuration
  • Established test parameter extraction from JSON configurations for flexible test definition

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tile_engine/ops/gemm_streamk/gemm_streamk_profiler.hpp Added TreeReduction to the reduction strategy mapping for kernel profiling
tile_engine/ops/gemm_streamk/gemm_streamk_instance_builder.py Added TreeReduction mapping and workspace reset logic
tile_engine/ops/gemm_streamk/configs/default_config.json Added "tree" to the list of supported reduction strategies
test/ck_tile/gemm_streamk_tile_engine/test_gemm_streamk_simple.cpp New test implementation using GTest with tile_engine error thresholds
test/ck_tile/gemm_streamk_tile_engine/extract_test_params.py Python utility to extract test parameters from JSON config files
test/ck_tile/gemm_streamk_tile_engine/configs/simple_test_config.json Test configuration defining problem sizes and kernel traits
test/ck_tile/gemm_streamk_tile_engine/README.md Comprehensive documentation of the test framework and its integration
test/ck_tile/gemm_streamk_tile_engine/CMakeLists.txt Build system integration for generating and compiling test targets
test/ck_tile/CMakeLists.txt Added gemm_streamk_tile_engine subdirectory to test build

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@arai713 arai713 force-pushed the arai/ck_tile/streamk_tile_engine_test branch from 89e4aba to 01b28de Compare January 17, 2026 10:31
@arai713 arai713 force-pushed the arai/ck_tile/streamk_tile_engine_test branch from 01b28de to cb76173 Compare January 19, 2026 18:21
This change adds an implementation for generating Stream-K tests using Tile Engine.
This will generate various test executables for different combinations based on the
config files. This addition has simple tests running for bf16 and fp16, with both
atomic and reduction strategies and compv3 pipeline. The tests rely on the implementation
of Stream-K in Tile Engine.
@arai713 arai713 force-pushed the arai/ck_tile/streamk_tile_engine_test branch from cb76173 to 64f4d20 Compare January 19, 2026 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants