Release v1.2.5 · microsoft/Accera

What's Changed

Add link to the NCHWc 2D Convolution Case Study by @marina-neseem in #41

Merged PR 2593: [docs] [release] bump docs version to 1.2.5 in
preparation for release. [Lisa Ong]

bump docs version to 1.2.5 in preparation for release

Merged PR 2586: Loop order and indices as parameters [Denny Sun]

With this change, the user can write a schedule with loop_order parameterized:

   loop_order = create_parameters()
   schedule.reorder(order=loop_order )

    parameter_grid = {
        loop_order : (j, k, i, ii, jj, kk)
    }

    parameters = create_parameter_grid(parameter_grid,
                                    filter_func = lambda *p : schedule.is_valid_loop_order(p[0][0]),
                                    sample=5)

    # Add another function to the package
    package.add(
        plan,
        args=(A, B, C),
        parameters=parameters,
        base_name="matmul_256_256_256"
    )

Related work items: #3693

Merged PR 2591: Fixes more warnings. Enables STRICT_MODE for Linux PR
CI. [Kern Handa]
Merged PR 2588: [test] Trim out redundant tests from ROCm pipeline.
[Lisa Ong]

The ROCm pipeline is currently on a single agent, avoid running CPU tests that are already running in other pipelines to speed up the pipeline execution.
Merged PR 2590: [nfc] Fixes a bunch of warnings in C++ layer. [Kern
Handa]

[nfc] Fixes a bunch of warnings in C++ layer
Merged PR 2589: [test] Adds DSL tests for Schedule.pad. [Kern Handa]

Adds DSL tests for Schedule.pad
Merged PR 2587: Sync Github to ADO. [Lisa Ong]

commit b934ad05f6b8cd84420226b93f57b8ac3229eadc
Merged PR 2585: Use conditional instead of loop-unswitching on GPU.
[Chuck Jacobs]

This PR changes how boundary conditions are handled on GPU-bound loop indices. If a loop's increment doesn't evenly divide its bounds, the body is guarded by a conditional instead of unswitching that loop.

Related work items: #3703
Merged PR 2571: Add random seed to enable reproducible sampling.
[Denny Sun]

Giving users control over sampling strategies.

Merged PR 2581: Add CUDA tensor core support. [Ritwik Das]

Added CUDA tensor ops (no caching)
Added validation tests
Changed MMA enum names
Bit of generated tensor op code in cuda:

...
vhalf *var11 = (vhalf*)arg2;
wmma::fragment<wmma::accumulator, 16, 16, 16, vhalf> mmaMatrix_12;
wmma::load_matrix_sync(mmaMatrix_12, var11 + var9 * 16 + var10, 16, wmma::layout_t::mem_row_major);
vhalf *var13 = (vhalf*)arg0;
wmma::fragment<wmma::matrix_a, 16, 16, 16, vhalf, wmma::row_major> mmaMatrix_14;
wmma::load_matrix_sync(mmaMatrix_14, var13 + var9 * 16 + 0, 16);
vhalf *var15 = (vhalf*)arg1;
wmma::fragment<wmma::matrix_b, 16, 16, 16, vhalf, wmma::row_major> mmaMatrix_16;
wmma::load_matrix_sync(mmaMatrix_16, var15 + 0 * 16 + var10, 16);
wmma::fragment<wmma::accumulator, 16, 16, 16, vhalf> mmaMatrix_17;
wmma::mma_sync(mmaMatrix_17, mmaMatrix_14, mmaMatrix_16, mmaMatrix_12);
wmma::store_matrix_sync(var11 + var9 * 16 + var10, mmaMatrix_17, 16, wmma::layout_t::mem_row_major);

Related work items: #3694

Merged PR 2584: Adds cublas_gemm benchmarking tool. [Kern Handa]

Adds cublas_gemm benchmarking tool
Merged PR 2583: Don't hold ResolveWarpSize results with rvalue. [Mason
Remy]

Don't hold ResolveWarpSize results with rvalue

gcc appears to be inlining ResolveWarpSize incorrectly in some cases and
not holding the result with an rvalue pair appears to fix it.

This was resulting in some mod 0's and floordiv 0's when we would expect
the warp size constants to either be 32 or 64 exactly.
Merged PR 2580: Fixes rocblas_gemm's fp32 -> fp16 conversion. [Kern
Handa]
Merged PR 2579: Improves accera_gemm.py's handling of unsupported
configs. [Kern Handa]

Improves accera_gemm.py's handling of unsupported configs
Merged PR 2578: Fixes time unit conversions in accera_gemm.py. [Kern
Handa]

Also addresses comments for the previous rocblas_gemm PR
Merged PR 2577: Fixes accera_gemm.py code after Plan.tensorize API
change. [Kern Handa]

Fixes accera_gemm.py code after Plan.tensorize API change
Merged PR 2575: Adds library warmup to rocblas_gemm benchmarker. [Kern
Handa]

Adds library warmup to rocblas_gemm benchmarker
Merged PR 2572: [nfc] Move accera/viz -> tools/viz. [Kern Handa]

[nfc] Move accera/viz -> tools/viz
Merged PR 2573: Update setup.cfg hatlib dependency version. [Mason
Remy]

Update setup.cfg hatlib dependency version
Merged PR 2557: Overhauls the benchmarking tool. [Kern Handa]

This change moves the benchmarking tool to a top-level tools/benchmarkers directory. The tool has also been split up so that the accera portion is in its own file, while the driver portion of the tool remains intact and has gained the ability to run a rocblas gemm benchmarking utility.

The aforementioned rocblas gemm benchmarking utility is also added in this change. rocblas_gemm is a new executable that is not built by default since it relies on the rocblas library, which may not be available everywhere. Once this tool has been explicitly built, it can be passed in as an argument to the benchmarker tool, which will use it to generate a comparison between accera's benchmark results and rocblas's.

An example:
```
<build accera like usual>
ninja -C `git rev-parse --show-toplevel`/build/temp.linux-x86_64-3.8 rocblas_gemm
cd tools/benchmarkers
mkdir ~/accera_benchmarks
./gpu_benchmark_tool.py -i sgemm_bert_assorted.csv -t 'AMD MI100' -o ~/accera_benchmarks/results -r `git rev-parse --show-toplevel`/build/temp.linux-x86_64-3.8/tools/benchmarkers/rocblas/rocblas_gemm
```
Related work items: #3685
Merged PR 2569: Make tensorization passes configurable, remove
dependency from split indices. [Ritwik Das]
- Make the mfma type a required parameter for tensorize() - this only chooses the underlyting mfma op to use
- Additionally, user can pass in the total number of passes (which defaults to 1) which needs to run instead of implicitly calculating a square tile.
- Added documentation for the new enum type.
- Added some tests
- Current code does not work with K > M (still investigating this, but should not block this PR)
Related work items: #3688
Merged PR 2567: Fix vectorized access of LAST_MAJOR arrays. [Mason
Remy]

Fix vectorized access of LAST_MAJOR arrays
- mlir::vector::LoadOp and mlir::vector::StoreOp only support unit
  strides on the minor dimension of the memref they access, so
  reinterpretcast the memref to a flat buffer to pass that check
- add translation for reinterpretcastop
- improve vectorization of LAST_MAJOR matrices in cache accesses by
  changing the traversal order of the cache region (when
  filling/reducing) based on the memory ordering of the outer array
  being acted on.
Merged PR 2568: [Compliance] [nfc] Switch to Azure Container Registry
for ROCm build agent. [Lisa Ong]
Merged PR 2560: Make register allocation during tensorization tunable.
[Ritwik Das]
- Add controllable number of fused mfma passes
- Add controllable scheduling policy of mfma ops
- Add tests
Related work items: #3687
Merged PR 2565: [build] bump hatlib dependency to 0.0.13. [Lisa Ong]

hatlib 0.0.13 contains a fix to unblock ROCm buddy builds

New Contributors

@marina-neseem made their first contribution in #41

Full Changelog: v1.2.4...v1.2.5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.2.5

What's Changed

New Contributors

Contributors