Skip to content

v1.2.5

Compare
Choose a tag to compare
@lisaong lisaong released this 24 May 06:26
· 44 commits to main since this release

What's Changed

  • Merged PR 2593: [docs] [release] bump docs version to 1.2.5 in
    preparation for release. [Lisa Ong]

    bump docs version to 1.2.5 in preparation for release

  • Merged PR 2586: Loop order and indices as parameters​ [Denny Sun]

    With this change, the user can write a schedule with loop_order parameterized:

       loop_order = create_parameters()
       schedule.reorder(order=loop_order )
    
        parameter_grid = {
            loop_order : (j, k, i, ii, jj, kk)
        }
    
        parameters = create_parameter_grid(parameter_grid,
                                        filter_func = lambda *p : schedule.is_valid_loop_order(p[0][0]),
                                        sample=5)
    
        # Add another function to the package
        package.add(
            plan,
            args=(A, B, C),
            parameters=parameters,
            base_name="matmul_256_256_256"
        )
    

    Related work items: #3693

  • Merged PR 2591: Fixes more warnings. Enables STRICT_MODE for Linux PR
    CI. [Kern Handa]

  • Merged PR 2588: [test] Trim out redundant tests from ROCm pipeline.
    [Lisa Ong]

    The ROCm pipeline is currently on a single agent, avoid running CPU tests that are already running in other pipelines to speed up the pipeline execution.

  • Merged PR 2590: [nfc] Fixes a bunch of warnings in C++ layer. [Kern
    Handa]

    [nfc] Fixes a bunch of warnings in C++ layer

  • Merged PR 2589: [test] Adds DSL tests for Schedule.pad. [Kern Handa]

    Adds DSL tests for Schedule.pad

  • Merged PR 2587: Sync Github to ADO. [Lisa Ong]

    commit b934ad05f6b8cd84420226b93f57b8ac3229eadc

  • Merged PR 2585: Use conditional instead of loop-unswitching on GPU.
    [Chuck Jacobs]

    This PR changes how boundary conditions are handled on GPU-bound loop indices. If a loop's increment doesn't evenly divide its bounds, the body is guarded by a conditional instead of unswitching that loop.

    Related work items: #3703

  • Merged PR 2571: Add random seed to enable reproducible sampling.
    [Denny Sun]

    Giving users control over sampling strategies.

  • Merged PR 2581: Add CUDA tensor core support. [Ritwik Das]

    • Added CUDA tensor ops (no caching)
    • Added validation tests
    • Changed MMA enum names
    • Bit of generated tensor op code in cuda:
    ...
    vhalf *var11 = (vhalf*)arg2;
    wmma::fragment<wmma::accumulator, 16, 16, 16, vhalf> mmaMatrix_12;
    wmma::load_matrix_sync(mmaMatrix_12, var11 + var9 * 16 + var10, 16, wmma::layout_t::mem_row_major);
    vhalf *var13 = (vhalf*)arg0;
    wmma::fragment<wmma::matrix_a, 16, 16, 16, vhalf, wmma::row_major> mmaMatrix_14;
    wmma::load_matrix_sync(mmaMatrix_14, var13 + var9 * 16 + 0, 16);
    vhalf *var15 = (vhalf*)arg1;
    wmma::fragment<wmma::matrix_b, 16, 16, 16, vhalf, wmma::row_major> mmaMatrix_16;
    wmma::load_matrix_sync(mmaMatrix_16, var15 + 0 * 16 + var10, 16);
    wmma::fragment<wmma::accumulator, 16, 16, 16, vhalf> mmaMatrix_17;
    wmma::mma_sync(mmaMatrix_17, mmaMatrix_14, mmaMatrix_16, mmaMatrix_12);
    wmma::store_matrix_sync(var11 + var9 * 16 + var10, mmaMatrix_17, 16, wmma::layout_t::mem_row_major);
    

    Related work items: #3694

  • Merged PR 2584: Adds cublas_gemm benchmarking tool. [Kern Handa]

    Adds cublas_gemm benchmarking tool

  • Merged PR 2583: Don't hold ResolveWarpSize results with rvalue. [Mason
    Remy]

    Don't hold ResolveWarpSize results with rvalue

    gcc appears to be inlining ResolveWarpSize incorrectly in some cases and
    not holding the result with an rvalue pair appears to fix it.

    This was resulting in some mod 0's and floordiv 0's when we would expect
    the warp size constants to either be 32 or 64 exactly.

  • Merged PR 2580: Fixes rocblas_gemm's fp32 -> fp16 conversion. [Kern
    Handa]

  • Merged PR 2579: Improves accera_gemm.py's handling of unsupported
    configs. [Kern Handa]

    Improves accera_gemm.py's handling of unsupported configs

  • Merged PR 2578: Fixes time unit conversions in accera_gemm.py. [Kern
    Handa]

    Also addresses comments for the previous rocblas_gemm PR

  • Merged PR 2577: Fixes accera_gemm.py code after Plan.tensorize API
    change. [Kern Handa]

    Fixes accera_gemm.py code after Plan.tensorize API change

  • Merged PR 2575: Adds library warmup to rocblas_gemm benchmarker. [Kern
    Handa]

    Adds library warmup to rocblas_gemm benchmarker

  • Merged PR 2572: [nfc] Move accera/viz -> tools/viz. [Kern Handa]

    [nfc] Move accera/viz -> tools/viz

  • Merged PR 2573: Update setup.cfg hatlib dependency version. [Mason
    Remy]

    Update setup.cfg hatlib dependency version

  • Merged PR 2557: Overhauls the benchmarking tool. [Kern Handa]

    This change moves the benchmarking tool to a top-level tools/benchmarkers directory. The tool has also been split up so that the accera portion is in its own file, while the driver portion of the tool remains intact and has gained the ability to run a rocblas gemm benchmarking utility.

    The aforementioned rocblas gemm benchmarking utility is also added in this change. rocblas_gemm is a new executable that is not built by default since it relies on the rocblas library, which may not be available everywhere. Once this tool has been explicitly built, it can be passed in as an argument to the benchmarker tool, which will use it to generate a comparison between accera's benchmark results and rocblas's.

    An example:

    <build accera like usual>
    ninja -C `git rev-parse --show-toplevel`/build/temp.linux-x86_64-3.8 rocblas_gemm
    cd tools/benchmarkers
    mkdir ~/accera_benchmarks
    ./gpu_benchmark_tool.py -i sgemm_bert_assorted.csv -t 'AMD MI100' -o ~/accera_benchmarks/results -r `git rev-parse --show-toplevel`/build/temp.linux-x86_64-3.8/tools/benchmarkers/rocblas/rocblas_gemm

    Related work items: #3685

  • Merged PR 2569: Make tensorization passes configurable, remove
    dependency from split indices. [Ritwik Das]

    • Make the mfma type a required parameter for tensorize() - this only chooses the underlyting mfma op to use
    • Additionally, user can pass in the total number of passes (which defaults to 1) which needs to run instead of implicitly calculating a square tile.
    • Added documentation for the new enum type.
    • Added some tests
    • Current code does not work with K > M (still investigating this, but should not block this PR)

    Related work items: #3688

  • Merged PR 2567: Fix vectorized access of LAST_MAJOR arrays. [Mason
    Remy]

    Fix vectorized access of LAST_MAJOR arrays

    • mlir::vector::LoadOp and mlir::vector::StoreOp only support unit
      strides on the minor dimension of the memref they access, so
      reinterpretcast the memref to a flat buffer to pass that check
    • add translation for reinterpretcastop
    • improve vectorization of LAST_MAJOR matrices in cache accesses by
      changing the traversal order of the cache region (when
      filling/reducing) based on the memory ordering of the outer array
      being acted on.
  • Merged PR 2568: [Compliance] [nfc] Switch to Azure Container Registry
    for ROCm build agent. [Lisa Ong]

  • Merged PR 2560: Make register allocation during tensorization tunable.
    [Ritwik Das]

    • Add controllable number of fused mfma passes
    • Add controllable scheduling policy of mfma ops
    • Add tests

    Related work items: #3687

  • Merged PR 2565: [build] bump hatlib dependency to 0.0.13. [Lisa Ong]

    hatlib 0.0.13 contains a fix to unblock ROCm buddy builds

New Contributors

Full Changelog: v1.2.4...v1.2.5