Release v1.2.8 · microsoft/Accera

What's Changed

Set license field in metadata of package by @tonybaloney in #46
Github codespaces configuration by @lisaong in #48

Merged PR 2814: Parameterize batch_size in GPU benchmarks. [Ritwik
Das]

Parameterize batch_size in GPU benchmarks
Merged PR 2810: [release] [nfc] Bump docs version to 1.2.8, bump
github actions to llvm 14.0.6. [Lisa Ong]

Preparation for 1.2.8 release
Merged PR 2808: [ci] Add vcpkg caching for buddy builds, disable flaky
parallelized tests. [Lisa Ong]
- Enable vcpkg binary caching for CI pipelines that are using non custom agents. This reduces vcpkg install time from 2-3 minutes to ~30 seconds
- ctest --parallel on macos can sometimes fail randomly. The tests will need to be updated to support running in parallel
Merged PR 2804: [ci] Reduce runtimes of PR Buddy Builds. [Lisa Ong]
- Remove redundant setup.py builds in pipelines with cmake builds
- Build debug for Linux only (the fastest config)
- Add pipeline caching for ccache, conan, and pip where applicable
- Add parallel configs where applicable
- Filter out some tests on windows due to slow runtimes. These should have coverage on Linux and macOS.
Merged PR 2807: Enable verification for CK baselines. [Ritwik Das]
- Enable verification for CK baselines
- increase timeout for cuda resnet
- add functionality for extracting kernel code from cosmosdb
Merged PR 2802: Fix barrier optimization pass. [Chuck Jacobs]

This PR fixes a couple of barrier-related issues:
- The barrier optimization pass wasn't keeping barriers that protected vector load/store ops
- Multiple barriers were getting generated when hoisting barriers out of conditionals
Related work items: #3732
Merged PR 2800: Add max_threads to parallelize and change default
behavior. [Ritwik Das]
- Add num_threads to parallelize
- change default behavior to count the number of iterations of the given indices
- Update documentation
Merged PR 2801: Remove verification on cuda-fp32-big benchmark.
[Ritwik Das]

Remove verification on cuda-fp32-big benchmark
Merged PR 2798: LLVM 14.0.6 upgrade. [Lisa Ong]

An incremental upgrade with minimal or no changes to MLIR
Merged PR 2796: Makes NestedPassAdaptor's pipeline consistent. [Kern
Handa]

Makes NestedPassAdaptor's pipeline consistent

This change makes it so NestedPassAdaptor creates a new pass manager
every time a new pass is added. Prior to this change, if dumpPasses was
false, the same nested pass manager would be used. If dumpPasses was
true, a new nested pass manager would be created per call to addPass.
This difference in behavior was also resulting in the lowering pipeline
to be different, depending on the value of dumpPasses.

For example, in the following code in AcceraPasses.cpp, all the passes
that are added to funcOpPM run BEFORE createConvertSCFToOpenMPPass
if dumpPasses was false.
```
    auto funcOpPM = pmAdaptor.nestPassManager([&]() -> OpPassManager& { return pm.nest<v::ValueModuleOp>().nest<FuncOp>(); });
    funcOpPM.addPass(createConvertLinalgToAffineLoopsPass());
    funcOpPM.addPass(createSimplifyAffineStructuresPass());
    funcOpPM.addPass(createCanonicalizerPass());
    funcOpPM.addPass(createLoopInvariantCodeMotionPass());
    funcOpPM.addPass(createCSEPass());

    pmAdaptor.addPass(createConvertSCFToOpenMPPass());
    pmAdaptor.addPass(value::createValueToStdPass(options.enableProfile));
    funcOpPM.addPass(value::createBarrierOptPass(options.writeBarrierGraph.getValue(), options.barrierGraphFilename.getValue()));
    pmAdaptor.addPass(value::createRangeValueOptimizePass());
    pmAdaptor.addPass(createCanonicalizerPass());
    pmAdaptor.addPass(createCSEPass());
```
Additionally, this change exposed the fact that the BarrierOpt pass is
incorrectly erasing barriers, and so has been made into a no-op until
this correctness issue has been fixed.
Merged PR 2795: [docs] Cleanup viz scripts, clarify reorder
illustrations. [Lisa Ong]
- Clarify in the labels while working on the animated version
- Cleanup and rename .js files for (slightly) easier lookup
Merged PR 2475: LLVM 14.0.0 upgrade. [Lisa Ong]

Tag: llvmorg-14.0.0

Notable changes:
- std dialect ops are now moved to arith, math dialects
- StrEnumAttribute is now replaced by simple enums. This affects things like gpu.dimension.x
- [Issue] linalg.copy is removed, replaced by memref.copy, which introduces a runtime dependency on a memrefCopy C function for non-identity layout copies. This affects Array.sub_array in debug mode.
- [Regression] OMP to LLVM lowering will crash in mlir-translate findAlloc due to a empty set of blocks being emitted. This only affects dynamic scheduling with collapsed loops.
- Lots of renames
- Upgraded macOS to macOS-12
Related work items: #3646

Merged PR 2753: accera.Dimension and runtime-sized Arrays in the
Python DSL. [Denny Sun]

With this change, Accera is able to generate the initial mlir for runtime sized Arrays. The ir lowering is not fully working due to some bug, which can be fixed in the later changes.

        M = Dim()
        N = Dim()
        K = Dim()

        A = Array(shape=(M, K), element_type=ScalarType.float32, role=Array.Role.INPUT)
        B = Array(shape=(K, N), element_type=ScalarType.float32, role=Array.Role.INPUT)
        C = Array(shape=(M, N), element_type=ScalarType.float32, role=Array.Role.INPUT_OUTPUT)

        nest = Nest((M, N, K))
        i, j, k = nest.get_indices()

        @nest.iteration_logic
        def _():
            C[i, j] += A[i, k] * B[k, j]

        package.add()
        package.build()

#domain0 = #accln<"idomain{{i,3}={0:{op_idx:0}:1}, {j,4}={0:{op_idx:1}:1}, {k,5}={0:{op_idx:2}:1}}">
#domain1 = #accln<"idomain{{i,9}={0:{op_idx:0}:1}, {j,10}={0:{op_idx:1}:1}}">
#domain2 = #accln<"idomain{{i,6}={0:1:1}}">

#map = affine_map<(d0, d1)[s0] -> (d0 * s0 + d1)>
#xdomain0 = #accln<"xfdomain{dims: {{i,3}, {j,4}, {k,5}}, indices: {{{i,3} : {0:{op_idx:0}:1}}, {{j,4} : {0:{op_idx:1}:1}}, {{k,5} : {0:{op_idx:2}:1}}}}">
#xdomain1 = #accln<"xfdomain{dims: {{i,9}, {j,10}}, indices: {{{i,9} : {0:{op_idx:0}:1}}, {{j,10} : {0:{op_idx:1}:1}}}}">
#xdomain2 = #accln<"xfdomain{dims: {{i,6}}, indices: {{{i,6} : {0:1:1}}}}">
module @test_runtimesizes attributes {llvm.data_layout = "... ..."}  {
  accv.module "test_runtimesizes"  {
    accv.func nested @runtimesizes_..._impl_...(%arg0: index loc(unknown), %arg1: index loc(unknown), %arg2: index loc(unknown), %arg3: memref<?x?xf32, #map> loc(unknown), %arg4: memref<?x?xf32, #map> loc(unknown), %arg5: memref<?x?xf32, #map> loc(unknown)) attributes {accv.output_verifiers = ["", "", "", "", "", "_debug_check_allclose_<accera.lang.Dim.Dim object at ...>_<accera.lang.Dim.Dim object at ...>_..."], exec_target = 0 : i64} {
      %0 = "accv.get_element"(<<UNKNOWN SSA VALUE>>) : (memref<index>) -> index loc(#loc)
      %1 = "accv.get_element"(<<UNKNOWN SSA VALUE>>) : (memref<index>) -> index loc(#loc)
      %2 = "accv.get_element"(<<UNKNOWN SSA VALUE>>) : (memref<index>) -> index loc(#loc)
      "accln.nest"(%0, %1, %2) ( {
        %3 = accln.sym_index {name = "i"} #accln<"index{i,3}"> loc(#loc)
        %4 = accln.sym_index {name = "j"} #accln<"index{j,4}"> loc(#loc)
        %5 = accln.sym_index {name = "k"} #accln<"index{k,5}"> loc(#loc)
        "accln.kernel"() ( {
          %7 = "accv.slice"(%arg5, %3, %4) {sliceDimensions = [0, 1]} : (memref<?x?xf32, #map>, index, index) -> memref<f32> loc(#loc)
          ... ...
          accln.terminator loc(#loc)
        }) {sym_name = "_"} : () -> () loc(#loc)
        ... ...
        accln.terminator loc(#loc)
      }) {domain = #domain0, exec_target = 0 : i64, kernels = []} : (index, index, index) -> () loc(#loc)
      accv.return loc(#loc)
    } loc(#loc)
    accv.func @runtimesizes_...(%arg0: index loc(unknown), %arg1: index loc(unknown), %arg2: index lo...

Merged PR 2793: support sign extend op in canVectorize() function to
improve generated MLIR. [JUBI TANEJA]

While trying to optimize int16 MatMul with vectorize transformation in DSL, we noticed an unrolled loop with load, binop, sexti, store instructions. There was no vector instruction emitted and it hinted us that sign extend instruction is not supported in canVectorize function and now with this op supported, we can emit some vector instructions in the MLIR.
Merged PR 2790: Filter invalid kernels from GPU benchmarks. [Ritwik
Das]
- Filter invalid kernels from GPU benchmarks
- Disable verification on cuda f16 benchmarks
- Remove frequent cleanups
Merged PR 2787: Remove MLIR flag from package format in benchmarks.
[Ritwik Das]

Remove MLIR flag from package format in benchmarks
Merged PR 2784: Merge Github changes to ADO. [Lisa Ong]
Merged PR 2776: Make fusing more efficient. [Chuck Jacobs]

This PR refactors the code generation for schedules and makes it more efficient. This makes a big difference for complex schedules with constraints on the kernels (like the ones generated when fusing schedules).

Here are some timings on a few tests (modified versions of Mason's example script) I ran:

test main branch PR branch

3 fused schedules, tile first only 18.8s 5.8s

3 fused schedules, tile 1 & 2 190s 6.2s

3 fused schedules, tile all 3 ???? 7.2s

Related work items: #3731
Merged PR 2781: Fix benchmark with MLIR format and add repro test.
[Ritwik Das]
Merged PR 2780: Type support for tensor ops in CUDA. [Ritwik Das]
- Add support for FP32 input (TF32 compute)
- Add support for bfloat16 input/FP32 output
- Add support for integer types
Related work items: #3709, #3710
Merged PR 2779: Some assorted benchmark fixes. [Ritwik Das]
- Build Accera in release mode
- Shuffle gemm sizes to run small sizes first
- Increase tolerance to account for floating point drift for large k-split
Merged PR 2774: Add input caching tests for CUDA, enable tests in PR
pipelines. [Ritwik Das]

Add input caching tests in CUDA

Related work items: #3725
Merged PR 2677: Unify rocm/cuda tensor ops lowering under accv
dialect. [Ritwik Das]
- remove gpu dialect lowering (CUDA)
- add accv dialect lowering for CUDA
- rocm and cuda lowering use the same semantics
Related work items: #3728
Merged PR 2764: [doc] Rename acc.Dim to acc.Dimension and add
create_dimensions() [Lisa Ong]
- Rename acc.Dim to acc.Dimension, acc.Dim.Role to acc.Dimension.Role
- Add the simplified acc.create_dimensions() construction pattern
- Kept the acc.Dimension constructor for advanced use cases involving generator patterns
Related work items: #3720
Merged PR 2752: Add nargs to input args in benchmark tool. [Ritwik
Das]

add nargs to input args in benchmark tool
Merged PR 2680: [doc] Manual and Reference doc updates for Runtime
Array DSL. [Lisa Ong]

Proposed DSL changes for supporting runtime array sizes:
- Adds a new dimension type that serves as a placeholder for runtime dimension sizes for Array and Nest. Supports both input and output dimensions
- Adds output-only Arrays
- Add the Scalar type
- Example kernels demonstrating different aspects:
  - Gather: basic features
  - Range: scalar function arguments
  - ReduceMean: fusion
Related work items: #3720

test	main branch	PR branch
3 fused schedules, tile first only	18.8s	5.8s
3 fused schedules, tile 1 & 2	190s	6.2s
3 fused schedules, tile all 3	????	7.2s

Merged PR 2683: Support conditionals in Logic Function. [Denny Sun]

Before this change, there is no way to emit conditionals in logic function.

With this change, the user is able to write the following logic function:

            def if_func():
                T[i, j] = A[i, j] + B[i, j]
                C[i, j] += T[i, j]**2.

            def elseif_func():
                T[i, j] = A[i, j] - B[i, j]
                C[i, j] += T[i, j]**2.

            def else_func():
                C[i, j] = A[i, j] + B[i, j]

            @nest.iteration_logic
            def _():
                _If(j<100, if_func).ElseIf(i>100, elseif_func).Else(else_func)

Related work items: #3706

New Contributors

@tonybaloney made their first contribution in #46

Full Changelog: v1.2.7...v1.2.8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.2.8

What's Changed

New Contributors

Contributors