Skip to content

v1.2.8

Compare
Choose a tag to compare
@CaptnJackSparrow CaptnJackSparrow released this 10 Aug 03:29
· 29 commits to main since this release

What's Changed


  • Merged PR 2814: Parameterize batch_size in GPU benchmarks. [Ritwik
    Das]

    Parameterize batch_size in GPU benchmarks

  • Merged PR 2810: [release] [nfc] Bump docs version to 1.2.8, bump
    github actions to llvm 14.0.6. [Lisa Ong]

    Preparation for 1.2.8 release

  • Merged PR 2808: [ci] Add vcpkg caching for buddy builds, disable flaky
    parallelized tests. [Lisa Ong]

    • Enable vcpkg binary caching for CI pipelines that are using non custom agents. This reduces vcpkg install time from 2-3 minutes to ~30 seconds
    • ctest --parallel on macos can sometimes fail randomly. The tests will need to be updated to support running in parallel
  • Merged PR 2804: [ci] Reduce runtimes of PR Buddy Builds. [Lisa Ong]

    • Remove redundant setup.py builds in pipelines with cmake builds
    • Build debug for Linux only (the fastest config)
    • Add pipeline caching for ccache, conan, and pip where applicable
    • Add parallel configs where applicable
    • Filter out some tests on windows due to slow runtimes. These should have coverage on Linux and macOS.
  • Merged PR 2807: Enable verification for CK baselines. [Ritwik Das]

    • Enable verification for CK baselines
    • increase timeout for cuda resnet
    • add functionality for extracting kernel code from cosmosdb
  • Merged PR 2802: Fix barrier optimization pass. [Chuck Jacobs]

    This PR fixes a couple of barrier-related issues:

    • The barrier optimization pass wasn't keeping barriers that protected vector load/store ops
    • Multiple barriers were getting generated when hoisting barriers out of conditionals

    Related work items: #3732

  • Merged PR 2800: Add max_threads to parallelize and change default
    behavior. [Ritwik Das]

    • Add num_threads to parallelize
    • change default behavior to count the number of iterations of the given indices
    • Update documentation
  • Merged PR 2801: Remove verification on cuda-fp32-big benchmark.
    [Ritwik Das]

    Remove verification on cuda-fp32-big benchmark

  • Merged PR 2798: LLVM 14.0.6 upgrade. [Lisa Ong]

    An incremental upgrade with minimal or no changes to MLIR

  • Merged PR 2796: Makes NestedPassAdaptor's pipeline consistent. [Kern
    Handa]

    Makes NestedPassAdaptor's pipeline consistent

    This change makes it so NestedPassAdaptor creates a new pass manager
    every time a new pass is added. Prior to this change, if dumpPasses was
    false, the same nested pass manager would be used. If dumpPasses was
    true, a new nested pass manager would be created per call to addPass.
    This difference in behavior was also resulting in the lowering pipeline
    to be different, depending on the value of dumpPasses.

    For example, in the following code in AcceraPasses.cpp, all the passes
    that are added to funcOpPM run BEFORE createConvertSCFToOpenMPPass
    if dumpPasses was false.

        auto funcOpPM = pmAdaptor.nestPassManager([&]() -> OpPassManager& { return pm.nest<v::ValueModuleOp>().nest<FuncOp>(); });
        funcOpPM.addPass(createConvertLinalgToAffineLoopsPass());
        funcOpPM.addPass(createSimplifyAffineStructuresPass());
        funcOpPM.addPass(createCanonicalizerPass());
        funcOpPM.addPass(createLoopInvariantCodeMotionPass());
        funcOpPM.addPass(createCSEPass());
    
        pmAdaptor.addPass(createConvertSCFToOpenMPPass());
        pmAdaptor.addPass(value::createValueToStdPass(options.enableProfile));
        funcOpPM.addPass(value::createBarrierOptPass(options.writeBarrierGraph.getValue(), options.barrierGraphFilename.getValue()));
        pmAdaptor.addPass(value::createRangeValueOptimizePass());
        pmAdaptor.addPass(createCanonicalizerPass());
        pmAdaptor.addPass(createCSEPass());

    Additionally, this change exposed the fact that the BarrierOpt pass is
    incorrectly erasing barriers, and so has been made into a no-op until
    this correctness issue has been fixed.

  • Merged PR 2795: [docs] Cleanup viz scripts, clarify reorder
    illustrations. [Lisa Ong]

    • Clarify in the labels while working on the animated version

    • Cleanup and rename .js files for (slightly) easier lookup

  • Merged PR 2475: LLVM 14.0.0 upgrade. [Lisa Ong]

    Tag: llvmorg-14.0.0

    Notable changes:

    • std dialect ops are now moved to arith, math dialects
    • StrEnumAttribute is now replaced by simple enums. This affects things like gpu.dimension.x
    • [Issue] linalg.copy is removed, replaced by memref.copy, which introduces a runtime dependency on a memrefCopy C function for non-identity layout copies. This affects Array.sub_array in debug mode.
    • [Regression] OMP to LLVM lowering will crash in mlir-translate findAlloc due to a empty set of blocks being emitted. This only affects dynamic scheduling with collapsed loops.
    • Lots of renames
    • Upgraded macOS to macOS-12

    Related work items: #3646

  • Merged PR 2753: accera.Dimension and runtime-sized Arrays in the
    Python DSL. [Denny Sun]

    With this change, Accera is able to generate the initial mlir for runtime sized Arrays. The ir lowering is not fully working due to some bug, which can be fixed in the later changes.

            M = Dim()
            N = Dim()
            K = Dim()
    
            A = Array(shape=(M, K), element_type=ScalarType.float32, role=Array.Role.INPUT)
            B = Array(shape=(K, N), element_type=ScalarType.float32, role=Array.Role.INPUT)
            C = Array(shape=(M, N), element_type=ScalarType.float32, role=Array.Role.INPUT_OUTPUT)
    
            nest = Nest((M, N, K))
            i, j, k = nest.get_indices()
    
            @nest.iteration_logic
            def _():
                C[i, j] += A[i, k] * B[k, j]
    
            package.add()
            package.build()
    
    #domain0 = #accln<"idomain{{i,3}={0:{op_idx:0}:1}, {j,4}={0:{op_idx:1}:1}, {k,5}={0:{op_idx:2}:1}}">
    #domain1 = #accln<"idomain{{i,9}={0:{op_idx:0}:1}, {j,10}={0:{op_idx:1}:1}}">
    #domain2 = #accln<"idomain{{i,6}={0:1:1}}">
    
    #map = affine_map<(d0, d1)[s0] -> (d0 * s0 + d1)>
    #xdomain0 = #accln<"xfdomain{dims: {{i,3}, {j,4}, {k,5}}, indices: {{{i,3} : {0:{op_idx:0}:1}}, {{j,4} : {0:{op_idx:1}:1}}, {{k,5} : {0:{op_idx:2}:1}}}}">
    #xdomain1 = #accln<"xfdomain{dims: {{i,9}, {j,10}}, indices: {{{i,9} : {0:{op_idx:0}:1}}, {{j,10} : {0:{op_idx:1}:1}}}}">
    #xdomain2 = #accln<"xfdomain{dims: {{i,6}}, indices: {{{i,6} : {0:1:1}}}}">
    module @test_runtimesizes attributes {llvm.data_layout = "... ..."}  {
      accv.module "test_runtimesizes"  {
        accv.func nested @runtimesizes_..._impl_...(%arg0: index loc(unknown), %arg1: index loc(unknown), %arg2: index loc(unknown), %arg3: memref<?x?xf32, #map> loc(unknown), %arg4: memref<?x?xf32, #map> loc(unknown), %arg5: memref<?x?xf32, #map> loc(unknown)) attributes {accv.output_verifiers = ["", "", "", "", "", "_debug_check_allclose_<accera.lang.Dim.Dim object at ...>_<accera.lang.Dim.Dim object at ...>_..."], exec_target = 0 : i64} {
          %0 = "accv.get_element"(<<UNKNOWN SSA VALUE>>) : (memref<index>) -> index loc(#loc)
          %1 = "accv.get_element"(<<UNKNOWN SSA VALUE>>) : (memref<index>) -> index loc(#loc)
          %2 = "accv.get_element"(<<UNKNOWN SSA VALUE>>) : (memref<index>) -> index loc(#loc)
          "accln.nest"(%0, %1, %2) ( {
            %3 = accln.sym_index {name = "i"} #accln<"index{i,3}"> loc(#loc)
            %4 = accln.sym_index {name = "j"} #accln<"index{j,4}"> loc(#loc)
            %5 = accln.sym_index {name = "k"} #accln<"index{k,5}"> loc(#loc)
            "accln.kernel"() ( {
              %7 = "accv.slice"(%arg5, %3, %4) {sliceDimensions = [0, 1]} : (memref<?x?xf32, #map>, index, index) -> memref<f32> loc(#loc)
              ... ...
              accln.terminator loc(#loc)
            }) {sym_name = "_"} : () -> () loc(#loc)
            ... ...
            accln.terminator loc(#loc)
          }) {domain = #domain0, exec_target = 0 : i64, kernels = []} : (index, index, index) -> () loc(#loc)
          accv.return loc(#loc)
        } loc(#loc)
        accv.func @runtimesizes_...(%arg0: index loc(unknown), %arg1: index loc(unknown), %arg2: index lo...
    
  • Merged PR 2793: support sign extend op in canVectorize() function to
    improve generated MLIR. [JUBI TANEJA]

    While trying to optimize int16 MatMul with vectorize transformation in DSL, we noticed an unrolled loop with load, binop, sexti, store instructions. There was no vector instruction emitted and it hinted us that sign extend instruction is not supported in canVectorize function and now with this op supported, we can emit some vector instructions in the MLIR.

  • Merged PR 2790: Filter invalid kernels from GPU benchmarks. [Ritwik
    Das]

    • Filter invalid kernels from GPU benchmarks
    • Disable verification on cuda f16 benchmarks
    • Remove frequent cleanups
  • Merged PR 2787: Remove MLIR flag from package format in benchmarks.
    [Ritwik Das]

    Remove MLIR flag from package format in benchmarks

  • Merged PR 2784: Merge Github changes to ADO. [Lisa Ong]

  • Merged PR 2776: Make fusing more efficient. [Chuck Jacobs]

    This PR refactors the code generation for schedules and makes it more efficient. This makes a big difference for complex schedules with constraints on the kernels (like the ones generated when fusing schedules).

    Here are some timings on a few tests (modified versions of Mason's example script) I ran:

    test main branch PR branch
    3 fused schedules, tile first only 18.8s 5.8s
    3 fused schedules, tile 1 & 2 190s 6.2s
    3 fused schedules, tile all 3 ???? 7.2s

    Related work items: #3731

  • Merged PR 2781: Fix benchmark with MLIR format and add repro test.
    [Ritwik Das]

  • Merged PR 2780: Type support for tensor ops in CUDA. [Ritwik Das]

    • Add support for FP32 input (TF32 compute)
    • Add support for bfloat16 input/FP32 output
    • Add support for integer types

    Related work items: #3709, #3710

  • Merged PR 2779: Some assorted benchmark fixes. [Ritwik Das]

    • Build Accera in release mode
    • Shuffle gemm sizes to run small sizes first
    • Increase tolerance to account for floating point drift for large k-split
  • Merged PR 2774: Add input caching tests for CUDA, enable tests in PR
    pipelines. [Ritwik Das]

    Add input caching tests in CUDA

    Related work items: #3725

  • Merged PR 2677: Unify rocm/cuda tensor ops lowering under accv
    dialect. [Ritwik Das]

    • remove gpu dialect lowering (CUDA)
    • add accv dialect lowering for CUDA
    • rocm and cuda lowering use the same semantics

    Related work items: #3728

  • Merged PR 2764: [doc] Rename acc.Dim to acc.Dimension and add
    create_dimensions() [Lisa Ong]

    • Rename acc.Dim to acc.Dimension, acc.Dim.Role to acc.Dimension.Role
    • Add the simplified acc.create_dimensions() construction pattern
    • Kept the acc.Dimension constructor for advanced use cases involving generator patterns

    Related work items: #3720

  • Merged PR 2752: Add nargs to input args in benchmark tool. [Ritwik
    Das]

    add nargs to input args in benchmark tool

  • Merged PR 2680: [doc] Manual and Reference doc updates for Runtime
    Array DSL. [Lisa Ong]

    Proposed DSL changes for supporting runtime array sizes:

    • Adds a new dimension type that serves as a placeholder for runtime dimension sizes for Array and Nest. Supports both input and output dimensions
    • Adds output-only Arrays
    • Add the Scalar type
    • Example kernels demonstrating different aspects:
      • Gather: basic features
      • Range: scalar function arguments
      • ReduceMean: fusion

    Related work items: #3720

  • Merged PR 2683: Support conditionals in Logic Function. [Denny Sun]

    Before this change, there is no way to emit conditionals in logic function.

    With this change, the user is able to write the following logic function:

                def if_func():
                    T[i, j] = A[i, j] + B[i, j]
                    C[i, j] += T[i, j]**2.
    
                def elseif_func():
                    T[i, j] = A[i, j] - B[i, j]
                    C[i, j] += T[i, j]**2.
    
                def else_func():
                    C[i, j] = A[i, j] + B[i, j]
    
                @nest.iteration_logic
                def _():
                    _If(j<100, if_func).ElseIf(i>100, elseif_func).Else(else_func)
    

    Related work items: #3706

New Contributors

Full Changelog: v1.2.7...v1.2.8