Skip to content

v1.2.9

Compare
Choose a tag to compare
@masonremy masonremy released this 17 Sep 03:37
· 28 commits to main since this release
  • Merged PR 2862: write runtime size of index type to Hat. [Denny Sun]

    write runtime size of index type to Hat

  • Merged PR 2861: Fix cache_C benchmark variable which is not getting
    set properly for CUDA. [Ritwik Das]

    Fix cache_C benchmark variable which is not getting set properly for CUDA

  • Merged PR 2864: [build]: fix breaks due to agent image updates. [Lisa
    Ong]

    Latest version of azure pipelines images now set VCPKG_ROOT, which overrides the submodule used by Accera.

    See: actions/runner-images@ef638dd

    • Only pipelines that rely on azure build agents are affected.
    • We still need to keep the submodule around to enable external builds from the Github repo.
    • Remove defunct pipeline
    • Update vcpkg submodule while we're here
  • Merged PR 2839: Enable CUDA output caching. [Ritwik Das]

    • Add Tensor memory space type to denote memory fragments for caching (e.g. C in gemm). this might go away in future and just be replaced with Private once caching code is unified with ROCM behavior.
    • Change caching code to generate MMALoad/StoreOps for caching of the output.

    Related work items: #3725

  • Merged PR 2813: Add pass to recognize patterns that look like int16
    matrix multiply. [Chuck Jacobs]

    This PR adds a pass to rewrite GEMM-like loops that multiply-accumulate int16 matrices into an int32 result. If this pattern gets invoked, the output should contain the much-sought vpmaddwd instruction.

    It also fixes some old low-level tests of integer arithmetic.

  • Merged PR 2847: [release] Bump docs version to 1.2.9 and update github
    action container. [Lisa Ong]

    • Rev docs to 1.2.9

    • Update github workflow to reference updated tag for 14.0.6-1

  • Merged PR 2845: Filter GPU benchmarks by de-parameterizing cache
    layouts. [Ritwik Das]

    Filter GPU benchmarks by de-parameterizing cache layouts

  • Merged PR 2843: Fix bug in GPU benchmark to calculate valid variant.
    [Ritwik Das]

    • Fix bug in GPU benchmark to calculate valid variant
    • Add cosmosdb util to cleanup old entries
  • Merged PR 2835: Merge in MLIR fixes for LocationSnapshot and
    MemRefCastOp. [Lisa Ong]

    From 1abc4a981067ef1fd9bf717d7fabc4f6d75520d1 Mon Sep 17 00:00:00 2001

  • Merged PR 2842: Paramterize cache strategy in GPU benchmarks and fix
    kernel filters. [Ritwik Das]

    Paramterize cache strategy in GPU benchmarks and fix kernel filters

  • Merged PR 2836: Value DSL support for runtime sized output arrays.
    [Lisa Ong]

    • This adds memref-in-memref support for output arrays that are allocated in the function
    • A new "Pointer" Value wrapper class with a Store() function which creates an accv.StoreOp, similar to Array, Scalar
    • Update accv.StoreOp to support memrefs-in-memrefs

    Value pointer levels are defined as follows:

    Layout Example Pointer level C-type
    scalar int16, float32, index, ... 0 int16_t, float32_t, int64_t, ...
    single-level memref memref<1xindex>, memref<3x2xf32>, memref<10x16x11x?xf32> 1 int64_t*, float32_t*, float32_t*
    memref in memref memref<memref<?x?x?f32>> at least 2 (= the number of levels of memrefs) float32_t**

    Future work:

    • End-to-end lowering through Python DSL
    • Bare pointer convention for output arrays
    • Custom allocator functions. Currently we use the built-in std alloc.

    Related work items: #3730

  • Merged PR 2840: [nfc] Remove redundant ACR info from docker scripts.
    [Lisa Ong]

    The container registry allows pull-only access

  • Merged PR 2838: Runtime sized Array lowering to LLVM, accv.alloc to
    LLVM malloc. [Denny Sun]

    1. make deep copy of range end of value type when cloning ops
    2. plumbing runtime size to LLVM
    3. transform memref.alloc to LLVM malloc
    4. conversion between block argument and symbol name

    the generated IRs:

    Initial.mlir

    %2 = "accv.alloc"(%arg0, %arg1) {sym_name = "diff"} : (index, index) -> memref<?x?xf32> loc(#loc)

    LoopNestToValueFunc.mlir

    %2 = "accv.alloc"(%arg0, %arg1) {sym_name = "diff"} : (index, index) -> memref<?x?xf32> loc(#loc)
    affine.for %arg4 = 0 to %arg0 {
        affine.for %arg5 = 0 to %arg1 {
        }
    }
    

    ConvertValueToStd.mlir

    `%0 = memref.alloc(%arg0, %arg1) : memref<?x?xf32>`
    

    ConvertValueToLLVM.mlir

    %8 = llvm.mul %arg1, %arg0  : i64
    %9 = llvm.mlir.null : !llvm.ptr<f32>
    %10 = llvm.getelementptr %9[%8] : (!llvm.ptr<f32>, i64) -> !llvm.ptr<f32>
    %11 = llvm.ptrtoint %10 : !llvm.ptr<f32> to i64
    %12 = llvm.call @malloc(%11) : (i64) -> !llvm.ptr<i8>
    

    Related work items: #3733

  • Merged PR 2831: Record unique IDs so that different processes acting
    on a value module. [Mason Remy]

    Record unique IDs so that different processes acting on a value module
    don't produce conflicting IDs

  • Merged PR 2837: Fix WPT calculation to prevent 0 work and filter
    benchmarks. [Ritwik Das]

    Fix WPT calculation to prevent 0 work and filter benchmarks

  • Merged PR 2832: Caching strategy flag and thread ID optimization (GPU)
    [Ritwik Das]

    • Add a flag to plan.cache() to expose the different thread <--> data arrangements
    • Optimize thread ID calculation to check blockdim first
  • Merged PR 2829: Add handwritten caching implementation for GPU.
    [Ritwik Das]

    Add GPUBlockCacheOp which lowers to handwritted caching implementation on the GPU which supports access patterns for minimizing bank conflicts in shared memory and maximizing coalescing global memory access.

  • Merged PR 2821: Fixes constraint logic for fusion of more than two
    schedules. [Kern Handa]

    Fixes constraint logic for fusion of more than two schedules

  • Merged PR 2830: Fixes macOS CI build. [Kern Handa]

    Fixes macOS CI build

  • Merged PR 2806: Enable specifying cache element type. [Mason Remy]

    Enable specifying cache element type

    • Supports accumulating and/or computing in a different element type and
      batching up the casts for those types
    • Also adds support for binop/castop expansion and castop folding
  • Merged PR 2818: Upgrade hatlib dependency to v0.0.23. [Ritwik Das]

    Upgrade hatlib dependency to v0.0.23

  • Merged PR 2792: Refactor cast to a value cast op. [Mason Remy]

    Refactor cast to a value cast op

  • Merged PR 2788: Re-enabled fusing test that was taking too long.
    [Chuck Jacobs]

    This PR just re-enables a skipped test that was taking too long

  • Merged PR 2816: Upgrade hatlib requirement to 0.0.22. [Ritwik Das]

    Upgrade hatlib requirement to 0.0.22

  • Merged PR 2811: [nfc] Upgrade CUDA to 11.7 on NVidia benchmark
    machines. [Lisa Ong]

    According to https://hub.docker.com/r/nvidia/cuda/tags, 11.7.0 is still the latest.

Full Changelog: v1.2.8...v1.2.9