v1.2.9
-
Merged PR 2862: write runtime size of index type to Hat. [Denny Sun]
write runtime size of index type to Hat
-
Merged PR 2861: Fix cache_C benchmark variable which is not getting
set properly for CUDA. [Ritwik Das]Fix cache_C benchmark variable which is not getting set properly for CUDA
-
Merged PR 2864: [build]: fix breaks due to agent image updates. [Lisa
Ong]Latest version of azure pipelines images now set VCPKG_ROOT, which overrides the submodule used by Accera.
See: actions/runner-images@ef638dd
- Only pipelines that rely on azure build agents are affected.
- We still need to keep the submodule around to enable external builds from the Github repo.
- Remove defunct pipeline
- Update vcpkg submodule while we're here
-
Merged PR 2839: Enable CUDA output caching. [Ritwik Das]
- Add Tensor memory space type to denote memory fragments for caching (e.g. C in gemm). this might go away in future and just be replaced with Private once caching code is unified with ROCM behavior.
- Change caching code to generate MMALoad/StoreOps for caching of the output.
Related work items: #3725
-
Merged PR 2813: Add pass to recognize patterns that look like int16
matrix multiply. [Chuck Jacobs]This PR adds a pass to rewrite GEMM-like loops that multiply-accumulate int16 matrices into an int32 result. If this pattern gets invoked, the output should contain the much-sought
vpmaddwd
instruction.It also fixes some old low-level tests of integer arithmetic.
-
Merged PR 2847: [release] Bump docs version to 1.2.9 and update github
action container. [Lisa Ong]-
Rev docs to 1.2.9
-
Update github workflow to reference updated tag for 14.0.6-1
-
-
Merged PR 2845: Filter GPU benchmarks by de-parameterizing cache
layouts. [Ritwik Das]Filter GPU benchmarks by de-parameterizing cache layouts
-
Merged PR 2843: Fix bug in GPU benchmark to calculate valid variant.
[Ritwik Das]- Fix bug in GPU benchmark to calculate valid variant
- Add cosmosdb util to cleanup old entries
-
Merged PR 2835: Merge in MLIR fixes for LocationSnapshot and
MemRefCastOp. [Lisa Ong]From 1abc4a981067ef1fd9bf717d7fabc4f6d75520d1 Mon Sep 17 00:00:00 2001
-
Merged PR 2842: Paramterize cache strategy in GPU benchmarks and fix
kernel filters. [Ritwik Das]Paramterize cache strategy in GPU benchmarks and fix kernel filters
-
Merged PR 2836: Value DSL support for runtime sized output arrays.
[Lisa Ong]- This adds memref-in-memref support for output arrays that are allocated in the function
- A new "Pointer" Value wrapper class with a Store() function which creates an accv.StoreOp, similar to Array, Scalar
- Update accv.StoreOp to support memrefs-in-memrefs
Value pointer levels are defined as follows:
Layout Example Pointer level C-type scalar int16, float32, index, ... 0 int16_t, float32_t, int64_t, ... single-level memref memref<1xindex>, memref<3x2xf32>, memref<10x16x11x?xf32> 1 int64_t*, float32_t*, float32_t* memref in memref memref<memref<?x?x?f32>> at least 2 (= the number of levels of memrefs) float32_t** Future work:
- End-to-end lowering through Python DSL
- Bare pointer convention for output arrays
- Custom allocator functions. Currently we use the built-in std alloc.
Related work items: #3730
-
Merged PR 2840: [nfc] Remove redundant ACR info from docker scripts.
[Lisa Ong]The container registry allows pull-only access
-
Merged PR 2838: Runtime sized Array lowering to LLVM, accv.alloc to
LLVM malloc. [Denny Sun]- make deep copy of range end of value type when cloning ops
- plumbing runtime size to LLVM
- transform memref.alloc to LLVM malloc
- conversion between block argument and symbol name
the generated IRs:
Initial.mlir
%2 = "accv.alloc"(%arg0, %arg1) {sym_name = "diff"} : (index, index) -> memref<?x?xf32> loc(#loc)
LoopNestToValueFunc.mlir
%2 = "accv.alloc"(%arg0, %arg1) {sym_name = "diff"} : (index, index) -> memref<?x?xf32> loc(#loc) affine.for %arg4 = 0 to %arg0 { affine.for %arg5 = 0 to %arg1 { } }
ConvertValueToStd.mlir
`%0 = memref.alloc(%arg0, %arg1) : memref<?x?xf32>`
ConvertValueToLLVM.mlir
%8 = llvm.mul %arg1, %arg0 : i64 %9 = llvm.mlir.null : !llvm.ptr<f32> %10 = llvm.getelementptr %9[%8] : (!llvm.ptr<f32>, i64) -> !llvm.ptr<f32> %11 = llvm.ptrtoint %10 : !llvm.ptr<f32> to i64 %12 = llvm.call @malloc(%11) : (i64) -> !llvm.ptr<i8>
Related work items: #3733
-
Merged PR 2831: Record unique IDs so that different processes acting
on a value module. [Mason Remy]Record unique IDs so that different processes acting on a value module
don't produce conflicting IDs -
Merged PR 2837: Fix WPT calculation to prevent 0 work and filter
benchmarks. [Ritwik Das]Fix WPT calculation to prevent 0 work and filter benchmarks
-
Merged PR 2832: Caching strategy flag and thread ID optimization (GPU)
[Ritwik Das]- Add a flag to plan.cache() to expose the different thread <--> data arrangements
- Optimize thread ID calculation to check blockdim first
-
Merged PR 2829: Add handwritten caching implementation for GPU.
[Ritwik Das]Add GPUBlockCacheOp which lowers to handwritted caching implementation on the GPU which supports access patterns for minimizing bank conflicts in shared memory and maximizing coalescing global memory access.
-
Merged PR 2821: Fixes constraint logic for fusion of more than two
schedules. [Kern Handa]Fixes constraint logic for fusion of more than two schedules
-
Merged PR 2830: Fixes macOS CI build. [Kern Handa]
Fixes macOS CI build
-
Merged PR 2806: Enable specifying cache element type. [Mason Remy]
Enable specifying cache element type
- Supports accumulating and/or computing in a different element type and
batching up the casts for those types - Also adds support for binop/castop expansion and castop folding
- Supports accumulating and/or computing in a different element type and
-
Merged PR 2818: Upgrade hatlib dependency to v0.0.23. [Ritwik Das]
Upgrade hatlib dependency to v0.0.23
-
Merged PR 2792: Refactor cast to a value cast op. [Mason Remy]
Refactor cast to a value cast op
-
Merged PR 2788: Re-enabled fusing test that was taking too long.
[Chuck Jacobs]This PR just re-enables a skipped test that was taking too long
-
Merged PR 2816: Upgrade hatlib requirement to 0.0.22. [Ritwik Das]
Upgrade hatlib requirement to 0.0.22
-
Merged PR 2811: [nfc] Upgrade CUDA to 11.7 on NVidia benchmark
machines. [Lisa Ong]According to https://hub.docker.com/r/nvidia/cuda/tags, 11.7.0 is still the latest.
Full Changelog: v1.2.8...v1.2.9