Release v1.2.9 · microsoft/Accera

Merged PR 2862: write runtime size of index type to Hat. [Denny Sun]

write runtime size of index type to Hat
Merged PR 2861: Fix cache_C benchmark variable which is not getting
set properly for CUDA. [Ritwik Das]

Fix cache_C benchmark variable which is not getting set properly for CUDA
Merged PR 2864: [build]: fix breaks due to agent image updates. [Lisa
Ong]

Latest version of azure pipelines images now set VCPKG_ROOT, which overrides the submodule used by Accera.

See: actions/runner-images@ef638dd
- Only pipelines that rely on azure build agents are affected.
- We still need to keep the submodule around to enable external builds from the Github repo.
- Remove defunct pipeline
- Update vcpkg submodule while we're here
Merged PR 2839: Enable CUDA output caching. [Ritwik Das]
- Add Tensor memory space type to denote memory fragments for caching (e.g. C in gemm). this might go away in future and just be replaced with Private once caching code is unified with ROCM behavior.
- Change caching code to generate MMALoad/StoreOps for caching of the output.
Related work items: #3725
Merged PR 2813: Add pass to recognize patterns that look like int16
matrix multiply. [Chuck Jacobs]

This PR adds a pass to rewrite GEMM-like loops that multiply-accumulate int16 matrices into an int32 result. If this pattern gets invoked, the output should contain the much-sought vpmaddwd instruction.

It also fixes some old low-level tests of integer arithmetic.
Merged PR 2847: [release] Bump docs version to 1.2.9 and update github
action container. [Lisa Ong]
- Rev docs to 1.2.9
- Update github workflow to reference updated tag for 14.0.6-1
Merged PR 2845: Filter GPU benchmarks by de-parameterizing cache
layouts. [Ritwik Das]

Filter GPU benchmarks by de-parameterizing cache layouts
Merged PR 2843: Fix bug in GPU benchmark to calculate valid variant.
[Ritwik Das]
- Fix bug in GPU benchmark to calculate valid variant
- Add cosmosdb util to cleanup old entries
Merged PR 2835: Merge in MLIR fixes for LocationSnapshot and
MemRefCastOp. [Lisa Ong]

From 1abc4a981067ef1fd9bf717d7fabc4f6d75520d1 Mon Sep 17 00:00:00 2001
Merged PR 2842: Paramterize cache strategy in GPU benchmarks and fix
kernel filters. [Ritwik Das]

Paramterize cache strategy in GPU benchmarks and fix kernel filters

Merged PR 2836: Value DSL support for runtime sized output arrays.
[Lisa Ong]

This adds memref-in-memref support for output arrays that are allocated in the function
A new "Pointer" Value wrapper class with a Store() function which creates an accv.StoreOp, similar to Array, Scalar
Update accv.StoreOp to support memrefs-in-memrefs

Value pointer levels are defined as follows:

Layout	Example	Pointer level	C-type
scalar	int16, float32, index, ...	0	int16_t, float32_t, int64_t, ...
single-level memref	memref<1xindex>, memref<3x2xf32>, memref<10x16x11x?xf32>	1	int64_t, float32_t, float32_t*
memref in memref	memref<memref<?x?x?f32>>	at least 2 (= the number of levels of memrefs)	float32_t**

Future work:

End-to-end lowering through Python DSL
Bare pointer convention for output arrays
Custom allocator functions. Currently we use the built-in std alloc.

Related work items: #3730

Merged PR 2840: [nfc] Remove redundant ACR info from docker scripts.
[Lisa Ong]

The container registry allows pull-only access
Merged PR 2838: Runtime sized Array lowering to LLVM, accv.alloc to
LLVM malloc. [Denny Sun]
1. make deep copy of range end of value type when cloning ops
2. plumbing runtime size to LLVM
3. transform memref.alloc to LLVM malloc
4. conversion between block argument and symbol name
the generated IRs:

Initial.mlir

%2 = "accv.alloc"(%arg0, %arg1) {sym_name = "diff"} : (index, index) -> memref<?x?xf32> loc(#loc)

LoopNestToValueFunc.mlir
```
%2 = "accv.alloc"(%arg0, %arg1) {sym_name = "diff"} : (index, index) -> memref<?x?xf32> loc(#loc)
affine.for %arg4 = 0 to %arg0 {
    affine.for %arg5 = 0 to %arg1 {
    }
}
```
ConvertValueToStd.mlir
```
`%0 = memref.alloc(%arg0, %arg1) : memref<?x?xf32>`
```
ConvertValueToLLVM.mlir
```
%8 = llvm.mul %arg1, %arg0  : i64
%9 = llvm.mlir.null : !llvm.ptr<f32>
%10 = llvm.getelementptr %9[%8] : (!llvm.ptr<f32>, i64) -> !llvm.ptr<f32>
%11 = llvm.ptrtoint %10 : !llvm.ptr<f32> to i64
%12 = llvm.call @malloc(%11) : (i64) -> !llvm.ptr<i8>
```
Related work items: #3733
Merged PR 2831: Record unique IDs so that different processes acting
on a value module. [Mason Remy]

Record unique IDs so that different processes acting on a value module
don't produce conflicting IDs
Merged PR 2837: Fix WPT calculation to prevent 0 work and filter
benchmarks. [Ritwik Das]

Fix WPT calculation to prevent 0 work and filter benchmarks
Merged PR 2832: Caching strategy flag and thread ID optimization (GPU)
[Ritwik Das]
- Add a flag to plan.cache() to expose the different thread <--> data arrangements
- Optimize thread ID calculation to check blockdim first
Merged PR 2829: Add handwritten caching implementation for GPU.
[Ritwik Das]

Add GPUBlockCacheOp which lowers to handwritted caching implementation on the GPU which supports access patterns for minimizing bank conflicts in shared memory and maximizing coalescing global memory access.
Merged PR 2821: Fixes constraint logic for fusion of more than two
schedules. [Kern Handa]

Fixes constraint logic for fusion of more than two schedules
Merged PR 2830: Fixes macOS CI build. [Kern Handa]

Fixes macOS CI build
Merged PR 2806: Enable specifying cache element type. [Mason Remy]

Enable specifying cache element type
- Supports accumulating and/or computing in a different element type and
  batching up the casts for those types
- Also adds support for binop/castop expansion and castop folding
Merged PR 2818: Upgrade hatlib dependency to v0.0.23. [Ritwik Das]

Upgrade hatlib dependency to v0.0.23
Merged PR 2792: Refactor cast to a value cast op. [Mason Remy]

Refactor cast to a value cast op
Merged PR 2788: Re-enabled fusing test that was taking too long.
[Chuck Jacobs]

This PR just re-enables a skipped test that was taking too long
Merged PR 2816: Upgrade hatlib requirement to 0.0.22. [Ritwik Das]

Upgrade hatlib requirement to 0.0.22
Merged PR 2811: [nfc] Upgrade CUDA to 11.7 on NVidia benchmark
machines. [Lisa Ong]

According to https://hub.docker.com/r/nvidia/cuda/tags, 11.7.0 is still the latest.

Full Changelog: v1.2.8...v1.2.9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.2.9