Release v1.2.2 · microsoft/Accera

What's Changed

Full Changelog: v1.2.1...v1.2.2

Add Ubuntu CI workflow by @lisaong in #9
Rework documentation sections by @lisaong in #8
Manually run script to update doc versions by @lisaong in #10
Show more perf gains for the quickstart example by @lisaong in #12
Fix post merge build break by @lisaong in #14
README.md refactoring by @Arslan-e-Mustafa in #13
Complete refactoring of file array.md and simple affine loop nests.md file in manual docs by @Arslan-e-Mustafa in #16
complete refactoring of introduction.md file in manual docs by @Arslan-e-Mustafa in #15
Complete refactoring of vectorization and parallelization of manual docs by @Arslan-e-Mustafa in #20
Complete refactoring of targets.md and previous typos by @Arslan-e-Mustafa in #18
Complete refactoring of caching.md file by @Arslan-e-Mustafa in #19
Complete refactoring of schedule.md file from manual docs by @Arslan-e-Mustafa in #17
Complete refactoring of manual docs: Deferred Layout of Constant Arrays by @Arslan-e-Mustafa in #21
Refactoring of packages(dot)md from manual docs by @Arslan-e-Mustafa in #23
Complete refactoring of parameters dot md from manual docs by @Arslan-e-Mustafa in #22

Merged PR 2439: Downstream doc changes from github/main. [Lisa Ong]

Squashed commit of the following:

commit 8a6e553
Merged PR 2440: Enable tensorization for Rocm target. [Abdul Dakkak]
Merged PR 2470: Adds support for the execution of GPU (CUDA only)
functions via hat. [Kern Handa]
Merged PR 2467: Adding multiple functions in package.add() can't work
with stateful auxiliary metadata and index_map. [Denny Sun]

These bugs are all about sharing Python objects among different functions, like auxiliary metadata and schedule's indexes, when we call pacakge.add() to add multiple parameterized functions, we add functions one by one, then emit functions one by one, at each step, the state of shared Python object is changed which results in only the first function added being correctly emitted, to make _add_function work, we need to make these shared Python objects stateless.

Related work items: #3662
Merged PR 2469: Convert 'Local' memory space to 'Private' [Mason Remy]

Convert 'Local' memory space to 'Private'
Merged PR 2463: Enable specifying double buffer memory space. [Mason
Remy]

Enable specifying double buffer memory space
Merged PR 2468: Move to VS2022 for builds. [Kern Handa]

Move to VS2022 for builds
Merged PR 2465: extend gpu target spec. [Abdul Dakkak]

extend gpu target spec

Merged PR 2464: Compute a stable hash for function name suffixes.
[Lisa Ong]

Create a stable hash using md5 and json serialization of these stringized entries:

Array args: shape, type, role, layout
parameter dictionary
Target

Example output:

test_unequal_iteration_space_fusing_1 (__main__.DSLTest_04Fusing) ... DEBUG:root:Adding wrapped function
DEBUG:root:Adding wrapped function
Building function fusing_test_32d12fb1a01061ec
DEBUG:root:Detected logic function _ uses indices i,j
DEBUG:root:Detected logic function _ uses indices i,j
Building function _debug_check_allclose_16_16_4cfd65a8b606655b

Merged PR 2460: [nfc] Fix build.sh setting for vcpkg debug builds.
[Lisa Ong]
Merged PR 2461: Replace MemoryType with MemorySpace for consistency.
[Mason Remy]

Replace MemoryType with MemorySpace for consistency
Merged PR 2416: Implement initial thrifty caching support. [Mason
Remy]

Implement initial thrifty caching support
- This is a simple brute-force approach where each thrifty cache is
  examined element-by-element alongside the array it is caching to check
  whether there is a stride of 1 between every access
- Currently this thrifty analysis and the potential erasing of thrifty
  caches happens after the cache ops have been created. This is due to
  needing the cache mapping to have already run in order to support
  hierarchical caching scenarios. Eventually this should be refactored
  and the thrifty analysis should be used to prevent creating the cache
  ops, but that is a larger refactor than the scope for this task.
- When creating affine loads and stores into caches, this change also
  tacks on some attributes onto the load/store ops to indicate how the
  original load or store accessed the base array. Since the base array
  -> cache position mapping is not always invertible (consider
  coefficient cache layout cases), this is one of the only ways to
  encode this information. Unfortunately, canonicalization on affine
  load/store ops will scrub away these attributes, so any reliance on
  them has to occur before a canonicalization pass. Similarly, the
  MakeCacheOps recording which argument to their accesses are the base
  array positions depends on the operand list being unchanged, however
  canonicalization may remove operands if it determines they are not
  used - while this is fine for the load/store op itself, any assumption
  like "base array indices are at positions N...N+K in the operand list"
  are no longer valid
Related work items: #3575
Merged PR 2459: Changes the order of the LLVM_SETUP_VARIANT detection.
[Kern Handa]

Changes the order of the LLVM_SETUP_VARIANT detection
Merged PR 2458: Fixes building with clang++ on Linux/WSL. [Kern Handa]

Fixes building with clang++ on Linux/WSL
Merged PR 2438: Support for double-buffer caching. [Mason Remy]

Support for double-buffer caching
- Adds plumbing from python dsl for double_buffer flag to cache API
- Implements double buffering by hoisting the initial cache fill outside
  of the cache trigger loop parent, then creating a prologue subnest
  that fills a temporary buffer with the i+1'st iterations data and an
  epilogue subnest that moves that temporary buffer data into the main
  cache buffer. The last iteration of the trigger loop parent loop is
  unswitched and no cache filling is done in that loop.
- On GPU the temporary buffer is allocated in private memory and if the
  cache is in shared memory each thread just holds onto their own
  contribution to the cache in their own private memory buffer until the
  epilogue fill nest
- Barrier ops are hoisted out of conditionals to avoid potential for
  deadlocks. The conditionals introduced in this PR should be
  always-true or always-false, but this is added as a safety measure.
  Currently the hoisting is naive - any barrier within a conditional is
  erased and barriers are placed before and after the conditional block.
  This is not correct for all future conditional scenarios as any
  operations that happen within the conditional that depend on the
  barrier existing will be broken, however it works for how conditionals
  are used currently and can be improved on over time
Related work items: #3659
Merged PR 2450: Automatically add parameter dict as auxiliary data.
[Denny Sun]

Automatically add parameter dict as auxiliary data

Related work items: #3662
Merged PR 2456: Updates CUDA source emission based on testing with
nvrtc. [Kern Handa]

Updates CUDA source emission based on testing with nvrtc
Merged PR 2453: Sets CPU targets to default to openmp. [Kern Handa]

Sets CPU targets to default to openmp
Merged PR 2443: Add FP16 support. [Abdul Dakkak]

preparation for adding mfma support for CUDA which only operates on FP16
Merged PR 2452: Updates GPU source emitting path to emit host launcher
and device function pairs. [Kern Handa]
Merged PR 2451: Updates IR util ResolveExec[Target,Runtime] to allow
for exact matches. [Kern Handa]

Updates IR util ResolveExec[Target,Runtime] to allow for exact matches
Merged PR 2447: Makes Vulkan specific behavior pred. on Runtime. [Kern
Handa]

Makes Vulkan specific behavior pred. on Runtime
Merged PR 2446: Updates Runtime enum in Targets.py to be more
comprehensive. [Kern Handa]

Updates Runtime enum in Targets.py to be more comprehensive
Merged PR 2449: [Cleanup] Replace "rc*" prefixes with "acc*"
prefixes in tablegen'ed code. [Lisa Ong]

For *.td, perform the following replacements for ops:

s/rcv_/accv_/g
s/rc_/acc_/g
s/rcxp_/accxp_/g
s/rcln_/accln_/g
Merged PR 2448: fix typo in the condition for mod in range analysis.
[Abdul Dakkak]

fix typo in the condition for mod in range analysis
Merged PR 2445: Fix bind command when index is further split. [Abdul
Dakkak]
Merged PR 2444: add range remainder. [Abdul Dakkak]

add range remainder
Merged PR 2441: Fix APInt usage in RangeValueOptimizePass. [Mason
Remy]

Run the RangeValueOptimizePass as part of acc-to-llvm
Merged PR 2442: Move ExecutionOptions to ir lib and create arrayattr
<-> struct utils. [Mason Remy]

Move ExecutionOptions to ir lib and create arrayattr <-> struct utils
Simplify target passthrough layer. [Mason Remy]
Move ExecutionOptions to ir lib and create arrayattr <-> struct utils.
[Mason Remy]
Merged PR 2430: Remove unnecessary barrier ops. [Chuck Jacobs]

This PR adds an optimization pass that removes redundant / unnecessary barrier ops around shared memory usage.

The optimization pass in this PR is pretty simple and has a couple of limitations:
- it only works on straight-line code (that is, when all the loads, stores, and barriers are at the same loop level as each other).
- it considers all accesses to a specific array to be conflicts (that is, any write to an array followed by a read of that array will want to have a barrier in between them, even if the writes and reads are to different elements in the array)
I should be following up with a PR that deals with barrier and memory ops at different loops levels pretty soon after this.

Related work items: #3648
PR comments. [Charles Jacobs]
Fixed lit test. [Charles Jacobs]
Remove duplicated code Remove hardcoded numeric memory space values.
[Charles Jacobs]
Remove multiprocessing import. [Abdul Dakkak]
Add python test cases. [Abdul Dakkak]
Cleanup. [Charles Jacobs]
Added lit test Removed debug output. [Charles Jacobs]
Basic version working (straight-line code, all-or-nothing memory
accesses) Moved much of the analysis from the "analysis" class to the
pass. [Charles Jacobs]

added try/except in init for gpu and llvm submodules
Propagating read info. [Charles Jacobs]
Removed barrier op rewrite pattern. [Charles Jacobs]
Simple case of all-or-nothing write-only barriers on inline code
working (I think) [Charles Jacobs]
Tweaked barrier opt pass Added BarrierOp to Python DSL Added
BarrierScope to Python DSL. [Charles Jacobs]
Add trivial barrier optimization pass. [Charles Jacobs]
Propagate the runtime info throughout the accera pipeline. [Abdul
Dakkak]
Merged PR 2431: Support split after fusing unequal iteration spaces.
[Lisa Ong]
- For each correspondence index entry, perform end-padding to the largest-sized index range
- Support different sized iteration spaces in any fusion order
- Add boundary block splits whenever the InRange predicate is applied on an outer split index. Currently, this can incur more splits than necessary to ensure correctness
Related work items: #3476
Merged PR 2436: [Debug mode] Support subarray's as function arguments.
[Lisa Ong]

Force an identity affine map if it cannot be canonicalized

Related work items: #3647
Merged PR 2435: Binding thread and block ids updates launch params.
[Kern Handa]

Binding thread and block ids updates launch params
Merged PR 2433: Remove defunct accera dialect transforms. [Kern Handa]

Remove defunct accera dialect transforms
Merged PR 2432: split range optimization into two parts (analysis and
optimization) [Abdul Dakkak]

this uses the analysis pipeline to implement the range analysis. The range optimization is one instance of code which uses the range analysis info

Merged PR 2428: Perform Optimization using Range Analysis. [Abdul
Dakkak]

This uses range reduction to remove provably true/false conditions. For example given the following mlir file

gpu.module @NestFunction_0_module attributes {gpu.binary = "HSACO"} {
  gpu.func @NestFunction_0(%arg0: memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>, %arg1: memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>) kernel attributes {blockSize = [16 : i32, 16 : i32, 1 : i32], gridSize = [2 : i32, 2 : i32, 1 : i32]} {
    %c16 = constant 16 : index
    %c8 = constant 8 : index
    %c0 = constant 0 : index
    %c15 = constant 15 : index
    %c-1 = constant -1 : index
    %c1 = constant 1 : index
    %c2 = constant 2 : index
    %0 = "gpu.thread_id"() {dimension = "y"} : () -> index
    %1 = "gpu.thread_id"() {dimension = "x"} : () -> index
    %2 = "gpu.block_id"() {dimension = "y"} : () -> index
    %3 = "gpu.block_id"() {dimension = "x"} : () -> index
    %4 = memref.alloc() : memref<32x16xf32, 3>
    scf.for %arg2 = %c0 to %c2 step %c1 {
      %10 = cmpi sge, %arg2, %c0 : index
      %11 = muli %arg2, %c-1 : index
      %12 = addi %11, %c1 : index
      %13 = cmpi sge, %12, %c0 : index
      %14 = and %10, %13 : i1
      %15 = cmpi sge, %0, %c0 : index
      %16 = and %14, %15 : i1
      %17 = muli %0, %c-1 : index
      %18 = addi %17, %c15 : index
      %19 = cmpi sge, %18, %c0 : index
      %20 = and %16, %19 : i1
      %21 = cmpi sge, %1, %c0 : index
      %22 = and %20, %21 : i1
      %23 = muli %1, %c-1 : index
      %24 = addi %23, %c15 : index
      %25 = cmpi sge, %24, %c0 : index
      %26 = and %22, %25 : i1
      scf.if %26 {
        %27 = muli %3, %c16 : index
        %28 = addi %27, %c8 : index
        %29 = memref.load %arg0[%28, %c0] : memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>
        memref.store %29, %4[%c0, %c8] : memref<32x16xf32, 3>
      }
    }
    gpu.barrier
    %5 = muli %2, %c16 : index
    %6 = addi %0, %5 : index
    %7 = memref.load %4[%6, %1] : memref<32x16xf32, 3>
    %8 = muli %3, %c16 : index
    %9 = addi %1, %8 : index
    memref.store %7, %arg1[%9, %6] : memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>
    gpu.return
  }
}

one can run it through the --optimize-range-value --cse --sccp --symbol-dce pass sequence to get

#map = affine_map<(d0, d1) -> (d0 * 32 + d1)>
module  {
  gpu.module @NestFunction_0_module attributes {gpu.binary = "HSACO"} {
    gpu.func @NestFunction_0(%arg0: memref<32x32xf32, #map>, %arg1: memref<32x32xf32, #map>) kernel attributes {blockSize = [16 : i32, 16 : i32, 1 : i32], gridSize = [2 : i32, 2 : i32, 1 : i32]} {
      %true = constant true
      %c2 = constant 2 : index
      %c1 = constant 1 : index
      %c-1 = constant -1 : index
      %c15 = constant 15 : index
      %c0 = constant 0 : index
      %c8 = constant 8 : index
      %c16 = constant 16 : index
      %0 = "gpu.thread_id"() {dimension = "y"} : () -> index
      %1 = "gpu.thread_id"() {dimension = "x"} : () -> index
      %2 = "gpu.block_id"() {dimension = "y...

Merged PR 2395: Add strided View implementation for containers.
[Ritwik Das]
- Add strided View implementation for containers
- added tests
Related work items: #3641
Merged PR 2429: Merged changes from GitHub remote. [Lisa Ong]

commit 8c0fb54
Squashed commit of the following: [Lisa Ong]

commit 8c0fb54
Squashed commit of the following: [Lisa Ong]

commit 4fccbb9
Merged PR 2427: Disable CI triggers for weekly SDL pipelines. [Lisa
Ong]

trigger needs to be explicitly set to none to be disabled, otherwise it will run as a CI trigger on the default branch.
Merged PR 2426: Expose and plumb support for ARM Cortex-M4F. [Kern
Handa]

Expose and plumb support for ARM Cortex-M4F
Merged PR 2425: Set non-CI pipelines on a weekly schedule. [Lisa Ong]

Setup a weekly schedule to conserve resources. The pipelines will run at the beginning of the week to verify the payload from the previous week.

Also removed the main trigger from Linux Package Build as it is only used in CI.
Merged PR 2424: Const arrays reused in two different functions results
in duplicate symbol name. [Denny Sun]

valueModuleOp.lookupSymbol() should be called here to loop for an existing symbol, but so far it doesn't work as expected. So manually walk the top level ops inside the ValueModuleOp to look for the symbol. Replace this workaround with a ValueModuleOp SymbolTable lookup once issues with comparing mlir::Identifiers is resolved.

Related work items: #3583
Fix the typo and simplify for loop. [Denny Sun]
Fix the build break caused by typo. [Denny Sun]
Walk the top level ops inside the ValueModuleOp to look for the
symbol. [Denny Sun]
Correct the logic. [Denny Sun]
Draft PR for fixing dup global const. [Denny Sun]
Merged PR 2422: Simplify implicit type casting check and add support
for more conversions. [Kern Handa]

Simplify implicit type casting check and add support for more conversions
. [Kern Handa]
Change controversial impl. [Kern Handa]
Simplify implicit type casting check and add support for more
conversions. [Kern Handa]
Merged PR 2421: Update accera-translate to emit CUDA code from acc-opt
output. [Kern Handa]

Update accera-translate to emit CUDA code from acc-opt output

This change adds support for new ops and refactors support for existing
ops, with the primary goal being the emission of CUDA code from the IR
output from acc-opt (when given the proper CL args). A secondary goal
was to clean up the code and make it easier to expand the support for
both C++ and CUDA-like code in the future.
Merged PR 2419: Add automatic type casting for basic scalar types.
[Denny Sun]

This change is to make Scalar be able to do upcasting, e.g. Int8 to Float, Int8 to Int32, Float to Double etc.
With this fix, our users can write the following Python code without explicit type casting, A is an array object with Float type,
A[i, j] = 5

Related work items: #3570
Add more type cast. [Denny Sun]
Type casting support in Scalar. [Denny Sun]
Merged PR 2420: Add smoke test repro'ing CONST array bug. [Mason Remy]

Add smoke test repro'ing CONST array bug
Add smoke test repro'ing CONST array bug. [Mason Remy]
Merged PR 2418: Fix Package format enum and memory space tablegen
enum. [Mason Remy]

Fix Package format enum and memory space tablegen enum
Fix Package format enum and memory space tablegen enum. [Mason Remy]
Merged PR 2415: propagate the runtime info throughout the accera
pipeline. [Abdul Dakkak]

propagate the runtime info throughout the accera pipeline
Updated smoke_test.py. [Abdul Dakkak]
Resolve comments. [Abdul Dakkak]
Propagate the runtime info throughout the accera pipeline. [Abdul
Dakkak]
Merged PR 2410: Initial work on supporting GPU Caching. [Abdul Dakkak]

This merges Alex's work on GPU caching. Tests do not run for reasons unrelated to caching
Resolve merge conflicts. [Abdul Dakkak]
Rename memory_space to location per the docs. [Abdul Dakkak]
Checkpoint. [Abdul Dakkak]
Continue merge. [Abdul Dakkak]
Continue merge. [Abdul Dakkak]
Merge from main. [Abdul Dakkak]
Resolve comments. [Abdul Dakkak]
Disable rocm test for now. [Abdul Dakkak]
Fix typo. [Abdul Dakkak]
Do not generate the global helpers if we are generating only gpu code
and the target is not an object. [Abdul Dakkak]
Merge branch 'dev/kerha/fix_quiet_and_gpu_only' into
dev/adakkak/initial_gpu_caching. [Abdul Dakkak]
Fix quiet and gpu_only integrations in FE. [Kern Handa]
Fix setting of target category. [Abdul Dakkak]
Checkpoint. [Abdul Dakkak]
Initial work on merging Alex's branch. [Abdul Dakkak]
Merged PR 2412: Merged commits from GitHub main. [Lisa Ong]

Changes:

commit fa63d6e
Squashed commit of the following: [Lisa Ong]

commit 2668b4b
Update vcpkg. [Lisa Ong]
Removed stale files from merge. [Lisa Ong]
Squashed commit of the following: [Lisa Ong]

commit fa63d6e
Merged PR 2414: LLVM 13.0.1 update. [Lisa Ong]

Dependent PR: !2413

Incremental version update to LLVM, no changes in MLIR: llvm/llvm-project@llvmorg-13.0.0...llvmorg-13.0.1

Related work items: #3620
Merged PR 2409: [nfc] Pull out ACCERA_TOOLS_DIR to the main
CMakeLists.txt file. [Kern Handa]

[nfc] Pull out ACCERA_TOOLS_DIR to the main CMakeLists.txt file
Merged PR 2408: [nfc] Update python style config, apply it on tree.
[Kern Handa]

[nfc] Update python style config, apply it on tree
Merged PR 2407: Fix quiet and gpu_only integrations in FE. [Kern
Handa]

Fix quiet and gpu_only integrations in FE
Merged PR 2406: Update accc.py to build with gpu_only mode. [Kern
Handa]

Update accc.py to build with gpu_only mode
Merged PR 2404: Add support for translation of accera dialect to
cpp/cuda. [Kern Handa]
Merged PR 2403: Update Python API for GPU support. [Kern Handa]

Update Python API for GPU support

Changes:
- Add support for gpu_only compilation mode in python layer
- Add support for specifying the execution runtime as a compiler option
- Adds the _gpu submodule to the accera module, which adds support for GPU specific ops - MFMA and Barrier, along with Python hooks for GPU Indices
Merged PR 2402: Update Value library to support GPU related changes.
[Kern Handa]

Update Value library to support GPU related changes
- Add support for performing only GPU related compilations to compiler front-end
- Add support for strided views to Value DSL
  - Matrix::SubMatrix has been updated to accept row, col strides
- Add support for MFMA ops in Value DSL (WIP)
- Add support for tensorization to Value DSL's GPUPlan
Merged PR 2401: Adds the GPU pass pipeline and related transforms.
[Kern Handa]

Adds the GPU pass pipeline and related transforms

Additionally:
- Added support for adding execution runtime annotations to functions, enabling target compilation control
- Adds support for Tensorization
- Adds support for MFMA (WIP)
- Adds a debug utility transform to dump the module, but disabled from the build by default
- Improved subview op semantics, enabling support for strided views
Merged PR 2400: Add support for MFMA, Tensorization, and Exec Runtime
in Value IR. [Kern Handa]

Add support for MFMA, Tensorization, and Exec Runtime in Value IR
Merged PR 2399: Integrate acc-translate into the python layer. [Kern
Handa]

This adds build support for calling acc-translate after the MLIR transformations are done. Work is still pending for front-end and back-end support for GPU.
Merged PR 2385: Add Execution Runtime to module. [Abdul Dakkak]

This continues on the merge path. It adds an execution runtime option to a GPUPlan which gets propagated and added to the module. Subsequent PRs will use this feature to dispatch to Rocm or SPIRV.
Merged PR 2398: quiet the build output from python by default. [Kern
Handa]

quiet the build output from python by default

Related work items: #3557
Merged PR 2397: Disable command line execution of accc.py. [Kern
Handa]

Disable command line execution of accc.py
Merged PR 2396: AVX512 was being incorrectly gated. [Kern Handa]

AVX512 was being incorrectly gated
Merged PR 2393: Pick up demo fixups from github/main. [Lisa Ong]

The binder demo now runs in a separate branch github/demos so that binder does not try to build the accera package.

The github/demos branch will not be merged into github/main (similar to github/gh-pages), and lives to host binder demos.
Merged PR 2394: Support Python 3.10. [Lisa Ong]
- Add Python 3.10 to package builds and as the default Python version for all pipelines
- Exception: Linux, macOS, and Windows buddy builds rely on onnxruntime, which only supports up to 3.9. Since these pipelines only build 1 python version, we'll keep them at 3.9 so that there's test coverage for ORT.
Related work items: #3643

New Contributors

@Arslan-e-Mustafa made their first contribution in #13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.2.2

What's Changed

New Contributors

Contributors