Skip to content

v1.2.2

Compare
Choose a tag to compare
@lisaong lisaong released this 18 Mar 05:32
· 73 commits to main since this release

What's Changed

Full Changelog: v1.2.1...v1.2.2

  • Merged PR 2439: Downstream doc changes from github/main. [Lisa Ong]

    Squashed commit of the following:

    commit 8a6e553

  • Merged PR 2440: Enable tensorization for Rocm target. [Abdul Dakkak]

  • Merged PR 2470: Adds support for the execution of GPU (CUDA only)
    functions via hat. [Kern Handa]

  • Merged PR 2467: Adding multiple functions in package.add() can't work
    with stateful auxiliary metadata and index_map. [Denny Sun]

    These bugs are all about sharing Python objects among different functions, like auxiliary metadata and schedule's indexes, when we call pacakge.add() to add multiple parameterized functions, we add functions one by one, then emit functions one by one, at each step, the state of shared Python object is changed which results in only the first function added being correctly emitted, to make _add_function work, we need to make these shared Python objects stateless.

    Related work items: #3662

  • Merged PR 2469: Convert 'Local' memory space to 'Private' [Mason Remy]

    Convert 'Local' memory space to 'Private'

  • Merged PR 2463: Enable specifying double buffer memory space. [Mason
    Remy]

    Enable specifying double buffer memory space

  • Merged PR 2468: Move to VS2022 for builds. [Kern Handa]

    Move to VS2022 for builds

  • Merged PR 2465: extend gpu target spec. [Abdul Dakkak]

    extend gpu target spec

  • Merged PR 2464: Compute a stable hash for function name suffixes.
    [Lisa Ong]

    Create a stable hash using md5 and json serialization of these stringized entries:

    • Array args: shape, type, role, layout
    • parameter dictionary
    • Target

    Example output:

    test_unequal_iteration_space_fusing_1 (__main__.DSLTest_04Fusing) ... DEBUG:root:Adding wrapped function
    DEBUG:root:Adding wrapped function
    Building function fusing_test_32d12fb1a01061ec
    DEBUG:root:Detected logic function _ uses indices i,j
    DEBUG:root:Detected logic function _ uses indices i,j
    Building function _debug_check_allclose_16_16_4cfd65a8b606655b
    
  • Merged PR 2460: [nfc] Fix build.sh setting for vcpkg debug builds.
    [Lisa Ong]

  • Merged PR 2461: Replace MemoryType with MemorySpace for consistency.
    [Mason Remy]

    Replace MemoryType with MemorySpace for consistency

  • Merged PR 2416: Implement initial thrifty caching support. [Mason
    Remy]

    Implement initial thrifty caching support

    • This is a simple brute-force approach where each thrifty cache is
      examined element-by-element alongside the array it is caching to check
      whether there is a stride of 1 between every access
    • Currently this thrifty analysis and the potential erasing of thrifty
      caches happens after the cache ops have been created. This is due to
      needing the cache mapping to have already run in order to support
      hierarchical caching scenarios. Eventually this should be refactored
      and the thrifty analysis should be used to prevent creating the cache
      ops, but that is a larger refactor than the scope for this task.
    • When creating affine loads and stores into caches, this change also
      tacks on some attributes onto the load/store ops to indicate how the
      original load or store accessed the base array. Since the base array
      -> cache position mapping is not always invertible (consider
      coefficient cache layout cases), this is one of the only ways to
      encode this information. Unfortunately, canonicalization on affine
      load/store ops will scrub away these attributes, so any reliance on
      them has to occur before a canonicalization pass. Similarly, the
      MakeCacheOps recording which argument to their accesses are the base
      array positions depends on the operand list being unchanged, however
      canonicalization may remove operands if it determines they are not
      used - while this is fine for the load/store op itself, any assumption
      like "base array indices are at positions N...N+K in the operand list"
      are no longer valid

    Related work items: #3575

  • Merged PR 2459: Changes the order of the LLVM_SETUP_VARIANT detection.
    [Kern Handa]

    Changes the order of the LLVM_SETUP_VARIANT detection

  • Merged PR 2458: Fixes building with clang++ on Linux/WSL. [Kern Handa]

    Fixes building with clang++ on Linux/WSL

  • Merged PR 2438: Support for double-buffer caching. [Mason Remy]

    Support for double-buffer caching

    • Adds plumbing from python dsl for double_buffer flag to cache API
    • Implements double buffering by hoisting the initial cache fill outside
      of the cache trigger loop parent, then creating a prologue subnest
      that fills a temporary buffer with the i+1'st iterations data and an
      epilogue subnest that moves that temporary buffer data into the main
      cache buffer. The last iteration of the trigger loop parent loop is
      unswitched and no cache filling is done in that loop.
    • On GPU the temporary buffer is allocated in private memory and if the
      cache is in shared memory each thread just holds onto their own
      contribution to the cache in their own private memory buffer until the
      epilogue fill nest
    • Barrier ops are hoisted out of conditionals to avoid potential for
      deadlocks. The conditionals introduced in this PR should be
      always-true or always-false, but this is added as a safety measure.
      Currently the hoisting is naive - any barrier within a conditional is
      erased and barriers are placed before and after the conditional block.
      This is not correct for all future conditional scenarios as any
      operations that happen within the conditional that depend on the
      barrier existing will be broken, however it works for how conditionals
      are used currently and can be improved on over time

    Related work items: #3659

  • Merged PR 2450: Automatically add parameter dict as auxiliary data.
    [Denny Sun]

    Automatically add parameter dict as auxiliary data

    Related work items: #3662

  • Merged PR 2456: Updates CUDA source emission based on testing with
    nvrtc. [Kern Handa]

    Updates CUDA source emission based on testing with nvrtc

  • Merged PR 2453: Sets CPU targets to default to openmp. [Kern Handa]

    Sets CPU targets to default to openmp

  • Merged PR 2443: Add FP16 support. [Abdul Dakkak]

    preparation for adding mfma support for CUDA which only operates on FP16

  • Merged PR 2452: Updates GPU source emitting path to emit host launcher
    and device function pairs. [Kern Handa]

  • Merged PR 2451: Updates IR util ResolveExec[Target,Runtime] to allow
    for exact matches. [Kern Handa]

    Updates IR util ResolveExec[Target,Runtime] to allow for exact matches

  • Merged PR 2447: Makes Vulkan specific behavior pred. on Runtime. [Kern
    Handa]

    Makes Vulkan specific behavior pred. on Runtime

  • Merged PR 2446: Updates Runtime enum in Targets.py to be more
    comprehensive. [Kern Handa]

    Updates Runtime enum in Targets.py to be more comprehensive

  • Merged PR 2449: [Cleanup] Replace "rc*" prefixes with "acc*"
    prefixes in tablegen'ed code. [Lisa Ong]

    For *.td, perform the following replacements for ops:

    s/rcv_/accv_/g
    s/rc_/acc_/g
    s/rcxp_/accxp_/g
    s/rcln_/accln_/g

  • Merged PR 2448: fix typo in the condition for mod in range analysis.
    [Abdul Dakkak]

    fix typo in the condition for mod in range analysis

  • Merged PR 2445: Fix bind command when index is further split. [Abdul
    Dakkak]

  • Merged PR 2444: add range remainder. [Abdul Dakkak]

    add range remainder

  • Merged PR 2441: Fix APInt usage in RangeValueOptimizePass. [Mason
    Remy]

    Run the RangeValueOptimizePass as part of acc-to-llvm

  • Merged PR 2442: Move ExecutionOptions to ir lib and create arrayattr
    <-> struct utils. [Mason Remy]

    Move ExecutionOptions to ir lib and create arrayattr <-> struct utils

  • Simplify target passthrough layer. [Mason Remy]

  • Move ExecutionOptions to ir lib and create arrayattr <-> struct utils.
    [Mason Remy]

  • Merged PR 2430: Remove unnecessary barrier ops. [Chuck Jacobs]

    This PR adds an optimization pass that removes redundant / unnecessary barrier ops around shared memory usage.

    The optimization pass in this PR is pretty simple and has a couple of limitations:

    • it only works on straight-line code (that is, when all the loads, stores, and barriers are at the same loop level as each other).
    • it considers all accesses to a specific array to be conflicts (that is, any write to an array followed by a read of that array will want to have a barrier in between them, even if the writes and reads are to different elements in the array)

    I should be following up with a PR that deals with barrier and memory ops at different loops levels pretty soon after this.

    Related work items: #3648

  • PR comments. [Charles Jacobs]

  • Fixed lit test. [Charles Jacobs]

  • Remove duplicated code Remove hardcoded numeric memory space values.
    [Charles Jacobs]

  • Remove multiprocessing import. [Abdul Dakkak]

  • Add python test cases. [Abdul Dakkak]

  • Cleanup. [Charles Jacobs]

  • Added lit test Removed debug output. [Charles Jacobs]

  • Basic version working (straight-line code, all-or-nothing memory
    accesses) Moved much of the analysis from the "analysis" class to the
    pass. [Charles Jacobs]

    added try/except in init for gpu and llvm submodules

  • Propagating read info. [Charles Jacobs]

  • Removed barrier op rewrite pattern. [Charles Jacobs]

  • Simple case of all-or-nothing write-only barriers on inline code
    working (I think) [Charles Jacobs]

  • Tweaked barrier opt pass Added BarrierOp to Python DSL Added
    BarrierScope to Python DSL. [Charles Jacobs]

  • Add trivial barrier optimization pass. [Charles Jacobs]

  • Propagate the runtime info throughout the accera pipeline. [Abdul
    Dakkak]

  • Merged PR 2431: Support split after fusing unequal iteration spaces.
    [Lisa Ong]

    • For each correspondence index entry, perform end-padding to the largest-sized index range
    • Support different sized iteration spaces in any fusion order
    • Add boundary block splits whenever the InRange predicate is applied on an outer split index. Currently, this can incur more splits than necessary to ensure correctness

    Related work items: #3476

  • Merged PR 2436: [Debug mode] Support subarray's as function arguments.
    [Lisa Ong]

    Force an identity affine map if it cannot be canonicalized

    Related work items: #3647

  • Merged PR 2435: Binding thread and block ids updates launch params.
    [Kern Handa]

    Binding thread and block ids updates launch params

  • Merged PR 2433: Remove defunct accera dialect transforms. [Kern Handa]

    Remove defunct accera dialect transforms

  • Merged PR 2432: split range optimization into two parts (analysis and
    optimization) [Abdul Dakkak]

    this uses the analysis pipeline to implement the range analysis. The range optimization is one instance of code which uses the range analysis info

  • Merged PR 2428: Perform Optimization using Range Analysis. [Abdul
    Dakkak]

    This uses range reduction to remove provably true/false conditions. For example given the following mlir file

    gpu.module @NestFunction_0_module attributes {gpu.binary = "HSACO"} {
      gpu.func @NestFunction_0(%arg0: memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>, %arg1: memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>) kernel attributes {blockSize = [16 : i32, 16 : i32, 1 : i32], gridSize = [2 : i32, 2 : i32, 1 : i32]} {
        %c16 = constant 16 : index
        %c8 = constant 8 : index
        %c0 = constant 0 : index
        %c15 = constant 15 : index
        %c-1 = constant -1 : index
        %c1 = constant 1 : index
        %c2 = constant 2 : index
        %0 = "gpu.thread_id"() {dimension = "y"} : () -> index
        %1 = "gpu.thread_id"() {dimension = "x"} : () -> index
        %2 = "gpu.block_id"() {dimension = "y"} : () -> index
        %3 = "gpu.block_id"() {dimension = "x"} : () -> index
        %4 = memref.alloc() : memref<32x16xf32, 3>
        scf.for %arg2 = %c0 to %c2 step %c1 {
          %10 = cmpi sge, %arg2, %c0 : index
          %11 = muli %arg2, %c-1 : index
          %12 = addi %11, %c1 : index
          %13 = cmpi sge, %12, %c0 : index
          %14 = and %10, %13 : i1
          %15 = cmpi sge, %0, %c0 : index
          %16 = and %14, %15 : i1
          %17 = muli %0, %c-1 : index
          %18 = addi %17, %c15 : index
          %19 = cmpi sge, %18, %c0 : index
          %20 = and %16, %19 : i1
          %21 = cmpi sge, %1, %c0 : index
          %22 = and %20, %21 : i1
          %23 = muli %1, %c-1 : index
          %24 = addi %23, %c15 : index
          %25 = cmpi sge, %24, %c0 : index
          %26 = and %22, %25 : i1
          scf.if %26 {
            %27 = muli %3, %c16 : index
            %28 = addi %27, %c8 : index
            %29 = memref.load %arg0[%28, %c0] : memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>
            memref.store %29, %4[%c0, %c8] : memref<32x16xf32, 3>
          }
        }
        gpu.barrier
        %5 = muli %2, %c16 : index
        %6 = addi %0, %5 : index
        %7 = memref.load %4[%6, %1] : memref<32x16xf32, 3>
        %8 = muli %3, %c16 : index
        %9 = addi %1, %8 : index
        memref.store %7, %arg1[%9, %6] : memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>
        gpu.return
      }
    }
    
    

    one can run it through the --optimize-range-value --cse --sccp --symbol-dce pass sequence to get

    #map = affine_map<(d0, d1) -> (d0 * 32 + d1)>
    module  {
      gpu.module @NestFunction_0_module attributes {gpu.binary = "HSACO"} {
        gpu.func @NestFunction_0(%arg0: memref<32x32xf32, #map>, %arg1: memref<32x32xf32, #map>) kernel attributes {blockSize = [16 : i32, 16 : i32, 1 : i32], gridSize = [2 : i32, 2 : i32, 1 : i32]} {
          %true = constant true
          %c2 = constant 2 : index
          %c1 = constant 1 : index
          %c-1 = constant -1 : index
          %c15 = constant 15 : index
          %c0 = constant 0 : index
          %c8 = constant 8 : index
          %c16 = constant 16 : index
          %0 = "gpu.thread_id"() {dimension = "y"} : () -> index
          %1 = "gpu.thread_id"() {dimension = "x"} : () -> index
          %2 = "gpu.block_id"() {dimension = "y...
    
  • Merged PR 2395: Add strided View implementation for containers.
    [Ritwik Das]

    • Add strided View implementation for containers
    • added tests

    Related work items: #3641

  • Merged PR 2429: Merged changes from GitHub remote. [Lisa Ong]

    commit 8c0fb54

  • Squashed commit of the following: [Lisa Ong]

    commit 8c0fb54

  • Squashed commit of the following: [Lisa Ong]

    commit 4fccbb9

  • Merged PR 2427: Disable CI triggers for weekly SDL pipelines. [Lisa
    Ong]

    trigger needs to be explicitly set to none to be disabled, otherwise it will run as a CI trigger on the default branch.

  • Merged PR 2426: Expose and plumb support for ARM Cortex-M4F. [Kern
    Handa]

    Expose and plumb support for ARM Cortex-M4F

  • Merged PR 2425: Set non-CI pipelines on a weekly schedule. [Lisa Ong]

    Setup a weekly schedule to conserve resources. The pipelines will run at the beginning of the week to verify the payload from the previous week.

    Also removed the main trigger from Linux Package Build as it is only used in CI.

  • Merged PR 2424: Const arrays reused in two different functions results
    in duplicate symbol name. [Denny Sun]

    valueModuleOp.lookupSymbol() should be called here to loop for an existing symbol, but so far it doesn't work as expected. So manually walk the top level ops inside the ValueModuleOp to look for the symbol. Replace this workaround with a ValueModuleOp SymbolTable lookup once issues with comparing mlir::Identifiers is resolved.

    Related work items: #3583

  • Fix the typo and simplify for loop. [Denny Sun]

  • Fix the build break caused by typo. [Denny Sun]

  • Walk the top level ops inside the ValueModuleOp to look for the
    symbol. [Denny Sun]

  • Correct the logic. [Denny Sun]

  • Draft PR for fixing dup global const. [Denny Sun]

  • Merged PR 2422: Simplify implicit type casting check and add support
    for more conversions. [Kern Handa]

    Simplify implicit type casting check and add support for more conversions

  • . [Kern Handa]

  • Change controversial impl. [Kern Handa]

  • Simplify implicit type casting check and add support for more
    conversions. [Kern Handa]

  • Merged PR 2421: Update accera-translate to emit CUDA code from acc-opt
    output. [Kern Handa]

    Update accera-translate to emit CUDA code from acc-opt output

    This change adds support for new ops and refactors support for existing
    ops, with the primary goal being the emission of CUDA code from the IR
    output from acc-opt (when given the proper CL args). A secondary goal
    was to clean up the code and make it easier to expand the support for
    both C++ and CUDA-like code in the future.

  • Merged PR 2419: Add automatic type casting for basic scalar types.
    [Denny Sun]

    This change is to make Scalar be able to do upcasting, e.g. Int8 to Float, Int8 to Int32, Float to Double etc.
    With this fix, our users can write the following Python code without explicit type casting, A is an array object with Float type,
    A[i, j] = 5

    Related work items: #3570

  • Add more type cast. [Denny Sun]

  • Type casting support in Scalar. [Denny Sun]

  • Merged PR 2420: Add smoke test repro'ing CONST array bug. [Mason Remy]

    Add smoke test repro'ing CONST array bug

  • Add smoke test repro'ing CONST array bug. [Mason Remy]

  • Merged PR 2418: Fix Package format enum and memory space tablegen
    enum. [Mason Remy]

    Fix Package format enum and memory space tablegen enum

  • Fix Package format enum and memory space tablegen enum. [Mason Remy]

  • Merged PR 2415: propagate the runtime info throughout the accera
    pipeline. [Abdul Dakkak]

    propagate the runtime info throughout the accera pipeline

  • Updated smoke_test.py. [Abdul Dakkak]

  • Resolve comments. [Abdul Dakkak]

  • Propagate the runtime info throughout the accera pipeline. [Abdul
    Dakkak]

  • Merged PR 2410: Initial work on supporting GPU Caching. [Abdul Dakkak]

    This merges Alex's work on GPU caching. Tests do not run for reasons unrelated to caching

  • Resolve merge conflicts. [Abdul Dakkak]

  • Rename memory_space to location per the docs. [Abdul Dakkak]

  • Checkpoint. [Abdul Dakkak]

  • Continue merge. [Abdul Dakkak]

  • Continue merge. [Abdul Dakkak]

  • Merge from main. [Abdul Dakkak]

  • Resolve comments. [Abdul Dakkak]

  • Disable rocm test for now. [Abdul Dakkak]

  • Fix typo. [Abdul Dakkak]

  • Do not generate the global helpers if we are generating only gpu code
    and the target is not an object. [Abdul Dakkak]

  • Merge branch 'dev/kerha/fix_quiet_and_gpu_only' into
    dev/adakkak/initial_gpu_caching. [Abdul Dakkak]

  • Fix quiet and gpu_only integrations in FE. [Kern Handa]

  • Fix setting of target category. [Abdul Dakkak]

  • Checkpoint. [Abdul Dakkak]

  • Initial work on merging Alex's branch. [Abdul Dakkak]

  • Merged PR 2412: Merged commits from GitHub main. [Lisa Ong]

    Changes:

    commit fa63d6e

  • Squashed commit of the following: [Lisa Ong]

    commit 2668b4b

  • Update vcpkg. [Lisa Ong]

  • Removed stale files from merge. [Lisa Ong]

  • Squashed commit of the following: [Lisa Ong]

    commit fa63d6e

  • Merged PR 2414: LLVM 13.0.1 update. [Lisa Ong]

    Dependent PR: !2413

    Incremental version update to LLVM, no changes in MLIR: llvm/llvm-project@llvmorg-13.0.0...llvmorg-13.0.1

    Related work items: #3620

  • Merged PR 2409: [nfc] Pull out ACCERA_TOOLS_DIR to the main
    CMakeLists.txt file. [Kern Handa]

    [nfc] Pull out ACCERA_TOOLS_DIR to the main CMakeLists.txt file

  • Merged PR 2408: [nfc] Update python style config, apply it on tree.
    [Kern Handa]

    [nfc] Update python style config, apply it on tree

  • Merged PR 2407: Fix quiet and gpu_only integrations in FE. [Kern
    Handa]

    Fix quiet and gpu_only integrations in FE

  • Merged PR 2406: Update accc.py to build with gpu_only mode. [Kern
    Handa]

    Update accc.py to build with gpu_only mode

  • Merged PR 2404: Add support for translation of accera dialect to
    cpp/cuda. [Kern Handa]

  • Merged PR 2403: Update Python API for GPU support. [Kern Handa]

    Update Python API for GPU support

    Changes:

    • Add support for gpu_only compilation mode in python layer
    • Add support for specifying the execution runtime as a compiler option
    • Adds the _gpu submodule to the accera module, which adds support for GPU specific ops - MFMA and Barrier, along with Python hooks for GPU Indices
  • Merged PR 2402: Update Value library to support GPU related changes.
    [Kern Handa]

    Update Value library to support GPU related changes

    • Add support for performing only GPU related compilations to compiler front-end
    • Add support for strided views to Value DSL
      • Matrix::SubMatrix has been updated to accept row, col strides
    • Add support for MFMA ops in Value DSL (WIP)
    • Add support for tensorization to Value DSL's GPUPlan
  • Merged PR 2401: Adds the GPU pass pipeline and related transforms.
    [Kern Handa]

    Adds the GPU pass pipeline and related transforms

    Additionally:

    • Added support for adding execution runtime annotations to functions, enabling target compilation control
    • Adds support for Tensorization
    • Adds support for MFMA (WIP)
    • Adds a debug utility transform to dump the module, but disabled from the build by default
    • Improved subview op semantics, enabling support for strided views
  • Merged PR 2400: Add support for MFMA, Tensorization, and Exec Runtime
    in Value IR. [Kern Handa]

    Add support for MFMA, Tensorization, and Exec Runtime in Value IR

  • Merged PR 2399: Integrate acc-translate into the python layer. [Kern
    Handa]

    This adds build support for calling acc-translate after the MLIR transformations are done. Work is still pending for front-end and back-end support for GPU.

  • Merged PR 2385: Add Execution Runtime to module. [Abdul Dakkak]

    This continues on the merge path. It adds an execution runtime option to a GPUPlan which gets propagated and added to the module. Subsequent PRs will use this feature to dispatch to Rocm or SPIRV.

  • Merged PR 2398: quiet the build output from python by default. [Kern
    Handa]

    quiet the build output from python by default

    Related work items: #3557

  • Merged PR 2397: Disable command line execution of accc.py. [Kern
    Handa]

    Disable command line execution of accc.py

  • Merged PR 2396: AVX512 was being incorrectly gated. [Kern Handa]

    AVX512 was being incorrectly gated

  • Merged PR 2393: Pick up demo fixups from github/main. [Lisa Ong]

    The binder demo now runs in a separate branch github/demos so that binder does not try to build the accera package.

    The github/demos branch will not be merged into github/main (similar to github/gh-pages), and lives to host binder demos.

  • Merged PR 2394: Support Python 3.10. [Lisa Ong]

    • Add Python 3.10 to package builds and as the default Python version for all pipelines
    • Exception: Linux, macOS, and Windows buddy builds rely on onnxruntime, which only supports up to 3.9. Since these pipelines only build 1 python version, we'll keep them at 3.9 so that there's test coverage for ORT.

    Related work items: #3643

New Contributors