Squashed commit of the following:

commit 37efd7c8223542c3d953f6127308542013c159b8 Author: Lisa Ong <onglisa@microsoft.com> Date: Fri Mar 18 00:34:18 2022 +0000 Merged PR 2439: Downstream doc changes from github/main Squashed commit of the following: commit 8a6e553 Author: Arslan-e-Mustafa <70168134+Arslan-e-Mustafa@users.noreply.github.com> Date: Sat Feb 26 16:50:57 2022 +0500 complete refactoring of introduction.md file in manual docs (#15) * Feedback addressed * Addressed the pending comments commit 329d695 Author: Arslan-e-Mustafa <70168134+Arslan-e-Mustafa@users.noreply.github.com> Date: Fri Feb 25 21:37:19 2022 +0500 Complete refactoring of file array.md and simple affine loop nests.md file in manual docs (#16) * complete refactoring of introduction.md file * completed array.md and simple affine loop nests.md files * Took care of extra semicolon commit 04af790 Author: Arslan-e-Mustafa <70168134+Arslan-e-Mustafa@users.noreply.github.com> Date: Tue Feb 22 05:42:22 2022 +0500 README.md refactoring (#13) * initial commit * worked on README.md until goals of accera section. Took the liberty of changing some headings, restructuring the paragraphs, and adding one more goal * Feedback addressed regarding README.md file * Take care of last comment and completed the whole file from my side Co-authored-by: Lisa Ong <11318241+lisaong@users.noreply.github.com> commit 356872bf787b3b076ac45aa86d2275ffcd15364e Author: Abdul Dakkak <adakkak@microsoft.com> Date: Thu Mar 17 12:35:33 2022 +0000 Merged PR 2440: Enable tensorization for Rocm target commit 5557ff59f398ddad818e9c5b93cd00408bd7637c Author: Kern Handa <kerha@microsoft.com> Date: Wed Mar 16 22:03:29 2022 +0000 Merged PR 2470: Adds support for the execution of GPU (CUDA only) functions via hat commit fb803a9fbaf0bfa7f809f5bdd8366629febb9bd0 Author: Denny Sun <dennys@microsoft.com> Date: Wed Mar 16 20:18:23 2022 +0000 Merged PR 2467: Adding multiple functions in package.add() can't work with stateful auxiliary metadata and index_map These bugs are all about sharing Python objects among different functions, like auxiliary metadata and schedule's indexes, when we call pacakge.add() to add multiple parameterized functions, we add functions one by one, then emit functions one by one, at each step, the state of shared Python object is changed which results in only the first function added being correctly emitted, to make _add_function work, we need to make these shared Python objects stateless. Related work items: #3662 commit e149bac1147d160b05aa55ad8ef4416423c20925 Author: Mason Remy <masonr@microsoft.com> Date: Wed Mar 16 06:31:10 2022 +0000 Merged PR 2469: Convert 'Local' memory space to 'Private' Convert 'Local' memory space to 'Private' commit 65363d35f7a31dfc682366ba70caaf301806a44b Author: Mason Remy <masonr@microsoft.com> Date: Wed Mar 16 02:41:31 2022 +0000 Merged PR 2463: Enable specifying double buffer memory space Enable specifying double buffer memory space commit f80b46af2b12689ff617ba3a491fee6ae9aad010 Author: Kern Handa <kerha@microsoft.com> Date: Wed Mar 16 01:57:46 2022 +0000 Merged PR 2468: Move to VS2022 for builds Move to VS2022 for builds commit 0870cb27ccbe52fa8182b960140f5b6d562ab929 Author: Abdul Dakkak <adakkak@microsoft.com> Date: Tue Mar 15 14:01:15 2022 +0000 Merged PR 2465: extend gpu target spec extend gpu target spec commit 07088ecd0700fee16efdab677a581aa47a6a8690 Author: Lisa Ong <onglisa@microsoft.com> Date: Tue Mar 15 09:30:22 2022 +0000 Merged PR 2464: Compute a stable hash for function name suffixes Create a stable hash using md5 and json serialization of these stringized entries: - Array args: shape, type, role, layout - parameter dictionary - Target Example output: ``` test_unequal_iteration_space_fusing_1 (__main__.DSLTest_04Fusing) ... DEBUG:root:Adding wrapped function DEBUG:root:Adding wrapped function Building function fusing_test_32d12fb1a01061ec DEBUG:root:Detected logic function _ uses indices i,j DEBUG:root:Detected logic function _ uses indices i,j Building function _debug_check_allclose_16_16_4cfd65a8b606655b ``` commit 63e82be5e7b92f750fdf6c19347609c119cc5642 Author: Lisa Ong <onglisa@microsoft.com> Date: Tue Mar 15 00:25:13 2022 +0000 Merged PR 2460: [nfc] Fix build.sh setting for vcpkg debug builds commit d5ca516084dd68966e8c14b6d64d4402f572349a Author: Mason Remy <masonr@microsoft.com> Date: Mon Mar 14 19:53:46 2022 +0000 Merged PR 2461: Replace MemoryType with MemorySpace for consistency Replace MemoryType with MemorySpace for consistency commit fdb503611bd235ca59c7769bd0d752519ce42bf5 Author: Mason Remy <masonr@microsoft.com> Date: Mon Mar 14 18:42:45 2022 +0000 Merged PR 2416: Implement initial thrifty caching support Implement initial thrifty caching support - This is a simple brute-force approach where each thrifty cache is examined element-by-element alongside the array it is caching to check whether there is a stride of 1 between every access - Currently this thrifty analysis and the potential erasing of thrifty caches happens after the cache ops have been created. This is due to needing the cache mapping to have already run in order to support hierarchical caching scenarios. Eventually this should be refactored and the thrifty analysis should be used to prevent creating the cache ops, but that is a larger refactor than the scope for this task. - When creating affine loads and stores into caches, this change also tacks on some attributes onto the load/store ops to indicate how the original load or store accessed the base array. Since the base array -> cache position mapping is not always invertible (consider coefficient cache layout cases), this is one of the only ways to encode this information. Unfortunately, canonicalization on affine load/store ops will scrub away these attributes, so any reliance on them has to occur before a canonicalization pass. Similarly, the MakeCacheOps recording which argument to their accesses are the base array positions depends on the operand list being unchanged, however canonicalization may remove operands if it determines they are not used - while this is fine for the load/store op itself, any assumption like "base array indices are at positions N...N+K in the operand list" are no longer valid Related work items: #3575 commit 3591856bf285c90195eae7431a2c25314820669f Author: Kern Handa <kerha@microsoft.com> Date: Mon Mar 14 04:31:13 2022 +0000 Merged PR 2459: Changes the order of the LLVM_SETUP_VARIANT detection Changes the order of the LLVM_SETUP_VARIANT detection commit fa1a527b549bd15431d59ca7c4946562d485a3fa Author: Kern Handa <kerha@microsoft.com> Date: Sat Mar 12 00:50:39 2022 +0000 Merged PR 2458: Fixes building with clang++ on Linux/WSL Fixes building with clang++ on Linux/WSL commit a8b98da932216aa74b8356e44191eb0b247d227e Author: Mason Remy <masonr@microsoft.com> Date: Sat Mar 12 00:08:40 2022 +0000 Merged PR 2438: Support for double-buffer caching Support for double-buffer caching - Adds plumbing from python dsl for double_buffer flag to cache API - Implements double buffering by hoisting the initial cache fill outside of the cache trigger loop parent, then creating a prologue subnest that fills a temporary buffer with the i+1'st iterations data and an epilogue subnest that moves that temporary buffer data into the main cache buffer. The last iteration of the trigger loop parent loop is unswitched and no cache filling is done in that loop. - On GPU the temporary buffer is allocated in private memory and if the cache is in shared memory each thread just holds onto their own contribution to the cache in their own private memory buffer until the epilogue fill nest - Barrier ops are hoisted out of conditionals to avoid potential for deadlocks. The conditionals introduced in this PR should be always-true or always-false, but this is added as a safety measure. Currently the hoisting is naive - any barrier within a conditional is erased and barriers are placed before and after the conditional block. This is not correct for all future conditional scenarios as any operations that happen within the conditional that depend on the barrier existing will be broken, however it works for how conditionals are used currently and can be improved on over time Related work items: #3659 commit b6db90faabf919b46b32eb822bf5620450797bab Author: Denny Sun <dennys@microsoft.com> Date: Fri Mar 11 00:39:58 2022 +0000 Merged PR 2450: Automatically add parameter dict as auxiliary data Automatically add parameter dict as auxiliary data Related work items: #3662 commit 52dadbfa73c4db94928bb17723184e7d16f93305 Author: Kern Handa <kerha@microsoft.com> Date: Thu Mar 10 16:49:53 2022 +0000 Merged PR 2456: Updates CUDA source emission based on testing with nvrtc Updates CUDA source emission based on testing with nvrtc commit 9c48b11b59b5a38f00c0f5ffb371ad2232b14e00 Author: Kern Handa <kerha@microsoft.com> Date: Wed Mar 9 21:54:55 2022 +0000 Merged PR 2453: Sets CPU targets to default to openmp Sets CPU targets to default to openmp commit 40fe9516f6c946ba72434cba286033b16bc4476b Author: Abdul Dakkak <adakkak@microsoft.com> Date: Wed Mar 9 14:02:43 2022 +0000 Merged PR 2443: Add FP16 support preparation for adding mfma support for CUDA which only operates on FP16 commit 6b79fdc5f060bb7dbf1d97a74ad334a248090dc6 Author: Kern Handa <kerha@microsoft.com> Date: Wed Mar 9 08:48:12 2022 +0000 Merged PR 2452: Updates GPU source emitting path to emit host launcher and device function pairs commit 4a345df664d45c2015585cf1a51449afae955617 Author: Kern Handa <kerha@microsoft.com> Date: Wed Mar 9 02:17:17 2022 +0000 Merged PR 2451: Updates IR util ResolveExec[Target,Runtime] to allow for exact matches Updates IR util ResolveExec[Target,Runtime] to allow for exact matches commit 710efe2cb7eb95eaac4e6400dbf847ae0440745b Author: Kern Handa <kerha@microsoft.com> Date: Tue Mar 8 23:44:01 2022 +0000 Merged PR 2447: Makes Vulkan specific behavior pred. on Runtime Makes Vulkan specific behavior pred. on Runtime commit 5ae4ae88ee7a92c069f2789f25724943d6444259 Author: Kern Handa <kerha@microsoft.com> Date: Tue Mar 8 23:03:46 2022 +0000 Merged PR 2446: Updates Runtime enum in Targets.py to be more comprehensive Updates Runtime enum in Targets.py to be more comprehensive commit 52c7d6355cbdb448c65876c3d840b3953c410f27 Author: Lisa Ong <onglisa@microsoft.com> Date: Tue Mar 8 12:42:02 2022 +0000 Merged PR 2449: [Cleanup] Replace "rc*_" prefixes with "acc*_" prefixes in tablegen'ed code For *.td, perform the following replacements for ops: s/rcv_/accv_/g s/rc_/acc_/g s/rcxp_/accxp_/g s/rcln_/accln_/g commit d345616611e8294863ca7df7f609db899b203b9c Author: Abdul Dakkak <adakkak@microsoft.com> Date: Tue Mar 8 09:03:09 2022 +0000 Merged PR 2448: fix typo in the condition for mod in range analysis fix typo in the condition for mod in range analysis commit c18aee909e83656a9650bdfc1a1a167687c0d7e2 Author: Abdul Dakkak <adakkak@microsoft.com> Date: Mon Mar 7 23:04:23 2022 +0000 Merged PR 2445: Fix bind command when index is further split commit 62d10e9214f4be7ad31e5507002957b78a1f3b76 Author: Abdul Dakkak <adakkak@microsoft.com> Date: Mon Mar 7 21:11:11 2022 +0000 Merged PR 2444: add range remainder add range remainder commit a77c9c0a24b6f66e7563ad8269542ee75b2cab15 Author: Mason Remy <masonr@microsoft.com> Date: Fri Mar 4 05:07:01 2022 +0000 Merged PR 2441: Fix APInt usage in RangeValueOptimizePass Run the RangeValueOptimizePass as part of acc-to-llvm commit 5b9e7020ad774447a4970a823b1103656d0d2e93 Merge: e6088d9 1dba1b7 Author: Mason Remy <masonr@microsoft.com> Date: Fri Mar 4 02:02:51 2022 +0000 Merged PR 2442: Move ExecutionOptions to ir lib and create arrayattr <-> struct utils Move ExecutionOptions to ir lib and create arrayattr <-> struct utils commit 1dba1b7e4e50d343f03dde1b1527bafdef1bed82 Author: Mason Remy <masonr@microsoft.com> Date: Thu Mar 3 14:59:49 2022 -0800 simplify target passthrough layer commit e6088d9b8ebe36792c508c8b88b72eb42414e41a Merge: 9f9f912 7dc3591 Author: Chuck Jacobs <cjacobs@microsoft.com> Date: Thu Mar 3 22:45:41 2022 +0000 Merged PR 2430: Remove unnecessary barrier ops This PR adds an optimization pass that removes redundant / unnecessary barrier ops around shared memory usage. The optimization pass in this PR is pretty simple and has a couple of limitations: - it only works on straight-line code (that is, when all the loads, stores, and barriers are at the same loop level as each other). - it considers all accesses to a specific array to be conflicts (that is, any write to an array followed by a read of that array will want to have a barrier in between them, even if the writes and reads are to different elements in the array) I should be following up with a PR that deals with barrier and memory ops at different loops levels pretty soon after this. Related work items: #3648 commit 8a0c0aa82bed26547757579b56fe82f5f9f54d77 Author: Mason Remy <masonr@microsoft.com> Date: Thu Mar 3 13:33:27 2022 -0800 Move ExecutionOptions to ir lib and create arrayattr <-> struct utils commit 7dc3591080644c5c906454e4605585a6e2a7c650 Author: Charles Jacobs <cjacobs@microsoft.com> Date: Thu Mar 3 13:31:02 2022 -0800 PR comments
microsoft · Mar 18, 2022 · 2060948 · 2060948
1 parent 458225d
commit 2060948
Show file tree

Hide file tree

Showing 134 changed files with 8,355 additions and 2,301 deletions.
diff --git a/.azure/win-accera.yml b/.azure/win-accera.yml
@@ -253,7 +253,7 @@ steps:
     workingDirectory: "$(Build.SourcesDirectory)/"
 
 - script: |
-    call "C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Auxiliary\Build\vcvars64.bat"
+    call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvars64.bat"
     set PATH=%VULKAN_SDK%\bin;%PATH%
     python -m accera.test.smoke_test
   displayName: Smoke test

diff --git a/.azure/win-pr.yml b/.azure/win-pr.yml
@@ -39,7 +39,7 @@ steps:
     continueOnError: false
     inputs:
       workingDirectory: 'build\RelWithDebInfo'
-      cmakeArgs: '..\.. -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLVM_LIT_ARGS=-vv -G"Visual Studio 16 2019" -Ax64 -DLLVM_SETUP_VARIANT=$(LLVM_SETUP_VARIANT)'
+      cmakeArgs: '..\.. -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLVM_LIT_ARGS=-vv -G"Visual Studio 17 2022" -Ax64 -DLLVM_SETUP_VARIANT=$(LLVM_SETUP_VARIANT)'
     condition: eq( variables['Agent.OS'], 'Windows_NT' )
 
   - task: CMake@1
@@ -70,7 +70,7 @@ steps:
       workingDirectory: "$(Build.SourcesDirectory)/"
 
   - script: |
-      call "C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Auxiliary\Build\vcvars64.bat"
+      call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvars64.bat"
       python -m pip install -r $(Build.SourcesDirectory)/accera/onnx-emitter/test/requirements.txt
       ctest -C RelWithDebInfo -T test -VV -LE benchmark
     displayName: Run all ctest targets

diff --git a/CMake/LLVMSetup.cmake b/CMake/LLVMSetup.cmake
@@ -7,14 +7,14 @@
 ####################################################################################################
 #
 # Gets the following variables:
-# 
+#
 # LLVM_SETUP_VARIANT: An optional environment variable or CMake define
 # that specifies the LLVM build source:
 #   LLVM_SETUP_VARIANT="Default" - uses vcpkg to acquire LLVM
 #                                  Pre-requisite: `vcpkg install accera-llvm` or
 #                                  `vcpkg install accera-llvm:x64-windows`
 #
-#   LLVM_SETUP_VARIANT="Conan"   - uses Conan to acquire LLVM 
+#   LLVM_SETUP_VARIANT="Conan"   - uses Conan to acquire LLVM
 #                                  (for internal use only)
 #
 # Sets the following variables:
@@ -34,10 +34,10 @@
 # Include guard so we don't try to find or download LLVM more than once
 include_guard()
 
+set(LLVM_SETUP_VARIANT "Default" CACHE STRING "Source for LLVM binaries")
 if(DEFINED ENV{LLVM_SETUP_VARIANT})
-  set(LLVM_SETUP_VARIANT $ENV{LLVM_SETUP_VARIANT} )
+  set(LLVM_SETUP_VARIANT $ENV{LLVM_SETUP_VARIANT} CACHE STRING "" FORCE)
 endif()
-set(LLVM_SETUP_VARIANT "Default" CACHE STRING "Source for LLVM binaries")
 
 message(STATUS "Using LLVMSetup${LLVM_SETUP_VARIANT}.cmake")
 

diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -35,8 +35,10 @@ option(USE_MKL "Build with Intel MKL" OFF)
 
 option(USE_LIBCXX "Build with libc++ if using the Clang compiler" OFF)
 if(CMAKE_CXX_COMPILER_ID STREQUAL Clang)
+  if(USE_LIBCXX OR (CMAKE_HOST_SYSTEM_NAME STREQUAL Darwin))
     add_compile_options(-stdlib=libc++)
     link_libraries(-lc++ -lc++abi)
+  endif(USE_LIBCXX OR (CMAKE_HOST_SYSTEM_NAME STREQUAL Darwin))
 endif(CMAKE_CXX_COMPILER_ID STREQUAL Clang)
 
 # Try to create a compilation database, which is useful to have when working
@@ -156,11 +158,14 @@ else()
   set(CMAKE_CXX_FLAGS_RELWITHDEBINFO "${CMAKE_CXX_FLAGS_RELWITHDEBINFO} -ggdb3")
   set(CMAKE_C_FLAGS_RELWITHDEBINFO "${CMAKE_C_FLAGS_RELWITHDEBINFO} -ggdb3")
   if(${CMAKE_CXX_COMPILER_ID} STREQUAL Clang)
+    if(CMAKE_BUILD_TYPE STREQUAL Debug)
+    # Set options for Control Flow Integrity
+      add_compile_options(-fsanitize=cfi)
+    endif(CMAKE_BUILD_TYPE STREQUAL Debug)
+
     add_compile_options(-Wno-backslash-newline-escape)
     add_compile_options(-Wno-self-assign)
     add_compile_options(-fcolor-diagnostics)
-    # Set options for Control Flow Integrity
-    add_compile_options(-fsanitize=cfi)
     # Enable Shadow Stack mitigation
     add_compile_options(-fsanitize=shadow-call-stack)
     # Exit after the first 2 errors are reported

diff --git a/accera/acc-opt/test/barrier_opt.mlir b/accera/acc-opt/test/barrier_opt.mlir
@@ -0,0 +1,54 @@
+// RUN: acc-opt --verify-each=false --optimize-barriers %s | FileCheck %s
+
+// CHECK-LABEL: module @barrier_test_1
+// CHECK: %2 = "accv.alloc"()
+// CHECK-NEXT: %3 = "accv.alloc"() : () -> memref<16xf32, 3>
+// CHECK-NEXT: %4 = affine.load %arg0[symbol(%0) + symbol(%1) * 16] : memref<1xf32>
+// CHECK-NEXT: affine.store %4, %2[symbol(%0) + symbol(%1) * 16] : memref<16xf32, 3>
+// CHECK-NEXT: %5 = affine.load %arg1[symbol(%0) + symbol(%1) * 16] : memref<1xf32>
+// CHECK-NEXT: affine.store %5, %2[symbol(%0) + symbol(%1) * 16] : memref<16xf32, 3>
+// CHECK-NEXT: "accv.barrier"() {scope = "Block"} : () -> ()
+// CHECK-NEXT: %6 = affine.load %2[symbol(%0) + symbol(%1) * 16] : memref<16xf32, 3>
+// CHECK-NEXT: %7 = affine.load %3[symbol(%0) + symbol(%1) * 16] : memref<16xf32, 3>
+// CHECK-NEXT: %8 = "accv.bin_op"(%6, %7) {predicate = 0 : i64} : (f32, f32) -> f32
+// CHECK-NEXT: affine.store %8, %arg2[symbol(%0) + symbol(%1) * 16] : memref<1xf32>
+// CHECK: accv.return
+module @barrier_test_1 attributes {llvm.data_layout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"}  {
+  accv.module "barrier_test_1"  {
+    accv.func nested @barrier_test_1_d9502818_impl_8438933964186859281(%arg0: memref<1xf32>, %arg1: memref<1xf32>, %arg2: memref<1xf32>) attributes {exec_target = 0 : i64} {
+      "accv.lambda"() ( {
+        %0 = "gpu.thread_id"() {dimension = "x"} : () -> index
+        %1 = "gpu.block_id"() {dimension = "x"} : () -> index
+        affine.for %arg3 = 0 to 1 {
+          affine.for %arg4 = 0 to 1 {
+            affine.for %arg5 = 0 to 1 {
+              affine.for %arg6 = 0 to 1 {
+                %2 = "accv.alloc"() : () -> memref<16xf32, 3>
+                %3 = "accv.alloc"() : () -> memref<16xf32, 3>
+                "accv.barrier"() {scope = "Block"} : () -> ()
+                %4 = affine.load %arg0[symbol(%0) + symbol(%1) * 16] : memref<1xf32>
+                affine.store %4, %2[symbol(%0) + symbol(%1) * 16] : memref<16xf32, 3>
+                "accv.barrier"() {scope = "Block"} : () -> ()
+                %5 = affine.load %arg1[symbol(%0) + symbol(%1) * 16] : memref<1xf32>
+                affine.store %5, %2[symbol(%0) + symbol(%1) * 16] : memref<16xf32, 3>
+                "accv.barrier"() {scope = "Block"} : () -> ()
+                %6 = affine.load %2[symbol(%0) + symbol(%1) * 16] : memref<16xf32, 3>
+                %7 = affine.load %3[symbol(%0) + symbol(%1) * 16] : memref<16xf32, 3>
+                %8 = "accv.bin_op"(%6, %7) {predicate = 0 : i64} : (f32, f32) -> f32
+                affine.store %8, %arg2[symbol(%0) + symbol(%1) * 16] : memref<1xf32>
+                "accv.barrier"() {scope = "Block"} : () -> ()
+              } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{j_i,5}">, kernels = ["_"], accv_gpu_map = "ThreadY", subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">], subdomainSize = [1, 1]}
+            } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{i_i,3}">, accv_gpu_map = "ThreadX", subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">], subdomainSize = [1, 16]}
+          } {begin = 0 : i64, end = 256 : i64, index = #accln<"index{j_o,4}">, accv_gpu_map = "BlockY", subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">], subdomainSize = [16, 16]}
+        } {begin = 0 : i64, end = 256 : i64, index = #accln<"index{i_o,2}">, accv_gpu_map = "BlockX", subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">], subdomainSize = [16, 256]}
+        accv.return
+      }) {exec_target = 1 : i64, gpu_launch = [16 : index, 16 : index, 1 : index, 16 : index, 16 : index, 1 : index], sym_name = "NestFunction_0", type = () -> ()} : () -> ()
+      accv.return
+    }
+    accv.func @barrier_test_1_d9502818(%arg0: memref<1xf32>, %arg1: memref<1xf32>, %arg2: memref<1xf32>) attributes {accv.base_name = "barrier_test_1", accv.emit_header_decl, accv.emit_raw_pointer_api, exec_target = 0 : i64} {
+      accv.launch_func @barrier_test_1_d9502818_impl_8438933964186859281(%arg0, %arg1, %arg2) {exec_target = 0 : i64, gpu_launch = "gpu_launch"} : (memref<1xf32>, memref<1xf32>, memref<1xf32>) -> ()
+      accv.return
+    }
+  }
+}
+
diff --git a/accera/acc-opt/test/thrifty_caching.mlir b/accera/acc-opt/test/thrifty_caching.mlir
@@ -0,0 +1,96 @@
+// RUN: acc-opt --verify-each=false --pass-pipeline="accv.module(accv.func(loopnest-to-value-func))" %s | FileCheck %s
+
+// This function has two caches initially, both marked thrifty, and one of them should
+// get elided based on thrifty checks but the other should not
+
+// This is the graph at the LoopNestToValueFuncPass_Subpasses_0_10_Canonicalize.mlir stage,
+// which is the last canonicalize stage before the thrifty checks and the subpasses 
+// before the thrifty phase create ops that the thrifty check depends on not being
+// canonicalized before it runs
+module @test_thrifty_caching_simple_input_cache attributes {llvm.data_layout = "e-m:w-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"}  {
+  accv.module "test_thrifty_caching_simple_input_cache"  {
+    accv.func nested @test_thrifty_caching_simple_input_cache_1127a105_impl_6891397719071098712(%arg0: memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>, %arg1: memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>, %arg2: memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>) attributes {exec_target = 0 : i64} {
+      %0 = accln.sym_index {name = "i_i"} #accln<"index{i_i,4}">
+      %1 = accln.sym_index {name = "i_o"} #accln<"index{i_o,3}">
+      %2 = accln.sym_index {name = "k_o"} #accln<"index{k_o,7}">
+      %3 = accln.sym_index {name = "j_i"} #accln<"index{j_i,6}">
+      %4 = accln.sym_index {name = "k_i"} #accln<"index{k_i,8}">
+      %5 = accln.sym_index {name = "j_o"} #accln<"index{j_o,5}">
+      "accv.lambda"() ( {
+        %6 = "accxp.make_cache"() {memorySpace = 0 : i64, multiCacheAccessIndices = [], offsetAccessIndices = [], offsetArrayToCacheAccessMap = affine_map<(d0) -> (d0)>} : () -> memref<?xf32, 3>
+        %7 = "accxp.begin_cache_region"(%arg0, %6, %arg0, %1, %2, %0, %4, %1, %2) {activeBlockCache, cacheAccessMaps = {manualCacheDimOrder = [0, 1]}, cacheHierarchyLevel = 0 : i64, cacheIndex = #accln<"index{i_i,4}">, cacheRegionBaseIndices = [[#accln<"index{i,0}">], [#accln<"index{k,2}">]], cacheRegionRelevantIndexRanges = [#accln<"indexrange{i_i,4}={0:4:1}">, #accln<"indexrange{k_i,8}={0:32:1}">], dimReorderCache, id = 0 : i64, operand_segment_sizes = dense<[1, 1, 1, 4, 2]> : vector<5xi32>, thrifty, triggerIndex = #accln<"index{i_i,4}">} : (memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>, memref<?xf32, 3>, memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>, index, index, index, index, index, index) -> index
+        "accxp.end_cache_region"(%7) : (index) -> ()
+        %8 = "accxp.make_cache"() {memorySpace = 0 : i64, multiCacheAccessIndices = [], offsetAccessIndices = [], offsetArrayToCacheAccessMap = affine_map<(d0) -> (d0)>} : () -> memref<?xf32, 3>
+        %9 = "accxp.begin_cache_region"(%arg1, %8, %arg1, %5, %2, %3, %4, %5) {activeBlockCache, cacheAccessMaps = {manualCacheDimOrder = [0, 1]}, cacheHierarchyLevel = 0 : i64, cacheIndex = #accln<"index{k_o,7}">, cacheRegionBaseIndices = [[#accln<"index{k,2}">], [#accln<"index{j,1}">], [#accln<"index{k,2}">]], cacheRegionRelevantIndexRanges = [#accln<"indexrange{k_o,7}={0:32:32}">, #accln<"indexrange{j_i,6}={0:16:1}">, #accln<"indexrange{k_i,8}={0:32:1}">], dimReorderCache, id = 1 : i64, operand_segment_sizes = dense<[1, 1, 1, 4, 1]> : vector<5xi32>, thrifty, triggerIndex = #accln<"index{k_o,7}">} : (memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>, memref<?xf32, 3>, memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>, index, index, index, index, index) -> index
+        "accxp.end_cache_region"(%9) : (index) -> ()
+        affine.for %arg3 = 0 to 32 step 4 {
+          affine.for %arg4 = 0 to 32 step 16 {
+            %10 = "accxp.begin_cache_region"(%arg1, %8, %arg1, %arg4, %2, %3, %4, %arg4) {activeBlockCache, cacheAccessMaps = {manualCacheDimOrder = [0, 1]}, cacheHierarchyLevel = 0 : i64, cacheIndex = #accln<"index{k_o,7}">, cacheRegionBaseIndices = [[#accln<"index{k,2}">], [#accln<"index{j,1}">], [#accln<"index{k,2}">]], cacheRegionRelevantIndexRanges = [#accln<"indexrange{k_o,7}={0:32:32}">, #accln<"indexrange{j_i,6}={0:16:1}">, #accln<"indexrange{k_i,8}={0:32:1}">], dimReorderCache, id = 1 : i64, operand_segment_sizes = dense<[1, 1, 1, 4, 1]> : vector<5xi32>, thrifty, triggerIndex = #accln<"index{k_o,7}">} : (memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>, memref<?xf32, 3>, memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>, index, index, index, index, index) -> index
+            affine.for %arg5 = 0 to 32 step 32 {
+              %11 = "accxp.begin_cache_region"(%arg0, %6, %arg0, %arg3, %arg5, %0, %4, %arg3, %arg5) {activeBlockCache, cacheAccessMaps = {manualCacheDimOrder = [0, 1]}, cacheHierarchyLevel = 0 : i64, cacheIndex = #accln<"index{i_i,4}">, cacheRegionBaseIndices = [[#accln<"index{i,0}">], [#accln<"index{k,2}">]], cacheRegionRelevantIndexRanges = [#accln<"indexrange{i_i,4}={0:4:1}">, #accln<"indexrange{k_i,8}={0:32:1}">], dimReorderCache, id = 0 : i64, operand_segment_sizes = dense<[1, 1, 1, 4, 2]> : vector<5xi32>, thrifty, triggerIndex = #accln<"index{i_i,4}">} : (memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>, memref<?xf32, 3>, memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>, index, index, index, index, index, index) -> index
+              affine.for %arg6 = 0 to 4 {
+                affine.for %arg7 = 0 to 16 {
+                  affine.for %arg8 = 0 to 32 {
+                    %12 = affine.load %arg0[%arg3 + %arg6, %arg5 + %arg8] : memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>
+                    %13 = affine.load %arg1[%arg5 + %arg8, %arg4 + %arg7] : memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>
+                    %14 = "accv.bin_op"(%12, %13) {predicate = 2 : i64} : (f32, f32) -> f32
+                    %15 = affine.load %arg2[%arg3 + %arg6, %arg4 + %arg7] : memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>
+                    %16 = "accv.bin_op"(%15, %14) {predicate = 0 : i64} : (f32, f32) -> f32
+                    affine.store %16, %arg2[%arg3 + %arg6, %arg4 + %arg7] : memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>
+                    %17 = affine.load %arg2[%arg3 + %arg6, %arg4 + %arg7] : memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>
+                    affine.store %17, %arg2[%arg3 + %arg6, %arg4 + %arg7] : memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>
+                  } {begin = 0 : i64, end = 32 : i64, index = #accln<"index{k_i,8}">, kernels = ["_"], subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 1, 1]}
+                } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{j_i,6}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 1, 32]}
+              } {begin = 0 : i64, end = 4 : i64, index = #accln<"index{i_i,4}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 16, 32]}
+              "accxp.end_cache_region"(%11) : (index) -> ()
+            } {begin = 0 : i64, end = 32 : i64, index = #accln<"index{k_o,7}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [4, 16, 32]}
+            "accxp.end_cache_region"(%10) : (index) -> ()
+          } {begin = 0 : i64, end = 32 : i64, index = #accln<"index{j_o,5}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [4, 16, 32]}
+        } {begin = 0 : i64, end = 32 : i64, index = #accln<"index{i_o,3}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [4, 32, 32]}
+        accv.return
+      }) {exec_target = 0 : i64, sym_name = "NestFunction_0", type = () -> ()} : () -> ()
+      accv.return
+    }
+  }
+}
+
+// CHECK: #map = affine_map<(d0, d1) -> (d0 * 32 + d1)>
+// CHECK: module @test_thrifty_caching_simple_input_cache attributes {llvm.data_layout = "e-m:w-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"}  {
+// CHECK:   accv.module "test_thrifty_caching_simple_input_cache"  {
+// CHECK:     "accv.global"() {sym_name = "cache_3", type = memref<32x16xf32, 3>} : () -> ()
+// CHECK:     accv.func nested @test_thrifty_caching_simple_input_cache_1127a105_impl_6891397719071098712(%arg0: memref<32x32xf32, #map>, %arg1: memref<32x32xf32, #map>, %arg2: memref<32x32xf32, #map>) attributes {exec_target = 0 : i64} {
+// CHECK:       "accv.lambda"() ( {
+// CHECK:         %0 = "accv.ref_global"() {global_name = @cache_3} : () -> memref<32x16xf32, 3>
+// CHECK:         affine.for %arg3 = 0 to 32 step 4 {
+// CHECK:           affine.for %arg4 = 0 to 32 step 16 {
+// CHECK:             "accv.lambda"() ( {
+// CHECK:               affine.for %arg5 = 0 to 32 {
+// CHECK:                 affine.for %arg6 = 0 to 16 {
+// CHECK:                   %1 = affine.load %arg1[%arg5, %arg4 + %arg6] : memref<32x32xf32, #map>
+// CHECK:                   affine.store %1, %0[%arg5, %arg6] : memref<32x16xf32, 3>
+// CHECK:                 } {accxp.access_bounds_check, begin = 0 : i64, end = 16 : i64, index = #accln<"index{j,5}">, kernels = ["cache_internal_loopnest_kernel_active_block_copy"], scheduledIndex = #accln<"index{j,5}">, subdomainIndexOrder = [#accln<"index{i,4}">, #accln<"index{j,5}">], subdomainSize = [1, 1]}
+// CHECK:               } {accxp.access_bounds_check, begin = 0 : i64, end = 32 : i64, index = #accln<"index{i,4}">, scheduledIndex = #accln<"index{i,4}">, subdomainIndexOrder = [#accln<"index{i,4}">, #accln<"index{j,5}">], subdomainSize = [1, 16]}
+// CHECK:               accv.return
+// CHECK:             }) {exec_target = 0 : i64, sym_name = "NestFunction_2", type = () -> ()} : () -> ()
+// CHECK:             affine.for %arg5 = 0 to 4 {
+// CHECK:               affine.for %arg6 = 0 to 16 {
+// CHECK:                 affine.for %arg7 = 0 to 32 {
+// CHECK:                   %1 = affine.load %arg0[%arg3 + %arg5, %arg7] : memref<32x32xf32, #map>
+// CHECK:                   %2 = affine.load %0[%arg7, %arg6] : memref<32x16xf32, 3>
+// CHECK:                   %3 = "accv.bin_op"(%1, %2) {predicate = 2 : i64} : (f32, f32) -> f32
+// CHECK:                   %4 = affine.load %arg2[%arg3 + %arg5, %arg4 + %arg6] : memref<32x32xf32, #map>
+// CHECK:                   %5 = "accv.bin_op"(%4, %3) {predicate = 0 : i64} : (f32, f32) -> f32
+// CHECK:                   affine.store %5, %arg2[%arg3 + %arg5, %arg4 + %arg6] : memref<32x32xf32, #map>
+// CHECK:                   %6 = affine.load %arg2[%arg3 + %arg5, %arg4 + %arg6] : memref<32x32xf32, #map>
+// CHECK:                   affine.store %6, %arg2[%arg3 + %arg5, %arg4 + %arg6] : memref<32x32xf32, #map>
+// CHECK:                 } {begin = 0 : i64, end = 32 : i64, index = #accln<"index{k_i,8}">, kernels = ["_"], subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 1, 1]}
+// CHECK:               } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{j_i,6}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 1, 32]}
+// CHECK:             } {begin = 0 : i64, end = 4 : i64, index = #accln<"index{i_i,4}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 16, 32]}
+// CHECK:           } {begin = 0 : i64, end = 32 : i64, index = #accln<"index{j_o,5}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [4, 16, 32]}
+// CHECK:         } {begin = 0 : i64, end = 32 : i64, index = #accln<"index{i_o,3}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [4, 32, 32]}
+// CHECK:         accv.return
+// CHECK:       }) {exec_target = 0 : i64, sym_name = "NestFunction_0", type = () -> ()} : () -> ()
+// CHECK:       accv.return
+// CHECK:     }
+// CHECK:   }
+// CHECK: }