From 2060948bbd136a816f48b2612dc6e7a0973e96cc Mon Sep 17 00:00:00 2001
From: Lisa Ong <onglisa@microsoft.com>
Date: Fri, 18 Mar 2022 08:53:38 +0800
Subject: [PATCH] Squashed commit of the following:

commit 37efd7c8223542c3d953f6127308542013c159b8
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Fri Mar 18 00:34:18 2022 +0000

    Merged PR 2439: Downstream doc changes from github/main

    Squashed commit of the following:

    commit 8a6e5535efe7cdf11e614b11abc5bde14ee76d5b
    Author: Arslan-e-Mustafa <70168134+Arslan-e-Mustafa@users.noreply.github.com>
    Date:   Sat Feb 26 16:50:57 2022 +0500

        complete refactoring of introduction.md file in manual docs (#15)

        * Feedback addressed

        * Addressed the pending comments

    commit 329d69516f31ec47ae282c2ae01221eb18fb18b8
    Author: Arslan-e-Mustafa <70168134+Arslan-e-Mustafa@users.noreply.github.com>
    Date:   Fri Feb 25 21:37:19 2022 +0500

        Complete refactoring of file array.md and simple affine loop nests.md file in manual docs (#16)

        * complete refactoring of introduction.md file

        * completed array.md and simple affine loop nests.md files

        * Took care of extra semicolon

    commit 04af790b7d42834eb17affa70a8dd6bce035c2fb
    Author: Arslan-e-Mustafa <70168134+Arslan-e-Mustafa@users.noreply.github.com>
    Date:   Tue Feb 22 05:42:22 2022 +0500

        README.md refactoring (#13)

        * initial commit

        * worked on README.md until goals of accera section. Took the liberty of changing some headings, restructuring the paragraphs, and adding one more goal

        * Feedback addressed regarding README.md file

        * Take care of last comment and completed the whole file from my side

        Co-authored-by: Lisa Ong <11318241+lisaong@users.noreply.github.com>

commit 356872bf787b3b076ac45aa86d2275ffcd15364e
Author: Abdul Dakkak <adakkak@microsoft.com>
Date:   Thu Mar 17 12:35:33 2022 +0000

    Merged PR 2440: Enable tensorization for Rocm target

commit 5557ff59f398ddad818e9c5b93cd00408bd7637c
Author: Kern Handa <kerha@microsoft.com>
Date:   Wed Mar 16 22:03:29 2022 +0000

    Merged PR 2470: Adds support for the execution of GPU (CUDA only) functions via hat

commit fb803a9fbaf0bfa7f809f5bdd8366629febb9bd0
Author: Denny Sun <dennys@microsoft.com>
Date:   Wed Mar 16 20:18:23 2022 +0000

    Merged PR 2467: Adding multiple functions in package.add() can't work with stateful auxiliary metadata and index_map

    These bugs are all about sharing Python objects among different functions, like auxiliary metadata and schedule's indexes, when we call pacakge.add() to add multiple parameterized functions, we add functions one by one, then emit functions one by one, at each step, the state of shared Python object is changed which results in only the first function added being correctly emitted, to make _add_function work, we need to make these shared Python objects stateless.

    Related work items: #3662

commit e149bac1147d160b05aa55ad8ef4416423c20925
Author: Mason Remy <masonr@microsoft.com>
Date:   Wed Mar 16 06:31:10 2022 +0000

    Merged PR 2469: Convert 'Local' memory space to 'Private'

    Convert 'Local' memory space to 'Private'

commit 65363d35f7a31dfc682366ba70caaf301806a44b
Author: Mason Remy <masonr@microsoft.com>
Date:   Wed Mar 16 02:41:31 2022 +0000

    Merged PR 2463: Enable specifying double buffer memory space

    Enable specifying double buffer memory space

commit f80b46af2b12689ff617ba3a491fee6ae9aad010
Author: Kern Handa <kerha@microsoft.com>
Date:   Wed Mar 16 01:57:46 2022 +0000

    Merged PR 2468: Move to VS2022 for builds

    Move to VS2022 for builds

commit 0870cb27ccbe52fa8182b960140f5b6d562ab929
Author: Abdul Dakkak <adakkak@microsoft.com>
Date:   Tue Mar 15 14:01:15 2022 +0000

    Merged PR 2465: extend gpu target spec

    extend gpu target spec

commit 07088ecd0700fee16efdab677a581aa47a6a8690
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Tue Mar 15 09:30:22 2022 +0000

    Merged PR 2464: Compute a stable hash for function name suffixes

    Create a stable hash using md5 and json serialization of these stringized entries:
    - Array args: shape, type, role, layout
    - parameter dictionary
    - Target

    Example output:
    ```
    test_unequal_iteration_space_fusing_1 (__main__.DSLTest_04Fusing) ... DEBUG:root:Adding wrapped function
    DEBUG:root:Adding wrapped function
    Building function fusing_test_32d12fb1a01061ec
    DEBUG:root:Detected logic function _ uses indices i,j
    DEBUG:root:Detected logic function _ uses indices i,j
    Building function _debug_check_allclose_16_16_4cfd65a8b606655b
    ```

commit 63e82be5e7b92f750fdf6c19347609c119cc5642
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Tue Mar 15 00:25:13 2022 +0000

    Merged PR 2460: [nfc] Fix build.sh setting for vcpkg debug builds

commit d5ca516084dd68966e8c14b6d64d4402f572349a
Author: Mason Remy <masonr@microsoft.com>
Date:   Mon Mar 14 19:53:46 2022 +0000

    Merged PR 2461: Replace MemoryType with MemorySpace for consistency

    Replace MemoryType with MemorySpace for consistency

commit fdb503611bd235ca59c7769bd0d752519ce42bf5
Author: Mason Remy <masonr@microsoft.com>
Date:   Mon Mar 14 18:42:45 2022 +0000

    Merged PR 2416: Implement initial thrifty caching support

    Implement initial thrifty caching support

    - This is a simple brute-force approach where each thrifty cache is
      examined element-by-element alongside the array it is caching to check
      whether there is a stride of 1 between every access
    - Currently this thrifty analysis and the potential erasing of thrifty
      caches happens after the cache ops have been created. This is due to
      needing the cache mapping to have already run in order to support
      hierarchical caching scenarios. Eventually this should be refactored
      and the thrifty analysis should be used to prevent creating the cache
      ops, but that is a larger refactor than the scope for this task.
    - When creating affine loads and stores into caches, this change also
      tacks on some attributes onto the load/store ops to indicate how the
      original load or store accessed the base array. Since the base array
      -> cache position mapping is not always invertible (consider
      coefficient cache layout cases), this is one of the only ways to
      encode this information. Unfortunately, canonicalization on affine
      load/store ops will scrub away these attributes, so any reliance on
      them has to occur before a canonicalization pass. Similarly, the
      MakeCacheOps recording which argument to their accesses are the base
      array positions depends on the operand list being unchanged, however
      canonicalization may remove operands if it determines they are not
      used - while this is fine for the load/store op itself, any assumption
      like "base array indices are at positions N...N+K in the operand list"
      are no longer valid

    Related work items: #3575

commit 3591856bf285c90195eae7431a2c25314820669f
Author: Kern Handa <kerha@microsoft.com>
Date:   Mon Mar 14 04:31:13 2022 +0000

    Merged PR 2459: Changes the order of the LLVM_SETUP_VARIANT detection

    Changes the order of the LLVM_SETUP_VARIANT detection

commit fa1a527b549bd15431d59ca7c4946562d485a3fa
Author: Kern Handa <kerha@microsoft.com>
Date:   Sat Mar 12 00:50:39 2022 +0000

    Merged PR 2458: Fixes building with clang++ on Linux/WSL

    Fixes building with clang++ on Linux/WSL

commit a8b98da932216aa74b8356e44191eb0b247d227e
Author: Mason Remy <masonr@microsoft.com>
Date:   Sat Mar 12 00:08:40 2022 +0000

    Merged PR 2438: Support for double-buffer caching

    Support for double-buffer caching

    - Adds plumbing from python dsl for double_buffer flag to cache API
    - Implements double buffering by hoisting the initial cache fill outside
      of the cache trigger loop parent, then creating a prologue subnest
      that fills a temporary buffer with the i+1'st iterations data and an
      epilogue subnest that moves that temporary buffer data into the main
      cache buffer. The last iteration of the trigger loop parent loop is
      unswitched and no cache filling is done in that loop.
    - On GPU the temporary buffer is allocated in private memory and if the
      cache is in shared memory each thread just holds onto their own
      contribution to the cache in their own private memory buffer until the
      epilogue fill nest
    - Barrier ops are hoisted out of conditionals to avoid potential for
      deadlocks. The conditionals introduced in this PR should be
      always-true or always-false, but this is added as a safety measure.
      Currently the hoisting is naive - any barrier within a conditional is
      erased and barriers are placed before and after the conditional block.
      This is not correct for all future conditional scenarios as any
      operations that happen within the conditional that depend on the
      barrier existing will be broken, however it works for how conditionals
      are used currently and can be improved on over time

    Related work items: #3659

commit b6db90faabf919b46b32eb822bf5620450797bab
Author: Denny Sun <dennys@microsoft.com>
Date:   Fri Mar 11 00:39:58 2022 +0000

    Merged PR 2450: Automatically add parameter dict as auxiliary data

    Automatically add parameter dict as auxiliary data

    Related work items: #3662

commit 52dadbfa73c4db94928bb17723184e7d16f93305
Author: Kern Handa <kerha@microsoft.com>
Date:   Thu Mar 10 16:49:53 2022 +0000

    Merged PR 2456: Updates CUDA source emission based on testing with nvrtc

    Updates CUDA source emission based on testing with nvrtc

commit 9c48b11b59b5a38f00c0f5ffb371ad2232b14e00
Author: Kern Handa <kerha@microsoft.com>
Date:   Wed Mar 9 21:54:55 2022 +0000

    Merged PR 2453: Sets CPU targets to default to openmp

    Sets CPU targets to default to openmp

commit 40fe9516f6c946ba72434cba286033b16bc4476b
Author: Abdul Dakkak <adakkak@microsoft.com>
Date:   Wed Mar 9 14:02:43 2022 +0000

    Merged PR 2443: Add FP16 support

    preparation for adding mfma support for CUDA which only operates on FP16

commit 6b79fdc5f060bb7dbf1d97a74ad334a248090dc6
Author: Kern Handa <kerha@microsoft.com>
Date:   Wed Mar 9 08:48:12 2022 +0000

    Merged PR 2452: Updates GPU source emitting path to emit host launcher and device function pairs

commit 4a345df664d45c2015585cf1a51449afae955617
Author: Kern Handa <kerha@microsoft.com>
Date:   Wed Mar 9 02:17:17 2022 +0000

    Merged PR 2451: Updates IR util ResolveExec[Target,Runtime] to allow for exact matches

    Updates IR util ResolveExec[Target,Runtime] to allow for exact matches

commit 710efe2cb7eb95eaac4e6400dbf847ae0440745b
Author: Kern Handa <kerha@microsoft.com>
Date:   Tue Mar 8 23:44:01 2022 +0000

    Merged PR 2447: Makes Vulkan specific behavior pred. on Runtime

    Makes Vulkan specific behavior pred. on Runtime

commit 5ae4ae88ee7a92c069f2789f25724943d6444259
Author: Kern Handa <kerha@microsoft.com>
Date:   Tue Mar 8 23:03:46 2022 +0000

    Merged PR 2446: Updates Runtime enum in Targets.py to be more comprehensive

    Updates Runtime enum in Targets.py to be more comprehensive

commit 52c7d6355cbdb448c65876c3d840b3953c410f27
Author: Lisa Ong <onglisa@microsoft.com>
Date:   Tue Mar 8 12:42:02 2022 +0000

    Merged PR 2449: [Cleanup] Replace "rc*_" prefixes with "acc*_" prefixes in tablegen'ed code

    For *.td, perform the following replacements for ops:

    s/rcv_/accv_/g
    s/rc_/acc_/g
    s/rcxp_/accxp_/g
    s/rcln_/accln_/g

commit d345616611e8294863ca7df7f609db899b203b9c
Author: Abdul Dakkak <adakkak@microsoft.com>
Date:   Tue Mar 8 09:03:09 2022 +0000

    Merged PR 2448: fix typo in the condition for mod in range analysis

    fix typo in the condition for mod in range analysis

commit c18aee909e83656a9650bdfc1a1a167687c0d7e2
Author: Abdul Dakkak <adakkak@microsoft.com>
Date:   Mon Mar 7 23:04:23 2022 +0000

    Merged PR 2445: Fix bind command when index is further split

commit 62d10e9214f4be7ad31e5507002957b78a1f3b76
Author: Abdul Dakkak <adakkak@microsoft.com>
Date:   Mon Mar 7 21:11:11 2022 +0000

    Merged PR 2444: add range remainder

    add range remainder

commit a77c9c0a24b6f66e7563ad8269542ee75b2cab15
Author: Mason Remy <masonr@microsoft.com>
Date:   Fri Mar 4 05:07:01 2022 +0000

    Merged PR 2441: Fix APInt usage in RangeValueOptimizePass

    Run the RangeValueOptimizePass as part of acc-to-llvm

commit 5b9e7020ad774447a4970a823b1103656d0d2e93
Merge: e6088d9 1dba1b7
Author: Mason Remy <masonr@microsoft.com>
Date:   Fri Mar 4 02:02:51 2022 +0000

    Merged PR 2442: Move ExecutionOptions to ir lib and create arrayattr <-> struct utils

    Move ExecutionOptions to ir lib and create arrayattr <-> struct utils

commit 1dba1b7e4e50d343f03dde1b1527bafdef1bed82
Author: Mason Remy <masonr@microsoft.com>
Date:   Thu Mar 3 14:59:49 2022 -0800

    simplify target passthrough layer

commit e6088d9b8ebe36792c508c8b88b72eb42414e41a
Merge: 9f9f912 7dc3591
Author: Chuck Jacobs <cjacobs@microsoft.com>
Date:   Thu Mar 3 22:45:41 2022 +0000

    Merged PR 2430: Remove unnecessary barrier ops

    This PR adds an optimization pass that removes redundant / unnecessary barrier ops around shared memory usage.

    The optimization pass in this PR is pretty simple and has a couple of limitations:
    - it only works on straight-line code (that is, when all the loads, stores, and barriers are at the same loop level as each other).
    - it considers all accesses to a specific array to be conflicts (that is, any write to an array followed by a read of that array will want to have a barrier in between them, even if the writes and reads are to different elements in the array)

    I should be following up with a PR that deals with barrier and memory ops at different loops levels pretty soon after this.

    Related work items: #3648

commit 8a0c0aa82bed26547757579b56fe82f5f9f54d77
Author: Mason Remy <masonr@microsoft.com>
Date:   Thu Mar 3 13:33:27 2022 -0800

    Move ExecutionOptions to ir lib and create arrayattr <-> struct utils

commit 7dc3591080644c5c906454e4605585a6e2a7c650
Author: Charles Jacobs <cjacobs@microsoft.com>
Date:   Thu Mar 3 13:31:02 2022 -0800

    PR comments
---
 .azure/win-accera.yml                         |    2 +-
 .azure/win-pr.yml                             |    4 +-
 CMake/LLVMSetup.cmake                         |    8 +-
 CMakeLists.txt                                |    9 +-
 accera/acc-opt/test/barrier_opt.mlir          |   54 +
 accera/acc-opt/test/thrifty_caching.mlir      |   96 +
 accera/acc-opt/test/value_mlir_test.cpp       |    4 +-
 .../Target/Cpp/AcceraDialectCppPrinter.cpp    |   68 +-
 .../src/Target/Cpp/AcceraDialectCppPrinter.h  |    4 +-
 .../Target/Cpp/AffineDialectCppPrinter.cpp    |   49 +-
 .../src/Target/Cpp/AffineDialectCppPrinter.h  |    7 +-
 .../src/Target/Cpp/CppPrinter.cpp             |   62 +-
 .../acc-translate/src/Target/Cpp/CppPrinter.h |   29 +-
 .../src/Target/Cpp/GpuDialectCppPrinter.cpp   |   59 +-
 .../src/Target/Cpp/StdDialectCppPrinter.cpp   |   16 +-
 .../src/Target/Cpp/StdDialectCppPrinter.h     |    5 +-
 .../src/Target/Cpp/TranslateToCpp.cpp         |   21 +
 .../Target/Cpp/VectorDialectCppPrinter.cpp    |   60 +
 .../src/Target/Cpp/VectorDialectCppPrinter.h  |    3 +
 accera/accc/CMakeLists.txt                    |    2 +-
 accera/accc/accc.py                           |    4 +-
 accera/ir/CMakeLists.txt                      |    1 +
 accera/ir/include/Common.td                   |   82 +-
 accera/ir/include/IRUtil.h                    |   55 +-
 accera/ir/include/accera/AcceraOps.h          |    8 +-
 accera/ir/include/exec/ExecutionOptions.h     |   82 +
 .../ir/include/exec/ExecutionPlanAttributes.h |    8 +-
 accera/ir/include/exec/ExecutionPlanAttrs.td  |    6 +-
 .../include/exec/ExecutionPlanInterfaces.td   |    4 +-
 accera/ir/include/exec/ExecutionPlanOps.h     |    2 +
 accera/ir/include/exec/ExecutionPlanOps.td    |  141 +-
 accera/ir/include/exec/TensorizationInfo.h    |    6 +-
 accera/ir/include/nest/Index.h                |    3 +-
 accera/ir/include/nest/LoopNestAttrs.td       |   20 +-
 .../nest/LoopNestExportedInterfaces.td        |    2 +-
 accera/ir/include/nest/LoopNestInterfaces.td  |    6 +-
 accera/ir/include/nest/LoopNestOps.td         |   86 +-
 accera/ir/include/value/ValueAttrs.td         |   33 +-
 accera/ir/include/value/ValueBase.td          |    8 +-
 accera/ir/include/value/ValueDialect.h        |    4 +
 accera/ir/include/value/ValueMFMAOp.h         |   24 +-
 accera/ir/include/value/ValueOps.td           |  360 ++-
 accera/ir/src/IRUtil.cpp                      |  210 +-
 accera/ir/src/TranslateToHeader.cpp           |   10 +
 .../ir/src/exec/ExecutionPlanAttributes.cpp   |    6 +-
 accera/ir/src/exec/ExecutionPlanOps.cpp       |  102 +-
 accera/ir/src/nest/Index.cpp                  |    2 +-
 accera/ir/src/nest/LoopNestBuilder.cpp        |    8 +-
 accera/ir/src/nest/TransformedDomain.cpp      |    2 +-
 accera/ir/src/value/ValueDialect.cpp          |   54 +-
 accera/ir/test/ir_tests/ir_tests.cpp          |  412 ++-
 accera/python/accera/Constants.py             |    2 +
 accera/python/accera/Package.py               |   98 +-
 accera/python/accera/Parameter.py             |    9 +-
 accera/python/accera/Targets.py               | 1161 ++++-----
 accera/python/accera/lang/Array.py            |   11 +
 accera/python/accera/lang/Cache.py            |   17 +-
 accera/python/accera/lang/Nest.py             |   10 +
 accera/python/accera/lang/Plan.py             |  133 +-
 accera/python/accera/lang/Schedule.py         |   36 +-
 accera/python/accera/test/dsl_tests.py        |   93 +-
 accera/python/accera/test/smoke_test.py       | 1945 ++++++++++++++-
 accera/python/accera/test/unit_tests.py       |   10 +-
 accera/python/gpu/src/__init__.py             |    6 +-
 accera/python/lib/src/ContainerTypes.cpp      |   38 +-
 accera/python/lib/src/ExecutionPlanTypes.cpp  |   71 +-
 accera/python/lib/src/Operations.cpp          |    7 +-
 accera/python/lib/src/SchedulingTypes.cpp     |    2 +-
 accera/python/llvm/src/__init__.py            |    6 +-
 accera/transforms/CMakeLists.txt              |    8 +-
 accera/transforms/include/AcceraPasses.h      |   14 +-
 accera/transforms/include/AcceraPasses.td     |   41 +-
 .../exec/ExecutionPlanToAffineLoweringPass.h  |    3 +
 .../transforms/include/gpu/AcceraToGPUPass.h  |    2 +
 .../include/gpu/AcceraToSPIRVPass.h           |   31 -
 .../transforms/include/value/BarrierOptPass.h |   19 +
 accera/transforms/src/AcceraPasses.cpp        |   17 +-
 .../ExecutionPlanToAffineLoweringPass.cpp     | 2218 ++++++++++++++---
 accera/transforms/src/gpu/AcceraToGPUPass.cpp |  788 ++++--
 .../transforms/src/gpu/AcceraToSPIRVPass.cpp  |  196 --
 accera/transforms/src/nest/LoopNestPasses.cpp |    2 +-
 .../transforms/src/nest/LoopNestToValue.cpp   |    4 +-
 accera/transforms/src/nest/LoopNestToValue.td |    2 +-
 .../src/nest/LoopNestToValueFunc.cpp          |   40 +-
 .../transforms/src/util/VectorizationUtil.cpp |   52 +-
 .../transforms/src/value/BarrierOptPass.cpp   |  235 ++
 .../src/value/RangeValueOptimizePass.cpp      |  106 +-
 .../transforms/src/value/ValueConversion.td   |    6 +-
 .../src/value/ValueFuncToTargetPass.cpp       |    4 +-
 .../src/value/ValueToStandardLoweringPass.cpp |   23 +-
 accera/utilities/include/MemoryLayout.h       |    5 +-
 accera/value/include/Cache.h                  |    6 +
 accera/value/include/CompilerOptions.h        |    2 +-
 accera/value/include/EmitterContext.h         |   38 +-
 accera/value/include/ExecutionOptions.h       |   44 +-
 accera/value/include/FunctionDeclaration.h    |    2 +-
 accera/value/include/MLIREmitterContext.h     |    7 +-
 accera/value/include/Plan.h                   |   53 +-
 accera/value/include/Scalar.h                 |    2 +
 accera/value/include/Schedule.h               |    2 +-
 accera/value/include/Value.h                  |    5 +
 accera/value/include/ValueType.h              |   10 +
 accera/value/src/ArrayOperations.cpp          |    4 +-
 accera/value/src/Cache.cpp                    |  107 +-
 accera/value/src/CompilerOptions.cpp          |   10 +-
 accera/value/src/EmitterContext.cpp           |   38 +-
 accera/value/src/MLIREmitterContext.cpp       |  125 +-
 accera/value/src/Plan.cpp                     |  101 +-
 accera/value/src/Scalar.cpp                   |    5 +-
 accera/value/src/ScalarOperations.cpp         |    2 +
 accera/value/src/Value.cpp                    |    4 +-
 accera/value/test/src/TestUtil.cpp            |    5 +-
 build.sh                                      |    2 +-
 docs/Manual/00 Introduction.md                |    4 +-
 docs/Manual/02 Simple Affine Loop Nests.md    |   68 +-
 docs/Manual/04 Fusing.md                      |    2 +-
 docs/Manual/06 Plans - Caching.md             |   55 +-
 ...ans - Vectorization and Parallelization.md |   21 +
 .../classes/Array/deferred_layout.md          |    4 +-
 docs/Reference/classes/Array/sub_array.md     |   51 +
 docs/Reference/classes/Plan/cache.md          |   19 +-
 docs/Reference/classes/Plan/tensorize.md      |   31 +
 docs/Reference/enumerations/ScalarType.md     |    1 +
 .../mlir/2_LoopNestToValueFunc.mlir           |    2 +-
 .../optimized_matmul/mlir/0_Initial.mlir      |    2 +-
 .../mlir/1_Canonicalizer.mlir                 |    2 +-
 .../mlir/2_LoopNestToValueFunc.mlir           |   20 +-
 .../mlir/3_ValueFuncToTarget.mlir             |    4 +-
 .../optimized_matmul/mlir/4_SymbolDCE.mlir    |    2 +-
 .../mlir/5_LinalgLowerToAffineLoops.mlir      |    2 +-
 .../mlir/6_SimplifyAffineStructures.mlir      |    2 +-
 .../mlir/7_Canonicalizer.mlir                 |    2 +-
 requirements.txt                              |    3 +-
 setup.cfg                                     |    1 +
 134 files changed, 8355 insertions(+), 2301 deletions(-)
 create mode 100644 accera/acc-opt/test/barrier_opt.mlir
 create mode 100644 accera/acc-opt/test/thrifty_caching.mlir
 create mode 100644 accera/ir/include/exec/ExecutionOptions.h
 delete mode 100644 accera/transforms/include/gpu/AcceraToSPIRVPass.h
 create mode 100644 accera/transforms/include/value/BarrierOptPass.h
 delete mode 100644 accera/transforms/src/gpu/AcceraToSPIRVPass.cpp
 create mode 100644 accera/transforms/src/value/BarrierOptPass.cpp
 create mode 100644 docs/Reference/classes/Array/sub_array.md
 create mode 100644 docs/Reference/classes/Plan/tensorize.md

diff --git a/.azure/win-accera.yml b/.azure/win-accera.yml
index 211c113c..b05da5bd 100644
--- a/.azure/win-accera.yml
+++ b/.azure/win-accera.yml
@@ -253,7 +253,7 @@ steps:
     workingDirectory: "$(Build.SourcesDirectory)/"
 
 - script: |
-    call "C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Auxiliary\Build\vcvars64.bat"
+    call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvars64.bat"
     set PATH=%VULKAN_SDK%\bin;%PATH%
     python -m accera.test.smoke_test
   displayName: Smoke test
diff --git a/.azure/win-pr.yml b/.azure/win-pr.yml
index 548f9d6e..ec4a34a6 100644
--- a/.azure/win-pr.yml
+++ b/.azure/win-pr.yml
@@ -39,7 +39,7 @@ steps:
     continueOnError: false
     inputs:
       workingDirectory: 'build\RelWithDebInfo'
-      cmakeArgs: '..\.. -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLVM_LIT_ARGS=-vv -G"Visual Studio 16 2019" -Ax64 -DLLVM_SETUP_VARIANT=$(LLVM_SETUP_VARIANT)'
+      cmakeArgs: '..\.. -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLVM_LIT_ARGS=-vv -G"Visual Studio 17 2022" -Ax64 -DLLVM_SETUP_VARIANT=$(LLVM_SETUP_VARIANT)'
     condition: eq( variables['Agent.OS'], 'Windows_NT' )
 
   - task: CMake@1
@@ -70,7 +70,7 @@ steps:
       workingDirectory: "$(Build.SourcesDirectory)/"
 
   - script: |
-      call "C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Auxiliary\Build\vcvars64.bat"
+      call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvars64.bat"
       python -m pip install -r $(Build.SourcesDirectory)/accera/onnx-emitter/test/requirements.txt
       ctest -C RelWithDebInfo -T test -VV -LE benchmark
     displayName: Run all ctest targets
diff --git a/CMake/LLVMSetup.cmake b/CMake/LLVMSetup.cmake
index 01110002..c531d989 100644
--- a/CMake/LLVMSetup.cmake
+++ b/CMake/LLVMSetup.cmake
@@ -7,14 +7,14 @@
 ####################################################################################################
 #
 # Gets the following variables:
-# 
+#
 # LLVM_SETUP_VARIANT: An optional environment variable or CMake define
 # that specifies the LLVM build source:
 #   LLVM_SETUP_VARIANT="Default" - uses vcpkg to acquire LLVM
 #                                  Pre-requisite: `vcpkg install accera-llvm` or
 #                                  `vcpkg install accera-llvm:x64-windows`
 #
-#   LLVM_SETUP_VARIANT="Conan"   - uses Conan to acquire LLVM 
+#   LLVM_SETUP_VARIANT="Conan"   - uses Conan to acquire LLVM
 #                                  (for internal use only)
 #
 # Sets the following variables:
@@ -34,10 +34,10 @@
 # Include guard so we don't try to find or download LLVM more than once
 include_guard()
 
+set(LLVM_SETUP_VARIANT "Default" CACHE STRING "Source for LLVM binaries")
 if(DEFINED ENV{LLVM_SETUP_VARIANT})
-  set(LLVM_SETUP_VARIANT $ENV{LLVM_SETUP_VARIANT} )
+  set(LLVM_SETUP_VARIANT $ENV{LLVM_SETUP_VARIANT} CACHE STRING "" FORCE)
 endif()
-set(LLVM_SETUP_VARIANT "Default" CACHE STRING "Source for LLVM binaries")
 
 message(STATUS "Using LLVMSetup${LLVM_SETUP_VARIANT}.cmake")
 
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 96609bc1..d12744f0 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -35,8 +35,10 @@ option(USE_MKL "Build with Intel MKL" OFF)
 
 option(USE_LIBCXX "Build with libc++ if using the Clang compiler" OFF)
 if(CMAKE_CXX_COMPILER_ID STREQUAL Clang)
+  if(USE_LIBCXX OR (CMAKE_HOST_SYSTEM_NAME STREQUAL Darwin))
     add_compile_options(-stdlib=libc++)
     link_libraries(-lc++ -lc++abi)
+  endif(USE_LIBCXX OR (CMAKE_HOST_SYSTEM_NAME STREQUAL Darwin))
 endif(CMAKE_CXX_COMPILER_ID STREQUAL Clang)
 
 # Try to create a compilation database, which is useful to have when working
@@ -156,11 +158,14 @@ else()
   set(CMAKE_CXX_FLAGS_RELWITHDEBINFO "${CMAKE_CXX_FLAGS_RELWITHDEBINFO} -ggdb3")
   set(CMAKE_C_FLAGS_RELWITHDEBINFO "${CMAKE_C_FLAGS_RELWITHDEBINFO} -ggdb3")
   if(${CMAKE_CXX_COMPILER_ID} STREQUAL Clang)
+    if(CMAKE_BUILD_TYPE STREQUAL Debug)
+    # Set options for Control Flow Integrity
+      add_compile_options(-fsanitize=cfi)
+    endif(CMAKE_BUILD_TYPE STREQUAL Debug)
+
     add_compile_options(-Wno-backslash-newline-escape)
     add_compile_options(-Wno-self-assign)
     add_compile_options(-fcolor-diagnostics)
-    # Set options for Control Flow Integrity
-    add_compile_options(-fsanitize=cfi)
     # Enable Shadow Stack mitigation
     add_compile_options(-fsanitize=shadow-call-stack)
     # Exit after the first 2 errors are reported
diff --git a/accera/acc-opt/test/barrier_opt.mlir b/accera/acc-opt/test/barrier_opt.mlir
new file mode 100644
index 00000000..6c7600aa
--- /dev/null
+++ b/accera/acc-opt/test/barrier_opt.mlir
@@ -0,0 +1,54 @@
+// RUN: acc-opt --verify-each=false --optimize-barriers %s | FileCheck %s
+
+// CHECK-LABEL: module @barrier_test_1
+// CHECK: %2 = "accv.alloc"()
+// CHECK-NEXT: %3 = "accv.alloc"() : () -> memref<16xf32, 3>
+// CHECK-NEXT: %4 = affine.load %arg0[symbol(%0) + symbol(%1) * 16] : memref<1xf32>
+// CHECK-NEXT: affine.store %4, %2[symbol(%0) + symbol(%1) * 16] : memref<16xf32, 3>
+// CHECK-NEXT: %5 = affine.load %arg1[symbol(%0) + symbol(%1) * 16] : memref<1xf32>
+// CHECK-NEXT: affine.store %5, %2[symbol(%0) + symbol(%1) * 16] : memref<16xf32, 3>
+// CHECK-NEXT: "accv.barrier"() {scope = "Block"} : () -> ()
+// CHECK-NEXT: %6 = affine.load %2[symbol(%0) + symbol(%1) * 16] : memref<16xf32, 3>
+// CHECK-NEXT: %7 = affine.load %3[symbol(%0) + symbol(%1) * 16] : memref<16xf32, 3>
+// CHECK-NEXT: %8 = "accv.bin_op"(%6, %7) {predicate = 0 : i64} : (f32, f32) -> f32
+// CHECK-NEXT: affine.store %8, %arg2[symbol(%0) + symbol(%1) * 16] : memref<1xf32>
+// CHECK: accv.return
+module @barrier_test_1 attributes {llvm.data_layout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"}  {
+  accv.module "barrier_test_1"  {
+    accv.func nested @barrier_test_1_d9502818_impl_8438933964186859281(%arg0: memref<1xf32>, %arg1: memref<1xf32>, %arg2: memref<1xf32>) attributes {exec_target = 0 : i64} {
+      "accv.lambda"() ( {
+        %0 = "gpu.thread_id"() {dimension = "x"} : () -> index
+        %1 = "gpu.block_id"() {dimension = "x"} : () -> index
+        affine.for %arg3 = 0 to 1 {
+          affine.for %arg4 = 0 to 1 {
+            affine.for %arg5 = 0 to 1 {
+              affine.for %arg6 = 0 to 1 {
+                %2 = "accv.alloc"() : () -> memref<16xf32, 3>
+                %3 = "accv.alloc"() : () -> memref<16xf32, 3>
+                "accv.barrier"() {scope = "Block"} : () -> ()
+                %4 = affine.load %arg0[symbol(%0) + symbol(%1) * 16] : memref<1xf32>
+                affine.store %4, %2[symbol(%0) + symbol(%1) * 16] : memref<16xf32, 3>
+                "accv.barrier"() {scope = "Block"} : () -> ()
+                %5 = affine.load %arg1[symbol(%0) + symbol(%1) * 16] : memref<1xf32>
+                affine.store %5, %2[symbol(%0) + symbol(%1) * 16] : memref<16xf32, 3>
+                "accv.barrier"() {scope = "Block"} : () -> ()
+                %6 = affine.load %2[symbol(%0) + symbol(%1) * 16] : memref<16xf32, 3>
+                %7 = affine.load %3[symbol(%0) + symbol(%1) * 16] : memref<16xf32, 3>
+                %8 = "accv.bin_op"(%6, %7) {predicate = 0 : i64} : (f32, f32) -> f32
+                affine.store %8, %arg2[symbol(%0) + symbol(%1) * 16] : memref<1xf32>
+                "accv.barrier"() {scope = "Block"} : () -> ()
+              } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{j_i,5}">, kernels = ["_"], accv_gpu_map = "ThreadY", subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">], subdomainSize = [1, 1]}
+            } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{i_i,3}">, accv_gpu_map = "ThreadX", subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">], subdomainSize = [1, 16]}
+          } {begin = 0 : i64, end = 256 : i64, index = #accln<"index{j_o,4}">, accv_gpu_map = "BlockY", subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">], subdomainSize = [16, 16]}
+        } {begin = 0 : i64, end = 256 : i64, index = #accln<"index{i_o,2}">, accv_gpu_map = "BlockX", subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">], subdomainSize = [16, 256]}
+        accv.return
+      }) {exec_target = 1 : i64, gpu_launch = [16 : index, 16 : index, 1 : index, 16 : index, 16 : index, 1 : index], sym_name = "NestFunction_0", type = () -> ()} : () -> ()
+      accv.return
+    }
+    accv.func @barrier_test_1_d9502818(%arg0: memref<1xf32>, %arg1: memref<1xf32>, %arg2: memref<1xf32>) attributes {accv.base_name = "barrier_test_1", accv.emit_header_decl, accv.emit_raw_pointer_api, exec_target = 0 : i64} {
+      accv.launch_func @barrier_test_1_d9502818_impl_8438933964186859281(%arg0, %arg1, %arg2) {exec_target = 0 : i64, gpu_launch = "gpu_launch"} : (memref<1xf32>, memref<1xf32>, memref<1xf32>) -> ()
+      accv.return
+    }
+  }
+}
+
diff --git a/accera/acc-opt/test/thrifty_caching.mlir b/accera/acc-opt/test/thrifty_caching.mlir
new file mode 100644
index 00000000..4c1583d2
--- /dev/null
+++ b/accera/acc-opt/test/thrifty_caching.mlir
@@ -0,0 +1,96 @@
+// RUN: acc-opt --verify-each=false --pass-pipeline="accv.module(accv.func(loopnest-to-value-func))" %s | FileCheck %s
+
+// This function has two caches initially, both marked thrifty, and one of them should
+// get elided based on thrifty checks but the other should not
+
+// This is the graph at the LoopNestToValueFuncPass_Subpasses_0_10_Canonicalize.mlir stage,
+// which is the last canonicalize stage before the thrifty checks and the subpasses 
+// before the thrifty phase create ops that the thrifty check depends on not being
+// canonicalized before it runs
+module @test_thrifty_caching_simple_input_cache attributes {llvm.data_layout = "e-m:w-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"}  {
+  accv.module "test_thrifty_caching_simple_input_cache"  {
+    accv.func nested @test_thrifty_caching_simple_input_cache_1127a105_impl_6891397719071098712(%arg0: memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>, %arg1: memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>, %arg2: memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>) attributes {exec_target = 0 : i64} {
+      %0 = accln.sym_index {name = "i_i"} #accln<"index{i_i,4}">
+      %1 = accln.sym_index {name = "i_o"} #accln<"index{i_o,3}">
+      %2 = accln.sym_index {name = "k_o"} #accln<"index{k_o,7}">
+      %3 = accln.sym_index {name = "j_i"} #accln<"index{j_i,6}">
+      %4 = accln.sym_index {name = "k_i"} #accln<"index{k_i,8}">
+      %5 = accln.sym_index {name = "j_o"} #accln<"index{j_o,5}">
+      "accv.lambda"() ( {
+        %6 = "accxp.make_cache"() {memorySpace = 0 : i64, multiCacheAccessIndices = [], offsetAccessIndices = [], offsetArrayToCacheAccessMap = affine_map<(d0) -> (d0)>} : () -> memref<?xf32, 3>
+        %7 = "accxp.begin_cache_region"(%arg0, %6, %arg0, %1, %2, %0, %4, %1, %2) {activeBlockCache, cacheAccessMaps = {manualCacheDimOrder = [0, 1]}, cacheHierarchyLevel = 0 : i64, cacheIndex = #accln<"index{i_i,4}">, cacheRegionBaseIndices = [[#accln<"index{i,0}">], [#accln<"index{k,2}">]], cacheRegionRelevantIndexRanges = [#accln<"indexrange{i_i,4}={0:4:1}">, #accln<"indexrange{k_i,8}={0:32:1}">], dimReorderCache, id = 0 : i64, operand_segment_sizes = dense<[1, 1, 1, 4, 2]> : vector<5xi32>, thrifty, triggerIndex = #accln<"index{i_i,4}">} : (memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>, memref<?xf32, 3>, memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>, index, index, index, index, index, index) -> index
+        "accxp.end_cache_region"(%7) : (index) -> ()
+        %8 = "accxp.make_cache"() {memorySpace = 0 : i64, multiCacheAccessIndices = [], offsetAccessIndices = [], offsetArrayToCacheAccessMap = affine_map<(d0) -> (d0)>} : () -> memref<?xf32, 3>
+        %9 = "accxp.begin_cache_region"(%arg1, %8, %arg1, %5, %2, %3, %4, %5) {activeBlockCache, cacheAccessMaps = {manualCacheDimOrder = [0, 1]}, cacheHierarchyLevel = 0 : i64, cacheIndex = #accln<"index{k_o,7}">, cacheRegionBaseIndices = [[#accln<"index{k,2}">], [#accln<"index{j,1}">], [#accln<"index{k,2}">]], cacheRegionRelevantIndexRanges = [#accln<"indexrange{k_o,7}={0:32:32}">, #accln<"indexrange{j_i,6}={0:16:1}">, #accln<"indexrange{k_i,8}={0:32:1}">], dimReorderCache, id = 1 : i64, operand_segment_sizes = dense<[1, 1, 1, 4, 1]> : vector<5xi32>, thrifty, triggerIndex = #accln<"index{k_o,7}">} : (memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>, memref<?xf32, 3>, memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>, index, index, index, index, index) -> index
+        "accxp.end_cache_region"(%9) : (index) -> ()
+        affine.for %arg3 = 0 to 32 step 4 {
+          affine.for %arg4 = 0 to 32 step 16 {
+            %10 = "accxp.begin_cache_region"(%arg1, %8, %arg1, %arg4, %2, %3, %4, %arg4) {activeBlockCache, cacheAccessMaps = {manualCacheDimOrder = [0, 1]}, cacheHierarchyLevel = 0 : i64, cacheIndex = #accln<"index{k_o,7}">, cacheRegionBaseIndices = [[#accln<"index{k,2}">], [#accln<"index{j,1}">], [#accln<"index{k,2}">]], cacheRegionRelevantIndexRanges = [#accln<"indexrange{k_o,7}={0:32:32}">, #accln<"indexrange{j_i,6}={0:16:1}">, #accln<"indexrange{k_i,8}={0:32:1}">], dimReorderCache, id = 1 : i64, operand_segment_sizes = dense<[1, 1, 1, 4, 1]> : vector<5xi32>, thrifty, triggerIndex = #accln<"index{k_o,7}">} : (memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>, memref<?xf32, 3>, memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>, index, index, index, index, index) -> index
+            affine.for %arg5 = 0 to 32 step 32 {
+              %11 = "accxp.begin_cache_region"(%arg0, %6, %arg0, %arg3, %arg5, %0, %4, %arg3, %arg5) {activeBlockCache, cacheAccessMaps = {manualCacheDimOrder = [0, 1]}, cacheHierarchyLevel = 0 : i64, cacheIndex = #accln<"index{i_i,4}">, cacheRegionBaseIndices = [[#accln<"index{i,0}">], [#accln<"index{k,2}">]], cacheRegionRelevantIndexRanges = [#accln<"indexrange{i_i,4}={0:4:1}">, #accln<"indexrange{k_i,8}={0:32:1}">], dimReorderCache, id = 0 : i64, operand_segment_sizes = dense<[1, 1, 1, 4, 2]> : vector<5xi32>, thrifty, triggerIndex = #accln<"index{i_i,4}">} : (memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>, memref<?xf32, 3>, memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>, index, index, index, index, index, index) -> index
+              affine.for %arg6 = 0 to 4 {
+                affine.for %arg7 = 0 to 16 {
+                  affine.for %arg8 = 0 to 32 {
+                    %12 = affine.load %arg0[%arg3 + %arg6, %arg5 + %arg8] : memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>
+                    %13 = affine.load %arg1[%arg5 + %arg8, %arg4 + %arg7] : memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>
+                    %14 = "accv.bin_op"(%12, %13) {predicate = 2 : i64} : (f32, f32) -> f32
+                    %15 = affine.load %arg2[%arg3 + %arg6, %arg4 + %arg7] : memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>
+                    %16 = "accv.bin_op"(%15, %14) {predicate = 0 : i64} : (f32, f32) -> f32
+                    affine.store %16, %arg2[%arg3 + %arg6, %arg4 + %arg7] : memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>
+                    %17 = affine.load %arg2[%arg3 + %arg6, %arg4 + %arg7] : memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>
+                    affine.store %17, %arg2[%arg3 + %arg6, %arg4 + %arg7] : memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>
+                  } {begin = 0 : i64, end = 32 : i64, index = #accln<"index{k_i,8}">, kernels = ["_"], subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 1, 1]}
+                } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{j_i,6}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 1, 32]}
+              } {begin = 0 : i64, end = 4 : i64, index = #accln<"index{i_i,4}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 16, 32]}
+              "accxp.end_cache_region"(%11) : (index) -> ()
+            } {begin = 0 : i64, end = 32 : i64, index = #accln<"index{k_o,7}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [4, 16, 32]}
+            "accxp.end_cache_region"(%10) : (index) -> ()
+          } {begin = 0 : i64, end = 32 : i64, index = #accln<"index{j_o,5}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [4, 16, 32]}
+        } {begin = 0 : i64, end = 32 : i64, index = #accln<"index{i_o,3}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [4, 32, 32]}
+        accv.return
+      }) {exec_target = 0 : i64, sym_name = "NestFunction_0", type = () -> ()} : () -> ()
+      accv.return
+    }
+  }
+}
+
+// CHECK: #map = affine_map<(d0, d1) -> (d0 * 32 + d1)>
+// CHECK: module @test_thrifty_caching_simple_input_cache attributes {llvm.data_layout = "e-m:w-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"}  {
+// CHECK:   accv.module "test_thrifty_caching_simple_input_cache"  {
+// CHECK:     "accv.global"() {sym_name = "cache_3", type = memref<32x16xf32, 3>} : () -> ()
+// CHECK:     accv.func nested @test_thrifty_caching_simple_input_cache_1127a105_impl_6891397719071098712(%arg0: memref<32x32xf32, #map>, %arg1: memref<32x32xf32, #map>, %arg2: memref<32x32xf32, #map>) attributes {exec_target = 0 : i64} {
+// CHECK:       "accv.lambda"() ( {
+// CHECK:         %0 = "accv.ref_global"() {global_name = @cache_3} : () -> memref<32x16xf32, 3>
+// CHECK:         affine.for %arg3 = 0 to 32 step 4 {
+// CHECK:           affine.for %arg4 = 0 to 32 step 16 {
+// CHECK:             "accv.lambda"() ( {
+// CHECK:               affine.for %arg5 = 0 to 32 {
+// CHECK:                 affine.for %arg6 = 0 to 16 {
+// CHECK:                   %1 = affine.load %arg1[%arg5, %arg4 + %arg6] : memref<32x32xf32, #map>
+// CHECK:                   affine.store %1, %0[%arg5, %arg6] : memref<32x16xf32, 3>
+// CHECK:                 } {accxp.access_bounds_check, begin = 0 : i64, end = 16 : i64, index = #accln<"index{j,5}">, kernels = ["cache_internal_loopnest_kernel_active_block_copy"], scheduledIndex = #accln<"index{j,5}">, subdomainIndexOrder = [#accln<"index{i,4}">, #accln<"index{j,5}">], subdomainSize = [1, 1]}
+// CHECK:               } {accxp.access_bounds_check, begin = 0 : i64, end = 32 : i64, index = #accln<"index{i,4}">, scheduledIndex = #accln<"index{i,4}">, subdomainIndexOrder = [#accln<"index{i,4}">, #accln<"index{j,5}">], subdomainSize = [1, 16]}
+// CHECK:               accv.return
+// CHECK:             }) {exec_target = 0 : i64, sym_name = "NestFunction_2", type = () -> ()} : () -> ()
+// CHECK:             affine.for %arg5 = 0 to 4 {
+// CHECK:               affine.for %arg6 = 0 to 16 {
+// CHECK:                 affine.for %arg7 = 0 to 32 {
+// CHECK:                   %1 = affine.load %arg0[%arg3 + %arg5, %arg7] : memref<32x32xf32, #map>
+// CHECK:                   %2 = affine.load %0[%arg7, %arg6] : memref<32x16xf32, 3>
+// CHECK:                   %3 = "accv.bin_op"(%1, %2) {predicate = 2 : i64} : (f32, f32) -> f32
+// CHECK:                   %4 = affine.load %arg2[%arg3 + %arg5, %arg4 + %arg6] : memref<32x32xf32, #map>
+// CHECK:                   %5 = "accv.bin_op"(%4, %3) {predicate = 0 : i64} : (f32, f32) -> f32
+// CHECK:                   affine.store %5, %arg2[%arg3 + %arg5, %arg4 + %arg6] : memref<32x32xf32, #map>
+// CHECK:                   %6 = affine.load %arg2[%arg3 + %arg5, %arg4 + %arg6] : memref<32x32xf32, #map>
+// CHECK:                   affine.store %6, %arg2[%arg3 + %arg5, %arg4 + %arg6] : memref<32x32xf32, #map>
+// CHECK:                 } {begin = 0 : i64, end = 32 : i64, index = #accln<"index{k_i,8}">, kernels = ["_"], subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 1, 1]}
+// CHECK:               } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{j_i,6}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 1, 32]}
+// CHECK:             } {begin = 0 : i64, end = 4 : i64, index = #accln<"index{i_i,4}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 16, 32]}
+// CHECK:           } {begin = 0 : i64, end = 32 : i64, index = #accln<"index{j_o,5}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [4, 16, 32]}
+// CHECK:         } {begin = 0 : i64, end = 32 : i64, index = #accln<"index{i_o,3}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [4, 32, 32]}
+// CHECK:         accv.return
+// CHECK:       }) {exec_target = 0 : i64, sym_name = "NestFunction_0", type = () -> ()} : () -> ()
+// CHECK:       accv.return
+// CHECK:     }
+// CHECK:   }
+// CHECK: }
diff --git a/accera/acc-opt/test/value_mlir_test.cpp b/accera/acc-opt/test/value_mlir_test.cpp
index 48aebdd2..690530d4 100644
--- a/accera/acc-opt/test/value_mlir_test.cpp
+++ b/accera/acc-opt/test/value_mlir_test.cpp
@@ -152,7 +152,7 @@ TEST_CASE("gpu_module2")
 {
     auto gpu_f1 =
         DeclareFunction("gpu_f1")
-            .Target(targets::GPU( {32, 32, 32}, {1, 1, 1 }))
+            .Target(targets::GPU({ 32, 32, 32 }, { 1, 1, 1 }))
             .Parameters(Value{ ValueType::Float, MemoryLayout{ { 16384 } } },
                         Value{ ValueType::Float, MemoryLayout{ { 16384 } } },
                         Value{ ValueType::Float, MemoryLayout{ { 16384 } } })
@@ -204,7 +204,7 @@ TEST_CASE("gpu_module3")
 {
     auto gpu_f1 =
         DeclareFunction("gpu_f1")
-            .Target(targets::GPU( {128, 1, 1}, {128, 1, 1 }))
+            .Target(targets::GPU({ 128, 1, 1 }, { 128, 1, 1 }))
             .Parameters(Value{ ValueType::Float, MemoryLayout{ { 16384 } } },
                         Value{ ValueType::Float, MemoryLayout{ { 16384 } } },
                         Value{ ValueType::Float, MemoryLayout{ { 16384 } } })
diff --git a/accera/acc-translate/src/Target/Cpp/AcceraDialectCppPrinter.cpp b/accera/acc-translate/src/Target/Cpp/AcceraDialectCppPrinter.cpp
index 975a2e52..a1869143 100644
--- a/accera/acc-translate/src/Target/Cpp/AcceraDialectCppPrinter.cpp
+++ b/accera/acc-translate/src/Target/Cpp/AcceraDialectCppPrinter.cpp
@@ -1,33 +1,73 @@
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 //  Copyright (c) Microsoft Corporation. All rights reserved.
 //  Licensed under the MIT License. See LICENSE in the project root for license information.
+//  Authors: Kern Handa
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 #include "AcceraDialectCppPrinter.h"
 
+#include "AMDGPU.h"
+#include "NVGPU.h"
+#include "ir/include/value/ValueDialect.h"
+
 #include <ir/include/IRUtil.h>
 #include <ir/include/argo/Utils.h>
+#include <ir/include/nest/LoopNestOps.h>
+
 #include <mlir/IR/BuiltinAttributes.h>
+#include <mlir/Interfaces/CallInterfaces.h>
 #include <mlir/Support/LLVM.h>
 #include <mlir/Support/LogicalResult.h>
 
-#include "AMDGPU.h"
-#include "NVGPU.h"
+#include <llvm/ADT/TypeSwitch.h>
 
 using namespace mlir::argo;
 
+namespace vir = accera::ir::value;
+
 namespace mlir
 {
 namespace cpp_printer
 {
+    LogicalResult AcceraDialectCppPrinter::printOp(vir::CallOp callOp)
+    {
+        auto callInterface = dyn_cast<CallOpInterface>(callOp.getOperation());
+        auto callee = callInterface.resolveCallable();
+        if (!callee) return callOp->emitError("Cannot find callee function");
+
+        (void)printer->printDeclarationForOpResult(callOp);
+        if (callOp->getNumResults() > 0)
+            os << " = ";
+
+        os << callOp.getCallee() << "(";
+        RETURN_IF_FAILED(printer->printOperationOperands(callOp));
+        os << ")";
+
+        return success();
+    }
 
-    static bool isMFMAComputeOp(Operation* op)
+    LogicalResult AcceraDialectCppPrinter::printOp(vir::ReturnOp returnOp)
     {
-        return llvm::isa<accera::ir::value::MFMAComputeOp>(op);
+        os << "return";
+
+        if (auto numOperands = returnOp.getNumOperands(); numOperands == 0)
+        {
+            // Nothing to do
+        }
+        else if (numOperands == 1)
+        {
+            os << " " << state.nameState.getName(returnOp.getOperand(0));
+        }
+        else
+        {
+            return returnOp.emitOpError() << "<<Returning tuple is not supported yet>>";
+        }
+
+        return success();
     }
-    LogicalResult AcceraDialectCppPrinter::printMFMAComputeOp(Operation* op)
+
+    LogicalResult AcceraDialectCppPrinter::printOp(vir::MFMAComputeOp mfmaOp)
     {
-        accera::ir::value::MFMAComputeOp mfmaOp = mlir::dyn_cast_or_null<accera::ir::value::MFMAComputeOp>(op);
         assert(mfmaOp);
         auto accumInputTy = mfmaOp.opC().getType();
         auto accumOutputTy = mfmaOp.res().getType();
@@ -60,10 +100,17 @@ namespace cpp_printer
         bool* /*skipped*/,
         bool* consumed)
     {
-        *consumed = true;
-        if (isMFMAComputeOp(op))
-            return printMFMAComputeOp(op);
-        *consumed = false;
+        auto handler = [&, this](auto op_) {
+            printOp(op_);
+            *consumed = true;
+        };
+
+        TypeSwitch<Operation*>(op)
+            .Case<vir::MFMAComputeOp>(handler)
+            .Case<vir::CallOp>(handler)
+            .Case<vir::ReturnOp>(handler)
+            .Default([&](Operation*) { *consumed = false; });
+
         return success();
     }
 
@@ -169,7 +216,6 @@ namespace cpp_printer
     LogicalResult AcceraDialectCppPrinter::printEpilogue()
     {
         // TODO: add a cmdline option to skip generating host launch func
-        RETURN_IF_FAILED(printHostLaunchFunc());
         return success();
     }
 
diff --git a/accera/acc-translate/src/Target/Cpp/AcceraDialectCppPrinter.h b/accera/acc-translate/src/Target/Cpp/AcceraDialectCppPrinter.h
index e2dae765..5c4d11da 100644
--- a/accera/acc-translate/src/Target/Cpp/AcceraDialectCppPrinter.h
+++ b/accera/acc-translate/src/Target/Cpp/AcceraDialectCppPrinter.h
@@ -32,7 +32,9 @@ namespace cpp_printer
 
         std::string getName() override { return "Accera"; }
 
-        LogicalResult printMFMAComputeOp(Operation* op);
+        LogicalResult printOp(accera::ir::value::MFMAComputeOp op);
+        LogicalResult printOp(accera::ir::value::CallOp op);
+        LogicalResult printOp(accera::ir::value::ReturnOp op);
 
         LogicalResult printDialectOperation(Operation* op, bool* skipped, bool* consumed) override;
 
diff --git a/accera/acc-translate/src/Target/Cpp/AffineDialectCppPrinter.cpp b/accera/acc-translate/src/Target/Cpp/AffineDialectCppPrinter.cpp
index f27f25f7..155f836b 100644
--- a/accera/acc-translate/src/Target/Cpp/AffineDialectCppPrinter.cpp
+++ b/accera/acc-translate/src/Target/Cpp/AffineDialectCppPrinter.cpp
@@ -5,7 +5,9 @@
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 #include "AffineDialectCppPrinter.h"
+
 #include <llvm/ADT/Sequence.h>
+#include <mlir/Support/LogicalResult.h>
 
 #include <numeric>
 namespace
@@ -444,11 +446,22 @@ namespace cpp_printer
     LogicalResult
     AffineDialectCppPrinter::printAffineForOp(AffineForOp affineForOp)
     {
-        // TODO: handle returned values from affine.yield after we support
-        // affine.yield op
         if (!affineForOp.getResults().empty())
         {
-            affineForOp.emitError("cannot yield values from the op");
+            if (affineForOp.getIterOperands().size() > 1 || affineForOp->getResults().size() > 1)
+            {
+                os << "AffineForOp with multiple iter operands or results is not supported yet.\n";
+                return failure();
+            }
+            auto resultVar = affineForOp.getResults()[0];
+            auto iterVar = affineForOp.getRegionIterArgs()[0];
+            auto initVal = affineForOp.getIterOperands()[0];
+
+            StringRef iterVarName = state.nameState.getOrCreateName(
+                iterVar, SSANameState::SSANameKind::Variable);
+
+            RETURN_IF_FAILED(printer->printType(resultVar.getType()));
+            os << " " << iterVarName << " = " << state.nameState.getName(initVal) << ";\n";
         }
 
         if (!affineForOp.hasConstantLowerBound())
@@ -483,9 +496,34 @@ namespace cpp_printer
 
         auto& loopRegion = affineForOp.region();
         RETURN_IF_FAILED(printer->printRegion(loopRegion, /*printParens*/ false,
-                                              /*printBlockTerminator*/ false));
+                                              /*printBlockTerminator*/ true));
         os << "}\n";
 
+        if (!affineForOp.getResults().empty())
+        {
+            auto resultVar = affineForOp.getResults()[0];
+            auto iterVar = affineForOp.getRegionIterArgs()[0];
+            StringRef resultName = state.nameState.getOrCreateName(
+                resultVar, SSANameState::SSANameKind::Variable);
+            RETURN_IF_FAILED(printer->printType(resultVar.getType()));
+            os << " " << resultName << " = " << state.nameState.getName(iterVar) << ";\n";
+        }
+
+        return success();
+    }
+
+    LogicalResult AffineDialectCppPrinter::printAffineYieldOp(AffineYieldOp affineYieldOp)
+    {
+        if (affineYieldOp.getNumOperands() == 0)
+        {
+            return success();
+        }
+        auto affineForOp = affineYieldOp->getParentOfType<AffineForOp>();
+        auto iterVar = affineForOp.getRegionIterArgs()[0];
+        auto result = affineYieldOp.getOperand(0);
+
+        os << state.nameState.getName(iterVar) << " = " << state.nameState.getName(result);
+
         return success();
     }
 
@@ -510,6 +548,9 @@ namespace cpp_printer
         if (auto affineVectorStoreOp = dyn_cast<AffineVectorStoreOp>(op))
             return printAffineVectorStoreOp(affineVectorStoreOp);
 
+        if (auto affineYieldOp = dyn_cast<AffineYieldOp>(op))
+            return printAffineYieldOp(affineYieldOp);
+
         if (auto affineForOp = dyn_cast<AffineForOp>(op))
         {
             *skipped = true;
diff --git a/accera/acc-translate/src/Target/Cpp/AffineDialectCppPrinter.h b/accera/acc-translate/src/Target/Cpp/AffineDialectCppPrinter.h
index 2f775aeb..fc461903 100644
--- a/accera/acc-translate/src/Target/Cpp/AffineDialectCppPrinter.h
+++ b/accera/acc-translate/src/Target/Cpp/AffineDialectCppPrinter.h
@@ -6,11 +6,12 @@
 
 #ifndef AFFINE_DIALECT_CPP_PRINTER_H_
 #define AFFINE_DIALECT_CPP_PRINTER_H_
+ 
+#include "CppPrinter.h"
 
-#include <cassert>
 #include <mlir/Dialect/Affine/IR/AffineOps.h>
+#include <mlir/Dialect/Vector/VectorOps.h>
 
-#include "CppPrinter.h"
 
 namespace mlir
 {
@@ -41,6 +42,8 @@ namespace cpp_printer
 
         LogicalResult printAffineForOp(AffineForOp affineForOp);
 
+        LogicalResult printAffineYieldOp(AffineYieldOp affineYieldOp);
+
         LogicalResult printAffineMapFunc(AffineMap map, StringRef funcName);
 
         LogicalResult printAffineExpr(AffineExpr affineExpr);
diff --git a/accera/acc-translate/src/Target/Cpp/CppPrinter.cpp b/accera/acc-translate/src/Target/Cpp/CppPrinter.cpp
index 76dfb260..58cd7294 100644
--- a/accera/acc-translate/src/Target/Cpp/CppPrinter.cpp
+++ b/accera/acc-translate/src/Target/Cpp/CppPrinter.cpp
@@ -7,6 +7,7 @@
 #include <llvm/ADT/STLExtras.h>
 #include <llvm/ADT/StringExtras.h>
 #include <llvm/ADT/StringRef.h>
+#include <llvm/Support/Debug.h>
 #include <mlir/Dialect/Affine/IR/AffineOps.h>
 #include <mlir/Dialect/MemRef/IR/MemRef.h>
 #include <mlir/IR/AffineExprVisitor.h>
@@ -29,6 +30,10 @@
 
 using namespace llvm;
 
+namespace ir = accera::ir;
+namespace utilir = accera::ir::util;
+namespace vir = accera::ir::value;
+
 namespace mlir
 {
 namespace cpp_printer
@@ -460,19 +465,7 @@ namespace cpp_printer
     {
         RETURN_IF_FAILED(checkMemRefType(memRefType));
         RETURN_IF_FAILED(printType(memRefType.getElementType()));
-        auto rank = memRefType.getRank();
-        if (rank <= 1)
-        {
-            os << " *" << arrayName;
-            return success();
-        }
-
-        auto shape = memRefType.getShape();
-        os << " (*" << arrayName << ")";
-        for (auto d : shape.drop_front())
-        {
-            os << "[" << d << "]";
-        }
+        os << " *" << arrayName;
         return success();
     }
 
@@ -571,17 +564,29 @@ namespace cpp_printer
     LogicalResult CppPrinter::printMemRefLoadOrStore(bool isLoad, Value memref, MemRefType memRefType, Operation::operand_range indices, Value targetOrSrc)
     {
         auto rank = memRefType.getRank();
+        auto srcTargetsIsVectorTy = targetOrSrc.getType().isa<VectorType>();
+        std::string vectorTypeName = "";
+        if (srcTargetsIsVectorTy)
+        {
+            SmallString<128> nameStr("");
+            llvm::raw_svector_ostream strm(nameStr);
+            CppPrinter cppPrinter(strm);
+            (void)cppPrinter.printType(targetOrSrc.getType());
+            vectorTypeName = strm.str().str();
+        }
+        auto memrefAccessPrefix = srcTargetsIsVectorTy ? std::string("*((") + vectorTypeName + "*)(&(" : std::string("");
+        auto memrefAccessSuffix = srcTargetsIsVectorTy ? ")))" : "";
         if (rank == 0)
         {
             if (isLoad)
             {
                 RETURN_IF_FAILED(printDeclarationForValue(targetOrSrc));
                 os << " = ";
-                os << "*" << state.nameState.getName(memref);
+                os << memrefAccessPrefix << state.nameState.getName(memref) << memrefAccessSuffix;
             }
             else
             {
-                os << "*" << state.nameState.getName(memref);
+                os << memrefAccessPrefix << state.nameState.getName(memref) << memrefAccessSuffix;
                 os << " = ";
                 RETURN_IF_FAILED(printDeclarationForValue(targetOrSrc));
             }
@@ -650,11 +655,11 @@ namespace cpp_printer
             if (isLoad)
             {
                 RETURN_IF_FAILED(printDeclarationForValue(targetOrSrc));
-                os << " = " << state.nameState.getName(memref) << offsetStr;
+                os << " = " << memrefAccessPrefix << state.nameState.getName(memref) << offsetStr << memrefAccessSuffix;
             }
             else
             {
-                os << state.nameState.getName(memref) << offsetStr;
+                os << memrefAccessPrefix << state.nameState.getName(memref) << offsetStr << memrefAccessSuffix;
                 os << " = " << state.nameState.getName(targetOrSrc);
             }
         }
@@ -665,11 +670,15 @@ namespace cpp_printer
             {
                 RETURN_IF_FAILED(printDeclarationForValue(targetOrSrc));
                 os << " = ";
+                os << memrefAccessPrefix;
                 RETURN_IF_FAILED(printMemRefAccess(memref, memRefType, offsetVarName));
+                os << memrefAccessSuffix;
             }
             else
             {
+                os << memrefAccessPrefix;
                 RETURN_IF_FAILED(printMemRefAccess(memref, memRefType, offsetVarName));
+                os << memrefAccessSuffix;
                 os << " = " << state.nameState.getName(targetOrSrc);
             }
         }
@@ -820,6 +829,19 @@ namespace cpp_printer
     LogicalResult CppPrinter::printFunctionDeclaration(FuncOp funcOp,
                                                        bool trailingSemiColon)
     {
+        if (funcOp->hasAttr(ir::HeaderDeclAttrName) && funcOp->hasAttr(ir::RawPointerAPIAttrName))
+        {
+            os << "extern \"C\" ";
+        }
+
+        if (auto execRuntime = utilir::ResolveExecutionRuntime(funcOp, /* exact */ true);
+            execRuntime &&
+            (execRuntime == vir::ExecutionRuntime::ROCM) &&
+            utilir::ResolveExecutionTarget(funcOp, /* exact */ true) == vir::ExecutionTarget::CPU)
+        {
+            os << "__host__ ";
+        }
+
         if (failed(printTypes(funcOp.getType().getResults())))
         {
             return funcOp.emitOpError() << "<<Unable to print return type>>";
@@ -999,12 +1021,11 @@ namespace cpp_printer
             return success();
         }
 
-        if (isa<ConstantOp>(op) || isa<IndexCastOp>(op))
+        if (isConstantScalarOp(op))
         {
             *skipped = true;
         }
-
-        if (!isa<ConstantOp>(op) && !isa<IndexCastOp>(op) && op->getNumRegions() == 0)
+        else if (op->getNumRegions() == 0)
         {
             os << "/*" << *op << "*/\n";
         }
@@ -1031,6 +1052,7 @@ namespace cpp_printer
     void CppPrinter::registerAllDialectPrinters()
     {
         [[maybe_unused]] static bool init_once = [&]() {
+            registerDialectPrinter<AcceraDialectCppPrinter>();
             registerDialectPrinter<AffineDialectCppPrinter>();
             registerDialectPrinter<GpuDialectCppPrinter>();
             registerDialectPrinter<RocDLDialectCppPrinter>();
diff --git a/accera/acc-translate/src/Target/Cpp/CppPrinter.h b/accera/acc-translate/src/Target/Cpp/CppPrinter.h
index 58bc2f30..f088ce82 100644
--- a/accera/acc-translate/src/Target/Cpp/CppPrinter.h
+++ b/accera/acc-translate/src/Target/Cpp/CppPrinter.h
@@ -8,6 +8,7 @@
 #define CPP_PRINTER_H_
 
 #include <mlir/Dialect/MemRef/IR/MemRef.h>
+#include <mlir/Dialect/StandardOps/IR/Ops.h>
 #include <mlir/IR/Attributes.h>
 #include <mlir/IR/BuiltinOps.h>
 #include <mlir/IR/BuiltinTypes.h>
@@ -117,13 +118,19 @@ namespace cpp_printer
         llvm::BumpPtrAllocator nameAllocator;
     };
 
+    // This is a bitmask flag because right now, the printer goes through the module as a whole and "discovers" the runtimes
+    // used within the module. This isn't the best system. Eventually, we should move to a system where it can be queried the runtimes
+    // that are enabled for the current function. Until we move to that design, however, we remain subscribed to the current paradigm.
     enum class Runtime
     {
-        None = 0,
+        NONE = 0,
         CUDA = 1 << 0,
+        ROCM = 1 << 1,
+        VULKAN = 1 << 2,
+        OPENMP = 1 << 3,
+        DEFAULT = 1 << 4,
 
-        LLVM_MARK_AS_BITMASK_ENUM(/* LargestValue = */ CUDA)
-        // TODO: add OpenMP? ROCM?
+        LLVM_MARK_AS_BITMASK_ENUM(/* LargestValue = */ DEFAULT)
     };
 
     /// Holding the states for the printer such as SSA names, type alias, etc
@@ -166,7 +173,7 @@ namespace cpp_printer
         llvm::SmallPtrSet<mlir::Operation*, 4> intrinsicDecls;
 
         // TODO: add more state kinds
-        Runtime runtimesDetected = Runtime::None;
+        Runtime runtimesDetected = Runtime::NONE;
     };
 
     /// Print the given MLIR into C++ code. Formatting is not a concern
@@ -487,6 +494,20 @@ namespace cpp_printer
         return interleaveWithError(c.begin(), c.end(), each_fn, [&]() { os << ", "; });
     }
 
+    [[maybe_unused]] static bool isConstantScalarOp(Operation* op)
+    {
+        if (isa<mlir::IndexCastOp>(op))
+        {
+            return true;
+        }
+        else if (auto constantOp = dyn_cast<mlir::ConstantOp>(op))
+        {
+            auto resTy = constantOp.value().getType();
+            return resTy.isIntOrFloat() || resTy.isIndex();
+        }
+        return false;
+    }
+
 } // namespace cpp_printer
 } // namespace mlir
 
diff --git a/accera/acc-translate/src/Target/Cpp/GpuDialectCppPrinter.cpp b/accera/acc-translate/src/Target/Cpp/GpuDialectCppPrinter.cpp
index 80d93dc4..7ff75288 100644
--- a/accera/acc-translate/src/Target/Cpp/GpuDialectCppPrinter.cpp
+++ b/accera/acc-translate/src/Target/Cpp/GpuDialectCppPrinter.cpp
@@ -15,6 +15,10 @@
 
 using namespace mlir::gpu;
 
+namespace ir = accera::ir;
+namespace utilir = accera::ir::util;
+namespace vir = accera::ir::value;
+
 namespace mlir
 {
 namespace cpp_printer
@@ -33,22 +37,11 @@ namespace cpp_printer
 
     static int dimIndexToInteger(llvm::StringRef dim)
     {
-        if (dim == "x")
-        {
-            return 0;
-        }
-        else if (dim == "y")
-        {
-            return 1;
-        }
-        else if (dim == "z")
-        {
-            return 2;
-        }
-        else
-        {
-            return -1;
-        }
+        return StringSwitch<int>(dim)
+            .Case("x", 0)
+            .Case("y", 1)
+            .Case("z", 2)
+            .Default(-1);
     }
 
     static Optional<uint64_t> getGridDim(Operation* op, llvm::StringRef dim)
@@ -60,7 +53,7 @@ namespace cpp_printer
             {
                 return llvm::None;
             }
-            auto arrayAttr = accera::ir::util::ArrayAttrToVector<mlir::IntegerAttr>(fn->getAttrOfType<ArrayAttr>("gridSize"));
+            auto arrayAttr = utilir::ArrayAttrToVector<mlir::IntegerAttr>(fn->getAttrOfType<ArrayAttr>("gridSize"));
             auto idx = dimIndexToInteger(dim);
             if (idx == -1) return llvm::None;
             return arrayAttr[idx].getInt();
@@ -76,7 +69,7 @@ namespace cpp_printer
             {
                 return llvm::None;
             }
-            auto arrayAttr = accera::ir::util::ArrayAttrToVector<mlir::IntegerAttr>(fn->getAttrOfType<ArrayAttr>("blockSize"));
+            auto arrayAttr = utilir::ArrayAttrToVector<mlir::IntegerAttr>(fn->getAttrOfType<ArrayAttr>("blockSize"));
             auto idx = dimIndexToInteger(dim);
             if (idx == -1) return llvm::None;
             return arrayAttr[idx].getInt();
@@ -94,7 +87,7 @@ namespace cpp_printer
         const std::string varPrefix = std::string("gridDim_") + gridDimOp.dimension().str() + "_";
         auto idx = state.nameState.getOrCreateName(
             gridDimOp.getResult(), SSANameState::SSANameKind::Variable, varPrefix);
-        os << "const uint " << idx << " = ";
+        os << "const unsigned int " << idx << " = ";
         if (auto c = getGridDim(gridDimOp, gridDimOp.dimension()); c)
         {
             os << c.getValue();
@@ -116,7 +109,7 @@ namespace cpp_printer
         const std::string varPrefix = std::string("blockDim_") + blockDimOp.dimension().str() + "_";
         auto idx = state.nameState.getOrCreateName(
             blockDimOp.getResult(), SSANameState::SSANameKind::Variable, varPrefix);
-        os << "const uint " << idx << " = ";
+        os << "const unsigned int " << idx << " = ";
         if (auto c = getBlockDim(blockDimOp, blockDimOp.dimension()); c)
         {
             os << c.getValue();
@@ -138,7 +131,7 @@ namespace cpp_printer
         const std::string varPrefix = std::string("blockIdx_") + bidOp.dimension().str() + "_";
         auto idx = state.nameState.getOrCreateName(
             bidOp.getResult(), SSANameState::SSANameKind::Variable, varPrefix);
-        os << "const uint " << idx << " = ";
+        os << "const unsigned int " << idx << " = ";
         if (auto c = getGridDim(bidOp, bidOp.dimension()); c)
         {
             os << "(blockIdx." << bidOp.dimension() << "%" << c.getValue() << ")";
@@ -161,7 +154,7 @@ namespace cpp_printer
         const std::string varPrefix = std::string("threadIdx_") + tidOp.dimension().str() + "_";
         auto idx = state.nameState.getOrCreateName(
             tidOp.getResult(), SSANameState::SSANameKind::Variable, varPrefix);
-        os << "const uint " << idx << " = ";
+        os << "const unsigned int " << idx << " = ";
         if (auto c = getBlockDim(tidOp, tidOp.dimension()); c)
         {
             os << "(threadIdx." << tidOp.dimension() << "%" << c.getValue() << ")";
@@ -190,10 +183,10 @@ namespace cpp_printer
             .Case<BlockIdOp>(handler)
             .Case<GPUFuncOp>(handler)
             .Case<GPUModuleOp>(handler)
+            .Case<gpu::ReturnOp>(handler)
             .Case<GridDimOp>(handler)
             .Case<LaunchFuncOp>(handler)
             .Case<ModuleEndOp>(handler)
-            .Case<ReturnOp>(handler)
             .Case<ThreadIdOp>(handler)
             .Default([&](Operation*) { *consumed = false; });
 
@@ -289,7 +282,7 @@ namespace cpp_printer
 using vfloatx2_t = float __attribute__((ext_vector_type(2)));
 using vfloatx4_t = float __attribute__((ext_vector_type(4)));
 using vfloatx16_t = float __attribute__((ext_vector_type(16)));
-#else
+#elif defined(__CUDA__)
 #include "cuda_fp16.h"
 #endif // !defined(__HIP_PLATFORM_AMD__)
 
@@ -304,13 +297,27 @@ using vfloatx16_t = float __attribute__((ext_vector_type(16)));
         gpu::GPUFuncOp funcOp,
         bool trailingSemiColon)
     {
+        auto execRuntime = utilir::ResolveExecutionRuntime(funcOp, /* exact */ false);
+        if (execRuntime && (execRuntime != vir::ExecutionRuntime::CUDA &&
+                            execRuntime != vir::ExecutionRuntime::ROCM &&
+                            // TODO: ugh. remove
+                            execRuntime != vir::ExecutionRuntime::DEFAULT))
+        {
+            return funcOp.emitError("Expected either CUDA or ROCm runtimes on GPU function");
+        }
+
+        if (funcOp->hasAttr(ir::HeaderDeclAttrName) && funcOp->hasAttr(ir::RawPointerAPIAttrName))
+        {
+            os << "extern \"C\" ";
+        }
+
         // TODO: We treat all functions to be CUDA global functions.
         // Need to add support for device functions
         os << "__global__ ";
 
         if (state.hasRuntime(Runtime::CUDA) && funcOp->hasAttrOfType<mlir::ArrayAttr>("blockSize"))
         {
-            auto arrayAttr = accera::ir::util::ArrayAttrToVector<mlir::IntegerAttr>(funcOp->getAttrOfType<mlir::ArrayAttr>("blockSize"));
+            auto arrayAttr = utilir::ArrayAttrToVector<mlir::IntegerAttr>(funcOp->getAttrOfType<mlir::ArrayAttr>("blockSize"));
             auto blockSizeX = arrayAttr[0].getInt();
             auto blockSizeY = arrayAttr[1].getInt();
             auto blockSizeZ = arrayAttr[2].getInt();
@@ -390,7 +397,7 @@ using vfloatx16_t = float __attribute__((ext_vector_type(16)));
         return success();
     }
 
-    LogicalResult GpuDialectCppPrinter::printOp(ReturnOp)
+    LogicalResult GpuDialectCppPrinter::printOp(gpu::ReturnOp)
     {
         return success();
     }
diff --git a/accera/acc-translate/src/Target/Cpp/StdDialectCppPrinter.cpp b/accera/acc-translate/src/Target/Cpp/StdDialectCppPrinter.cpp
index 2aed9649..960316ff 100644
--- a/accera/acc-translate/src/Target/Cpp/StdDialectCppPrinter.cpp
+++ b/accera/acc-translate/src/Target/Cpp/StdDialectCppPrinter.cpp
@@ -148,7 +148,21 @@ namespace cpp_printer
 
     LogicalResult StdDialectCppPrinter::printConstantOp(ConstantOp constOp)
     {
-        state.nameState.addConstantValue(constOp.getResult(), constOp.getValue());
+        if (isConstantScalarOp(constOp))
+        {
+            state.nameState.addConstantValue(constOp.getResult(), constOp.getValue());
+        }
+        else
+        {
+            RETURN_IF_FAILED(printer->printType(constOp.getType()));
+            os << " "
+               << state.nameState.getOrCreateName(constOp.getResult(),
+                                                  SSANameState::SSANameKind::Constant);
+            // Now print out the constant value
+            os << " = ";
+            if (failed(printer->printAttribute(constOp.getValue())))
+                return constOp.emitOpError("<<unable to print constant value>>");
+        }
         return success();
     }
 
diff --git a/accera/acc-translate/src/Target/Cpp/StdDialectCppPrinter.h b/accera/acc-translate/src/Target/Cpp/StdDialectCppPrinter.h
index 94bfda16..cefb7fdd 100644
--- a/accera/acc-translate/src/Target/Cpp/StdDialectCppPrinter.h
+++ b/accera/acc-translate/src/Target/Cpp/StdDialectCppPrinter.h
@@ -6,10 +6,7 @@
 
 #ifndef STD_DIALECT_CPP_PRINTER_H_
 #define STD_DIALECT_CPP_PRINTER_H_
-
-// #include "CppPrinter.h"
-// #include "mlir/Dialect/StandardOps/IR/Ops.h"
-
+ 
 #include <mlir/Dialect/Math/IR/Math.h>
 #include <mlir/Dialect/MemRef/IR/MemRef.h>
 #include <mlir/Dialect/StandardOps/IR/Ops.h>
diff --git a/accera/acc-translate/src/Target/Cpp/TranslateToCpp.cpp b/accera/acc-translate/src/Target/Cpp/TranslateToCpp.cpp
index efa83f83..e47e06ad 100644
--- a/accera/acc-translate/src/Target/Cpp/TranslateToCpp.cpp
+++ b/accera/acc-translate/src/Target/Cpp/TranslateToCpp.cpp
@@ -7,6 +7,13 @@
 #include "TranslateToCpp.h"
 #include "CppPrinter.h"
 
+#include <mlir/Dialect/Affine/Passes.h>
+#include <mlir/Dialect/MemRef/Transforms/Passes.h>
+#include <mlir/Pass/Pass.h>
+#include <mlir/Pass/PassManager.h>
+#include <mlir/Transforms/GreedyPatternRewriteDriver.h>
+#include <mlir/Transforms/Passes.h>
+
 using namespace llvm;
 
 namespace mlir
@@ -15,7 +22,21 @@ namespace mlir
 LogicalResult translateModuleToCpp(Operation* m, raw_ostream& os)
 {
     cpp_printer::CppPrinter printer(os);
+#if 0
+    auto context = m->getContext();
+
+    PassManager pm(context);
+    auto& optPM = pm.nest<mlir::FuncOp>();
+    optPM.addPass(memref::createFoldSubViewOpsPass());
+    optPM.addPass(createAffineScalarReplacementPass());
+    pm.addPass(createCSEPass()); 
+    pm.addPass(createCanonicalizerPass());
 
+    if (failed(pm.run(m)))
+    {
+        return failure();
+    } 
+#endif
     return printer.process(m);
 }
 
diff --git a/accera/acc-translate/src/Target/Cpp/VectorDialectCppPrinter.cpp b/accera/acc-translate/src/Target/Cpp/VectorDialectCppPrinter.cpp
index 20cbb7d5..bd263efa 100644
--- a/accera/acc-translate/src/Target/Cpp/VectorDialectCppPrinter.cpp
+++ b/accera/acc-translate/src/Target/Cpp/VectorDialectCppPrinter.cpp
@@ -5,6 +5,8 @@
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 #include "VectorDialectCppPrinter.h"
+#include "AffineDialectCppPrinter.h"
+#include <mlir/Dialect/Vector/VectorOps.h>
 
 using namespace mlir;
 
@@ -29,10 +31,61 @@ namespace cpp_printer
 
     LogicalResult VectorDialectCppPrinter::printInsertElementOp(vector::InsertElementOp op)
     {
+        auto result = op.getResult();
+        auto idx = state.nameState.getOrCreateName(
+            result, SSANameState::SSANameKind::Variable);
+
         os << state.nameState.getName(op.dest()) << "[";
         os << state.nameState.getName(op.position()) << "]";
         os << " = ";
         os << state.nameState.getName(op.source());
+        os << ";\n";
+
+        RETURN_IF_FAILED(printer->printType(result.getType()));
+        os << " " << idx;
+        os << " = ";
+        os << state.nameState.getName(op.dest());
+
+        return success();
+    }
+
+    LogicalResult VectorDialectCppPrinter::printLoadOp(vector::LoadOp op)
+    {
+        return printer->printMemRefLoadOrStore(true, op.base(), op.getMemRefType(), op.indices(), op.getResult());
+    }
+
+    LogicalResult VectorDialectCppPrinter::printStoreOp(vector::StoreOp op)
+    {
+        return printer->printMemRefLoadOrStore(false, op.base(), op.getMemRefType(), op.indices(), op.valueToStore());
+    }
+    
+    LogicalResult VectorDialectCppPrinter::printBroadcastOp(vector::BroadcastOp op)
+    {
+        auto vecTy = op.getVectorType();
+        if (vecTy.getRank() != 1) {
+            os << "[[ only rank 1 vector is supported ]]";
+            return failure();
+        }
+
+        auto vec = op.vector();
+        auto source = op.source();
+
+        
+        auto idx = state.nameState.getOrCreateName(
+            vec, SSANameState::SSANameKind::Variable);
+
+        RETURN_IF_FAILED(printer->printType(vecTy));
+        os << " " << idx;
+        os << "{";
+        for (int i = 0; i < vecTy.getNumElements(); i++) {
+            os << " ";
+            os << state.nameState.getName(source);
+            if (i != vecTy.getNumElements() - 1) {
+                os << ",";
+            }
+        }
+        os << "}";
+
         return success();
     }
 
@@ -46,6 +99,13 @@ namespace cpp_printer
             return printExtractElementOp(extractElementOp);
         if (auto insertElementOp = dyn_cast<mlir::vector::InsertElementOp>(op))
             return printInsertElementOp(insertElementOp);
+        if (auto loadOp = dyn_cast<mlir::vector::LoadOp>(op))
+            return printLoadOp(loadOp);
+        if (auto storeOp = dyn_cast<mlir::vector::StoreOp>(op))
+            return printStoreOp(storeOp);
+        if (auto broadcastOp = dyn_cast<mlir::vector::BroadcastOp>(op))
+            return printBroadcastOp(broadcastOp);
+
 
         *consumed = false;
         return success();
diff --git a/accera/acc-translate/src/Target/Cpp/VectorDialectCppPrinter.h b/accera/acc-translate/src/Target/Cpp/VectorDialectCppPrinter.h
index ef66667d..838bd823 100644
--- a/accera/acc-translate/src/Target/Cpp/VectorDialectCppPrinter.h
+++ b/accera/acc-translate/src/Target/Cpp/VectorDialectCppPrinter.h
@@ -27,6 +27,9 @@ namespace cpp_printer
         LogicalResult printDialectOperation(Operation* op, bool* skipped, bool* consumed) override;
         LogicalResult printExtractElementOp(vector::ExtractElementOp op);
         LogicalResult printInsertElementOp(vector::InsertElementOp op);
+        LogicalResult printLoadOp(vector::LoadOp op);
+        LogicalResult printStoreOp(vector::StoreOp op);
+        LogicalResult printBroadcastOp(vector::BroadcastOp op);
     };
 
 } // namespace cpp_printer
diff --git a/accera/accc/CMakeLists.txt b/accera/accc/CMakeLists.txt
index a5d757e3..c3a5a4a1 100644
--- a/accera/accc/CMakeLists.txt
+++ b/accera/accc/CMakeLists.txt
@@ -37,7 +37,7 @@ if(MSVC)
     set(OBJ_EXTENSION .obj)
     set(EXE_EXTENSION .exe)
     set(CONFIG_IN_BUILT_PATH True)
-    set(ADDITIONAL_CMAKE_INIT_ARGS "-G \"Visual Studio 16 2019\" -A x64 -T host=x64")
+    set(ADDITIONAL_CMAKE_INIT_ARGS "-G \"Visual Studio 17 2022\" -A x64 -T host=x64")
 elseif(XCODE)
     set(CONFIG_IN_BUILT_PATH True)
 endif()
diff --git a/accera/accc/accc.py b/accera/accc/accc.py
index 12d3ae13..08788acd 100644
--- a/accera/accc/accc.py
+++ b/accera/accc/accc.py
@@ -32,10 +32,12 @@ class SystemTarget(Enum):
 
 
 class Runtime(Enum):
-    DEFAULT = "default"
+    NONE = "none"
     CUDA = "cuda"
     ROCM = "rocm"
     VULKAN = "vulkan"
+    OPENMP = "openmp"
+    DEFAULT = "default"
 
 
 system_target_options = [t.value for t in SystemTarget]
diff --git a/accera/ir/CMakeLists.txt b/accera/ir/CMakeLists.txt
index f3635519..4de63f7f 100644
--- a/accera/ir/CMakeLists.txt
+++ b/accera/ir/CMakeLists.txt
@@ -86,6 +86,7 @@ set(rcexec_src
 
 set(rcexec_include
     include/exec/CacheAccessMaps.h
+    include/exec/ExecutionOptions.h
     include/exec/ExecutionPlanAttributes.h
     include/exec/ExecutionPlanOps.h
     include/exec/VectorizationInfo.h
diff --git a/accera/ir/include/Common.td b/accera/ir/include/Common.td
index 557e668c..9fafa2e4 100644
--- a/accera/ir/include/Common.td
+++ b/accera/ir/include/Common.td
@@ -14,84 +14,84 @@ include "mlir/Interfaces/CallInterfaces.td"
 include "mlir/Interfaces/LoopLikeInterface.td"
 include "mlir/Interfaces/SideEffectInterfaces.td"
 
-class rc_HasShape<list<int> shape> :
+class acc_HasShape<list<int> shape> :
     CPred<"$_self.cast<ShapedType>().hasStaticShape({" # !interleave(shape, ",") # "})">;
 
-class rc_HasNumElements<int length> :
+class acc_HasNumElements<int length> :
     CPred<"$_self.cast<ShapedType>().getNumElements() == " # length>;
 
-class rc_MemRefOfTypeWithShape<list<Type> allowedTypes, list<int> shape> :
-    Type<And<[MemRefOf<allowedTypes>.predicate, rc_HasShape<shape>]>,
+class acc_MemRefOfTypeWithShape<list<Type> allowedTypes, list<int> shape> :
+    Type<And<[MemRefOf<allowedTypes>.predicate, acc_HasShape<shape>]>,
         MemRefOf<allowedTypes>.description # " with shape { " #
         !interleave(shape, ",") # " }">;
 
-class rc_MemRefOfTypeWithNumElements<list<Type> allowedTypes, int length> :
-    Type<And<[MemRefOf<allowedTypes>.predicate, rc_HasNumElements<length>]>,
+class acc_MemRefOfTypeWithNumElements<list<Type> allowedTypes, int length> :
+    Type<And<[MemRefOf<allowedTypes>.predicate, acc_HasNumElements<length>]>,
         MemRefOf<allowedTypes>.description # " with " # length # " elements">;
 
-class rc_MemRefWithShape<list<int> shape> :
-    Type<And<[AnyStaticShapeMemRef.predicate, rc_HasShape<shape>]>,
+class acc_MemRefWithShape<list<int> shape> :
+    Type<And<[AnyStaticShapeMemRef.predicate, acc_HasShape<shape>]>,
         AnyStaticShapeMemRef.description # " with shape { " # !interleave(shape, ",") # " }">;
 
-class rc_MemRefWithNumElements<int length> :
-    Type<And<[AnyStaticShapeMemRef.predicate, rc_HasNumElements<length>]>,
+class acc_MemRefWithNumElements<int length> :
+    Type<And<[AnyStaticShapeMemRef.predicate, acc_HasNumElements<length>]>,
         AnyStaticShapeMemRef.description # " with " # length # " elements">;
 
-class rc_TensorOfTypeWithShape<list<Type> allowedTypes, list<int> shape> :
-    Type<And<[TensorOf<allowedTypes>.predicate, rc_HasShape<shape>]>,
+class acc_TensorOfTypeWithShape<list<Type> allowedTypes, list<int> shape> :
+    Type<And<[TensorOf<allowedTypes>.predicate, acc_HasShape<shape>]>,
         TensorOf<allowedTypes>.description # " with shape { " #
         !interleave(shape, ",") # " }">;
 
-class rc_TensorOfTypeWithNumElements<list<Type> allowedTypes, int length> :
-    Type<And<[TensorOf<allowedTypes>.predicate, rc_HasNumElements<length>]>,
+class acc_TensorOfTypeWithNumElements<list<Type> allowedTypes, int length> :
+    Type<And<[TensorOf<allowedTypes>.predicate, acc_HasNumElements<length>]>,
         TensorOf<allowedTypes>.description # " with " # length # " elements">;
 
 
-class rc_TensorWithShape<list<int> shape> :
-    Type<And<[AnyStaticShapeTensor.predicate, rc_HasShape<shape>]>,
+class acc_TensorWithShape<list<int> shape> :
+    Type<And<[AnyStaticShapeTensor.predicate, acc_HasShape<shape>]>,
         AnyStaticShapeTensor.description # " with shape { " # !interleave(shape, ",") # " }">;
 
-class rc_TensorWithNumElements<int length> :
-    Type<And<[AnyStaticShapeTensor.predicate, rc_HasNumElements<length>]>,
+class acc_TensorWithNumElements<int length> :
+    Type<And<[AnyStaticShapeTensor.predicate, acc_HasNumElements<length>]>,
         AnyStaticShapeTensor.description # " with " # length # " elements">;
 
 
-class rc_ContainerOfTypeWithShape<list<Type> allowedTypes, list<int> shape> :
-    Type<Or<[rc_MemRefOfTypeWithShape<allowedTypes, shape>.predicate,
-        rc_TensorOfTypeWithShape<allowedTypes, shape>.predicate]>,
+class acc_ContainerOfTypeWithShape<list<Type> allowedTypes, list<int> shape> :
+    Type<Or<[acc_MemRefOfTypeWithShape<allowedTypes, shape>.predicate,
+        acc_TensorOfTypeWithShape<allowedTypes, shape>.predicate]>,
         MemRefOf<allowedTypes>.description # " or " #
         TensorOf<allowedTypes>.description # " with shape { " # !interleave(shape, ",") # " }">;
 
-class rc_ContainerOfTypeWithNumElements<list<Type> allowedTypes, int length> :
-    Type<Or<[rc_MemRefOfTypeWithNumElements<allowedTypes, length>.predicate,
-        rc_TensorOfTypeWithNumElements<allowedTypes, length>.predicate]>,
+class acc_ContainerOfTypeWithNumElements<list<Type> allowedTypes, int length> :
+    Type<Or<[acc_MemRefOfTypeWithNumElements<allowedTypes, length>.predicate,
+        acc_TensorOfTypeWithNumElements<allowedTypes, length>.predicate]>,
         MemRefOf<allowedTypes>.description # " or " #
         TensorOf<allowedTypes>.description # " with " # length # " elements">;
 
-class rc_ContainerWithShape<list<int> shape> :
-    Type<Or<[rc_MemRefWithShape<shape>.predicate, rc_TensorWithShape<shape>.predicate]>,
-        rc_MemRefWithShape<shape>.description # " or " #
-        rc_TensorWithShape<shape>.description>;
+class acc_ContainerWithShape<list<int> shape> :
+    Type<Or<[acc_MemRefWithShape<shape>.predicate, acc_TensorWithShape<shape>.predicate]>,
+        acc_MemRefWithShape<shape>.description # " or " #
+        acc_TensorWithShape<shape>.description>;
 
-class rc_ContainerWithNumElements<int length> :
-    Type<Or<[rc_MemRefWithNumElements<length>.predicate, rc_TensorWithNumElements<length>.predicate]>,
-        rc_MemRefWithNumElements<length>.description # " or " #
-        rc_TensorWithNumElements<length>.description>;
+class acc_ContainerWithNumElements<int length> :
+    Type<Or<[acc_MemRefWithNumElements<length>.predicate, acc_TensorWithNumElements<length>.predicate]>,
+        acc_MemRefWithNumElements<length>.description # " or " #
+        acc_TensorWithNumElements<length>.description>;
 
-def rc_NumericType :
+def acc_NumericType :
     Type<Or<[AnySignlessInteger.predicate, AnyFloat.predicate, Index.predicate]>, "Arithmetic type">;
 
-def rc_ScalarOrVectorNumericType :
-    AnyTypeOf<[rc_NumericType, VectorOf<[rc_NumericType]>]>;
+def acc_ScalarOrVectorNumericType :
+    AnyTypeOf<[acc_NumericType, VectorOf<[acc_NumericType]>]>;
 
-class rc_Scalarlike<Type type> :
-    AnyTypeOf<[type, rc_ContainerOfTypeWithNumElements<[type], 1>]>;
+class acc_Scalarlike<Type type> :
+    AnyTypeOf<[type, acc_ContainerOfTypeWithNumElements<[type], 1>]>;
 
-def rc_Indexlike : AnyTypeOf<[Index, rc_Scalarlike<AnySignlessInteger>]>;
+def acc_Indexlike : AnyTypeOf<[Index, acc_Scalarlike<AnySignlessInteger>]>;
 
-def rc_BoolType : Type<I1.predicate, "Boolean type">;
+def acc_BoolType : Type<I1.predicate, "Boolean type">;
 
-def rc_ScalarOrVectorBoolType :
-    AnyTypeOf<[rc_BoolType, VectorOf<[rc_BoolType]>]>;
+def acc_ScalarOrVectorBoolType :
+    AnyTypeOf<[acc_BoolType, VectorOf<[acc_BoolType]>]>;
 
 #endif // ACCERA_COMMON_TD
diff --git a/accera/ir/include/IRUtil.h b/accera/ir/include/IRUtil.h
index c731d01b..51094085 100644
--- a/accera/ir/include/IRUtil.h
+++ b/accera/ir/include/IRUtil.h
@@ -9,6 +9,7 @@
 #include <optional>
 #include <queue>
 #include <set>
+#include <stack>
 #include <string>
 #include <type_traits>
 #include <vector>
@@ -31,7 +32,15 @@
 namespace mlir
 {
 class AffineForOp;
+class AffineLoadOp;
+class AffineStoreOp;
 class OpBuilder;
+
+namespace memref
+{
+    class LoadOp;
+    class StoreOp;
+} // namespace memref
 } // namespace mlir
 
 namespace accera::ir
@@ -248,16 +257,17 @@ namespace util
 
     mlir::Operation* CloneRecursively(mlir::OpBuilder& builder, mlir::Operation* op, mlir::BlockAndValueMapping& mapping);
 
-    std::optional<ir::value::ExecutionTarget> ResolveExecutionTarget(mlir::Operation* op);
-    std::optional<ir::value::ExecutionRuntime> ResolveExecutionRuntime(mlir::Operation* op);
+    std::optional<ir::value::ExecutionTarget> ResolveExecutionTarget(mlir::Operation* op, bool exact = false);
+    std::optional<ir::value::ExecutionRuntime> ResolveExecutionRuntime(mlir::Operation* op, bool exact = false);
+    std::optional<int64_t> ResolveWarpSize(mlir::Operation* op);
 
     mlir::Operation* CreateGPUControlBarrier(mlir::OpBuilder& builder, const std::string scope, std::optional<mlir::Location> loc = std::nullopt);
 
     std::optional<int64_t> GetDimSizeAt(const loopnest::Index& dimensionIndex, mlir::Operation* where);
 
-    std::vector<mlir::Value> GetCurrentIndexIVs(const std::vector<loopnest::Index>& loopIndices, mlir::Operation* where);
+    std::vector<mlir::Value> GetCurrentIndexIVs(const std::vector<loopnest::Index>& loopIndices, mlir::Operation* where, const std::vector<std::pair<loopnest::Index, mlir::Value>>& unrealizedLoopNestIndices = {});
 
-    std::vector<mlir::Value> GetCurrentIndexIVs(const std::vector<loopnest::Index>& loopIndices, mlir::Block* where);
+    std::vector<mlir::Value> GetCurrentIndexIVs(const std::vector<loopnest::Index>& loopIndices, mlir::Block* where, const std::vector<std::pair<loopnest::Index, mlir::Value>>& unrealizedLoopNestIndices = {});
 
     std::vector<loopnest::Index> GetIndicesForLoopIVs(const std::vector<mlir::Value>& loopIVs);
 
@@ -335,5 +345,42 @@ namespace util
         return true;
     }
 
+    mlir::AffineMap GetIndexToMemoryLocationMap(mlir::MLIRContext* context, mlir::AffineStoreOp op);
+    mlir::AffineMap GetIndexToMemoryLocationMap(mlir::MLIRContext* context, mlir::AffineLoadOp op);
+    mlir::AffineMap GetIndexToMemoryLocationMap(mlir::MLIRContext* context, mlir::memref::StoreOp op);
+    mlir::AffineMap GetIndexToMemoryLocationMap(mlir::MLIRContext* context, mlir::memref::LoadOp op);
+
+    mlir::AffineMap ComposeAffineMapSequence(const std::vector<mlir::AffineMap>& maps);
+
+    struct TempOpCleanupGuard
+    {
+        TempOpCleanupGuard(std::stack<mlir::Operation*>* opStack, mlir::PatternRewriter& rewriter);
+        ~TempOpCleanupGuard();
+        std::stack<mlir::Operation*>* _opStack;
+        mlir::PatternRewriter& _rewriter;
+    };
+
+    mlir::Attribute MemorySpaceToAttribute(const value::MemorySpace& memorySpace, mlir::MLIRContext* context);
+    value::MemorySpace AttributeToMemorySpace(mlir::Attribute memorySpaceAttr);
+
+    // Similar to mlir::AffineMap::getMinorIdentityMap(), but instead this creates the mapping
+    // (d0, ..., dn) -> (d0, ... , dk) where n = (num dims - 1) and k = (num results - 1)
+    mlir::AffineMap GetMajorIdentityMap(unsigned dims, unsigned results, mlir::MLIRContext* context);
+
+    template <typename OpType>
+    mlir::Operation* GetHighestAncestorOfType(mlir::Operation* op)
+    {
+        mlir::Operation* current = nullptr;
+        mlir::Operation* parent = op->getParentOfType<OpType>();
+        while (parent != nullptr)
+        {
+            current = parent;
+            parent = current->getParentOfType<OpType>();
+        }
+        return current;
+    }
+
+    void EraseAllOpsInBlock(mlir::PatternRewriter& rewriter, mlir::Block& block);
+
 } // namespace util
 } // namespace accera::ir
diff --git a/accera/ir/include/accera/AcceraOps.h b/accera/ir/include/accera/AcceraOps.h
index 188a7d38..e1b30e45 100644
--- a/accera/ir/include/accera/AcceraOps.h
+++ b/accera/ir/include/accera/AcceraOps.h
@@ -6,11 +6,12 @@
 
 #pragma once
 
+#include <mlir/Dialect/Affine/IR/AffineMemoryOpInterfaces.h>
 #include <mlir/IR/Builders.h>
-#include <mlir/IR/Dialect.h>
 #include <mlir/IR/BuiltinOps.h>
-#include <mlir/IR/OpDefinition.h>
 #include <mlir/IR/BuiltinTypes.h>
+#include <mlir/IR/Dialect.h>
+#include <mlir/IR/OpDefinition.h>
 #include <mlir/IR/TypeUtilities.h>
 #include <mlir/Interfaces/SideEffectInterfaces.h>
 
@@ -29,7 +30,10 @@ using llvm::SmallVectorImpl;
 using llvm::StringRef;
 
 using mlir::AffineMap;
+using mlir::AffineMapAccessInterface;
 using mlir::AffineMapAttr;
+using mlir::AffineReadOpInterface;
+using mlir::AffineWriteOpInterface;
 using mlir::Attribute;
 using mlir::Block;
 using mlir::Builder;
diff --git a/accera/ir/include/exec/ExecutionOptions.h b/accera/ir/include/exec/ExecutionOptions.h
new file mode 100644
index 00000000..06f891e4
--- /dev/null
+++ b/accera/ir/include/exec/ExecutionOptions.h
@@ -0,0 +1,82 @@
+////////////////////////////////////////////////////////////////////////////////////////////////////
+//  Copyright (c) Microsoft Corporation. All rights reserved.
+//  Licensed under the MIT License. See LICENSE in the project root for license information.
+//  Authors: Abdul Dakkak, Mason Remy
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+#pragma once
+
+#include <ir/include/IRUtil.h>
+
+#include <mlir/IR/Attributes.h>
+
+#include <cstdint>
+#include <variant>
+#include <vector>
+
+namespace accera::ir
+{
+namespace targets
+{
+    //  <summary> A struct encapsulating x, y, z indices for a GPU processor </summary>
+    struct Dim3
+    {
+        /// <summary> The x index </summary>
+        int64_t x;
+        /// <summary> The y index </summary>
+        int64_t y;
+        /// <summary> The z index </summary>
+        int64_t z;
+
+        Dim3(int64_t x_ = 1, int64_t y_ = 1, int64_t z_ = 1) :
+            x(x_), y(y_), z(z_) {}
+    };
+
+    /// <summary> The CPU execution options </summary>
+    struct CPU
+    {};
+
+    /// <summary> The GPU execution options </summary>
+    struct GPU
+    {
+        /// <summary> Indicates the grid </summary>
+        Dim3 grid;
+
+        /// <summary> Indicates the block </summary>
+        Dim3 block;
+
+        GPU(Dim3 grid_ = Dim3(1, 1, 1), Dim3 block_ = Dim3(1, 1, 1)) :
+            grid(grid_), block(block_){};
+
+        static GPU FromArrayAttr(const mlir::ArrayAttr& arrayAttr)
+        {
+            auto launchParams = util::ConvertArrayAttrToIntVector(arrayAttr);
+            Dim3 gridDimSizes(launchParams[0], launchParams[1], launchParams[2]);
+            Dim3 blockDimSizes(launchParams[3], launchParams[4], launchParams[5]);
+            return { gridDimSizes, blockDimSizes };
+        }
+
+        mlir::ArrayAttr ToArrayAttr(mlir::MLIRContext* context)
+        {
+            std::vector<int64_t> gridAndBlockDims{ grid.x, grid.y, grid.z, block.x, block.y, block.z };
+            return util::VectorToArrayAttr<int64_t, mlir::IntegerAttr>(
+                gridAndBlockDims, [&](const int64_t& intVal) {
+                    return mlir::IntegerAttr::get(mlir::IntegerType::get(context, 64), intVal);
+                },
+                context);
+        }
+    };
+
+    using Target = std::variant<CPU, GPU>;
+    enum class Runtime : int
+    {
+        NONE,
+        CUDA,
+        ROCM,
+        VULKAN,
+        OPENMP,
+        DEFAULT
+    };
+
+} // namespace targets
+} // namespace accera::ir
\ No newline at end of file
diff --git a/accera/ir/include/exec/ExecutionPlanAttributes.h b/accera/ir/include/exec/ExecutionPlanAttributes.h
index 0015132d..64aa53a2 100644
--- a/accera/ir/include/exec/ExecutionPlanAttributes.h
+++ b/accera/ir/include/exec/ExecutionPlanAttributes.h
@@ -134,7 +134,7 @@ namespace executionPlan
 
         ValueType getValue() const;
 
-        static llvm::StringRef getKeyName() { return "rcxp_vectorizationInfo"; }
+        static llvm::StringRef getKeyName() { return "accxp_vectorizationInfo"; }
     };
 
     class ParallelizationInfoAttr
@@ -148,7 +148,7 @@ namespace executionPlan
 
         ValueType getValue() const;
 
-        static llvm::StringRef getKeyName() { return "rcxp_parallelizationInfo"; }
+        static llvm::StringRef getKeyName() { return "accxp_parallelizationInfo"; }
     };
 
     class TensorizationInfoAttr
@@ -162,7 +162,7 @@ namespace executionPlan
 
         ValueType getValue() const;
 
-        static llvm::StringRef getKeyName() { return "rcxp_tensorizationInfo"; }
+        static llvm::StringRef getKeyName() { return "accxp_tensorizationInfo"; }
     };
 
     class InPlaceUnrollInfoAttr
@@ -176,7 +176,7 @@ namespace executionPlan
 
         ValueType getValue() const;
 
-        static llvm::StringRef getKeyName() { return "rcxp_inPlaceUnrollInfo"; }
+        static llvm::StringRef getKeyName() { return "accxp_inPlaceUnrollInfo"; }
     };
 
     //
diff --git a/accera/ir/include/exec/ExecutionPlanAttrs.td b/accera/ir/include/exec/ExecutionPlanAttrs.td
index 46d30d18..339e3af8 100644
--- a/accera/ir/include/exec/ExecutionPlanAttrs.td
+++ b/accera/ir/include/exec/ExecutionPlanAttrs.td
@@ -16,7 +16,7 @@ def CACHE_MAPPING_LOGICAL_TO_PHYSICAL : I64EnumAttrCase<"LogicalToPhysical", 1>;
 def CACHE_MAPPING_LOGICAL_TO_GLOBAL : I64EnumAttrCase<"LogicalToGlobal", 2>;
 def CACHE_MAPPING_NONE : I64EnumAttrCase<"None", 3>;
 
-def rcxp_CacheMappingAttr : I64EnumAttr<
+def accxp_CacheMappingAttr : I64EnumAttr<
   "CacheIndexing",
   "An attribute containing a cache mapping type enum. This indicates which cache access map to use to access a cache.",
   [CACHE_MAPPING_GLOBAL_TO_PHYSICAL, CACHE_MAPPING_LOGICAL_TO_PHYSICAL, CACHE_MAPPING_LOGICAL_TO_GLOBAL, CACHE_MAPPING_NONE]> {
@@ -26,7 +26,7 @@ def rcxp_CacheMappingAttr : I64EnumAttr<
 def CACHE_ALLOCATION_AUTOMATIC : I64EnumAttrCase<"Automatic", 0>;
 def CACHE_ALLOCATION_NONE : I64EnumAttrCase<"None", 1>;
 
-def rcxp_CacheAllocationAttr : I64EnumAttr<
+def accxp_CacheAllocationAttr : I64EnumAttr<
   "CacheAllocation", "An attribute containing a cache allocation type enum",
   [CACHE_ALLOCATION_AUTOMATIC, CACHE_ALLOCATION_NONE]> {
   let cppNamespace = "::accera::ir::executionPlan";
@@ -35,7 +35,7 @@ def rcxp_CacheAllocationAttr : I64EnumAttr<
 def CACHE_COPY_SRC_DIMS : I64EnumAttrCase<"Source", 0>;
 def CACHE_COPY_DST_DIMS : I64EnumAttrCase<"Destination", 1>;
 
-def rcxp_CacheCopyDimensionsAttr : I64EnumAttr<
+def accxp_CacheCopyDimensionsAttr : I64EnumAttr<
   "CacheCopyDimensions", "An attribute containing a cache copy shape dimensions assignment. This indicates which side of the copy is represented by the copy dimension sizes.",
   [CACHE_COPY_SRC_DIMS, CACHE_COPY_DST_DIMS]> {
   let cppNamespace = "::accera::ir::executionPlan";
diff --git a/accera/ir/include/exec/ExecutionPlanInterfaces.td b/accera/ir/include/exec/ExecutionPlanInterfaces.td
index 90c751ca..cb54efc3 100644
--- a/accera/ir/include/exec/ExecutionPlanInterfaces.td
+++ b/accera/ir/include/exec/ExecutionPlanInterfaces.td
@@ -7,7 +7,7 @@ include "mlir/IR/OpBase.td"
 // Cache region common interface
 //
 
-def rcxp_BeginCacheRegionOpInterface : OpInterface<"BeginCacheRegion"> {
+def accxp_BeginCacheRegionOpInterface : OpInterface<"BeginCacheRegion"> {
   let description = [{
     Interface for the cache region begin ops
   }];
@@ -26,7 +26,7 @@ def rcxp_BeginCacheRegionOpInterface : OpInterface<"BeginCacheRegion"> {
   ];
 }
 
-def rcxp_EndCacheRegionOpInterface : OpInterface<"EndCacheRegion"> {
+def accxp_EndCacheRegionOpInterface : OpInterface<"EndCacheRegion"> {
   let description = [{
     Interface for the cache region end ops
   }];
diff --git a/accera/ir/include/exec/ExecutionPlanOps.h b/accera/ir/include/exec/ExecutionPlanOps.h
index 3f76635e..47695276 100644
--- a/accera/ir/include/exec/ExecutionPlanOps.h
+++ b/accera/ir/include/exec/ExecutionPlanOps.h
@@ -177,4 +177,6 @@ CacheAccessContext MakeCacheAccessContext(
     mlir::Value cache,
     CacheInfo& cacheInfo);
 
+DelayedMappingRegionOp MakeDelayedMappingRegion(mlir::OpBuilder& builder, mlir::Value from, mlir::Value to, std::function<void(mlir::OpBuilder&)> body);
+
 } // namespace accera::ir::executionPlan
diff --git a/accera/ir/include/exec/ExecutionPlanOps.td b/accera/ir/include/exec/ExecutionPlanOps.td
index fc704e7a..495d52e3 100644
--- a/accera/ir/include/exec/ExecutionPlanOps.td
+++ b/accera/ir/include/exec/ExecutionPlanOps.td
@@ -25,7 +25,7 @@ def ExecutionPlan_Dialect : Dialect {
 // Attributes
 //
 
-def rcxp_VectorizationInfoAttr : DialectAttr<
+def accxp_VectorizationInfoAttr : DialectAttr<
     ExecutionPlan_Dialect,
     CPred<"$_self.isa<VectorizationInfoAttr>()">,
     "Vectorization info attribute"> {
@@ -35,7 +35,7 @@ def rcxp_VectorizationInfoAttr : DialectAttr<
   let constBuilderCall = "$0";
 }
 
-def rcxp_InPlaceUnrollInfoAttr : DialectAttr<
+def accxp_InPlaceUnrollInfoAttr : DialectAttr<
     ExecutionPlan_Dialect,
     CPred<"$_self.isa<InPlaceUnrollInfoAttr>()">,
     "In-place unroll info attribute"> {
@@ -45,7 +45,7 @@ def rcxp_InPlaceUnrollInfoAttr : DialectAttr<
   let constBuilderCall = "$0";
 }
 
-def rcxp_ParallelizationInfoAttr : DialectAttr<
+def accxp_ParallelizationInfoAttr : DialectAttr<
     ExecutionPlan_Dialect,
     CPred<"$_self.isa<ParallelizationInfoAttr>()">,
     "Parallelization info attribute"> {
@@ -55,7 +55,7 @@ def rcxp_ParallelizationInfoAttr : DialectAttr<
   let constBuilderCall = "$0";
 }
 
-def rcxp_TensorizationInfoAttr : DialectAttr<
+def accxp_TensorizationInfoAttr : DialectAttr<
     ExecutionPlan_Dialect,
     CPred<"$_self.isa<TensorizationInfoAttr>()">,
     "Parallelization info attribute"> {
@@ -70,17 +70,28 @@ def rcxp_TensorizationInfoAttr : DialectAttr<
 //   * The parent dialect of the operation.
 //   * The mnemonic for the operation, or the name without the dialect prefix.
 //   * A list of traits for the operation.
-class rcxp_Op<string mnemonic, list<OpTrait> traits = []> :
+class accxp_Op<string mnemonic, list<OpTrait> traits = []> :
     Op<ExecutionPlan_Dialect, mnemonic, traits>;
 
 //
 // ExecutionPlan Operations
 //
 
+
+//
+// TerminatorOp
+//
+def accxp_TerminatorOp : accxp_Op<"terminator", [Terminator]> {
+  let summary = "cf terminator operation";
+  let description = [{
+    "accxp.terminator" is a terminator operation for blocks used in ExecutionPlan op regions.
+  }];
+}
+
 //
 // MakeCacheOp
 //
-def rcxp_MakeCacheOp : rcxp_Op<"make_cache"> {
+def accxp_MakeCacheOp : accxp_Op<"make_cache"> {
   let summary = "Operation to infer and allocate a cache shape and viewing maps";
   let description = [{
     The "accxp.make_cache" operation lowers to an allocated cache in global memory.
@@ -109,8 +120,8 @@ def rcxp_MakeCacheOp : rcxp_Op<"make_cache"> {
 
   let extraClassDeclaration = [{
     mlir::AffineValueMap insertCachePosition(const std::vector<mlir::Value>& multiCacheIndexIterationCounters, const std::vector<mlir::Value>& offsetAccessIVs, const std::vector<mlir::Value>& baseArrayIndices);
-    mlir::AffineValueMap insertCachePosition(mlir::Operation* where, const std::vector<mlir::Value>& baseArrayIndices);
-    mlir::AffineValueMap insertCachePosition(mlir::Block* where, const std::vector<mlir::Value>& baseArrayIndices);
+    mlir::AffineValueMap insertCachePosition(mlir::Operation* where, const std::vector<mlir::Value>& baseArrayIndices, const std::vector<std::pair<loopnest::Index, mlir::Value>>& unrealizedLoopNestIndices);
+    mlir::AffineValueMap insertCachePosition(mlir::Block* where, const std::vector<mlir::Value>& baseArrayIndices, const std::vector<std::pair<loopnest::Index, mlir::Value>>& unrealizedLoopNestIndices);
     std::vector<mlir::Value> getBaseArrayPosition(mlir::AffineLoadOp loadOp);
     std::vector<mlir::Value> getBaseArrayPosition(mlir::AffineStoreOp storeOp);
   }];
@@ -119,7 +130,7 @@ def rcxp_MakeCacheOp : rcxp_Op<"make_cache"> {
 //
 // ActiveElementCacheCopyOp
 //
-def rcxp_ActiveElementCacheCopyOp : rcxp_Op<"active_element_cache_copy", []> {
+def accxp_ActiveElementCacheCopyOp : accxp_Op<"active_element_cache_copy", []> {
   let summary = "memory reshaping and cache data copying operation";
   let description = [{
     The "accxp.active_element_cache_copy" operation describes a memory shape for a cached piece of data and produces code to copy to or from the cache.
@@ -157,7 +168,7 @@ def rcxp_ActiveElementCacheCopyOp : rcxp_Op<"active_element_cache_copy", []> {
 //
 // ActiveBlockCacheCopyOp
 //
-def rcxp_ActiveBlockCacheCopyOp : rcxp_Op<"active_block_cache_copy", [AttrSizedOperandSegments]> {
+def accxp_ActiveBlockCacheCopyOp : accxp_Op<"active_block_cache_copy", [AttrSizedOperandSegments]> {
   let summary = "memory reshaping and cache data copying operation";
   let description = [{
     The "accxp.active_block_cache_copy" operation describes a memory shape for a cached piece of data and produces code to copy to or from the cache.
@@ -171,13 +182,16 @@ def rcxp_ActiveBlockCacheCopyOp : rcxp_Op<"active_block_cache_copy", [AttrSizedO
                         ArrayAttr:$lbMaps,
                         ArrayAttr:$ubMaps,
                         AffineMapAttr:$activeBlockToCacheMap,
-                        UnitAttr:$toCache);
+                        UnitAttr:$toCache,
+                        StrAttr:$activeBlockTag,
+                        UnitAttr:$thrifty,
+                        UnitAttr:$skipBarriers); // TODO : remove this once barrier analysis hoists barriers out of loops
 }
 
 //
 // MultiCacheCopyOp
 //
-def rcxp_MultiCacheCopyOp : rcxp_Op<"multi_cache_copy"> {
+def accxp_MultiCacheCopyOp : accxp_Op<"multi_cache_copy"> {
   let summary = "memory reshaping and cache data copying operation";
   let description = [{
     The "accxp.multi_cache_copy" operation describes a memory shape for a cached piece of data and produces code to copy to or from the multicache.
@@ -194,25 +208,30 @@ def rcxp_MultiCacheCopyOp : rcxp_Op<"multi_cache_copy"> {
                         ArrayAttr:$activeBlockLowerBoundMaps,
                         ArrayAttr:$activeBlockUpperBoundMaps,
                         AffineMapAttr:$externalSymbolsPermutationMap,
-                        AffineMapAttr:$activeBlockToCacheMap);
+                        AffineMapAttr:$activeBlockToCacheMap,
+                        StrAttr:$activeBlockTag,
+                        UnitAttr:$thrifty,
+                        UnitAttr:$toCache);
 }
 
 //
 // CacheZeroOp
 //
-def rcxp_CacheZeroOp : rcxp_Op<"cache_zero"> {
+def accxp_CacheZeroOp : accxp_Op<"cache_zero"> {
   let summary = "cache zeroing operation";
   let description = [{
     The "accxp.cacheZeroOp" operation zeros out a cache
   }];
 
-  let arguments = (ins AnyMemRef:$cache);
+  let arguments = (ins AnyMemRef:$cache,
+                       StrAttr:$activeBlockTag,
+                       UnitAttr:$thrifty);
 }
 
 //
 // ActiveElementCacheReduceOp
 //
-def rcxp_ActiveElementCacheReduceOp : rcxp_Op<"active_element_cache_reduce", [AttrSizedOperandSegments]> {
+def accxp_ActiveElementCacheReduceOp : accxp_Op<"active_element_cache_reduce", [AttrSizedOperandSegments]> {
   let summary = "memory reshaping and cache reducing operation";
   let description = [{
     The "accxp.active_element_cache_reduce" operation describes a memory shape for a cached piece of data and produces code to copy from the cache back to the output.
@@ -252,7 +271,7 @@ def rcxp_ActiveElementCacheReduceOp : rcxp_Op<"active_element_cache_reduce", [At
 //
 // ActiveBlockCacheReduceOp
 //
-def rcxp_ActiveBlockCacheReduceOp : rcxp_Op<"active_block_cache_reduce", [AttrSizedOperandSegments]> {
+def accxp_ActiveBlockCacheReduceOp : accxp_Op<"active_block_cache_reduce", [AttrSizedOperandSegments]> {
   let summary = "memory reshaping and cache reducing operation";
   let description = [{
     The "accxp.active_block_cache_reduce" operation describes a memory shape for a cached piece of data and produces code to copy from the cache back to the output.
@@ -265,7 +284,9 @@ def rcxp_ActiveBlockCacheReduceOp : rcxp_Op<"active_block_cache_reduce", [AttrSi
                         ArrayAttr:$lbMaps,
                         ArrayAttr:$ubMaps,
                         AffineMapAttr:$activeBlockToCacheMap,
-                        Variadic<AnyType>:$scaleValues);
+                        Variadic<AnyType>:$scaleValues,
+                        StrAttr:$activeBlockTag,
+                        UnitAttr:$thrifty);
 
   let builders = [
     OpBuilder<(ins
@@ -275,16 +296,18 @@ def rcxp_ActiveBlockCacheReduceOp : rcxp_Op<"active_block_cache_reduce", [AttrSi
       "ValueRange":$ubOperands,
       "mlir::ArrayAttr":$lbMaps,
       "mlir::ArrayAttr":$ubMaps,
-      "mlir::AffineMap":$activeBlockToCacheMap
+      "mlir::AffineMap":$activeBlockToCacheMap,
+      "StringRef":$activeBlockTag,
+      "bool":$thrifty
     )>];
 }
 
 //
 // BeginCacheMappingOp
 //
-def rcxp_BeginCacheMappingOp : rcxp_Op<"begin_cache_mapping",
+def accxp_BeginCacheMappingOp : accxp_Op<"begin_cache_mapping",
   [AttrSizedOperandSegments,
-  DeclareOpInterfaceMethods<rcxp_BeginCacheRegionOpInterface>]> {
+  DeclareOpInterfaceMethods<accxp_BeginCacheRegionOpInterface>]> {
   let summary = "Maps input replacement value to cache value and replaces registered operations as appropriate";
   let description = [{
     The "accxp.begin_cache_mapping" operation marks the start of the subgraph graph to
@@ -325,8 +348,8 @@ def rcxp_BeginCacheMappingOp : rcxp_Op<"begin_cache_mapping",
 //
 // EndCacheMappingOp
 //
-def rcxp_EndCacheMappingOp : rcxp_Op<"end_cache_mapping",
-  [DeclareOpInterfaceMethods<rcxp_EndCacheRegionOpInterface>]> {
+def accxp_EndCacheMappingOp : accxp_Op<"end_cache_mapping",
+  [DeclareOpInterfaceMethods<accxp_EndCacheRegionOpInterface>]> {
   let summary = "Denotes the end of the graph section to replace mappings for";
   let description = [{
     The "accxp.end_cache_mapping" operation marks the end the of the subgraph to
@@ -340,10 +363,10 @@ def rcxp_EndCacheMappingOp : rcxp_Op<"end_cache_mapping",
 //
 // BeginCacheRegionOp
 //
-def rcxp_BeginCacheRegionOp : rcxp_Op<"begin_cache_region",
+def accxp_BeginCacheRegionOp : accxp_Op<"begin_cache_region",
   [AttrSizedOperandSegments,
-   DeclareOpInterfaceMethods<rcln_InjectableMappingOpInterface>,
-   DeclareOpInterfaceMethods<rcxp_BeginCacheRegionOpInterface>]> {
+   DeclareOpInterfaceMethods<accln_InjectableMappingOpInterface>,
+   DeclareOpInterfaceMethods<accxp_BeginCacheRegionOpInterface>]> {
   let summary = "Denotes the beginning of the subgraph where a cache is active. Lowers to the appropriate cache data moving ops and cache mapping ops";
   let description = [{
     The "accxp.begin_cache_region" operation marks the beginning of the subraph where the cache is active.
@@ -360,12 +383,15 @@ def rcxp_BeginCacheRegionOp : rcxp_Op<"begin_cache_region",
                         ArrayAttr:$cacheRegionRelevantIndexRanges,
                         ArrayAttr:$cacheRegionBaseIndices,
                         DictionaryAttr:$cacheAccessMaps,
-                        rcln_IndexAttr:$triggerIndex,
-                        rcln_IndexAttr:$cacheIndex,
+                        accln_IndexAttr:$triggerIndex,
+                        accln_IndexAttr:$cacheIndex,
                         I64Attr:$id,
                         I64Attr:$cacheHierarchyLevel,
                         UnitAttr:$activeBlockCache,
-                        UnitAttr:$dimReorderCache);
+                        UnitAttr:$dimReorderCache,
+                        UnitAttr:$thrifty,
+                        UnitAttr:$doubleBufferCache,
+                        OptionalAttr<MemorySpaceAttr>:$doubleBufferMemorySpace);
 
   let results = (outs Index:$resultId);
 
@@ -380,7 +406,10 @@ def rcxp_BeginCacheRegionOp : rcxp_Op<"begin_cache_region",
       "int64_t":$id,
       "int64_t":$cacheHierarchyLevel,
       "bool":$activeBlockCache,
-      "bool":$dimReorderCache)>
+      "bool":$dimReorderCache,
+      "bool":$thrifty,
+      "bool":$doubleBufferCache,
+      "MemorySpace":$doubleBufferMemorySpace)>
   ];
 
   let extraClassDeclaration = [{
@@ -392,8 +421,8 @@ def rcxp_BeginCacheRegionOp : rcxp_Op<"begin_cache_region",
 //
 // EndCacheRegionOp
 //
-def rcxp_EndCacheRegionOp : rcxp_Op<"end_cache_region",
-  [DeclareOpInterfaceMethods<rcxp_EndCacheRegionOpInterface>]> {
+def accxp_EndCacheRegionOp : accxp_Op<"end_cache_region",
+  [DeclareOpInterfaceMethods<accxp_EndCacheRegionOpInterface>]> {
   let summary = "Denotes the end of the graph section the cache is active for";
   let description = [{
     The "accxp.end_cache_region" operation marks the end the of the subgraph to
@@ -407,9 +436,9 @@ def rcxp_EndCacheRegionOp : rcxp_Op<"end_cache_region",
 //
 // BeginMaxElementCacheRegionOp
 //
-def rcxp_BeginMaxElementCacheRegionOp : rcxp_Op<"begin_max_element_cache_region",
-  [DeclareOpInterfaceMethods<rcln_InjectableMappingOpInterface>,
-   DeclareOpInterfaceMethods<rcxp_BeginCacheRegionOpInterface>]> {
+def accxp_BeginMaxElementCacheRegionOp : accxp_Op<"begin_max_element_cache_region",
+  [DeclareOpInterfaceMethods<accln_InjectableMappingOpInterface>,
+   DeclareOpInterfaceMethods<accxp_BeginCacheRegionOpInterface>]> {
   let summary = "Denotes the beginning of the subgraph where a max element cache is active. Lowers to a begin_cache_region op at the appropriate level for the element budget";
   let description = [{
     The "accxp.begin_max_element_cache_region" operation marks the beginning of the subraph where the max element cache is active.
@@ -422,11 +451,14 @@ def rcxp_BeginMaxElementCacheRegionOp : rcxp_Op<"begin_max_element_cache_region"
                         AnyMemRef:$cache,
                         AnyMemRef:$baseInput,
                         DictionaryAttr:$cacheAccessMaps,
-                        rcln_IndexAttr:$innermostLoopNestIndex, // The max element cache is initially positioned around the innermost loop index, but will be hoisted out as part of lowering
+                        accln_IndexAttr:$innermostLoopNestIndex, // The max element cache is initially positioned around the innermost loop index, but will be hoisted out as part of lowering
                         I64Attr:$id,
                         I64Attr:$cacheHierarchyLevel,
                         I64Attr:$maxElements,
-                        UnitAttr:$dimReorderCache);
+                        UnitAttr:$dimReorderCache,
+                        UnitAttr:$thrifty,
+                        UnitAttr:$doubleBufferCache,
+                        OptionalAttr<MemorySpaceAttr>:$doubleBufferMemorySpace);
 
   let results = (outs Index:$resultId);
 
@@ -441,8 +473,41 @@ def rcxp_BeginMaxElementCacheRegionOp : rcxp_Op<"begin_max_element_cache_region"
       "loopnest::Index":$innermostLoopNestIndex,
       "int64_t":$id,
       "int64_t":$cacheHierarchyLevel,
-      "bool":$dimReorderCache)>
+      "bool":$dimReorderCache,
+      "bool":$thrifty,
+      "bool":$doubleBufferCache,
+      "MemorySpace":$doubleBufferMemorySpace)>
+  ];
+}
+
+//
+// DelayedMappingRegionOp
+//
+def accxp_DelayedMappingRegionOp : accxp_Op<"delayed_mapping_region_op", [SingleBlockImplicitTerminator<"TerminatorOp">]> {
+  let summary = "Holds a mapping from one value to another to be applied to all the operations within its region";
+  let description = [{
+    The "accxp.delayed_mapping_region_op" operation will map one value to another for all the ops in its region.
+    This op exists as a way to replace one value with another in ops that haven't been fully expanded to consume
+    the "from" SSA value yet, as is sometimes the case with accesses into cache memrefs.
+  }];
+
+  let arguments = (ins  AnyType:$from,
+                        AnyType:$to);
+
+  let regions = (region AnyRegion:$region);
+
+  let skipDefaultBuilders = 1;
+  let builders = [
+    OpBuilder<(ins
+        "Value":$from,
+        "Value":$to)>
   ];
+  
+  let extraClassDeclaration = [{
+    mlir::OpBuilder getBodyBuilder() {
+      return mlir::OpBuilder(&region().front(), std::prev(region().front().end()));
+    }
+  }];
 }
 
 #endif // EXECUTIONPLAN_OPS
diff --git a/accera/ir/include/exec/TensorizationInfo.h b/accera/ir/include/exec/TensorizationInfo.h
index 68c8f73c..2154a643 100644
--- a/accera/ir/include/exec/TensorizationInfo.h
+++ b/accera/ir/include/exec/TensorizationInfo.h
@@ -5,17 +5,19 @@
 
 #pragma once
 
+#include <array>
+
 namespace accera::ir
 {
 namespace executionPlan
 {
     struct TensorizationInfo
     {
-        std::vector<int> dim{16,16,16};
+        std::array<int, 3> dim{0,0,0};
     private:
         friend inline bool operator==(const TensorizationInfo& p1, const TensorizationInfo& p2)
         {
-            return p1.dim[0] == p2.dim[0] && p1.dim[1] == p2.dim[1] && p1.dim[2] == p2.dim[2];
+            return p1.dim == p2.dim;
         }
         friend inline bool operator!=(const TensorizationInfo& p1, const TensorizationInfo& p2)
         {
diff --git a/accera/ir/include/nest/Index.h b/accera/ir/include/nest/Index.h
index dbc68a9c..b17d8382 100644
--- a/accera/ir/include/nest/Index.h
+++ b/accera/ir/include/nest/Index.h
@@ -34,6 +34,7 @@ namespace loopnest
         Id GetId() const;
 
         static Index none;
+        static constexpr Id DefaultID = -1;
 
     private:
         static int GetNextId();
@@ -43,7 +44,7 @@ namespace loopnest
         friend inline bool operator<(const Index& i1, const Index& i2) { return i1.GetId() < i2.GetId(); }
 
         std::string _name;
-        Id _id = -1;
+        Id _id = Index::DefaultID;
     };
 
     struct SplitIndex
diff --git a/accera/ir/include/nest/LoopNestAttrs.td b/accera/ir/include/nest/LoopNestAttrs.td
index 16326603..c5248756 100644
--- a/accera/ir/include/nest/LoopNestAttrs.td
+++ b/accera/ir/include/nest/LoopNestAttrs.td
@@ -25,7 +25,7 @@ def FRAGMENT_ALL : I64EnumAttrCase<"all", 3>;
 def FRAGMENT_SELECT : I64EnumAttrCase<"select", 4>;
 def FRAGMENT_RANGE : I64EnumAttrCase<"range", 5>;
 
-def rcln_FragmentTypeAttr : I64EnumAttr<
+def accln_FragmentTypeAttr : I64EnumAttr<
   "FragmentType", "An attribute containing a Fragment enum",
   [FRAGMENT_FIRST, FRAGMENT_LAST, FRAGMENT_END_BOUNDARY, FRAGMENT_ALL, FRAGMENT_SELECT, FRAGMENT_RANGE]> {
   let cppNamespace = "::accera::ir::loopnest";
@@ -34,7 +34,7 @@ def rcln_FragmentTypeAttr : I64EnumAttr<
 def PLACEMENT_BEFORE : I64EnumAttrCase<"before", 0>;
 def PLACEMENT_AFTER : I64EnumAttrCase<"after", 1>;
 
-def rcln_PlacementPredicateAttr : I64EnumAttr<
+def accln_PlacementPredicateAttr : I64EnumAttr<
   "PlacementType", "An attribute containing a Placement enum",
   [PLACEMENT_BEFORE, PLACEMENT_AFTER]> {
   let cppNamespace = "::accera::ir::loopnest";
@@ -44,13 +44,13 @@ def POSITION_PROLOGUE : I64EnumAttrCase<"prologue", 0>;
 def POSITION_BODY : I64EnumAttrCase<"body", 1>;
 def POSITION_EPILOGUE : I64EnumAttrCase<"epilogue", 2>;
 
-def rcln_PositionAttr : I64EnumAttr<
+def accln_PositionAttr : I64EnumAttr<
   "Position", "An attribute containing a Placement enum",
   [POSITION_PROLOGUE, POSITION_BODY, POSITION_EPILOGUE]> {
   let cppNamespace = "::accera::ir::loopnest";
 }
 
-def rcln_IndexAttr : DialectAttr<
+def accln_IndexAttr : DialectAttr<
     LoopNest_Dialect,
     CPred<"$_self.isa<IndexAttr>()">,
     "Symbolic index attribute"> {
@@ -60,7 +60,7 @@ def rcln_IndexAttr : DialectAttr<
   let constBuilderCall = "$0";
 }
 
-def rcln_IndexRangeAttr : DialectAttr<
+def accln_IndexRangeAttr : DialectAttr<
     LoopNest_Dialect,
     CPred<"$_self.isa<IndexRangeAttr>()">,
     "Index range attribute"> {
@@ -70,7 +70,7 @@ def rcln_IndexRangeAttr : DialectAttr<
   let constBuilderCall = "$0";
 }
 
-def rcln_RangeAttr : DialectAttr<
+def accln_RangeAttr : DialectAttr<
     LoopNest_Dialect,
     CPred<"$_self.isa<RangeAttr>()">,
     "Range attribute"> {
@@ -80,7 +80,7 @@ def rcln_RangeAttr : DialectAttr<
   let constBuilderCall = "$0";
 }
 
-def rcln_IterationDomainAttr : DialectAttr<
+def accln_IterationDomainAttr : DialectAttr<
     LoopNest_Dialect,
     CPred<"$_self.isa<IterationDomainAttr>()">,
     "IterationDomain attribute"> {
@@ -90,7 +90,7 @@ def rcln_IterationDomainAttr : DialectAttr<
   let constBuilderCall = "$0";
 }
 
-def rcln_SplitIndexAttr : DialectAttr<
+def accln_SplitIndexAttr : DialectAttr<
     LoopNest_Dialect,
     CPred<"$_self.isa<SplitIndexAttr>()">,
     "Split index attribute"> {
@@ -100,7 +100,7 @@ def rcln_SplitIndexAttr : DialectAttr<
   let constBuilderCall = "$0";
 }
 
-def rcln_TransformedDomainAttr : DialectAttr<
+def accln_TransformedDomainAttr : DialectAttr<
     LoopNest_Dialect,
     CPred<"$_self.isa<TransformedDomainAttr>()">,
     "TransformedDomain attribute"> {
@@ -110,7 +110,7 @@ def rcln_TransformedDomainAttr : DialectAttr<
   let constBuilderCall = "$0";
 }
 
-def rcln_OperandIndexAttr : DialectAttr<
+def accln_OperandIndexAttr : DialectAttr<
     LoopNest_Dialect,
     CPred<"$_self.isa<OperandIndexAttr>()">,
     "Attribute referencing an operand of an op"> {
diff --git a/accera/ir/include/nest/LoopNestExportedInterfaces.td b/accera/ir/include/nest/LoopNestExportedInterfaces.td
index 2e297fa7..8083a6dd 100644
--- a/accera/ir/include/nest/LoopNestExportedInterfaces.td
+++ b/accera/ir/include/nest/LoopNestExportedInterfaces.td
@@ -11,7 +11,7 @@ include "mlir/IR/OpBase.td"
 //
 // Mapping operation interface 
 //
-def rcln_InjectableMappingOpInterface : OpInterface<"InjectableMapping"> {
+def accln_InjectableMappingOpInterface : OpInterface<"InjectableMapping"> {
   let description = [{
     Interface for mapping ops to be injected into a loopnest schedule
   }];
diff --git a/accera/ir/include/nest/LoopNestInterfaces.td b/accera/ir/include/nest/LoopNestInterfaces.td
index 4bfd03a3..6adcef95 100644
--- a/accera/ir/include/nest/LoopNestInterfaces.td
+++ b/accera/ir/include/nest/LoopNestInterfaces.td
@@ -8,7 +8,7 @@
 
 include "mlir/IR/OpBase.td"
 
-def rcln_EvaluatablePredicateOpInterface : OpInterface<"EvaluatablePredicateOpInterface"> {
+def accln_EvaluatablePredicateOpInterface : OpInterface<"EvaluatablePredicateOpInterface"> {
   let description = [{
     Evaluatable predicate at emit-time
   }];
@@ -23,7 +23,7 @@ def rcln_EvaluatablePredicateOpInterface : OpInterface<"EvaluatablePredicateOpIn
   ];
 }
 
-def rcln_KernelPredicateOpInterface : OpInterface<"KernelPredicateOpInterface"> {
+def accln_KernelPredicateOpInterface : OpInterface<"KernelPredicateOpInterface"> {
   let description = [{
     Predicate determining if a Kernel should be run
   }];
@@ -45,7 +45,7 @@ def rcln_KernelPredicateOpInterface : OpInterface<"KernelPredicateOpInterface">
 
 // TODO : this interface can't surface the size because the op needs to be lowered first, rename it something related to
 //        what it does rather than how it will be used
-def rcln_EmitTimeSizeOpInterface : OpInterface<"EmitTimeSize"> {
+def accln_EmitTimeSizeOpInterface : OpInterface<"EmitTimeSize"> {
   let description = [{
     Evaluatable size at conversion-time
   }];
diff --git a/accera/ir/include/nest/LoopNestOps.td b/accera/ir/include/nest/LoopNestOps.td
index 78718e71..281d9a6d 100644
--- a/accera/ir/include/nest/LoopNestOps.td
+++ b/accera/ir/include/nest/LoopNestOps.td
@@ -21,19 +21,19 @@ include "ir/include/value/ValueAttrs.td"
 //   * The parent dialect of the operation.
 //   * The mnemonic for the operation, or the name without the dialect prefix.
 //   * A list of traits for the operation.
-class rcln_Op<string mnemonic, list<OpTrait> traits = []> :
+class accln_Op<string mnemonic, list<OpTrait> traits = []> :
     Op<LoopNest_Dialect, mnemonic, traits>;
 
 //
 // SymbolicIndexOp
 //
-def rcln_SymbolicIndexOp : rcln_Op<"sym_index", [NoSideEffect, ConstantLike]> {
+def accln_SymbolicIndexOp : accln_Op<"sym_index", [NoSideEffect, ConstantLike]> {
   let summary = "symbolic loop index";
   let description = [{
     The "accln.sym_index" builtin operation creates a symbolic "placeholder" index for a loop.
   }];
 
-  let arguments = (ins rcln_IndexAttr:$index);
+  let arguments = (ins accln_IndexAttr:$index);
   let results = (outs Index:$result);
 
   let builders = [
@@ -59,7 +59,7 @@ def rcln_SymbolicIndexOp : rcln_Op<"sym_index", [NoSideEffect, ConstantLike]> {
 //
 // ScheduleOp
 //
-def rcln_ScheduleOp : rcln_Op<"schedule",
+def accln_ScheduleOp : accln_Op<"schedule",
        [SingleBlockImplicitTerminator<"TerminatorOp">]> {
   let summary = "loopnest scheduling operation";
   let description = [{
@@ -169,7 +169,7 @@ def rcln_ScheduleOp : rcln_Op<"schedule",
 //
 // NestOp
 //
-def rcln_NestOp : rcln_Op<"nest",[SingleBlockImplicitTerminator<"TerminatorOp">, SymbolTable]> {
+def accln_NestOp : accln_Op<"nest",[SingleBlockImplicitTerminator<"TerminatorOp">, SymbolTable]> {
   let summary = "nest operation";
   let description = [{
     The "accln.nest" operation produces a loop nest. Takes a variadic number of values indicating the upper bound for the loops.
@@ -177,7 +177,7 @@ def rcln_NestOp : rcln_Op<"nest",[SingleBlockImplicitTerminator<"TerminatorOp">,
     The "accln.nest" operation has a single region indicating the code at the middle of the loop nest.
   }];
 
-  let arguments = (ins SymbolRefArrayAttr:$kernels, rcln_IterationDomainAttr:$domain, OptionalAttr<ExecutionTargetAttr>:$exec_target, Variadic<AnyType>:$rangeOperands);
+  let arguments = (ins SymbolRefArrayAttr:$kernels, accln_IterationDomainAttr:$domain, OptionalAttr<ExecutionTargetAttr>:$exec_target, Variadic<AnyType>:$rangeOperands);
   let results = (outs);
   let regions = (region SizedRegion<1>:$body);
 
@@ -219,14 +219,14 @@ def rcln_NestOp : rcln_Op<"nest",[SingleBlockImplicitTerminator<"TerminatorOp">,
 //
 // ScheduledLoopOp
 //
-def rcln_ScheduledLoopOp : rcln_Op<"scheduled_loop",
+def accln_ScheduledLoopOp : accln_Op<"scheduled_loop",
   [SingleBlockImplicitTerminator<"TerminatorOp">]> {
   let summary = "Represents a scheduled loop and separates prologue, body, and epilogue regions";
   let description = [{
     The "accln.ScheduledLoopOp" operation represents a scheduled loop level and separates prolgoue, body, and epilogue regions
   }];
 
-  let arguments = (ins I64Attr:$begin, I64Attr:$end, I64Attr:$step, rcln_IndexAttr:$index, Index:$symbolicIndex, ArrayAttr:$subdomainSize, ArrayAttr:$subdomainIndexOrder);
+  let arguments = (ins I64Attr:$begin, I64Attr:$end, I64Attr:$step, accln_IndexAttr:$index, Index:$symbolicIndex, ArrayAttr:$subdomainSize, ArrayAttr:$subdomainIndexOrder);
   let regions = (region AnyRegion:$prologue, AnyRegion:$body, AnyRegion:$epilogue);
 
   let printer = [{ return ::print(p, *this); }];
@@ -311,7 +311,7 @@ def rcln_ScheduledLoopOp : rcln_Op<"scheduled_loop",
 // Then add the kernel to the nest (via n.addKernel(kernel)), which moves %n to the current point and adds an invoke call
 
 // TODO: add CallableOpInterface, FunctionLike ?
-def rcln_KernelOp : rcln_Op<"kernel",
+def accln_KernelOp : accln_Op<"kernel",
        [SingleBlockImplicitTerminator<"TerminatorOp">, Symbol]> {
   let summary = "kernel operation";
   let description = [{
@@ -342,7 +342,7 @@ def rcln_KernelOp : rcln_Op<"kernel",
     std::vector<SymbolicIndexOp> getIndices();
 
     // FunctionLike trait needs access to the functions below.
-    // friend class OpTrait::FunctionLike<rcln_KernelOp>;
+    // friend class OpTrait::FunctionLike<accln_KernelOp>;
 
     // Hooks for the input/output type enumeration in FunctionLike trait
     // unsigned getNumFuncArguments() { return getType().getNumInputs(); }
@@ -364,7 +364,7 @@ def rcln_KernelOp : rcln_Op<"kernel",
 //
 // ScheduledKernelOp
 //
-def rcln_ScheduledKernelOp : rcln_Op<"scheduled_kernel", [Symbol]> {
+def accln_ScheduledKernelOp : accln_Op<"scheduled_kernel", [Symbol]> {
   let summary = "scheduled kernel operation";
   let description = [{
     The "loopnest.scheduled_kernel" operation associates a predicate with a kernel.
@@ -401,7 +401,7 @@ def rcln_ScheduledKernelOp : rcln_Op<"scheduled_kernel", [Symbol]> {
 // InvokeKernelOp
 //
 
-def rcln_InvokeKernelOp : rcln_Op<"invoke_kernel", []> {
+def accln_InvokeKernelOp : accln_Op<"invoke_kernel", []> {
   let summary = "Invoke a kernel";
   let description = [{}];
   let arguments = (ins FlatSymbolRefAttr:$kernel);
@@ -423,8 +423,8 @@ def rcln_InvokeKernelOp : rcln_Op<"invoke_kernel", []> {
 // DimSizeOp
 //
 
-def rcln_DimSizeOp : rcln_Op<"dim_size",
-    [DeclareOpInterfaceMethods<rcln_EmitTimeSizeOpInterface>]> {
+def accln_DimSizeOp : accln_Op<"dim_size",
+    [DeclareOpInterfaceMethods<accln_EmitTimeSizeOpInterface>]> {
   let summary = "Evaluate the given dimension range in the current subdomain";
   let description = [{
     The "accln.dim_size" op lowers to the constant value of the size of the requested subdomain dimension at the point it is inserted
@@ -438,7 +438,7 @@ def rcln_DimSizeOp : rcln_Op<"dim_size",
     static StringRef getIndexAttrName();
   }];
 
-  let arguments = (ins rcln_IndexAttr:$dimensionIndex);
+  let arguments = (ins accln_IndexAttr:$dimensionIndex);
   let results = (outs Index:$size);
 }
 
@@ -446,12 +446,12 @@ def rcln_DimSizeOp : rcln_Op<"dim_size",
 // Kernel predicates
 //
 
-class rcln_KernelPredicateOp<string mnemonic, list<OpTrait> traits = []> : rcln_Op<!interleave([mnemonic, "pred"], "_"), !listconcat(traits, [IsolatedFromAbove, NoSideEffect])> {
+class accln_KernelPredicateOp<string mnemonic, list<OpTrait> traits = []> : accln_Op<!interleave([mnemonic, "pred"], "_"), !listconcat(traits, [IsolatedFromAbove, NoSideEffect])> {
   let results = (outs I1:$result);
 }
 
-def rcln_NullPredicateOp : rcln_KernelPredicateOp<"null",
-  [DeclareOpInterfaceMethods<rcln_KernelPredicateOpInterface>]> {
+def accln_NullPredicateOp : accln_KernelPredicateOp<"null",
+  [DeclareOpInterfaceMethods<accln_KernelPredicateOpInterface>]> {
   let summary = "";
   let description = [{ blah }];
   let arguments = (ins);
@@ -464,8 +464,8 @@ def rcln_NullPredicateOp : rcln_KernelPredicateOp<"null",
   let verifier = [{ return ::verify(*this); }];
 }
 
-def rcln_ConstantPredicateOp : rcln_KernelPredicateOp<"const",
-  [DeclareOpInterfaceMethods<rcln_KernelPredicateOpInterface>]> {
+def accln_ConstantPredicateOp : accln_KernelPredicateOp<"const",
+  [DeclareOpInterfaceMethods<accln_KernelPredicateOpInterface>]> {
   let arguments = (ins BoolAttr:$value);
   let builders = [
     OpBuilder<(ins "BoolAttr":$value), [{
@@ -475,9 +475,9 @@ def rcln_ConstantPredicateOp : rcln_KernelPredicateOp<"const",
   let verifier = [{ return ::verify(*this); }];
 }
 
-def rcln_FragmentTypePredicateOp : rcln_KernelPredicateOp<"frag",
-  [DeclareOpInterfaceMethods<rcln_KernelPredicateOpInterface>]> {
-  let arguments = (ins rcln_FragmentTypeAttr:$fragment, rcln_IndexAttr:$index, I64ArrayAttr:$indexValues);
+def accln_FragmentTypePredicateOp : accln_KernelPredicateOp<"frag",
+  [DeclareOpInterfaceMethods<accln_KernelPredicateOpInterface>]> {
+  let arguments = (ins accln_FragmentTypeAttr:$fragment, accln_IndexAttr:$index, I64ArrayAttr:$indexValues);
   let builders = [
     OpBuilder<(ins
     "IntegerAttr":$fragment,
@@ -517,10 +517,10 @@ def rcln_FragmentTypePredicateOp : rcln_KernelPredicateOp<"frag",
   }];
 }
 
-def rcln_ProloguePredicateOp : rcln_KernelPredicateOp<"prologue",
-  [DeclareOpInterfaceMethods<rcln_EvaluatablePredicateOpInterface>]
+def accln_ProloguePredicateOp : accln_KernelPredicateOp<"prologue",
+  [DeclareOpInterfaceMethods<accln_EvaluatablePredicateOpInterface>]
   > {
-  let arguments = (ins rcln_IndexAttr:$index);
+  let arguments = (ins accln_IndexAttr:$index);
   let builders = [
     OpBuilder<(ins "Index":$index), [{
       build($_builder, $_state, $_builder.getI1Type(), IndexAttr::get(index, $_builder.getContext()));
@@ -532,10 +532,10 @@ def rcln_ProloguePredicateOp : rcln_KernelPredicateOp<"prologue",
   let verifier = [{ return ::verify(*this); }];
 }
 
-def rcln_EpiloguePredicateOp : rcln_KernelPredicateOp<"epilogue",
-  [DeclareOpInterfaceMethods<rcln_EvaluatablePredicateOpInterface>]
+def accln_EpiloguePredicateOp : accln_KernelPredicateOp<"epilogue",
+  [DeclareOpInterfaceMethods<accln_EvaluatablePredicateOpInterface>]
   > {
-  let arguments = (ins rcln_IndexAttr:$index);
+  let arguments = (ins accln_IndexAttr:$index);
   let builders = [
     OpBuilder<(ins "Index":$index), [{
       build($_builder, $_state, $_builder.getI1Type(), IndexAttr::get(index, $_builder.getContext()));
@@ -547,9 +547,9 @@ def rcln_EpiloguePredicateOp : rcln_KernelPredicateOp<"epilogue",
   let verifier = [{ return ::verify(*this); }];
 }
 
-def rcln_PlacementPredicateOp : rcln_KernelPredicateOp<"place",
-  [DeclareOpInterfaceMethods<rcln_KernelPredicateOpInterface>]> {
-  let arguments = (ins rcln_PlacementPredicateAttr:$placement, rcln_IndexAttr:$index);
+def accln_PlacementPredicateOp : accln_KernelPredicateOp<"place",
+  [DeclareOpInterfaceMethods<accln_KernelPredicateOpInterface>]> {
+  let arguments = (ins accln_PlacementPredicateAttr:$placement, accln_IndexAttr:$index);
   let builders = [
     OpBuilder<(ins "IntegerAttr":$placement, "Index":$index), [{
       auto placementAttr = placement.cast<PlacementTypeAttr>();
@@ -563,10 +563,10 @@ def rcln_PlacementPredicateOp : rcln_KernelPredicateOp<"place",
   let verifier = [{ return ::verify(*this); }];
 }
 
-def rcln_IndexDefinedPredicateOp : rcln_KernelPredicateOp<"indexdef",
-  [DeclareOpInterfaceMethods<rcln_EvaluatablePredicateOpInterface>,
-   DeclareOpInterfaceMethods<rcln_KernelPredicateOpInterface>]> {
-  let arguments = (ins rcln_IndexAttr:$index);
+def accln_IndexDefinedPredicateOp : accln_KernelPredicateOp<"indexdef",
+  [DeclareOpInterfaceMethods<accln_EvaluatablePredicateOpInterface>,
+   DeclareOpInterfaceMethods<accln_KernelPredicateOpInterface>]> {
+  let arguments = (ins accln_IndexAttr:$index);
   let builders = [
     OpBuilder<(ins "Index":$index), [{
       build($_builder, $_state, $_builder.getI1Type(), IndexAttr::get(index, $_builder.getContext()));
@@ -578,8 +578,8 @@ def rcln_IndexDefinedPredicateOp : rcln_KernelPredicateOp<"indexdef",
   let verifier = [{ return ::verify(*this); }];
 }
 
-def rcln_ConjunctionPredicateOp : rcln_KernelPredicateOp<"conj",
-  [DeclareOpInterfaceMethods<rcln_KernelPredicateOpInterface>]> {
+def accln_ConjunctionPredicateOp : accln_KernelPredicateOp<"conj",
+  [DeclareOpInterfaceMethods<accln_KernelPredicateOpInterface>]> {
   let arguments = (ins Variadic<AnyType>:$values);
   let builders = [
     OpBuilder<(ins "ValueRange":$values), [{
@@ -600,8 +600,8 @@ def rcln_ConjunctionPredicateOp : rcln_KernelPredicateOp<"conj",
   let verifier = [{ return ::verify(*this); }];
 }
 
-def rcln_DisjunctionPredicateOp : rcln_KernelPredicateOp<"disj",
-  [DeclareOpInterfaceMethods<rcln_KernelPredicateOpInterface>]> {
+def accln_DisjunctionPredicateOp : accln_KernelPredicateOp<"disj",
+  [DeclareOpInterfaceMethods<accln_KernelPredicateOpInterface>]> {
   let arguments = (ins Variadic<AnyType>:$values);
   let builders = [
     OpBuilder<(ins "ValueRange":$values), [{
@@ -625,7 +625,7 @@ def rcln_DisjunctionPredicateOp : rcln_KernelPredicateOp<"disj",
 //
 // PrintOp
 //
-def rcln_PrintOp : rcln_Op<"print"> {
+def accln_PrintOp : accln_Op<"print"> {
   let summary = "print operation";
   let description = [{
     The "print" builtin operation prints a given input tensor, and produces
@@ -650,7 +650,7 @@ def rcln_PrintOp : rcln_Op<"print"> {
 //
 // TerminatorOp
 //
-def rcln_TerminatorOp : rcln_Op<"terminator", [Terminator]> {
+def accln_TerminatorOp : accln_Op<"terminator", [Terminator]> {
   let summary = "cf terminator operation";
   let description = [{
     "accln.terminator" is a special terminator operation for blocks inside
@@ -673,7 +673,7 @@ def rcln_TerminatorOp : rcln_Op<"terminator", [Terminator]> {
 //
 // ExecPlanOp
 //
-def rcln_ExecPlanOp : rcln_Op<"exec_plan", [ParentOneOf<["NestOp", "ScheduleOp"]>]> {
+def accln_ExecPlanOp : accln_Op<"exec_plan", [ParentOneOf<["NestOp", "ScheduleOp"]>]> {
 let summary = "loopnest execution plan operation";
 let description = [{
 The "accln.exec_plan" operation describes the execution plan for a loop nest.
diff --git a/accera/ir/include/value/ValueAttrs.td b/accera/ir/include/value/ValueAttrs.td
index 499897a9..8cfa4e3d 100644
--- a/accera/ir/include/value/ValueAttrs.td
+++ b/accera/ir/include/value/ValueAttrs.td
@@ -4,8 +4,8 @@
 //  Authors:  Kern Handa
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#ifndef ACCERA_rcv_ATTRS
-#define ACCERA_rcv_ATTRS
+#ifndef ACCERA_accv_ATTRS
+#define ACCERA_accv_ATTRS
 
 include "ir/include/Common.td"
 
@@ -16,12 +16,23 @@ def ExecutionTargetAttr : I64EnumAttr<"ExecutionTarget", "target for function",
     let cppNamespace = "::accera::ir::value";
 }
 
-def ExecutionRuntimeDefault : StrEnumAttrCase<"Default">;
-def ExecutionRuntimeVulkan : StrEnumAttrCase<"Vulkan">;
-def ExecutionRuntimeRocm : StrEnumAttrCase<"Rocm">;
+def ExecutionRuntimeNone : StrEnumAttrCase<"NONE">;
 def ExecutionRuntimeCUDA : StrEnumAttrCase<"CUDA">;
+def ExecutionRuntimeRocm : StrEnumAttrCase<"ROCM">;
+def ExecutionRuntimeVulkan : StrEnumAttrCase<"VULKAN">;
+def ExecutionRuntimeOpenMP : StrEnumAttrCase<"OPENMP">;
+def ExecutionRuntimeDefault : StrEnumAttrCase<"DEFAULT">;
 
-def ExecutionRuntimeAttr : StrEnumAttr<"ExecutionRuntime", "execution runtime for function", [ExecutionRuntimeDefault, ExecutionRuntimeVulkan, ExecutionRuntimeRocm, ExecutionRuntimeCUDA]> {
+
+def ExecutionRuntimeAttr : StrEnumAttr<"ExecutionRuntime", "execution runtime for function",
+    [
+        ExecutionRuntimeNone,
+        ExecutionRuntimeCUDA,
+        ExecutionRuntimeRocm,
+        ExecutionRuntimeVulkan,
+        ExecutionRuntimeOpenMP,
+        ExecutionRuntimeDefault
+    ]> {
     let cppNamespace = "::accera::ir::value";
     let genSpecializedAttr = 1;
 }
@@ -51,18 +62,18 @@ def MemoryAllocTypeAttr : I64EnumAttr<
 def MEMORY_SPACE_NONE : I64EnumAttrCase<"None", 0>;
 def MEMORY_SPACE_GLOBAL : I64EnumAttrCase<"Global", 1>;
 def MEMORY_SPACE_SHARED : I64EnumAttrCase<"Shared", 3>;
-def MEMORY_SPACE_LOCAL : I64EnumAttrCase<"Local", 5>;
+def MEMORY_SPACE_PRIVATE : I64EnumAttrCase<"Private", 5>;
 
 def MemorySpaceAttr : I64EnumAttr<
         "MemorySpace",
         "Describes the memory space in which an allocation resides.",
-        [ MEMORY_SPACE_NONE, MEMORY_SPACE_GLOBAL, MEMORY_SPACE_SHARED, MEMORY_SPACE_LOCAL]> {
+        [ MEMORY_SPACE_NONE, MEMORY_SPACE_GLOBAL, MEMORY_SPACE_SHARED, MEMORY_SPACE_PRIVATE]> {
     let cppNamespace = "::accera::ir::value";
 }
 
 def BARRIER_SCOPE_BLOCK : StrEnumAttrCase<"Block", 0>;
-def BARRIER_SCOPE_WARP : StrEnumAttrCase<"Warp", 1>; 
-def BARRIER_SCOPE_THREADFENCE : StrEnumAttrCase<"Threadfence", 2>; 
+def BARRIER_SCOPE_WARP : StrEnumAttrCase<"Warp", 1>;
+def BARRIER_SCOPE_THREADFENCE : StrEnumAttrCase<"Threadfence", 2>;
 
 def BarrierScopeAttr : StrEnumAttr<
         "BarrierScope",
@@ -71,4 +82,4 @@ def BarrierScopeAttr : StrEnumAttr<
     let cppNamespace = "::accera::ir::value";
     let genSpecializedAttr = 1;
 }
-#endif // ACCERA_rcv_ATTRS
+#endif // ACCERA_accv_ATTRS
diff --git a/accera/ir/include/value/ValueBase.td b/accera/ir/include/value/ValueBase.td
index 8b299e98..febbb762 100644
--- a/accera/ir/include/value/ValueBase.td
+++ b/accera/ir/include/value/ValueBase.td
@@ -4,8 +4,8 @@
 //  Authors:  Kern Handa
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#ifndef ACCERA_rcv_BASE
-#define ACCERA_rcv_BASE
+#ifndef ACCERA_accv_BASE
+#define ACCERA_accv_BASE
 
 include "ir/include/Common.td"
 
@@ -28,6 +28,6 @@ def Value_Dialect : Dialect {
 //   * The parent dialect of the operation.
 //   * The mnemonic for the operation, or the name without the dialect prefix.
 //   * A list of traits for the operation.
-class rcv_Op<string mnemonic, list<OpTrait> traits = []> : Op<Value_Dialect, mnemonic, traits>;
+class accv_Op<string mnemonic, list<OpTrait> traits = []> : Op<Value_Dialect, mnemonic, traits>;
 
-#endif // ACCERA_rcv_BASE
+#endif // ACCERA_accv_BASE
diff --git a/accera/ir/include/value/ValueDialect.h b/accera/ir/include/value/ValueDialect.h
index 639f8274..342dc500 100644
--- a/accera/ir/include/value/ValueDialect.h
+++ b/accera/ir/include/value/ValueDialect.h
@@ -6,6 +6,7 @@
 
 #pragma once
 
+#include <mlir/Dialect/Affine/IR/AffineMemoryOpInterfaces.h>
 #include <mlir/IR/Builders.h>
 #include <mlir/IR/BuiltinOps.h>
 #include <mlir/IR/BuiltinTypes.h>
@@ -38,7 +39,10 @@ using llvm::StringRef;
 using llvm::iterator_range;
 
 using mlir::AffineMap;
+using mlir::AffineMapAccessInterface;
 using mlir::AffineMapAttr;
+using mlir::AffineReadOpInterface;
+using mlir::AffineWriteOpInterface;
 using mlir::ArrayAttr;
 using mlir::Attribute;
 using mlir::Block;
diff --git a/accera/ir/include/value/ValueMFMAOp.h b/accera/ir/include/value/ValueMFMAOp.h
index 4388d7ad..23beefac 100644
--- a/accera/ir/include/value/ValueMFMAOp.h
+++ b/accera/ir/include/value/ValueMFMAOp.h
@@ -23,8 +23,8 @@ using llvm::StringRef;
 using mlir::Type;
 using mlir::LogicalResult;
 
-/// MFMAMatrixType storage and uniquing. Array is uniqued based on its shape
-/// and type.
+/// MFMAMatrixType storage and uniquing. Array is uniqued based on its shape,
+/// type, and operand.
 struct MFMAMatrixStorageType : public mlir::TypeStorage
 {
     MFMAMatrixStorageType(unsigned numDims, const int64_t* dimShapes, Type elementType, StringRef operand) :
@@ -71,7 +71,7 @@ struct MFMAMatrixStorageType : public mlir::TypeStorage
     StringRef operand;
 };
 
-/// MFMAMatrix represents a matrix held by a subgroup for matrix-matrix multiply
+/// MFMAMatrix represents a matrix held by for matrix-matrix multiply
 /// accumulate operations. MFMAMatrices are taken as direct operands by these
 /// operations and are also produced as results. These matrices are meant to
 /// reside in the registers. A limited number of pointwise operations can be
@@ -81,8 +81,8 @@ struct MFMAMatrixStorageType : public mlir::TypeStorage
 /// inside the matrix is opaque i.e., the elements may be present in the
 /// matrix in any order. The general usage of this type is shown as follows:-
 ///
-///   %0 = gpu.subgroup_mma_load_matrix %arg0[%c0, %c0] {leadDimension = 16 :
-///           index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">
+///   %0 = accv.mfma_load_matrix %arg0[%c0, %c0] {leadDimension = 16 :
+///           index} : memref<16x16xf16> -> !accv.mfma_matrix<16x16xf16, "AOp">
 ///
 /// The MFMAMatrixType describes the shape of the matrix being loaded and the
 /// operand being loaded too. The operand needs to be specified to aid the
@@ -92,14 +92,13 @@ struct MFMAMatrixStorageType : public mlir::TypeStorage
 /// and 8 f32s for f32 data type of MFMAMatrix. Some other instances of usage
 /// are:-
 ///
-///   %3 = gpu.subgroup_mma_compute %0, %1, %2 :
-///   !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp">
-///    -> !gpu.mma_matrix<16x16xf32, "COp">
+///   %3 = accv.mfma_compute %0, %1, %2 :
+///   !accv.mfma_matrix<16x16xf16, "AOp">, !accv.mfma_matrix<16x16xf16, "BOp">
+///    -> !accv.mfma_matrix<16x16xf32, "COp">
 ///
 ///
-///   gpu.subgroup_mma_store_matrix %3, %arg22[%c0, %c0] {leadDimension = 16
-///           : index}: !gpu.mma_matrix<16x16xf32, "COp">, memref<16x16xf32>
-// TODO: consider moving this to ODS.
+///   accv.mfma_store_matrix %3, %arg22[%c0, %c0] {leadDimension = 16
+///           : index}: !accv.mfma_matrix<16x16xf32, "COp">, memref<16x16xf32>
 class MFMAMatrixType
     : public Type::TypeBase<MFMAMatrixType, Type, MFMAMatrixStorageType>
 {
@@ -139,5 +138,8 @@ class MFMAMatrixType
     /// C += A*B. This function returns which operand in the given equation is
     /// held by this type. String returned can be one of"AOp", "BOp" and "COp".
     StringRef getOperand() const;
+
+    /// The stride between consecutive rows of the MFMA matrix.
+    int64_t getLeadingDim() const;
 };
 } // namespace accera::ir::value
diff --git a/accera/ir/include/value/ValueOps.td b/accera/ir/include/value/ValueOps.td
index 72ec0ef7..4101d6ff 100644
--- a/accera/ir/include/value/ValueOps.td
+++ b/accera/ir/include/value/ValueOps.td
@@ -4,16 +4,17 @@
 //  Authors:  Kern Handa
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#ifndef ACCERA_rcv_OPS
-#define ACCERA_rcv_OPS
+#ifndef ACCERA_accv_OPS
+#define ACCERA_accv_OPS
 
 include "ir/include/value/ValueBase.td"
 
 include "ir/include/value/ValueAttrs.td"
 
 include "mlir/Interfaces/ControlFlowInterfaces.td"
+include "mlir/Dialect/Affine/IR/AffineMemoryOpInterfaces.td"
 
-def rcv_ValueLambdaOp : rcv_Op<"lambda", [
+def accv_ValueLambdaOp : accv_Op<"lambda", [
     SymbolTable,
     Symbol,
     FunctionLike,
@@ -53,7 +54,7 @@ def rcv_ValueLambdaOp : rcv_Op<"lambda", [
   }];
 }
 
-def rcv_ValueModuleOp : rcv_Op<"module", [
+def accv_ValueModuleOp : accv_Op<"module", [
     IsolatedFromAbove,
     SymbolTable,
     Symbol,
@@ -79,12 +80,12 @@ def rcv_ValueModuleOp : rcv_Op<"module", [
   }];
 }
 
-def rcv_ModuleTerminatorOp : rcv_Op<"module_terminator", [Terminator, HasParent<"ValueModuleOp">]> {
+def accv_ModuleTerminatorOp : accv_Op<"module_terminator", [Terminator, HasParent<"ValueModuleOp">]> {
   let summary = "A pseudo op that marks the end of a gpu.module.";
   let description = [{}];
 }
 
-def rcv_PrintOp : rcv_Op<"print"> {
+def accv_PrintOp : accv_Op<"print"> {
   let summary = "print operation";
   let description = [{
     The `accv.print` operation prints a given input tensor, and produces
@@ -93,12 +94,12 @@ def rcv_PrintOp : rcv_Op<"print"> {
 
   // The print operation takes an input tensor to print.
   let arguments = (ins
-    AnyTypeOf<[AnyMemRef, AnyTensor, rc_NumericType]>:$input,
+    AnyTypeOf<[AnyMemRef, AnyTensor, acc_NumericType]>:$input,
     UnitAttr:$to_stderr
   );
 }
 
-def rcv_PrintFOp : rcv_Op<"printf"> {
+def accv_PrintFOp : accv_Op<"printf"> {
   let summary = "printf operation";
   let description = [{
     The `printf` builtin operation prints a scalar value and returns no results.
@@ -117,7 +118,7 @@ def rcv_PrintFOp : rcv_Op<"printf"> {
   ];
 }
 
-def rcv_GlobalOp : rcv_Op<"global", [IsolatedFromAbove, Symbol]> {
+def accv_GlobalOp : accv_Op<"global", [IsolatedFromAbove, Symbol]> {
   let arguments = (ins
     TypeAttrBase<"MemRefType", "MemRef">:$type,
     UnitAttr:$constant,
@@ -145,7 +146,7 @@ def rcv_GlobalOp : rcv_Op<"global", [IsolatedFromAbove, Symbol]> {
   }];
 }
 
-def rcv_ReferenceGlobalOp : rcv_Op<"ref_global"> {
+def accv_ReferenceGlobalOp : accv_Op<"ref_global"> {
   let arguments = (ins FlatSymbolRefAttr:$global_name);
   let results = (outs AnyStaticShapeMemRef);
 
@@ -170,7 +171,7 @@ def rcv_ReferenceGlobalOp : rcv_Op<"ref_global"> {
   }];
 }
 
-def rcv_AllocOp : rcv_Op<"alloc"> {
+def accv_AllocOp : accv_Op<"alloc"> {
   let summary = "Memory allocation operation";
   let description = [{
     The `accv.alloc` operation allocates a region of memory, as specified by its
@@ -224,7 +225,7 @@ def rcv_AllocOp : rcv_Op<"alloc"> {
 }
 
 
-def rcv_BitcastOp : rcv_Op<"bitcast_op",
+def accv_BitcastOp : accv_Op<"bitcast_op",
   [NoSideEffect]> {
     let summary = "bitcast operation";
     let description = [{
@@ -232,21 +233,21 @@ def rcv_BitcastOp : rcv_Op<"bitcast_op",
     }];
 
     let arguments = (ins
-      rc_ScalarOrVectorNumericType:$input
+      acc_ScalarOrVectorNumericType:$input
     );
-    let results = (outs rc_ScalarOrVectorNumericType:$result);
+    let results = (outs acc_ScalarOrVectorNumericType:$result);
 }
 
 
-def rcv_UNARY_OP_NOT : I64EnumAttrCase<"NOT", 0>;
+def accv_UNARY_OP_NOT : I64EnumAttrCase<"NOT", 0>;
 
-def rcv_UnaryOpPredicateAttr : I64EnumAttr<
+def accv_UnaryOpPredicateAttr : I64EnumAttr<
   "UnaryOpPredicate", "",
-  [rcv_UNARY_OP_NOT]> {
+  [accv_UNARY_OP_NOT]> {
   let cppNamespace = "::accera::ir::value";
 }
 
-def rcv_UnaryOp : rcv_Op<"unary_op",
+def accv_UnaryOp : accv_Op<"unary_op",
   [NoSideEffect]> {
     let summary = "unary operation";
     let description = [{
@@ -255,10 +256,10 @@ def rcv_UnaryOp : rcv_Op<"unary_op",
     }];
 
     let arguments = (ins
-      rcv_UnaryOpPredicateAttr:$predicate,
-      rc_NumericType:$input
+      accv_UnaryOpPredicateAttr:$predicate,
+      acc_NumericType:$input
     );
-    let results = (outs rc_NumericType:$result);
+    let results = (outs acc_NumericType:$result);
 
     let builders = [OpBuilder<(ins "UnaryOpPredicate":$predicate, "Value":$input), [{
         ::buildUnaryOp($_builder, $_state, predicate, input);
@@ -274,22 +275,22 @@ def rcv_UnaryOp : rcv_Op<"unary_op",
     }];
 }
 
-def rcv_BINARY_OP_ADD : I64EnumAttrCase<"ADD", 0>;
-def rcv_BINARY_OP_SUB : I64EnumAttrCase<"SUB", 1>;
-def rcv_BINARY_OP_MUL : I64EnumAttrCase<"MUL", 2>;
-def rcv_BINARY_OP_DIV : I64EnumAttrCase<"DIV", 3>;
-def rcv_BINARY_OP_MOD : I64EnumAttrCase<"MOD", 4>;
-def rcv_BINARY_OP_AND : I64EnumAttrCase<"LOGICAL_AND", 5>;
-def rcv_BINARY_OP_OR : I64EnumAttrCase<"LOGICAL_OR", 6>;
+def accv_BINARY_OP_ADD : I64EnumAttrCase<"ADD", 0>;
+def accv_BINARY_OP_SUB : I64EnumAttrCase<"SUB", 1>;
+def accv_BINARY_OP_MUL : I64EnumAttrCase<"MUL", 2>;
+def accv_BINARY_OP_DIV : I64EnumAttrCase<"DIV", 3>;
+def accv_BINARY_OP_MOD : I64EnumAttrCase<"MOD", 4>;
+def accv_BINARY_OP_AND : I64EnumAttrCase<"LOGICAL_AND", 5>;
+def accv_BINARY_OP_OR : I64EnumAttrCase<"LOGICAL_OR", 6>;
 
-def rcv_BinaryOpPredicateAttr : I64EnumAttr<
+def accv_BinaryOpPredicateAttr : I64EnumAttr<
   "BinaryOpPredicate", "",
-  [rcv_BINARY_OP_ADD, rcv_BINARY_OP_SUB, rcv_BINARY_OP_MUL, rcv_BINARY_OP_DIV, rcv_BINARY_OP_MOD,
-  rcv_BINARY_OP_AND, rcv_BINARY_OP_OR]> {
+  [accv_BINARY_OP_ADD, accv_BINARY_OP_SUB, accv_BINARY_OP_MUL, accv_BINARY_OP_DIV, accv_BINARY_OP_MOD,
+  accv_BINARY_OP_AND, accv_BINARY_OP_OR]> {
   let cppNamespace = "::accera::ir::value";
 }
 
-def rcv_BinOp : rcv_Op<"bin_op",
+def accv_BinOp : accv_Op<"bin_op",
   [NoSideEffect]> {
     let summary = "binary operation";
     let description = [{
@@ -298,11 +299,11 @@ def rcv_BinOp : rcv_Op<"bin_op",
     }];
 
     let arguments = (ins
-      rcv_BinaryOpPredicateAttr:$predicate,
-      rc_ScalarOrVectorNumericType:$lhs,
-      rc_ScalarOrVectorNumericType:$rhs
+      accv_BinaryOpPredicateAttr:$predicate,
+      acc_ScalarOrVectorNumericType:$lhs,
+      acc_ScalarOrVectorNumericType:$rhs
     );
-    let results = (outs rc_ScalarOrVectorNumericType:$result);
+    let results = (outs acc_ScalarOrVectorNumericType:$result);
 
     let builders = [OpBuilder<(ins "BinaryOpPredicate":$predicate, "Value":$lhs, "Value":$rhs), [{
         ::buildBinOp($_builder, $_state, predicate, lhs, rhs);
@@ -320,20 +321,20 @@ def rcv_BinOp : rcv_Op<"bin_op",
     }];
 }
 
-def rcv_CMP_P_EQ : I64EnumAttrCase<"EQ", 0>;
-def rcv_CMP_P_NE : I64EnumAttrCase<"NE", 1>;
-def rcv_CMP_P_LT : I64EnumAttrCase<"LT", 2>;
-def rcv_CMP_P_LE : I64EnumAttrCase<"LE", 3>;
-def rcv_CMP_P_GT : I64EnumAttrCase<"GT", 4>;
-def rcv_CMP_P_GE : I64EnumAttrCase<"GE", 5>;
+def accv_CMP_P_EQ : I64EnumAttrCase<"EQ", 0>;
+def accv_CMP_P_NE : I64EnumAttrCase<"NE", 1>;
+def accv_CMP_P_LT : I64EnumAttrCase<"LT", 2>;
+def accv_CMP_P_LE : I64EnumAttrCase<"LE", 3>;
+def accv_CMP_P_GT : I64EnumAttrCase<"GT", 4>;
+def accv_CMP_P_GE : I64EnumAttrCase<"GE", 5>;
 
-def rcv_CmpOpPredicateAttr : I64EnumAttr<
+def accv_CmpOpPredicateAttr : I64EnumAttr<
     "CmpOpPredicate", "",
-    [rcv_CMP_P_EQ, rcv_CMP_P_NE, rcv_CMP_P_LT, rcv_CMP_P_LE, rcv_CMP_P_GT, rcv_CMP_P_GE]> {
+    [accv_CMP_P_EQ, accv_CMP_P_NE, accv_CMP_P_LT, accv_CMP_P_LE, accv_CMP_P_GT, accv_CMP_P_GE]> {
   let cppNamespace = "::accera::ir::value";
 }
 
-def rcv_CmpOp : rcv_Op<"cmp",
+def accv_CmpOp : accv_Op<"cmp",
     [NoSideEffect]> {
   let summary = "comparison operation";
   let description = [{
@@ -344,11 +345,11 @@ def rcv_CmpOp : rcv_Op<"cmp",
   }];
 
   let arguments = (ins
-      rcv_CmpOpPredicateAttr:$predicate,
-      rc_ScalarOrVectorNumericType:$lhs,
-      rc_ScalarOrVectorNumericType:$rhs
+      accv_CmpOpPredicateAttr:$predicate,
+      acc_ScalarOrVectorNumericType:$lhs,
+      acc_ScalarOrVectorNumericType:$rhs
   );
-  let results = (outs rc_ScalarOrVectorBoolType:$result);
+  let results = (outs acc_ScalarOrVectorBoolType:$result);
 
   let builders = [OpBuilder<(ins "CmpOpPredicate":$predicate, "Value":$lhs, "Value":$rhs), [{
       ::buildCmpOp($_builder, $_state, predicate, lhs, rhs);
@@ -364,7 +365,7 @@ def rcv_CmpOp : rcv_Op<"cmp",
   }];
 }
 
-def rcv_CopyOp : rcv_Op<"copy"> {
+def accv_CopyOp : accv_Op<"copy"> {
   let description = [{
     Copies the data in the input view into the output view.
 
@@ -405,7 +406,7 @@ def rcv_CopyOp : rcv_Op<"copy"> {
   let hasCanonicalizer = 1;
 }
 
-def rcv_IfOp : rcv_Op<"if",
+def accv_IfOp : accv_Op<"if",
       [SingleBlockImplicitTerminator<"YieldOp">]> {
   let summary = "if-then-else operation";
   let description = [{
@@ -449,7 +450,7 @@ def rcv_IfOp : rcv_Op<"if",
        }
     ```
   }];
-  let arguments = (ins rc_Scalarlike<rc_BoolType>:$condition);
+  let arguments = (ins acc_Scalarlike<acc_BoolType>:$condition);
   let results = (outs Variadic<AnyType>:$results);
   let regions = (region SizedRegion<1>:$thenRegion, AnyRegion:$elseRegion);
 
@@ -472,7 +473,7 @@ def rcv_IfOp : rcv_Op<"if",
   }];
 }
 
-def rcv_LoadOp : rcv_Op<"load",
+def accv_LoadOp : accv_Op<"load",
     [AllElementTypesMatch<["memref", "result"]>]> {
   let summary = "load operation";
   let description = [{
@@ -489,8 +490,8 @@ def rcv_LoadOp : rcv_Op<"load",
 
   let arguments = (ins
     AnyTypeOf<[AnyMemRef,AnyStaticShapeTensor]>:$memref,
-    Variadic<rc_Indexlike>:$indices);
-  let results = (outs rc_NumericType:$result);
+    Variadic<acc_Indexlike>:$indices);
+  let results = (outs acc_NumericType:$result);
 
   let builders = [OpBuilder<(ins "Value":$data, CArg<"ValueRange", "{}">:$indices), [{
       auto shapedType = data.getType().cast<ShapedType>();
@@ -510,7 +511,7 @@ def rcv_LoadOp : rcv_Op<"load",
   }];
 }
 
-def rcv_GetElementOp : rcv_Op<"get_element", [NoSideEffect]> {
+def accv_GetElementOp : accv_Op<"get_element", [NoSideEffect]> {
   let summary = "load element from memref or tensor";
   let description =[{
     Returns the element at index 0 of a Scalar sized MemRef or Tensor
@@ -536,7 +537,7 @@ def rcv_GetElementOp : rcv_Op<"get_element", [NoSideEffect]> {
   let hasCanonicalizer = 1;
 }
 
-def rcv_LaunchFuncOp : rcv_Op<"launch_func", [CallOpInterface]> {
+def accv_LaunchFuncOp : accv_Op<"launch_func", [CallOpInterface]> {
   let summary = "call operation for multi-target func ops";
   let description = [{}];
 
@@ -584,7 +585,7 @@ def rcv_LaunchFuncOp : rcv_Op<"launch_func", [CallOpInterface]> {
   }];
 }
 
-def rcv_CallOp : rcv_Op<"call", [CallOpInterface]> {
+def accv_CallOp : accv_Op<"call", [CallOpInterface]> {
   let summary = "call operation";
   let description = [{
     The `accv.call` operation represents a direct call to a function that is within
@@ -636,7 +637,7 @@ def rcv_CallOp : rcv_Op<"call", [CallOpInterface]> {
   }];
 }
 
-def rcv_MemRefCastOp : rcv_Op<"memref_cast", [SameOperandsAndResultShape]> {
+def accv_MemRefCastOp : accv_Op<"memref_cast", [SameOperandsAndResultShape]> {
   let summary = "memref element casting operation";
   let description = [{
     The `accv.memref_cast` operation converts the elements in a memref to another
@@ -660,7 +661,7 @@ def rcv_MemRefCastOp : rcv_Op<"memref_cast", [SameOperandsAndResultShape]> {
   }];
 }
 
-def rcv_OffsetOp : rcv_Op<"offset", [NoSideEffect]> {
+def accv_OffsetOp : accv_Op<"offset", [NoSideEffect]> {
   let summary = "memref offset operation";
   let description = [{
     The `accv.offset` operation converts a memref type to another memref type
@@ -675,7 +676,7 @@ def rcv_OffsetOp : rcv_Op<"offset", [NoSideEffect]> {
 
   let arguments = (ins
     AnyMemRef:$source,
-    Variadic<rc_Indexlike>:$offsets
+    Variadic<acc_Indexlike>:$offsets
   );
   let results = (outs AnyMemRef);
 
@@ -706,7 +707,7 @@ def rcv_OffsetOp : rcv_Op<"offset", [NoSideEffect]> {
   }];
 }
 
-def rcv_ViewOp : rcv_Op<"view", [NoSideEffect]> {
+def accv_ViewOp : accv_Op<"view", [NoSideEffect]> {
   let summary = "memref view operation";
   let description = [{
     The `accv.view` operation converts a memref type to another memref type
@@ -753,7 +754,7 @@ def rcv_ViewOp : rcv_Op<"view", [NoSideEffect]> {
   }];
 }
 
-def rcv_SliceOp : rcv_Op<"slice", [NoSideEffect]> {
+def accv_SliceOp : accv_Op<"slice", [NoSideEffect]> {
   let summary = "memref slice operation";
   let description = [{
     The `accv.slice` operation converts a memref type to another memref type
@@ -769,7 +770,7 @@ def rcv_SliceOp : rcv_Op<"slice", [NoSideEffect]> {
   let arguments = (ins
     AnyMemRef:$source,
     ArrayAttr:$sliceDimensions,
-    Variadic<rc_Indexlike>:$offsets
+    Variadic<acc_Indexlike>:$offsets
   );
   let results = (outs AnyMemRef:$result);
 
@@ -805,7 +806,7 @@ def rcv_SliceOp : rcv_Op<"slice", [NoSideEffect]> {
   let hasCanonicalizer = 1;
 }
 
-def rcv_MergeDimOp : rcv_Op<"merge_dim", [NoSideEffect]> {
+def accv_MergeDimOp : accv_Op<"merge_dim", [NoSideEffect]> {
   let summary = "memref merge dimensions operation";
   let description = [{
     The `accv.merge_dim` operation converts a memref type to another memref type
@@ -831,7 +832,7 @@ def rcv_MergeDimOp : rcv_Op<"merge_dim", [NoSideEffect]> {
   }];
 }
 
-def rcv_SplitDimOp : rcv_Op<"split_dim", [NoSideEffect]> {
+def accv_SplitDimOp : accv_Op<"split_dim", [NoSideEffect]> {
   let summary = "memref split dimension operation";
   let description = [{
     The `accv.split_dim` operation converts a memref type to another memref type
@@ -857,7 +858,7 @@ def rcv_SplitDimOp : rcv_Op<"split_dim", [NoSideEffect]> {
   }];
 }
 
-def rcv_StoreOp : rcv_Op<"store",
+def accv_StoreOp : accv_Op<"store",
     [AllElementTypesMatch<["memref", "value"]>]> {
   let summary = "store operation";
   let description = [{
@@ -875,9 +876,9 @@ def rcv_StoreOp : rcv_Op<"store",
   }];
 
   let arguments = (ins
-                   rc_NumericType:$value,
+                   acc_NumericType:$value,
                    AnyMemRef:$memref,
-                   Variadic<rc_Indexlike>:$indices);
+                   Variadic<acc_Indexlike>:$indices);
 
   let builders = [OpBuilder<(ins "Value":$valueToStore, "Value":$memref), [{
       $_state.addOperands(valueToStore);
@@ -899,14 +900,14 @@ def rcv_StoreOp : rcv_Op<"store",
   }];
 }
 
-def rcv_StoreToMemRefOp : rcv_Op<"store_to_memref", [NoSideEffect]> {
+def accv_StoreToMemRefOp : accv_Op<"store_to_memref", [NoSideEffect]> {
   let summary = "store element in memref";
   let description =[{
     Stores the element at index 0 of a Scalar sizes MemRef
   }];
 
   let arguments = (ins AnyType:$value);
-  let results = (outs rc_MemRefWithShape<[1]>:$result);
+  let results = (outs acc_MemRefWithShape<[1]>:$result);
 
   let builders = [OpBuilder<(ins "Value":$value), [{
       auto elementType = value.getType();
@@ -915,14 +916,14 @@ def rcv_StoreToMemRefOp : rcv_Op<"store_to_memref", [NoSideEffect]> {
   }]>];
 }
 
-def rcv_UnsafeOffsetOp : rcv_Op<"unsafe_offset"> {
+def accv_UnsafeOffsetOp : accv_Op<"unsafe_offset"> {
   let summary = "";
   let description = "";
-  let arguments = (ins AnyMemRef:$input, rc_Indexlike:$offset);
+  let arguments = (ins AnyMemRef:$input, acc_Indexlike:$offset);
   let results = (outs AnyMemRef:$result);
 }
 
-def rcv_ReshapeOp : rcv_Op<"reshape"> {
+def accv_ReshapeOp : accv_Op<"reshape"> {
   let summary = "";
   let description = "";
   let arguments = (ins AnyMemRef:$source);
@@ -939,7 +940,7 @@ def rcv_ReshapeOp : rcv_Op<"reshape"> {
   }];
 }
 
-def rcv_ReorderOp : rcv_Op<"reorder"> {
+def accv_ReorderOp : accv_Op<"reorder"> {
   let summary = "";
   let description = "";
   let arguments = (ins
@@ -966,7 +967,7 @@ def rcv_ReorderOp : rcv_Op<"reorder"> {
   }];
 }
 
-def rcv_YieldOp : rcv_Op<"yield", [Terminator]> {
+def accv_YieldOp : accv_Op<"yield", [Terminator]> {
   let summary = "loop yield and termination operation";
   let description = [{
     The `accv.yield` op yields an SSA value from a loop dialect op region and
@@ -987,7 +988,7 @@ def rcv_YieldOp : rcv_Op<"yield", [Terminator]> {
   ];
 }
 
-def rcv_EarlyReturnOp : rcv_Op<"early_return", [MemRefsNormalizable,]> {
+def accv_EarlyReturnOp : accv_Op<"early_return", [MemRefsNormalizable,]> {
   let summary = "early return operation";
   let description = [{}];
 
@@ -999,7 +1000,7 @@ def rcv_EarlyReturnOp : rcv_Op<"early_return", [MemRefsNormalizable,]> {
   let assemblyFormat = "attr-dict ($operands^ `:` type($operands))?";
 }
 
-def rcv_ReturnOp : rcv_Op<"return", [
+def accv_ReturnOp : accv_Op<"return", [
         NoSideEffect, MemRefsNormalizable, ReturnLike, Terminator]> {
   let summary = "return operation";
   let description = [{}];
@@ -1013,7 +1014,7 @@ def rcv_ReturnOp : rcv_Op<"return", [
 }
 
 // Note: only works on vectors (1-D memrefs) currently
-def rcv_ReduceOp : rcv_Op<"reduce",
+def accv_ReduceOp : accv_Op<"reduce",
       [SingleBlockImplicitTerminator<"YieldOp">]> {
   let summary = "Reduction operation";
   let description = "";
@@ -1044,7 +1045,7 @@ def rcv_ReduceOp : rcv_Op<"reduce",
 }
 
 // Note: only works on vectors (1-D memrefs) currently
-def rcv_MapReduceOp : rcv_Op<"map_reduce",
+def accv_MapReduceOp : accv_Op<"map_reduce",
       [SingleBlockImplicitTerminator<"YieldOp">]> {
   let summary = "Map-reduce operation";
   let description = "";
@@ -1088,27 +1089,27 @@ def rcv_MapReduceOp : rcv_Op<"map_reduce",
   }];
 }
 
-def rcv_ReduceMaxOp : rcv_Op<"reduce_max"> {
+def accv_ReduceMaxOp : accv_Op<"reduce_max"> {
   let summary = "";
   let description = "";
   let arguments = (ins AnyMemRef:$input);
   let results = (outs AnyType:$result);
 }
 
-def rcv_ReduceSumOp : rcv_Op<"reduce_sum"> {
+def accv_ReduceSumOp : accv_Op<"reduce_sum"> {
   let summary = "";
   let description = "";
   let arguments = (ins AnyMemRef:$input);
   let results = (outs AnyType:$result);
 }
 
-def rcv_BarrierOp : rcv_Op<"barrier"> {
+def accv_BarrierOp : accv_Op<"barrier"> {
   let summary = "Block synchronization primitive.";
   let hasCanonicalizer = 1;
   let arguments = (ins BarrierScopeAttr:$scope);
 }
 
-def rcv_GetTimeOp : rcv_Op<"gettime"> {
+def accv_GetTimeOp : accv_Op<"gettime"> {
   let summary = "Get current clock time";
   let results = (outs AnyFloat:$result);
   let builders = [
@@ -1117,98 +1118,235 @@ def rcv_GetTimeOp : rcv_Op<"gettime"> {
     }]>];
 }
 
-def rcv_EnterProfileRegionOp : rcv_Op<"enter_profile"> {
+def accv_EnterProfileRegionOp : accv_Op<"enter_profile"> {
   let summary = "Enter a profile region";
   let arguments = (ins StrAttr:$regionName);
 }
 
-def rcv_ExitProfileRegionOp : rcv_Op<"exit_profile"> {
+def accv_ExitProfileRegionOp : accv_Op<"exit_profile"> {
   let summary = "Exit a profile region";
   let arguments = (ins StrAttr:$regionName);
 }
 
-def rcv_PrintProfileResultsOp : rcv_Op<"print_profile"> {
+def accv_PrintProfileResultsOp : accv_Op<"print_profile"> {
   let summary = "Print out a summary of the profile counters";
 }
 
 
 // matrix-fuse-multiply-add
 
-// Predicate to check if type is rcv::MMAMatrixType.
+// Predicate to check if type is accv::MMAMatrixType.
 def IsMMFAMatrixTypePred : CPred<"$_self.isa<accera::ir::value::MFMAMatrixType>()">;
 
-def rcv_MFMAMatrix : DialectType<Value_Dialect,
+def accv_MFMAMatrix : DialectType<Value_Dialect,
                                         IsMMFAMatrixTypePred,
                                         "MFMAMatrix type">;
 class MFMAMatrixOf<list<Type> allowedTypes> :
   ContainerType<AnyTypeOf<allowedTypes>, IsMMFAMatrixTypePred,
   "$_self.cast<::accera::ir::value::MFMAMatrixType>().getElementType()",
-  "rcv.mfma_matrix", "::accera::ir::value::MFMAMatrixType">;
+  "accv.mfma_matrix", "::accera::ir::value::MFMAMatrixType">;
 
-def rcv_MFMAComputeOp: rcv_Op<"mfma_compute", [NoSideEffect, AllTypesMatch<["opC", "res"]>]> {
+def accv_MFMAComputeOp: accv_Op<"mfma_compute", [NoSideEffect, AllTypesMatch<["opC", "res"]>]> {
   let summary = "MFMA Compute";
   let description = [{
-    The `rcv.mfma_compute` op is an abstraction of TensorCore ops (currently GPU centric).
+    The `accv.mfma_compute` op is an abstraction of TensorCore ops (currently GPU centric).
   }];
 
-  let arguments = (ins Arg<MFMAMatrixOf<[F16, F32]>>:$opA,
+  let arguments = (ins
+                  Arg<MFMAMatrixOf<[F16, F32]>>:$opA,
                   Arg<MFMAMatrixOf<[F16, F32]>>:$opB,
                   Arg<MFMAMatrixOf<[F16, F32]>>:$opC);
 
-  let results = (outs rcv_MFMAMatrix:$res);
+  let results = (outs accv_MFMAMatrix:$res);
 
   let assemblyFormat = [{
-    $opA`,` $opB`,` $opC attr-dict `:` type($opA)`,` type($opB) `->` type($res)
+    $opA `,` $opB`,` $opC attr-dict `:` type($opA) `,` type($opB) `,` type($opC) `->` type($res)
   }];
 
   let verifier = [{ return ::verify(*this); }];
 }
 
+
+def accv_MFMAConstantOp : accv_Op<"mfma_constant_matrix",
+  [
+    NoSideEffect,
+    TypesMatchWith<"value type matches element type of mfma_matrix",
+                    "result", "value",
+                    "$_self.cast<MFMAMatrixType>().getElementType()">
+  ]>{
+
+  let summary = "creates a constant MFMA matrix";
+
+  let description = [{
+    The `accv.mfma_constant_matrix` creates a constant MFMA matrix that can be subsequently
+    using within an accv.mfma_compute operation.
+
+    This operation takes a constant as its operand. The op returns a `!accv.mfma_matrix`.
+  }];
+
+  let arguments = (ins
+    AnyTypeOf<[I8, F16, F32]>:$value
+  );
+
+  let results = (outs accv_MFMAMatrix:$result);
+
+  code extraClassDeclaration = [{  
+    MFMAMatrixType getMFMAMatrixType() {
+      return result().getType().template cast<MFMAMatrixType>();
+    }
+    ArrayRef<int64_t> getMFMAMatrixShape() {
+      return getMFMAMatrixType().getShape();
+    }
+  }];
+
+  let assemblyFormat = [{
+    $value attr-dict `:` type($value) `->` type($result)
+  }]; 
+  let verifier = [{ return ::verify(*this); }];
+}
+
 // load and store submatrix
 
-def rcv_MFMALoadMatrixOp : rcv_Op<"mfma_load_matrix", [MemoryEffects<[MemRead]>]>{
+def accv_MFMALoadOp : accv_Op<"mfma_load_matrix",
+  [
+    MemoryEffects<[MemRead]>,
+    DeclareOpInterfaceMethods<AffineReadOpInterface>,
+    DeclareOpInterfaceMethods<AffineMapAccessInterface>
+  ]>{
 
   let summary = "matrix load";
 
   let description = [{
-    The `rcv.mfma_load_matrix` operation loads a sub-matrix to be used by the rcv.mfma_compute operation.
+    The `accv.mfma_load_matrix` operation loads a sub-matrix to be used by the accv.mfma_compute operation.
 
     This operation takes a memref as its first operand: it is the source matrix
-    from which data is to be loaded. The op returns a `!rcv.mfma_matrix`.
+    from which data is to be loaded. The op returns a `!accv.mfma_matrix`.
   }];
 
-  let arguments = (ins Arg<MemRefOf<[F16, F32]>, "", [MemRead]>:$srcMemref);
+  let arguments = (ins
+    Arg<MemRefRankOf<[F16, F32], [2]>, "", [MemRead]>:$memref,
+    Variadic<acc_Indexlike>:$indices,
+    OptionalAttr<AffineMapAttr>:$map
+  );
 
-  let results = (outs rcv_MFMAMatrix:$res);
+  let results = (outs accv_MFMAMatrix:$result);
 
-  let assemblyFormat = [{
-    $srcMemref attr-dict `:` type($srcMemref) `->` type($res)
+  let builders = [
+    OpBuilder<(ins "Type":$resultType, "Value":$memref, "AffineMap":$map, "ValueRange":$mapOperands), [{
+        assert(map.getNumInputs() == mapOperands.size() && "inconsistent index info");
+        $_state.addOperands(memref);
+        $_state.addOperands(mapOperands);
+        $_state.addAttribute(getMapAttrName(), AffineMapAttr::get(map));
+        $_state.addTypes(resultType);
+    }]>,
+    OpBuilder<(ins "Type":$resultType, "Value":$memref, "ValueRange":$indices), [{
+        auto memrefType = memref.getType().cast<MemRefType>();
+        int64_t rank = memrefType.getRank();
+        // Create identity map for memrefs with at least one dimension or () -> ()
+        // for zero-dimensional memrefs.
+        auto map =
+            rank ? $_builder.getMultiDimIdentityMap(rank) : $_builder.getEmptyAffineMap();
+        return build($_builder, $_state, resultType, memref, map, indices);
+    }]>
+  ];
+
+  code extraClassDeclaration = [{  
+    MFMAMatrixType getMFMAMatrixType() {
+      return result().getType().template cast<MFMAMatrixType>();
+    }
+    ArrayRef<int64_t> getMFMAMatrixShape() {
+      return getMFMAMatrixType().getShape();
+    }
+    
+    /// Returns the operand index of the memref.
+    unsigned getMemRefOperandIndex() { return 0; }
+
+    void setMemRef(Value value) { setOperand(getMemRefOperandIndex(), value); }
+
+    /// Returns the affine map used to index the memref for this operation.
+    AffineMapAttr getAffineMapAttr() {
+      return (*this)->getAttr(getMapAttrName()).cast<AffineMapAttr>();
+    }
+
+    static StringRef getMapAttrName() { return "map"; }
   }];
 
+  let assemblyFormat = [{
+    $memref `[` $indices `]` attr-dict `:` type($memref) `[` type($indices) `]` `->` type($result)
+  }]; 
   let verifier = [{ return ::verify(*this); }];
 }
 
-def rcv_MFMAStoreMatrixOp : rcv_Op<"subgroup_mma_store_matrix",
-    [MemoryEffects<[MemWrite]>]>{
+def accv_MFMAStoreOp : accv_Op<"mfma_store_matrix", [
+    MemoryEffects<[MemWrite]>,
+    DeclareOpInterfaceMethods<AffineWriteOpInterface>,
+    DeclareOpInterfaceMethods<AffineMapAccessInterface>
+  ]>{
 
   let summary = "matrix store";
 
   let description = [{
-    The `rcv.mfma_load_matrix` operation loads a sub-matrix.
+    The `accv.mfma_store_matrix` operation stores an mfma matrix into a matrix.
 
-    This operation takes a `rcv.mfma_matrix` and a memref as operands.
-    `!rcv.mma_matrix` is the source value containing the data to be stored into the
+    This operation takes a `accv.mfma_matrix` and a memref as operands.
+    `!accv.mfma_matrix` is the source value containing the data to be stored into the
     destination memref which can be in global or shared memory.
   }];
 
-  let arguments = (ins Arg<MFMAMatrixOf<[F16, F32]>>:$src,
-                  Arg<MemRefOf<[F16, F32]>, "",[MemWrite]>:$dstMemref);
+  let arguments = (ins 
+        Arg<MFMAMatrixOf<[F16, F32]>>:$value,
+        Arg<MemRefRankOf<[F16, F32], [2]>, "",[MemWrite]>:$memref,              
+        Variadic<acc_Indexlike>:$indices,
+        OptionalAttr<AffineMapAttr>:$map);
+
+  let builders = [
+    OpBuilder<(ins "Value":$value, "Value":$memref, "AffineMap":$map, "ValueRange":$mapOperands), [{
+        assert(map.getNumInputs() == mapOperands.size() && "inconsistent index info");
+        $_state.addOperands(value);
+        $_state.addOperands(memref);
+        $_state.addOperands(mapOperands);
+        $_state.addAttribute(getMapAttrName(), AffineMapAttr::get(map));
+    }]>,
+    OpBuilder<(ins "Value":$value, "Value":$memref, "ValueRange":$indices), [{
+        auto memrefType = memref.getType().cast<MemRefType>();
+        int64_t rank = memrefType.getRank();
+        // Create identity map for memrefs with at least one dimension or () -> ()
+        // for zero-dimensional memrefs.
+        auto map =
+            rank ? $_builder.getMultiDimIdentityMap(rank) : $_builder.getEmptyAffineMap();
+        return build($_builder, $_state, value, memref, map, indices);
+    }]>
+  ];
+
+  code extraClassDeclaration = [{ 
+    MFMAMatrixType getMFMAMatrixType() {
+      return value().getType().template cast<MFMAMatrixType>();
+    }
+    ArrayRef<int64_t> getMFMAMatrixShape() {
+      return getMFMAMatrixType().getShape();
+    }
+    
+    /// Returns the operand index of the value to be stored.
+    unsigned getStoredValOperandIndex() { return 0; }
+
+    /// Returns the operand index of the memref.
+    unsigned getMemRefOperandIndex() { return 1; }
+
+    void setMemRef(Value value) { setOperand(getMemRefOperandIndex(), value); }
+
+    /// Returns the affine map used to index the memref for this operation.
+    AffineMapAttr getAffineMapAttr() {
+      return (*this)->getAttr(getMapAttrName()).cast<AffineMapAttr>();
+    }
+
+    static StringRef getMapAttrName() { return "map"; }
+  }];
 
   let assemblyFormat = [{
-    $src`,` $dstMemref attr-dict `:` type($src)`,` type($dstMemref)
+    $value`,` $memref `[` $indices `]` attr-dict `:` type($value)`,` type($memref) `[` type($indices) `]`
   }];
 
   let verifier = [{ return ::verify(*this); }];
 }
 
-#endif // ACCERA_rcv_OPS
+#endif // ACCERA_accv_OPS
diff --git a/accera/ir/src/IRUtil.cpp b/accera/ir/src/IRUtil.cpp
index 15094732..c6bb0dfd 100644
--- a/accera/ir/src/IRUtil.cpp
+++ b/accera/ir/src/IRUtil.cpp
@@ -291,15 +291,18 @@ namespace util
         return builder.clone(*op, mapping);
     }
 
-    std::optional<vir::ExecutionTarget> ResolveExecutionTarget(mlir::Operation* op)
+    std::optional<vir::ExecutionTarget> ResolveExecutionTarget(mlir::Operation* op, bool exact /* = false */)
     {
-        // modules can define the execution runtime
-        // search if the current module specifies the execution runtime
+        // modules can define the execution target
+        // search if the current module specifies the execution target
         auto getExecTarget = [](Operation* op) { return op->getAttrOfType<vir::ExecutionTargetAttr>(vir::ValueFuncOp::getExecTargetAttrName()); };
 
         Operation* execAwareOp = op;
         auto execTargetAttr = getExecTarget(execAwareOp);
-        while (execAwareOp && !execAwareOp->hasTrait<mlir::OpTrait::FunctionLike>() && !execTargetAttr)
+        while (!exact &&
+               execAwareOp &&
+               !execAwareOp->hasTrait<mlir::OpTrait::FunctionLike>() &&
+               !execTargetAttr)
         {
             if ((execAwareOp = execAwareOp->getParentWithTrait<mlir::OpTrait::FunctionLike>()))
             {
@@ -314,7 +317,7 @@ namespace util
 
         assert(execAwareOp && "Unable to find a function-like op which surrounds the curent op");
         return mlir::TypeSwitch<Operation*, std::optional<vir::ExecutionTarget>>(execAwareOp)
-            .Case([=](mlir::gpu::GPUFuncOp op) {
+            .Case([](mlir::gpu::GPUFuncOp op) {
                 return vir::ExecutionTarget::GPU;
             })
             .Case([](mlir::spirv::FuncOp op) {
@@ -323,53 +326,65 @@ namespace util
             .Case([](mlir::FuncOp op) {
                 return vir::ExecutionTarget::CPU;
             })
+            .Case([](mlir::LLVM::LLVMFuncOp op) {
+                return vir::ExecutionTarget::CPU;
+            })
             .Default([](Operation* op) {
                 op->emitWarning("Couldn't determine execution environment");
                 return std::nullopt;
             });
     }
 
-    std::optional<vir::ExecutionRuntime> ResolveExecutionRuntime(mlir::Operation* op)
+    std::optional<vir::ExecutionRuntime> ResolveExecutionRuntime(mlir::Operation* op, bool exact /* = false */)
     {
-        // search the rcv.Module for the runtime
-        std::function getExecRuntime = [](Operation* op) {
-            return op->getAttrOfType<vir::ExecutionRuntimeAttr>(vir::ValueModuleOp::getExecRuntimeAttrName());
+        auto execRuntimeAttrName = ir::value::ValueModuleOp::getExecRuntimeAttrName();
+
+        auto getExecRuntime = [&](Operation* op) {
+            return op->getAttrOfType<vir::ExecutionRuntimeAttr>(execRuntimeAttrName);
         };
+
         Operation* moduleLikeOp = op;
         auto execRuntimeAttr = getExecRuntime(moduleLikeOp);
-        while (moduleLikeOp && !execRuntimeAttr)
+        // if the runtime attribute is not found in the rcv.module, then
+        // search the mlir.module for the runtime (using a fully qualified attribute name) 
+        if (!exact && op && !execRuntimeAttr)
         {
-            if ((moduleLikeOp = moduleLikeOp->getParentOfType<vir::ValueModuleOp>()))
+            if ((moduleLikeOp = op->getParentOfType<vir::ValueModuleOp>()))
             {
                 execRuntimeAttr = getExecRuntime(moduleLikeOp);
             }
-        }
-
-        // if the runtime attribute is not found in the rcv.module, then
-        // search the mlir.module for the runtime (using a fully qualified attribute name)
-        if (!execRuntimeAttr)
-        {
-            auto execRuntimeAttrName = ir::value::ValueModuleOp::getExecRuntimeAttrName();
-            getExecRuntime = [=](Operation* op) { return op->getAttrOfType<vir::ExecutionRuntimeAttr>(execRuntimeAttrName.str()); };
-
-            moduleLikeOp = op;
-            execRuntimeAttr = getExecRuntime(moduleLikeOp);
-            while (moduleLikeOp && !execRuntimeAttr)
+            if (!execRuntimeAttr && (moduleLikeOp = op->getParentOfType<mlir::ModuleOp>()))
             {
-                if ((moduleLikeOp = moduleLikeOp->getParentOfType<mlir::ModuleOp>()))
-                    execRuntimeAttr = getExecRuntime(moduleLikeOp);
+                execRuntimeAttr = getExecRuntime(moduleLikeOp);
             }
         }
 
-        // the runtime attribute was not set by the user, so set it as default
+        // the runtime attribute was not set by the user, so set it to NONE
         if (!execRuntimeAttr)
         {
-            return vir::ExecutionRuntime::Default;
+            return vir::ExecutionRuntime::NONE;
         }
 
         return execRuntimeAttr.getValue();
     }
 
+    std::optional<int64_t> ResolveWarpSize(mlir::Operation* op)
+    {
+        auto runtime = ResolveExecutionRuntime(op);
+        if (runtime == vir::ExecutionRuntime::CUDA)
+        {
+            return 32;
+        }
+        else if (runtime == vir::ExecutionRuntime::ROCM)
+        {
+            return 64;
+        }
+        else
+        {
+            return std::nullopt;
+        }
+    }
+
     mlir::Operation* CreateGPUControlBarrier(mlir::OpBuilder& builder, const std::string scope, std::optional<mlir::Location> loc /*= std::nullopt*/)
     {
         auto barrierScope = vir::symbolizeEnum<value::BarrierScope>(scope);
@@ -409,15 +424,33 @@ namespace util
         return {};
     }
 
-    std::vector<mlir::Value> GetCurrentIndexIVs(const std::vector<loopnest::Index>& loopIndices, mlir::Operation* where)
+    std::vector<mlir::Value> GetCurrentIndexIVs(const std::vector<loopnest::Index>& loopIndices, mlir::Operation* where, const std::vector<std::pair<loopnest::Index, mlir::Value>>& unrealizedLoopNestIndices)
     {
-        return GetCurrentIndexIVs(loopIndices, where->getBlock());
+        return GetCurrentIndexIVs(loopIndices, where->getBlock(), unrealizedLoopNestIndices);
     }
 
-    std::vector<mlir::Value> GetCurrentIndexIVs(const std::vector<loopnest::Index>& loopIndices, mlir::Block* where)
+    std::vector<mlir::Value> GetCurrentIndexIVs(const std::vector<loopnest::Index>& loopIndices, mlir::Block* where, const std::vector<std::pair<loopnest::Index, mlir::Value>>& unrealizedLoopNestIndices)
     {
         std::vector<mlir::Value> ivs(loopIndices.size());
 
+        // First check the unrealizedLoopNestIndices for any loopnest indices that haven't been resolved to full AffineForOps yet
+        for (const auto& indexIVPair : unrealizedLoopNestIndices)
+        {
+            const auto& currentIndex = indexIVPair.first;
+            const auto& currentIV = indexIVPair.second;
+            auto it = std::find_if(loopIndices.begin(), loopIndices.end(), [&](const loopnest::Index& searchIndex) {
+                return (searchIndex == currentIndex) ||
+                       (searchIndex.GetId() == loopnest::Index::DefaultID &&
+                        searchIndex.GetName() == currentIndex.GetName());
+            });
+            if (it != loopIndices.end())
+            {
+                size_t idx = std::distance(loopIndices.begin(), it);
+                assert(ivs[idx] == nullptr && "Found same index on multiple loops");
+                ivs[idx] = currentIV;
+            }
+        }
+
         auto blockParentOp = where->getParentOp();
         mlir::AffineForOp currentParentLoop;
         if (mlir::isa<mlir::AffineForOp>(blockParentOp))
@@ -434,7 +467,16 @@ namespace util
             if (auto indexAttr = currentParentLoop->getAttrOfType<loopnest::IndexAttr>("index"))
             {
                 auto currentIndex = indexAttr.getValue();
-                auto it = std::find(loopIndices.begin(), loopIndices.end(), currentIndex);
+
+                // If the indices we're looking for have a default ID, then only compare by the name of the index
+                // This is to support well-known named loops created internally by Accera
+                // If the ID's are not the default, then compare IDs as well
+                auto it = std::find_if(loopIndices.begin(), loopIndices.end(), [&](const loopnest::Index& searchIndex) {
+                    return (searchIndex == currentIndex) ||
+                           (searchIndex.GetId() == loopnest::Index::DefaultID &&
+                            searchIndex.GetName() == currentIndex.GetName());
+                });
+
                 if (it != loopIndices.end())
                 {
                     size_t idx = std::distance(loopIndices.begin(), it);
@@ -624,8 +666,7 @@ namespace util
         }
         // Move the loop body operations, except for its terminator, to the loop's
         // containing block.
-
-        rewriter.eraseOp(&forOp.getBody()->back());
+        rewriter.eraseOp(forOp.getBody()->getTerminator());
 
         parentBlock->getOperations().splice(mlir::Block::iterator(forOp),
                                             forOp.getBody()->getOperations());
@@ -715,5 +756,108 @@ namespace util
         assert(false && "Neither op found in block");
     }
 
+    mlir::AffineMap ComposeAffineMapSequence(const std::vector<mlir::AffineMap>& maps)
+    {
+        if (maps.empty())
+        {
+            return mlir::AffineMap();
+        }
+        else
+        {
+            auto accessMapComposition = maps.front();
+            for (size_t mapIdx = 1; mapIdx < maps.size(); ++mapIdx)
+            {
+                accessMapComposition = maps[mapIdx].compose(accessMapComposition);
+            }
+            return accessMapComposition;
+        }
+    }
+
+    template <typename MemoryOp>
+    mlir::AffineMap GetMemRefIndexToMemoryLocationMap(mlir::MLIRContext* context, MemoryOp op)
+    {
+        auto memRefType = op.memref().getType().template cast<mlir::MemRefType>();
+        std::vector<mlir::AffineMap> memRefMaps = memRefType.getAffineMaps().vec();
+        if (memRefMaps.empty())
+        {
+            auto stridedLayout = mlir::makeCanonicalStridedLayoutExpr(memRefType.getShape(), context);
+            memRefMaps.push_back(mlir::AffineMap::get(memRefType.getRank(), 0, stridedLayout));
+        }
+        auto accessMapComposition = ComposeAffineMapSequence(memRefMaps);
+        assert(accessMapComposition.getNumResults() == 1);
+        return accessMapComposition;
+    }
+
+    template <typename AffineMemoryOp>
+    mlir::AffineMap GetAffineOpIndexToMemoryLocationMap(mlir::MLIRContext* context, AffineMemoryOp op)
+    {
+        auto composedMemRefMap = GetMemRefIndexToMemoryLocationMap(context, op);
+        mlir::AffineMap affineOpMap = op.getAffineMapAttr().getValue();
+        mlir::AffineMap accessMapComposition = composedMemRefMap.compose(affineOpMap);
+        assert(accessMapComposition.getNumResults() == 1);
+        return accessMapComposition;
+    }
+
+    mlir::AffineMap GetIndexToMemoryLocationMap(mlir::MLIRContext* context, mlir::AffineStoreOp op)
+    {
+        return GetAffineOpIndexToMemoryLocationMap(context, op);
+    }
+
+    mlir::AffineMap GetIndexToMemoryLocationMap(mlir::MLIRContext* context, mlir::AffineLoadOp op)
+    {
+        return GetAffineOpIndexToMemoryLocationMap(context, op);
+    }
+
+    mlir::AffineMap GetIndexToMemoryLocationMap(mlir::MLIRContext* context, mlir::memref::StoreOp op)
+    {
+        return GetMemRefIndexToMemoryLocationMap(context, op);
+    }
+
+    mlir::AffineMap GetIndexToMemoryLocationMap(mlir::MLIRContext* context, mlir::memref::LoadOp op)
+    {
+        return GetMemRefIndexToMemoryLocationMap(context, op);
+    }
+
+    TempOpCleanupGuard::TempOpCleanupGuard(std::stack<mlir::Operation*>* opStack, mlir::PatternRewriter& rewriter) :
+        _opStack(opStack),
+        _rewriter(rewriter)
+    {}
+
+    TempOpCleanupGuard::~TempOpCleanupGuard()
+    {
+        while (!_opStack->empty())
+        {
+            auto eraseOp = _opStack->top();
+            assert(eraseOp->use_empty());
+            _rewriter.eraseOp(eraseOp);
+            _opStack->pop();
+        }
+    }
+
+    mlir::Attribute MemorySpaceToAttribute(const value::MemorySpace& memorySpace, mlir::MLIRContext* context)
+    {
+        return mlir::IntegerAttr::get(mlir::IntegerType::get(context, 64), static_cast<int64_t>(memorySpace));
+    }
+
+    value::MemorySpace AttributeToMemorySpace(mlir::Attribute memorySpaceAttr)
+    {
+        return static_cast<value::MemorySpace>(memorySpaceAttr.cast<mlir::IntegerAttr>().getInt());
+    }
+
+    mlir::AffineMap GetMajorIdentityMap(unsigned dims, unsigned results, mlir::MLIRContext* context)
+    {
+        assert(dims >= results && "Dimension mismatch");
+        auto id = mlir::AffineMap::getMultiDimIdentityMap(dims, context);
+        return mlir::AffineMap::get(dims, 0, id.getResults().take_front(results), context);
+    }
+
+    void EraseAllOpsInBlock(mlir::PatternRewriter& rewriter, mlir::Block& block)
+    {
+        for (auto& op : llvm::make_early_inc_range(llvm::reverse(block)))
+        {
+            assert(op.use_empty() && "expected 'op' to have no uses");
+            rewriter.eraseOp(&op);
+        }
+    }
 } // namespace util
 } // namespace accera::ir
diff --git a/accera/ir/src/TranslateToHeader.cpp b/accera/ir/src/TranslateToHeader.cpp
index 5346152f..c044dba6 100644
--- a/accera/ir/src/TranslateToHeader.cpp
+++ b/accera/ir/src/TranslateToHeader.cpp
@@ -84,6 +84,12 @@ namespace ir
             os << "//\n\n";
             os << "#include <stdint.h>\n\n";
 
+            // for float16_t
+            os << "#if !defined(ACCERA_FLOAT)\n";
+            os << "#define ACCERA_FLOAT 1\n";
+            os << "typedef uint16_t float16_t;\n";
+            os << "#endif // !defined(ACCERA_FLOAT)\n";
+
             os << "#if defined(__cplusplus)\n";
             os << "extern \"C\"\n";
             os << "{\n";
@@ -285,6 +291,10 @@ namespace ir
             {
                 WriteIntegerType(os, t);
             }
+            else if (t.isF16())
+            {
+                os << "float16_t";
+            }
             else if (t.isF32())
             {
                 os << "float";
diff --git a/accera/ir/src/exec/ExecutionPlanAttributes.cpp b/accera/ir/src/exec/ExecutionPlanAttributes.cpp
index 07b7550f..d4243e8e 100644
--- a/accera/ir/src/exec/ExecutionPlanAttributes.cpp
+++ b/accera/ir/src/exec/ExecutionPlanAttributes.cpp
@@ -159,7 +159,7 @@ namespace executionPlan
         return getImpl()->getValue();
     }
 
-    TensorizationInfoAttr parseTensorizeInfo(mlir::DialectAsmParser& parser)
+    TensorizationInfoAttr parseTensorizationInfo(mlir::DialectAsmParser& parser)
     {
         int dim0, dim1, dim2;
         if (failed(parser.parseLBrace()))
@@ -176,12 +176,12 @@ namespace executionPlan
             return {};
         if (failed(parser.parseRBrace()))
             return {};
-        return TensorizationInfoAttr::get(TensorizationInfo{std::vector<int>{dim0, dim1, dim2}}, parser.getBuilder().getContext());
+        return TensorizationInfoAttr::get(TensorizationInfo{std::array<int, 3>{dim0, dim1, dim2}}, parser.getBuilder().getContext());
     }
 
     void print(TensorizationInfoAttr attr, mlir::DialectAsmPrinter& printer)
     {
-        printer << "tensorizeinfo";
+        printer << "tensorizationinfo";
         auto tensorizelInfo = attr.cast<TensorizationInfoAttr>().getValue();
         printer << tensorizelInfo;
     }
diff --git a/accera/ir/src/exec/ExecutionPlanOps.cpp b/accera/ir/src/exec/ExecutionPlanOps.cpp
index 47ac6fb3..80b319d7 100644
--- a/accera/ir/src/exec/ExecutionPlanOps.cpp
+++ b/accera/ir/src/exec/ExecutionPlanOps.cpp
@@ -4,9 +4,10 @@
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 #include "exec/ExecutionPlanOps.h"
-#include "exec/ExecutionPlanDialect.cpp.inc"
+
 #include "IRUtil.h"
 #include "exec/ExecutionPlanAttributes.h"
+#include "exec/ExecutionPlanDialect.cpp.inc"
 #include "exec/ExecutionPlanEnums.cpp.inc"
 #include "nest/Index.h"
 #include "nest/LoopNestAttributes.h"
@@ -839,7 +840,7 @@ namespace executionPlan
         case MemorySpace::Shared:
             memoryLocation = gpu::GPUDialect::getWorkgroupAddressSpace();
             break;
-        case MemorySpace::Local:
+        case MemorySpace::Private:
             memoryLocation = gpu::GPUDialect::getPrivateAddressSpace();
             break;
         }
@@ -925,8 +926,8 @@ namespace executionPlan
         case MemorySpace::Shared:
             memoryLocation = (int)value::MemorySpace::Shared;
             break;
-        case MemorySpace::Local:
-            memoryLocation = (int)value::MemorySpace::Local;
+        case MemorySpace::Private:
+            memoryLocation = (int)value::MemorySpace::Private;
             break;
         }
 
@@ -969,13 +970,14 @@ namespace executionPlan
                             mlir::MemRefType cacheType,
                             accera::ir::value::MemorySpace memorylocation)
     {
-        build(builder,
-              result,
-              cacheType,
-              memorylocation,
-              mlir::AffineMap::getMultiDimIdentityMap(cacheType.getRank(), builder.getContext()),
-              std::vector<Index>{},
-              std::vector<Index>{});
+        build(
+            builder,
+            result,
+            cacheType,
+            memorylocation,
+            mlir::AffineMap::getMultiDimIdentityMap(cacheType.getRank(), builder.getContext()),
+            std::vector<Index>{},
+            std::vector<Index>{});
     }
 
     void MakeCacheOp::build(OpBuilder& builder,
@@ -988,6 +990,7 @@ namespace executionPlan
     {
         auto offsetAccessIndexAttrs = util::ConvertIndexVectorToArrayAttr(offsetAccessIndices, builder.getContext());
         auto multiCacheAccessIndexAttrs = util::ConvertIndexVectorToArrayAttr(multiCacheAccessIndices, builder.getContext());
+
         build(builder,
               result,
               cacheType,
@@ -1007,18 +1010,19 @@ namespace executionPlan
         return result;
     }
 
-    mlir::AffineValueMap MakeCacheOp::insertCachePosition(mlir::Operation* where, const std::vector<mlir::Value>& baseArrayIndices)
+    mlir::AffineValueMap MakeCacheOp::insertCachePosition(mlir::Operation* where, const std::vector<mlir::Value>& baseArrayIndices, const std::vector<std::pair<loopnest::Index, mlir::Value>>& unrealizedLoopnestIndices)
     {
-        return insertCachePosition(where->getBlock(), baseArrayIndices);
+        return insertCachePosition(where->getBlock(), baseArrayIndices, unrealizedLoopnestIndices);
     }
 
-    mlir::AffineValueMap MakeCacheOp::insertCachePosition(mlir::Block* where, const std::vector<mlir::Value>& baseArrayIndices)
+    mlir::AffineValueMap MakeCacheOp::insertCachePosition(mlir::Block* where, const std::vector<mlir::Value>& baseArrayIndices, const std::vector<std::pair<loopnest::Index, mlir::Value>>& unrealizedLoopnestIndices)
     {
+        // The unrealizedLoopnestIndices contain indices for a loopnest that hasn't been fully constructed yet
         std::vector<loopnest::Index> cacheMultiCacheIndices = util::ConvertArrayAttrToIndexVector(multiCacheAccessIndices());
         std::vector<loopnest::Index> cacheOffsetAccessIndices = util::ConvertArrayAttrToIndexVector(offsetAccessIndices());
 
-        std::vector<mlir::Value> multiCacheIVs = util::GetCurrentIndexIVs(cacheMultiCacheIndices, where);
-        std::vector<mlir::Value> offsetAccessIVs = util::GetCurrentIndexIVs(cacheOffsetAccessIndices, where);
+        std::vector<mlir::Value> multiCacheIVs = util::GetCurrentIndexIVs(cacheMultiCacheIndices, where, unrealizedLoopnestIndices);
+        std::vector<mlir::Value> offsetAccessIVs = util::GetCurrentIndexIVs(cacheOffsetAccessIndices, where, unrealizedLoopnestIndices);
 
         mlir::OpBuilder builder = mlir::OpBuilder::atBlockBegin(where);
 
@@ -1032,6 +1036,7 @@ namespace executionPlan
         return insertCachePosition(multiCacheIterationCounters, offsetAccessIVs, baseArrayIndices);
     }
 
+    // Note : this doesn't always work after canonicalization potentially removes operands
     template <typename OpType>
     std::vector<mlir::Value> GetBaseArrayLoadStorePosition(OpType op, const mlir::ArrayAttr& multiCacheAccessIndices, const mlir::ArrayAttr& offsetAccessIndices)
     {
@@ -1214,7 +1219,9 @@ namespace executionPlan
                                          ValueRange ubOperands,
                                          mlir::ArrayAttr lbMaps,
                                          mlir::ArrayAttr ubMaps,
-                                         AffineMap activeBlockToCacheMap)
+                                         AffineMap activeBlockToCacheMap,
+                                         StringRef activeBlockTag,
+                                         bool thrifty)
     {
         build(builder,
               result,
@@ -1225,7 +1232,9 @@ namespace executionPlan
               lbMaps,
               ubMaps,
               activeBlockToCacheMap,
-              llvm::None); // scaleValues
+              llvm::None, // scaleValues
+              activeBlockTag,
+              thrifty);
     }
 
     //
@@ -1332,7 +1341,10 @@ namespace executionPlan
                                    int64_t id,
                                    int64_t cacheHierarchyLevel,
                                    bool activeBlockCache,
-                                   bool dimReorderCache)
+                                   bool dimReorderCache,
+                                   bool thrifty,
+                                   bool doubleBufferCache,
+                                   accera::ir::value::MemorySpace doubleBufferMemorySpace)
     {
         auto cacheRegionRelevantIndexRangeAttrs = util::VectorToArrayAttr<IndexRange, IndexRangeAttr>(
             cacheAccessContext.cacheRegionRelevantScheduleIndexRanges,
@@ -1369,6 +1381,15 @@ namespace executionPlan
         {
             result.addAttribute("dimReorderCache", builder.getUnitAttr());
         }
+        if (thrifty)
+        {
+            result.addAttribute("thrifty", builder.getUnitAttr());
+        }
+        if (doubleBufferCache)
+        {
+            result.addAttribute("doubleBufferCache", builder.getUnitAttr());
+            result.addAttribute("doubleBufferMemorySpace", value::MemorySpaceAttr::get(builder.getContext(), doubleBufferMemorySpace));
+        }
         result.addAttribute("operand_segment_sizes", builder.getI32VectorAttr({ 1 /* fromValue */, 1 /* toValue */, 1 /* baseInput */, static_cast<int32_t>(cacheAccessContext.fullRelevantScheduleIndices.size()), static_cast<int32_t>(cacheAccessContext.externalRelevantScheduleIndices.size()) }));
     }
 
@@ -1438,7 +1459,10 @@ namespace executionPlan
                                              loopnest::Index innermostLoopNestIndex,
                                              int64_t id,
                                              int64_t cacheHierarchyLevel,
-                                             bool dimReorderCache)
+                                             bool dimReorderCache,
+                                             bool thrifty,
+                                             bool doubleBufferCache,
+                                             accera::ir::value::MemorySpace doubleBufferMemorySpace)
     {
         result.addTypes(builder.getIndexType());
         result.addOperands(input);
@@ -1453,6 +1477,15 @@ namespace executionPlan
         {
             result.addAttribute("dimReorderCache", builder.getUnitAttr());
         }
+        if (thrifty)
+        {
+            result.addAttribute("thrifty", builder.getUnitAttr());
+        }
+        if (doubleBufferCache)
+        {
+            result.addAttribute("doubleBufferCache", builder.getUnitAttr());
+            result.addAttribute("doubleBufferMemorySpace", value::MemorySpaceAttr::get(builder.getContext(), doubleBufferMemorySpace));
+        }
     }
 
     Index BeginMaxElementCacheRegionOp::index()
@@ -1492,6 +1525,31 @@ namespace executionPlan
         return op;
     }
 
+    //
+    // DelayedMappingRegionOp
+    //
+    void DelayedMappingRegionOp::build(mlir::OpBuilder& builder,
+                                       mlir::OperationState& result,
+                                       mlir::Value from,
+                                       mlir::Value to)
+    {
+        result.addOperands({ from, to });
+        mlir::Region* region = result.addRegion();
+        mlir::Block* bodyBlock = new mlir::Block;
+        region->getBlocks().push_back(bodyBlock);
+        ensureTerminator(*region, builder, result.location);
+    }
+
+    DelayedMappingRegionOp MakeDelayedMappingRegion(mlir::OpBuilder& builder, mlir::Value from, mlir::Value to, std::function<void(mlir::OpBuilder&)> body)
+    {
+        auto loc = from.getLoc();
+        auto mappingRegionOp = builder.create<DelayedMappingRegionOp>(loc, from, to);
+        auto bodyBuilder = mappingRegionOp.getBodyBuilder();
+        body(bodyBuilder);
+
+        return mappingRegionOp;
+    }
+
     // Parse an instance of an attribute registered to the execution plan dialect.
     mlir::Attribute ExecutionPlanDialect::parseAttribute(mlir::DialectAsmParser& parser, mlir::Type type) const
     {
@@ -1508,6 +1566,10 @@ namespace executionPlan
         {
             return parseParallelizationInfo(parser);
         }
+        else if (keyword == "tensorizationinfo")
+        {
+            return parseTensorizationInfo(parser);
+        }
         else if (keyword == "inplaceunrollinfo")
         {
             return parseInPlaceUnrollInfo(parser);
diff --git a/accera/ir/src/nest/Index.cpp b/accera/ir/src/nest/Index.cpp
index e1ecf344..ee0fb13f 100644
--- a/accera/ir/src/nest/Index.cpp
+++ b/accera/ir/src/nest/Index.cpp
@@ -10,7 +10,7 @@ namespace accera::ir
 {
 namespace loopnest
 {
-    Index Index::none = Index("", -1);
+    Index Index::none = Index("", Index::DefaultID);
 
     Index::Index(const std::string& name) :
         Index(name, Index::GetNextId())
diff --git a/accera/ir/src/nest/LoopNestBuilder.cpp b/accera/ir/src/nest/LoopNestBuilder.cpp
index 9b7c9730..77c4a955 100644
--- a/accera/ir/src/nest/LoopNestBuilder.cpp
+++ b/accera/ir/src/nest/LoopNestBuilder.cpp
@@ -499,18 +499,18 @@ namespace loopnest
         {
             if (constRange.NumIterations() < (int64_t)*val)
             {
-                loop->setAttr("rcv_unrolled", builder.getUnitAttr());
+                loop->setAttr("accv_unrolled", builder.getUnitAttr());
             }
         }
 
         if (auto val = GetUnrollAndJamFactor(loopIndex))
         {
-            loop->setAttr("rcv_unroll_jam", builder.getI64IntegerAttr((int64_t)*val));
+            loop->setAttr("accv_unroll_jam", builder.getI64IntegerAttr((int64_t)*val));
         }
 
         if (IsSaturated(loopIndex))
         {
-            loop->setAttr("rcv_saturated", builder.getUnitAttr());
+            loop->setAttr("accv_saturated", builder.getUnitAttr());
         }
 
         auto execPlan = GetScheduleOp().getOrCreateExecPlan();
@@ -528,7 +528,7 @@ namespace loopnest
                     auto indexAttr = val.dyn_cast<IndexAttr>();
                     if (loopIndex.GetId() == indexAttr.getValue().GetId())
                     {
-                        loop->setAttr("rcv_gpu_map", builder.getStringAttr(key.str()));
+                        loop->setAttr("accv_gpu_map", builder.getStringAttr(key.str()));
                     }
                 }
             }
diff --git a/accera/ir/src/nest/TransformedDomain.cpp b/accera/ir/src/nest/TransformedDomain.cpp
index eae20421..d73f4a5f 100644
--- a/accera/ir/src/nest/TransformedDomain.cpp
+++ b/accera/ir/src/nest/TransformedDomain.cpp
@@ -205,7 +205,7 @@ namespace loopnest
         if (_indices.count(index) == 0)
             throw accera::utilities::InputException(accera::utilities::InputExceptionErrors::invalidArgument, "Splitting an unknown index");
         if (!IsLoopIndex(index))
-            throw accera::utilities::InputException(accera::utilities::InputExceptionErrors::invalidArgument, "Can't split an already-transformed index");
+            throw accera::utilities::InputException(accera::utilities::InputExceptionErrors::invalidArgument, "Cannot split an already-transformed index");
 
         auto parentRange = _indices[index].range;
         auto parentIncrement = parentRange.Increment();
diff --git a/accera/ir/src/value/ValueDialect.cpp b/accera/ir/src/value/ValueDialect.cpp
index 116cff49..995c2dd3 100644
--- a/accera/ir/src/value/ValueDialect.cpp
+++ b/accera/ir/src/value/ValueDialect.cpp
@@ -500,6 +500,10 @@ bool MFMAMatrixType::isValidElementType(Type elementType)
     return elementType.isF16() || elementType.isF32();
 }
 
+int64_t MFMAMatrixType::getLeadingDim() const {
+    return getShape().back();
+}
+
 LogicalResult
 MFMAMatrixType::verify(llvm::function_ref<InFlightDiagnostic()> emitError,
                        ArrayRef<int64_t> shape,
@@ -510,9 +514,6 @@ MFMAMatrixType::verify(llvm::function_ref<InFlightDiagnostic()> emitError,
         !operand.equals("COp"))
         return emitError() << "operand expected to be one of AOp, BOp or COp";
 
-    if (shape.size() != 2)
-        return emitError() << "MFMAMatrixType must have exactly two dimensions";
-
     if (!MFMAMatrixType::isValidElementType(elementType))
         return emitError() << "MFMAMatrixType elements must be F16 or F32";
 
@@ -537,7 +538,7 @@ static LogicalResult verify(MFMAComputeOp op)
     };
     SmallVector<MFMAMatrixType, 3> opTypes;
 
-    auto populateOpInfo = [&opTypes, &op]() {
+    auto populateOpInfo = [&opTypes, &op]() { 
         opTypes.push_back(op.opA().getType().cast<MFMAMatrixType>());
         opTypes.push_back(op.opB().getType().cast<MFMAMatrixType>());
         opTypes.push_back(op.opC().getType().cast<MFMAMatrixType>());
@@ -554,25 +555,37 @@ static LogicalResult verify(MFMAComputeOp op)
     bShape = opTypes[B].getShape();
     cShape = opTypes[C].getShape();
 
-    if (aShape[1] != bShape[0] || aShape[0] != cShape[0] ||
+    if (aShape[1] != bShape[0] ||
+        aShape[0] != cShape[0] ||
         bShape[1] != cShape[1])
         return op.emitError("operand shapes do not satisfy matmul constraints");
 
     return success();
 }
 
-static LogicalResult verify(MFMALoadMatrixOp op)
+static LogicalResult verify(MFMAConstantOp op)
 {
-    auto srcType = op.srcMemref().getType();
-    auto resType = op.res().getType();
-    auto resMatrixType = resType.cast<MFMAMatrixType>();
+    auto value = op.value();
+    auto valueType = value.getType();
+    auto resMatrixType = op.getMFMAMatrixType();
     auto operand = resMatrixType.getOperand();
-    auto srcMemrefType = srcType.cast<MemRefType>();
-    auto srcMemSpace = srcMemrefType.getMemorySpaceAsInt();
 
-    if (!srcMemrefType.getAffineMaps().empty() &&
-        !srcMemrefType.getAffineMaps().front().isIdentity())
-        return op.emitError("expected identity layout map for source memref");
+    if (!operand.equals("AOp") && !operand.equals("BOp") &&
+        !operand.equals("COp"))
+        return op.emitError("only AOp, BOp and COp can be constant filled");
+
+    if (valueType != resMatrixType.getElementType())
+        return op.emitError("value type must match matrix element type");
+
+    return success();
+}
+
+static LogicalResult verify(MFMALoadOp op)
+{
+    auto srcType = op.getMemRefType(); 
+    auto resMatrixType = op.getMFMAMatrixType();
+    auto operand = resMatrixType.getOperand();
+    auto srcMemSpace = srcType.getMemorySpaceAsInt();
 
     if (srcMemSpace != kGenericMemorySpace && srcMemSpace != kSharedMemorySpace &&
         srcMemSpace != kGlobalMemorySpace)
@@ -587,17 +600,12 @@ static LogicalResult verify(MFMALoadMatrixOp op)
     return success();
 }
 
-static LogicalResult verify(MFMAStoreMatrixOp op)
+static LogicalResult verify(MFMAStoreOp op)
 {
-    auto srcType = op.src().getType();
-    auto dstType = op.dstMemref().getType();
-    auto srcMatrixType = srcType.cast<MFMAMatrixType>();
-    auto dstMemrefType = dstType.cast<MemRefType>();
+    auto srcMatrixType = op.getMFMAMatrixType();
+    auto dstMemrefType = op.getMemRefType();
     auto dstMemSpace = dstMemrefType.getMemorySpaceAsInt();
-    if (!dstMemrefType.getAffineMaps().empty() &&
-        !dstMemrefType.getAffineMaps().front().isIdentity())
-        return op.emitError("expected identity layout map for destination memref");
-
+    
     if (dstMemSpace != kGenericMemorySpace && dstMemSpace != kSharedMemorySpace &&
         dstMemSpace != kGlobalMemorySpace)
         return op.emitError(
diff --git a/accera/ir/test/ir_tests/ir_tests.cpp b/accera/ir/test/ir_tests/ir_tests.cpp
index 1adfb6ba..355948d6 100644
--- a/accera/ir/test/ir_tests/ir_tests.cpp
+++ b/accera/ir/test/ir_tests/ir_tests.cpp
@@ -253,6 +253,66 @@ TEST_CASE_METHOD(Fixture, "Test4", "[cpu][lang]")
                          << debugString(module));
 }
 
+TEST_CASE_METHOD(Fixture, "vectorized_vector_add", "[cpu][nest]")
+{
+    auto target = GENERATE(ConversionTarget::accera, ConversionTarget::mlir, ConversionTarget::llvm);
+
+    constexpr auto N = 1024;
+
+    using namespace accera::value;
+    using namespace accera::utilities;
+    using accera::value::Value, accera::value::Matrix;
+
+    DeclareFunction("NestVectorAdd")
+        .Public(true)
+        .Parameters(
+            Value({ ValueType::Float, MemoryLayout(MemoryShape{ N }) }),
+            Value({ ValueType::Float, MemoryLayout(MemoryShape{ N }) }),
+            Value({ ValueType::Float, MemoryLayout(MemoryShape{ N }) }))
+        .Define([=](Vector Out, Vector A, Vector B) {
+            // Declare and/or calculate constants
+            const int n = (int)(A.Size()); // N
+
+            // Schedule constants
+            const int vectorSize = 8; // AVX-2 gives 256-bit registers, which can hold 8 floats
+            const int vectorBytes = vectorSize * 4; // 4 bytes per float
+            const int vectorUnits = 16; // AVX-2 has 16 256-bit registers
+            const int innerLoopSize = 128;
+
+            // Define Nest
+            Nest nest(MemoryShape{ N });
+
+            // Get indexes
+            auto indices = nest.GetIndices();
+            Scalar i = indices[0];
+
+            nest.Set([&]() { Out(i) = A(i) + B(i); });
+
+            auto schedule = nest.CreateSchedule();
+
+            // Declare splits
+            auto [iCache, iInner1] = schedule.Split(i, innerLoopSize);
+            auto [iKernelOuter2, iInner2] = schedule.Split(iInner1, 2 * vectorSize);
+            auto [iKernelOuter, iInner3] = schedule.Split(iInner2, vectorSize);
+
+            // Set the order
+            schedule.SetOrder({ iCache, iKernelOuter2, iKernelOuter, iInner3 });
+
+            auto plan = schedule.CreatePlan();
+            plan.AddCache(A, iKernelOuter2);
+            plan.AddCache(B, iKernelOuter2);
+            plan.AddCache(Out, iKernelOuter2);
+
+            // Set unrolling
+            schedule.Unroll(iKernelOuter);
+            plan.Vectorize(iInner3, { vectorBytes, vectorUnits });
+        });
+
+    RunConversionPasses(target, "vectorized_vector_add_" + stringify(target));
+    SUCCEED("targeting " << stringify(target) << ":\n\n"
+                         << debugString(module));
+}
+
 TEST_CASE_METHOD(Fixture, "strided_subvector", "[cpu][lang]")
 {
     auto target = GENERATE(ConversionTarget::accera, ConversionTarget::mlir, ConversionTarget::llvm);
@@ -281,7 +341,6 @@ TEST_CASE_METHOD(Fixture, "strided_subvector2", "[cpu][lang]")
     using namespace accera::value;
     using namespace accera::utilities;
     using accera::value::Value;
-    using accera::value::Value, accera::value::Matrix;
 
     const auto N = 16;
 
@@ -289,7 +348,7 @@ TEST_CASE_METHOD(Fixture, "strided_subvector2", "[cpu][lang]")
         DeclareFunction("func_test")
             .Parameters(Value({ ValueType::Float, 0 }, { N }))
             .Define([=](Vector x) {
-                Vector y = x.SubVector(1, N/2, 2);
+                Vector y = x.SubVector(1, N / 2, 2);
                 y(0) = 4.0f;
             });
 
@@ -322,6 +381,54 @@ TEST_CASE_METHOD(Fixture, "strided_submatrix", "[cpu][lang]")
                          << debugString(module));
 }
 
+TEST_CASE_METHOD(Fixture, "fp32_vector_add", "[cpu][lang]")
+{
+    auto target = GENERATE(ConversionTarget::accera, ConversionTarget::mlir, ConversionTarget::llvm);
+
+    using namespace accera::value;
+    using namespace accera::utilities;
+    using accera::value::Value;
+
+    [[maybe_unused]] auto f =
+        DeclareFunction("func_test")
+            .Public(true)
+            .Parameters(
+                Value({ ValueType::Float, 0 }, { 2 }),
+                Value({ ValueType::Float, 0 }, { 2 }),
+                Value({ ValueType::Float, 0 }, { 2 }))
+            .Define([](Vector a, Vector b, Vector c) {
+                c[0] = a[0] + b[0];
+            });
+
+    RunConversionPasses(target, "fp32_vector_add_" + stringify(target));
+    SUCCEED("targeting " << stringify(target) << ":\n\n"
+                         << debugString(module));
+}
+
+TEST_CASE_METHOD(Fixture, "fp16_vector_add", "[cpu][lang]")
+{
+    auto target = GENERATE(ConversionTarget::accera, ConversionTarget::mlir, ConversionTarget::llvm);
+
+    using namespace accera::value;
+    using namespace accera::utilities;
+    using accera::value::Value;
+
+    [[maybe_unused]] auto f =
+        DeclareFunction("func_test")
+            .Public(true)
+            .Parameters(
+                Value({ ValueType::Float16, 0 }, { 1 }),
+                Value({ ValueType::Float16, 0 }, { 1 }),
+                Value({ ValueType::Float16, 0 }, { 1 }))
+            .Define([](Vector a, Vector b, Vector c) {
+                c[0] = a[0] + b[0];
+            });
+
+    RunConversionPasses(target, "fp16_vector_add_" + stringify(target));
+    SUCCEED("targeting " << stringify(target) << ":\n\n"
+                         << debugString(module));
+}
+
 TEST_CASE_METHOD(Fixture, "vector_add", "[gpu][lang]")
 {
     auto target = GENERATE(ConversionTarget::accera, ConversionTarget::mlir, ConversionTarget::llvm);
@@ -384,7 +491,7 @@ TEST_CASE_METHOD(Fixture, "vector_add_rocm", "[gpu][lang]")
     auto gpu_f1 =
         DeclareFunction("gpu_f1")
             .Target(targets::GPU({ 128, 1, 1 }, { 128, 1, 1 }))
-            .Runtime(ExecutionRuntime::Rocm)
+            .Runtime(ExecutionRuntime::ROCM)
             .Parameters(Value{ ValueType::Float, MemoryLayout{ { 16384 } } },
                         Value{ ValueType::Float, MemoryLayout{ { 16384 } } },
                         Value{ ValueType::Float, MemoryLayout{ { 16384 } } })
@@ -399,7 +506,7 @@ TEST_CASE_METHOD(Fixture, "vector_add_rocm", "[gpu][lang]")
                 C[offset] = summed;
             });
     accera::transforms::AcceraPassPipelineOptions opts{};
-    opts.runtime = accera::value::ExecutionRuntime::Rocm;
+    opts.runtime = accera::value::ExecutionRuntime::ROCM;
     RunConversionPasses(target, "vector_sum_rocm_" + stringify(target), opts);
     SUCCEED("targeting " << stringify(target) << ":\n\n"
                          << debugString(module));
@@ -1518,7 +1625,7 @@ TEST_CASE_METHOD(Fixture, "basic_gemm_loopnest", "[cpu][nest]")
                          << debugString(module));
 }
 
-TEST_CASE_METHOD(Fixture, "matmul_value_gpu_local_mem", "[gpu][lang]")
+TEST_CASE_METHOD(Fixture, "matmul_value_gpu_private_mem", "[gpu][lang]")
 {
     const int64_t M = 32;
     const int64_t N = 32;
@@ -1555,7 +1662,7 @@ TEST_CASE_METHOD(Fixture, "matmul_value_gpu_local_mem", "[gpu][lang]")
                 auto i = blockIdX * blockDimX + threadIdX;
                 auto j = blockIdY * blockDimY + threadIdY;
 
-                Vector accum_ref = Allocate(C.GetType(), MemoryLayout{ { 1 } }.SetMemorySpace(MemorySpace::Local));
+                Vector accum_ref = Allocate(C.GetType(), MemoryLayout{ { 1 } }.SetMemorySpace(MemorySpace::Private));
                 accum_ref[0] = Cast(0, C.GetType());
 
                 ForRange(K, [&](Scalar k) {
@@ -1597,7 +1704,7 @@ TEST_CASE_METHOD(Fixture, "matmul_value_gpu_local_mem", "[gpu][lang]")
             PrintMemref(C);
         });
 
-    RunConversionPasses(target, "matmul_value_gpu_local_mem_" + stringify(target));
+    RunConversionPasses(target, "matmul_value_gpu_private_mem_" + stringify(target));
     SUCCEED("targeting " << stringify(target) << ":\n\n"
                          << debugString(module));
 }
@@ -2559,3 +2666,294 @@ TEST_CASE_METHOD(Fixture, "parallelize_gemm_mlas_value", "[cpu][nest]")
     SUCCEED("targeting " << stringify(target) << ":\n\n"
                          << debugString(module));
 }
+
+TEST_CASE_METHOD(Fixture, "mlir_nest_test_gemm_tiled_mfma_rocm", "[gpu][nest][mfma][main]")
+{
+    const int64_t M = 32;
+    const int64_t N = 32;
+    const int64_t K = 32;
+
+    const int64_t blockDim = 16;
+    const int64_t tileSize = blockDim;
+    const int64_t mfmaOutLen = 4;
+
+    REQUIRE(M % tileSize == 0);
+    REQUIRE(N % tileSize == 0);
+
+    const int64_t gridDimX = N / blockDim;
+    const int64_t gridDimY = M / blockDim;
+
+    auto target = GENERATE(ConversionTarget::accera, ConversionTarget::mlir, ConversionTarget::llvm);
+
+    using namespace accera::value;
+    using namespace accera::utilities;
+    using accera::utilities::MemorySpace;
+    using accera::value::Value, accera::value::Matrix;
+
+    auto gpuConfig = targets::GPU{};
+    gpuConfig.grid = targets::Dim3(gridDimX, gridDimY);
+    gpuConfig.block = targets::Dim3(blockDim, blockDim);
+
+    auto matmul =
+        DeclareFunction("NestMatMul")
+            .Target(gpuConfig)
+            .Runtime(ExecutionRuntime::ROCM)
+            .Decorated(false)
+            .Public(true)
+            .Parameters(
+                Value({ ValueType::Float, MemoryLayout(MemoryShape{ M, K }) }),
+                Value({ ValueType::Float, MemoryLayout(MemoryShape{ K, N }) }),
+                Value({ ValueType::Float, MemoryLayout(MemoryShape{ M, N }) }))
+            .Define([=](Matrix A, Matrix B, Matrix C) {
+                Nest matmul({ M, N });
+                auto indices = matmul.GetIndices();
+                Scalar i = indices[0];
+                Scalar j = indices[1];
+
+                matmul.Set([&]() {
+                    Scalar tidX = GPU::ThreadId().X();
+                    Scalar tidY = GPU::ThreadId().Y();
+
+                    auto mfmaAMatrix = MFMALoad(A.GetValue(), { 16, 16 }, "AOp");
+                    auto mfmaBMatrix = MFMALoad(B.GetValue(), { 16, 16 }, "BOp");
+                    auto mfmaCMatrix = MFMALoad(C.GetValue(), { 16, 16 }, "COp");
+                    auto mfmaDMatrix = MFMACompute(mfmaAMatrix, mfmaBMatrix, mfmaCMatrix);
+                    MFMAStore(mfmaDMatrix, C.GetValue());
+                });
+
+                auto sched = matmul.CreateSchedule();
+                auto [iOuter, iInner] = sched.Split(i, blockDim);
+                auto [jOuter, jInner] = sched.Split(j, blockDim);
+                auto plan = sched.CreateGPUPlan(gpuConfig);
+                plan.MapIndexToProcessor(iOuter, Processor::BlockY);
+                plan.MapIndexToProcessor(jOuter, Processor::BlockX);
+                plan.MapIndexToProcessor(iInner, Processor::ThreadY);
+                plan.MapIndexToProcessor(jInner, Processor::ThreadX);
+            });
+
+    accera::transforms::AcceraPassPipelineOptions opts{};
+    opts.dumpPasses = true;
+    opts.dumpIntraPassIR = false;
+    opts.gpuOnly = true;
+    opts.runtime = accera::value::ExecutionRuntime::ROCM;
+
+    RunConversionPasses(target, "mlir_nest_test_gemm_tiled_mfma_rocm_", opts);
+    SUCCEED("targeting " << stringify(target) << ":\n\n"
+                         << debugString(module));
+}
+
+TEST_CASE_METHOD(Fixture, "mlir_nest_test_tensorize_rocm_single_block_single_warp", "[gpu][nest][mfma][main]")
+{
+    const int64_t M = 16;
+    const int64_t N = 16;
+    const int64_t K = 16;
+
+    const int64_t blockDim = 16;
+    const int64_t tileSize = blockDim;
+
+    REQUIRE(M % tileSize == 0);
+    REQUIRE(N % tileSize == 0);
+
+    const int64_t gridDimX = N / blockDim;
+    const int64_t gridDimY = M / blockDim;
+
+    auto target = GENERATE(ConversionTarget::accera, ConversionTarget::mlir, ConversionTarget::llvm);
+
+    using namespace accera::value;
+    using namespace accera::utilities;
+    using accera::utilities::MemorySpace;
+    using accera::value::Value, accera::value::Matrix;
+
+    auto gpuConfig = targets::GPU{};
+    gpuConfig.grid = targets::Dim3(gridDimX, gridDimY);
+    gpuConfig.block = targets::Dim3(blockDim, blockDim);
+
+    auto matmul =
+        DeclareFunction("NestMatMul")
+            .Target(gpuConfig)
+            .Runtime(ExecutionRuntime::ROCM)
+            .Decorated(false)
+            .Public(true)
+            .Parameters(
+                Value({ ValueType::Float, MemoryLayout(MemoryShape{ M, K }) }),
+                Value({ ValueType::Float, MemoryLayout(MemoryShape{ K, N }) }),
+                Value({ ValueType::Float, MemoryLayout(MemoryShape{ M, N }) }))
+            .Define([=](Matrix A, Matrix B, Matrix C) {
+                Nest nest({ M, N, K });
+                auto indices = nest.GetIndices();
+                Scalar i = indices[0];
+                Scalar j = indices[1];
+                Scalar k = indices[2];
+
+                nest.Set([&]() { C(i, j) += A(i, k) * B(k, j); });
+
+                auto sched = nest.CreateSchedule();
+                auto [iOuter, iInner] = sched.Split(i, blockDim);
+                auto [jOuter, jInner] = sched.Split(j, blockDim);
+                auto [kOuter, kInner] = sched.Split(k, 16);
+                auto [iInnerOuter, iInner2] = sched.Split(iInner, 2);
+                auto [jInnerOuter, jInner2] = sched.Split(jInner, 2);
+                sched.SetOrder({ iOuter, jOuter, iInnerOuter, jInnerOuter, kOuter, iInner2, jInner2, kInner });
+
+                auto plan = sched.CreateGPUPlan(gpuConfig);
+                plan.MapIndexToProcessor(iOuter, Processor::BlockY);
+                plan.MapIndexToProcessor(jOuter, Processor::BlockX);
+                plan.MapIndexToProcessor(iInnerOuter, Processor::ThreadY);
+                plan.MapIndexToProcessor(jInnerOuter, Processor::ThreadX);
+                plan.Tensorize({ iInner2, jInner2, kInner }, { 2, 2, 16 });
+            });
+
+    accera::transforms::AcceraPassPipelineOptions opts{};
+    opts.dumpPasses = true;
+    opts.dumpIntraPassIR = false;
+    opts.gpuOnly = true;
+    opts.runtime = accera::value::ExecutionRuntime::ROCM;
+
+    RunConversionPasses(target, "mlir_nest_test_tensorize_rocm_single_block_single_warp_", opts);
+    SUCCEED("targeting " << stringify(target) << ":\n\n"
+                         << debugString(module));
+}
+
+TEST_CASE_METHOD(Fixture, "mlir_nest_test_tensorize_rocm_single_block_multiple_warp", "[gpu][nest][mfma][main]")
+{
+    const int64_t M = 64;
+    const int64_t N = 64;
+    const int64_t K = 64;
+
+    const int64_t blockDim = 64;
+    const int64_t tileSize = blockDim;
+
+    REQUIRE(M % tileSize == 0);
+    REQUIRE(N % tileSize == 0);
+
+    const int64_t gridDimX = N / blockDim;
+    const int64_t gridDimY = M / blockDim;
+
+    auto target = GENERATE(ConversionTarget::accera, ConversionTarget::mlir, ConversionTarget::llvm);
+
+    using namespace accera::value;
+    using namespace accera::utilities;
+    using accera::utilities::MemorySpace;
+    using accera::value::Value, accera::value::Matrix;
+
+    auto gpuConfig = targets::GPU{};
+    gpuConfig.grid = targets::Dim3(gridDimX, gridDimY);
+    gpuConfig.block = targets::Dim3(blockDim, blockDim);
+
+    auto matmul =
+        DeclareFunction("NestMatMul")
+            .Target(gpuConfig)
+            .Runtime(ExecutionRuntime::ROCM)
+            .Decorated(false)
+            .Public(true)
+            .Parameters(
+                Value({ ValueType::Float, MemoryLayout(MemoryShape{ M, K }) }),
+                Value({ ValueType::Float, MemoryLayout(MemoryShape{ K, N }) }),
+                Value({ ValueType::Float, MemoryLayout(MemoryShape{ M, N }) }))
+            .Define([=](Matrix A, Matrix B, Matrix C) {
+                Nest nest({ M, N, K });
+                auto indices = nest.GetIndices();
+                Scalar i = indices[0];
+                Scalar j = indices[1];
+                Scalar k = indices[2];
+
+                nest.Set([&]() { C(i, j) += A(i, k) * B(k, j); });
+
+                auto sched = nest.CreateSchedule();
+                auto [iOuter, iInner] = sched.Split(i, blockDim);
+                auto [jOuter, jInner] = sched.Split(j, blockDim);
+                auto [kOuter, kInner] = sched.Split(k, 16);
+                auto [iInnerOuter, iInner2] = sched.Split(iInner, 2);
+                auto [jInnerOuter, jInner2] = sched.Split(jInner, 2);
+                sched.SetOrder({ iOuter, jOuter, iInnerOuter, jInnerOuter, kOuter, iInner2, jInner2, kInner });
+
+                auto plan = sched.CreateGPUPlan(gpuConfig);
+                plan.MapIndexToProcessor(iOuter, Processor::BlockY);
+                plan.MapIndexToProcessor(jOuter, Processor::BlockX);
+                plan.MapIndexToProcessor(iInnerOuter, Processor::ThreadY);
+                plan.MapIndexToProcessor(jInnerOuter, Processor::ThreadX);
+                plan.Tensorize({ iInner2, jInner2, kInner }, { 2, 2, 16 });
+            });
+
+    accera::transforms::AcceraPassPipelineOptions opts{};
+    opts.dumpPasses = true;
+    opts.dumpIntraPassIR = false;
+    opts.gpuOnly = true;
+    opts.runtime = accera::value::ExecutionRuntime::ROCM;
+
+    RunConversionPasses(target, "mlir_nest_test_tensorize_rocm_single_block_multiple_warp_", opts);
+    SUCCEED("targeting " << stringify(target) << ":\n\n"
+                         << debugString(module));
+}
+
+TEST_CASE_METHOD(Fixture, "mlir_nest_test_tensorize_rocm_multiple_block_multiple_warp", "[gpu][nest][mfma][main]")
+{
+    const int64_t M = 1024;
+    const int64_t N = 1024;
+    const int64_t K = 1024;
+
+    const int64_t blockDim = 32;
+    const int64_t tileSize = blockDim;
+
+    REQUIRE(M % tileSize == 0);
+    REQUIRE(N % tileSize == 0);
+
+    const int64_t gridDimX = N / blockDim;
+    const int64_t gridDimY = M / blockDim;
+
+    auto target = GENERATE(ConversionTarget::accera, ConversionTarget::mlir, ConversionTarget::llvm);
+
+    using namespace accera::value;
+    using namespace accera::utilities;
+    using accera::utilities::MemorySpace;
+    using accera::value::Value, accera::value::Matrix;
+
+    auto gpuConfig = targets::GPU{};
+    gpuConfig.grid = targets::Dim3(gridDimX, gridDimY);
+    gpuConfig.block = targets::Dim3(blockDim, blockDim);
+
+    auto matmul =
+        DeclareFunction("NestMatMul")
+            .Target(gpuConfig)
+            .Runtime(ExecutionRuntime::ROCM)
+            .Decorated(false)
+            .Public(true)
+            .Parameters(
+                Value({ ValueType::Float, MemoryLayout(MemoryShape{ M, K }) }),
+                Value({ ValueType::Float, MemoryLayout(MemoryShape{ K, N }) }),
+                Value({ ValueType::Float, MemoryLayout(MemoryShape{ M, N }) }))
+            .Define([=](Matrix A, Matrix B, Matrix C) {
+                Nest nest({ M, N, K });
+                auto indices = nest.GetIndices();
+                Scalar i = indices[0];
+                Scalar j = indices[1];
+                Scalar k = indices[2];
+
+                nest.Set([&]() { C(i, j) += A(i, k) * B(k, j); });
+
+                auto sched = nest.CreateSchedule();
+                auto [iOuter, iInner] = sched.Split(i, blockDim);
+                auto [jOuter, jInner] = sched.Split(j, blockDim);
+                auto [kOuter, kInner] = sched.Split(k, 16);
+                auto [iInnerOuter, iInner2] = sched.Split(iInner, 2);
+                auto [jInnerOuter, jInner2] = sched.Split(jInner, 2);
+                sched.SetOrder({ iOuter, jOuter, iInnerOuter, jInnerOuter, kOuter, iInner2, jInner2, kInner });
+
+                auto plan = sched.CreateGPUPlan(gpuConfig);
+                plan.MapIndexToProcessor(iOuter, Processor::BlockY);
+                plan.MapIndexToProcessor(jOuter, Processor::BlockX);
+                plan.MapIndexToProcessor(iInnerOuter, Processor::ThreadY);
+                plan.MapIndexToProcessor(jInnerOuter, Processor::ThreadX);
+                plan.Tensorize({ iInner2, jInner2, kInner }, { 2, 2, 16 });
+            });
+
+    accera::transforms::AcceraPassPipelineOptions opts{};
+    opts.dumpPasses = true;
+    opts.dumpIntraPassIR = false;
+    opts.gpuOnly = true;
+    opts.runtime = accera::value::ExecutionRuntime::ROCM;
+
+    RunConversionPasses(target, "mlir_nest_test_tensorize_rocm_multiple_block_multiple_warp_", opts);
+    SUCCEED("targeting " << stringify(target) << ":\n\n"
+                         << debugString(module));
+}
\ No newline at end of file
diff --git a/accera/python/accera/Constants.py b/accera/python/accera/Constants.py
index ec1f82b8..dd87ced6 100644
--- a/accera/python/accera/Constants.py
+++ b/accera/python/accera/Constants.py
@@ -4,3 +4,5 @@
 ####################################################################################################
 
 inf = float('inf')
+
+AUTO = object()
diff --git a/accera/python/accera/Package.py b/accera/python/accera/Package.py
index d07f5fae..9127737a 100644
--- a/accera/python/accera/Package.py
+++ b/accera/python/accera/Package.py
@@ -3,14 +3,18 @@
 # Licensed under the MIT License. See LICENSE in the project root for license information.
 ####################################################################################################
 
+import hatlib as hat
+import json
 import logging
+import os
+import re
+import shutil
 from collections import OrderedDict
 from enum import Enum, Flag, auto
 from functools import wraps, singledispatch
+from hashlib import md5
+from secrets import token_hex
 from typing import *
-import os
-import shutil
-import hatlib as hat
 
 from . import _lang_python, lang
 from .Targets import Target, Runtime
@@ -18,6 +22,10 @@
 from .Constants import inf
 from .Platforms import Platform, get_library_reference
 
+_R_DIM3 = r'dim3\((\d+),\s*(\d+),\s*(\d+)\)'
+_R_GPU_LAUNCH = f"<<<{_R_DIM3},\s*{_R_DIM3}>>>"
+del _R_DIM3
+
 
 @singledispatch
 def _convert_arg(arg: _lang_python._lang._Valor):
@@ -206,11 +214,14 @@ def _add_function(
             function_opts: A dictionary of advanced options to set on the function, e.g. {"no_inline" : True}
             auxiliary: A dictionary of auxiliary metadata to include in the HAT package.
         """
-
-        from secrets import token_hex
-
+        
+        # Auxiliary data should be one copy per function
+        auxiliary_metadata = auxiliary.copy()
+        param_value_dict = {}
         for delayed_param, value in parameters.items():
             delayed_param.set_value(value)
+            param_value_dict[delayed_param._name] = value if isinstance(value, int) else str(value)
+        auxiliary_metadata['accera'] = param_value_dict
 
         def validate_target(target: Target):
             # can't use set because targets are mutable (therefore unhashable)
@@ -220,9 +231,24 @@ def validate_target(target: Target):
                         "Function target being added is currently incompatible with existing functions in package"
                     )
 
-        # Function names must begin with an _ or alphabetical character
-        name = token_hex(4)
-        name = (f"{base_name}_{name}" if base_name else f"_{name}")
+        def get_function_name(target: Target):
+            # Get a function name using a stable hash of [base_name, signature, target, and parameters]
+            # If no base_name is provided, use a unique identifier to avoid collisions (assume user
+            # does not care about the function name in this case)
+            # ref: https://death.andgravity.com/stable-hashing
+            suffix = md5(
+                json.dumps(
+                    tuple(
+                        map(
+                            lambda x: str(x), [base_name or token_hex(4), target, auxiliary_metadata['accera']] +
+                            [(a.role, a.element_type, a.shape, a.layout) for a in args]
+                        )
+                    )
+                ).encode('utf-8')
+            ).digest().hex()[:16]    # truncate
+
+            # Function names must begin with an _ or alphabetical character
+            return (f"{base_name}_{suffix}" if base_name else f"_{suffix}")
 
         # Resolve any undefined argument shapes based on the source usage pattern
         for arr in args:
@@ -248,9 +274,9 @@ def validate_target(target: Target):
             native_array_args = [arg._get_native_array() for arg in args]
 
             assert source.public
-            source.name = name
+            source.name = get_function_name(source.target)
             source.base_name = base_name
-            source.auxiliary = auxiliary
+            source.auxiliary = auxiliary_metadata
             source.param_overrides = parameters
             source.args = tuple(native_array_args)
             source.requested_args = args
@@ -266,6 +292,7 @@ def validate_target(target: Target):
             def wrapper_fn(args):
                 source(*map(_convert_arg, args))
 
+            name = get_function_name(Target.HOST)
             logging.debug(f"[API] Added {name}")
 
             wrapped_func = lang.Function(
@@ -277,7 +304,7 @@ def wrapper_fn(args):
                 args=tuple(map(_convert_arg, args)),
                 requested_args=args,
                 definition=wrapper_fn,
-                auxiliary=auxiliary,
+                auxiliary=auxiliary_metadata,
                 target=Target.HOST,
             )
 
@@ -398,6 +425,9 @@ def build(
 
         target, target_device, compiler_options, dynamic_dependencies = self._generate_target_options(platform, mode)
 
+        if target.category == Target.Category.GPU and target.runtime == Target.Runtime.NONE:
+            raise RuntimeError("GPU targets must specify a runtime")
+
         cross_compile = platform != Platform.HOST
 
         format_is_default = bool(
@@ -429,6 +459,7 @@ def build(
             package_module.EmitDebugFunction(fn_name, utilities)
 
         # Emit the package module
+        # TODO: Update Format enum to use SOURCE instead and then this should take runtime into consideration
         if format & Package.Format.CPP:
             output_type = accc.ModuleOutputType.CPP
         elif format & Package.Format.CUDA:
@@ -442,7 +473,8 @@ def build(
             supporting_hats.append(
                 Package._emit_default_module(compiler_options, target, mode, output_dir, f"{name}_Globals")
             )
-            if any(fn.target.category == Target.Category.GPU for fn in self._fns.values()):
+            if any(fn.target.category == Target.Category.GPU and fn.target.runtime == Target.Runtime.VULKAN
+                   for fn in self._fns.values()):
                 supporting_hats.append(self._create_gpu_utility_module(compiler_options, target, mode, output_dir))
 
         proj = accc.AcceraProject(output_dir=working_dir, library_name=name, output_type=output_type)
@@ -477,7 +509,7 @@ def build(
             package_module.WriteHeader(header_path)
 
             # Complete the HAT file with information we have stored at this layer
-            hat_file = hat.HATFile.Deserialize(header_path)
+            hat_file: hat.HATFile = hat.HATFile.Deserialize(header_path)
 
             if format & (Package.Format.DYNAMIC_LIBRARY | Package.Format.STATIC_LIBRARY):
                 hat_file.dependencies.link_target = os.path.basename(proj.module_file_sets[0].object_filepath)
@@ -503,11 +535,45 @@ def build(
             hat_file.declaration.code = decl_code._new('\n'.join(map(str, ['', decl_code] + supporting_decls)))
 
             for fn_name in self._fns:
-                if self._fns[fn_name].public:
+                fn: lang.Function = self._fns[fn_name]
+
+                if fn.public:
                     hat_func = hat_file.function_map.get(fn_name)
+
                     if hat_func is None:
                         raise ValueError(f"Couldn't find header-declared function {fn_name} in emitted HAT file")
-                    hat_func.auxiliary = self._fns[fn_name].auxiliary
+
+                    hat_func.auxiliary = fn.auxiliary
+
+                    if fn.target.category == Target.Category.GPU and fn.target.runtime != Target.Runtime.VULKAN:
+                        # TODO: Remove this when the header is emitted as part of the compilation
+                        gpu_source = proj.module_file_sets[0].translated_source_filepath
+                        gpu_device_func = fn_name + "__gpu__"
+                        with open(gpu_source) as gpu_source_f:
+                            s = re.search(gpu_device_func + _R_GPU_LAUNCH, gpu_source_f.read())
+                            if not s:
+                                raise RuntimeError("Couldn't parse emitted source code")
+                            launch_parameters = list(map(int, [s[n] for n in range(1, 7)]))
+                        gpu_source = os.path.split(gpu_source)[1]
+
+                        hat_target: hat.Target = hat_file.target
+                        hat_target.required.gpu.runtime = fn.target.runtime.name
+                        hat_target.required.gpu.model = fn.target.name
+
+                        hat_func.runtime = fn.target.runtime.name
+                        hat_func.launches = gpu_device_func
+
+                        hat_file.device_function_map[gpu_device_func] = hat.Function(
+                            name=gpu_device_func,
+                            description=f"Device function launched by {fn_name}",
+                            calling_convention=hat.CallingConventionType.Device,
+                            arguments=hat_func.arguments,
+                            return_info=hat_func.return_info,
+                            launch_parameters=launch_parameters,
+                            provider=gpu_source,
+                            runtime=fn.target.runtime.name
+                        )
+
             if target_device.is_windows():
                 hat_os = hat.OperatingSystem.Windows
             elif target_device.is_macOS():
diff --git a/accera/python/accera/Parameter.py b/accera/python/accera/Parameter.py
index 7c2852a4..08fc78a4 100644
--- a/accera/python/accera/Parameter.py
+++ b/accera/python/accera/Parameter.py
@@ -4,11 +4,12 @@
 ####################################################################################################
 
 from typing import List
-
+from varname import varname
 
 class DelayedParameter:
-    def __init__(self):
+    def __init__(self, name=None):
         self._value = None
+        self._name = name
 
     def get_value(self):
         return self._value
@@ -20,7 +21,9 @@ def set_value(self, value):
 def create_parameters(count: int):
     if count < 1:
         raise ValueError("Invalid parameters count")
-    return (tuple([DelayedParameter() for i in range(count)]) if count > 1 else DelayedParameter())
+    names = varname(multi_vars=True)
+    return (tuple([DelayedParameter(name) for name in names])
+            if count > 1 else DelayedParameter(names[0]))
 
 
 def get_parameters_from_grid(parameter_grid: dict) -> List[dict]:
diff --git a/accera/python/accera/Targets.py b/accera/python/accera/Targets.py
index e6163c44..faefc610 100644
--- a/accera/python/accera/Targets.py
+++ b/accera/python/accera/Targets.py
@@ -7,7 +7,8 @@
 from typing import List, Union
 from dataclasses import dataclass, field, fields
 from enum import Enum, auto
-from ._lang_python._lang import (BLOCK_X, BLOCK_Y, BLOCK_Z, THREAD_X, THREAD_Y, THREAD_Z)
+from ._lang_python import ScalarType
+from ._lang_python._lang import (BLOCK_X, BLOCK_Y, BLOCK_Z, THREAD_X, THREAD_Y, THREAD_Z, _MemorySpace, _ExecutionRuntime as Runtime)
 
 
 class Category(Enum):
@@ -23,16 +24,9 @@ class Architecture(Enum):
     # AARCH64 = auto()
 
 
-class Runtime(Enum):
-    DEFAULT = auto()
-    CUDA = auto()
-    ROCM = auto()
-    VULKAN = auto()
-
-
 # Branding is currently unused
 KNOWN_CPUS_HEADER = \
-    ["Model", "Family", "Branding", "Base Freq", "Turbo Freq", "Cores", "Threads", "Cache Lines", "Cache Sizes", "Vector Bytes", "Vector Registers", "Extensions", "ISA"]
+    ["Model", "Family", "Branding", "Base Freq", "Turbo Freq", "Cores", "Threads", "Cache Lines", "Cache Sizes", "Vector Bytes", "Vector Registers", "Extensions", "ISA", "Runtime"]
 
 # yapf: disable
 KNOWN_CPUS = [
@@ -40,644 +34,663 @@ class Runtime(Enum):
     # Intel Skylake
     # ref: https://en.wikipedia.org/wiki/Skylake_(microarchitecture)
     # Mainstream desktop processors
-    ["Intel 6700K", "Skylake", "Core i7", 4.0, {1: 4.2, 2: 4.0, 4: 4.0}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 6785R", "Skylake", "Core i7", 3.3, {1: 3.9, 2: 3.8, 4: 3.5}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"], # Has 128 MB L4, unaccounted/untested
-    ["Intel 6700",  "Skylake", "Core i7", 3.4, {1: 4.0, 2: 3.9, 4: 3.7}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 6700T", "Skylake", "Core i7", 2.8, {1: 3.6, 2: 3.5, 4: 3.4}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-
-    ["Intel 6600K", "Skylake", "Core i5", 3.5, {1: 3.9, 2: 3.8, 4: 3.6}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 6685R", "Skylake", "Core i5", 3.2, {1: 3.8, 2: 3.7, 4: 3.3}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 6600",  "Skylake", "Core i5", 3.3, {1: 3.9, 2: 3.8, 4: 3.6}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 6585R", "Skylake", "Core i5", 2.8, {1: 3.6, 2: 3.5, 4: 3.1}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 6500",  "Skylake", "Core i5", 3.2, {1: 3.6, 2: 3.5, 4: 3.3}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 6600T", "Skylake", "Core i5", 2.7, {1: 3.5, 2: 3.4, 4: 3.3}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 6500T", "Skylake", "Core i5", 2.5, {1: 3.1, 2: 3.0, 4: 2.8}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 6402P", "Skylake", "Core i5", 2.8, {1: 3.4, 2: 3.4, 4: 3.2}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 6400T", "Skylake", "Core i5", 2.2, {1: 2.8, 2: 2.7, 4: 2.5}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 6400",  "Skylake", "Core i5", 2.7, {1: 3.3, 2: 3.3, 4: 3.1}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-
-    ["Intel 6320",  "Skylake", "Core i3", 3.9,                       {}, 2, 4, [32, 256, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 6300",  "Skylake", "Core i3", 3.8,                       {}, 2, 4, [32, 256, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 6100",  "Skylake", "Core i3", 3.7,                       {}, 2, 4, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 6300T", "Skylake", "Core i3", 3.3,                       {}, 2, 4, [32, 256, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 6100T", "Skylake", "Core i3", 3.2,                       {}, 2, 4, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 6098P", "Skylake", "Core i3", 3.6,                       {}, 2, 4, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-
-    ["Intel G4520",   "Skylake", "Pentium", 3.6,                       {}, 2, 2, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2"], "X86_64"],
-    ["Intel G4500",   "Skylake", "Pentium", 3.5,                       {}, 2, 2, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2"], "X86_64"],
-    ["Intel G4500T",  "Skylake", "Pentium", 3.0,                       {}, 2, 2, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2"], "X86_64"],
-    ["Intel G4400",   "Skylake", "Pentium", 3.3,                       {}, 2, 2, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2"], "X86_64"],
-    ["Intel G4400T",  "Skylake", "Pentium", 2.9,                       {}, 2, 2, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2"], "X86_64"],
-    ["Intel G4400TE", "Skylake", "Pentium", 2.4,                       {}, 2, 2, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2"], "X86_64"],
-
-    ["Intel G3920",   "Skylake", "Celeron", 2.9,                       {}, 2, 2, [32, 256, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2"], "X86_64"],
-    ["Intel G3900",   "Skylake", "Celeron", 2.8,                       {}, 2, 2, [32, 256, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2"], "X86_64"],
-    ["Intel G3900TE", "Skylake", "Celeron", 2.3,                       {}, 2, 2, [32, 256, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2"], "X86_64"],
-    ["Intel G3900T",  "Skylake", "Celeron", 2.6,                       {}, 2, 2, [32, 256, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2"], "X86_64"],
+    ["Intel 6700K", "Skylake", "Core i7", 4.0, {1: 4.2, 2: 4.0, 4: 4.0}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 6785R", "Skylake", "Core i7", 3.3, {1: 3.9, 2: 3.8, 4: 3.5}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"], # Has 128 MB L4, unaccounted/untested
+    ["Intel 6700",  "Skylake", "Core i7", 3.4, {1: 4.0, 2: 3.9, 4: 3.7}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 6700T", "Skylake", "Core i7", 2.8, {1: 3.6, 2: 3.5, 4: 3.4}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+
+    ["Intel 6600K", "Skylake", "Core i5", 3.5, {1: 3.9, 2: 3.8, 4: 3.6}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 6685R", "Skylake", "Core i5", 3.2, {1: 3.8, 2: 3.7, 4: 3.3}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 6600",  "Skylake", "Core i5", 3.3, {1: 3.9, 2: 3.8, 4: 3.6}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 6585R", "Skylake", "Core i5", 2.8, {1: 3.6, 2: 3.5, 4: 3.1}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 6500",  "Skylake", "Core i5", 3.2, {1: 3.6, 2: 3.5, 4: 3.3}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 6600T", "Skylake", "Core i5", 2.7, {1: 3.5, 2: 3.4, 4: 3.3}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 6500T", "Skylake", "Core i5", 2.5, {1: 3.1, 2: 3.0, 4: 2.8}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 6402P", "Skylake", "Core i5", 2.8, {1: 3.4, 2: 3.4, 4: 3.2}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 6400T", "Skylake", "Core i5", 2.2, {1: 2.8, 2: 2.7, 4: 2.5}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 6400",  "Skylake", "Core i5", 2.7, {1: 3.3, 2: 3.3, 4: 3.1}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+
+    ["Intel 6320",  "Skylake", "Core i3", 3.9,                       {}, 2, 4, [32, 256, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 6300",  "Skylake", "Core i3", 3.8,                       {}, 2, 4, [32, 256, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 6100",  "Skylake", "Core i3", 3.7,                       {}, 2, 4, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 6300T", "Skylake", "Core i3", 3.3,                       {}, 2, 4, [32, 256, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 6100T", "Skylake", "Core i3", 3.2,                       {}, 2, 4, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 6098P", "Skylake", "Core i3", 3.6,                       {}, 2, 4, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+
+    ["Intel G4520",   "Skylake", "Pentium", 3.6,                       {}, 2, 2, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2"], "X86_64", "OPENMP"],
+    ["Intel G4500",   "Skylake", "Pentium", 3.5,                       {}, 2, 2, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2"], "X86_64", "OPENMP"],
+    ["Intel G4500T",  "Skylake", "Pentium", 3.0,                       {}, 2, 2, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2"], "X86_64", "OPENMP"],
+    ["Intel G4400",   "Skylake", "Pentium", 3.3,                       {}, 2, 2, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2"], "X86_64", "OPENMP"],
+    ["Intel G4400T",  "Skylake", "Pentium", 2.9,                       {}, 2, 2, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2"], "X86_64", "OPENMP"],
+    ["Intel G4400TE", "Skylake", "Pentium", 2.4,                       {}, 2, 2, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2"], "X86_64", "OPENMP"],
+
+    ["Intel G3920",   "Skylake", "Celeron", 2.9,                       {}, 2, 2, [32, 256, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2"], "X86_64", "OPENMP"],
+    ["Intel G3900",   "Skylake", "Celeron", 2.8,                       {}, 2, 2, [32, 256, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2"], "X86_64", "OPENMP"],
+    ["Intel G3900TE", "Skylake", "Celeron", 2.3,                       {}, 2, 2, [32, 256, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2"], "X86_64", "OPENMP"],
+    ["Intel G3900T",  "Skylake", "Celeron", 2.6,                       {}, 2, 2, [32, 256, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2"], "X86_64", "OPENMP"],
 
     # High-end desktop processors (Skylake-X)
     # 7th generation Skylake-X high-end desktop CPUs
-    ["Intel 7980XE", "Skylake-X", "Core i9", 2.6, {2: 4.2, 1: 4.4}, 18, 36, [32, 1024, 1408], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 7960X",  "Skylake-X", "Core i9", 2.8, {2: 4.2, 1: 4.4}, 16, 32, [32, 1024, 1408], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 7940X",  "Skylake-X", "Core i9", 3.1, {2: 4.3, 1: 4.4}, 14, 28, [32, 1024, 1408], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 7920X",  "Skylake-X", "Core i9", 2.9, {2: 4.3, 1: 4.4}, 12, 24, [32, 1024, 1408], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 7900X",  "Skylake-X", "Core i9", 3.3, {2: 4.3, 1: 4.5}, 10, 20, [32, 1024, 1408], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
+    ["Intel 7980XE", "Skylake-X", "Core i9", 2.6, {2: 4.2, 1: 4.4}, 18, 36, [32, 1024, 1408], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 7960X",  "Skylake-X", "Core i9", 2.8, {2: 4.2, 1: 4.4}, 16, 32, [32, 1024, 1408], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 7940X",  "Skylake-X", "Core i9", 3.1, {2: 4.3, 1: 4.4}, 14, 28, [32, 1024, 1408], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 7920X",  "Skylake-X", "Core i9", 2.9, {2: 4.3, 1: 4.4}, 12, 24, [32, 1024, 1408], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 7900X",  "Skylake-X", "Core i9", 3.3, {2: 4.3, 1: 4.5}, 10, 20, [32, 1024, 1408], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
 
-    ["Intel 7820X",  "Skylake-X", "Core i7", 3.6, {2: 4.3, 1: 4.5},  8, 26, [32,  1024,  1408], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 7800X",  "Skylake-X", "Core i7", 3.5,        {1: 4.0},  6, 12, [32,  1024,  1408], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
+    ["Intel 7820X",  "Skylake-X", "Core i7", 3.6, {2: 4.3, 1: 4.5},  8, 26, [32,  1024,  1408], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 7800X",  "Skylake-X", "Core i7", 3.5,        {1: 4.0},  6, 12, [32,  1024,  1408], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
 
     # 9th generation Skylake-X high-end desktop CPUs
-    ["Intel 9990XE", "Skylake-X", "Core i9", 4.0, {2: 5.0, 1: 5.0}, 14, 28, [32, 1024, 19.25 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 9980XE", "Skylake-X", "Core i9", 3.0, {2: 4.4, 1: 4.5}, 18, 36, [32, 1024, 24.75 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 9960X",  "Skylake-X", "Core i9", 3.1, {2: 4.4, 1: 4.5}, 16, 32, [32, 1024, 22.00 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 9940X",  "Skylake-X", "Core i9", 3.3, {2: 4.4, 1: 4.5}, 14, 32, [32, 1024, 19.25 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 9920X",  "Skylake-X", "Core i9", 3.5, {2: 4.4, 1: 4.5}, 12, 24, [32, 1024, 19.25 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 9900X",  "Skylake-X", "Core i9", 3.5, {2: 4.4, 1: 4.5}, 10, 20, [32, 1024, 19.25 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 9820X",  "Skylake-X", "Core i9", 3.3, {2: 4.1, 1: 4.2}, 10, 20, [32, 1024, 16.50 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
+    ["Intel 9990XE", "Skylake-X", "Core i9", 4.0, {2: 5.0, 1: 5.0}, 14, 28, [32, 1024, 19.25 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 9980XE", "Skylake-X", "Core i9", 3.0, {2: 4.4, 1: 4.5}, 18, 36, [32, 1024, 24.75 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 9960X",  "Skylake-X", "Core i9", 3.1, {2: 4.4, 1: 4.5}, 16, 32, [32, 1024, 22.00 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 9940X",  "Skylake-X", "Core i9", 3.3, {2: 4.4, 1: 4.5}, 14, 32, [32, 1024, 19.25 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 9920X",  "Skylake-X", "Core i9", 3.5, {2: 4.4, 1: 4.5}, 12, 24, [32, 1024, 19.25 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 9900X",  "Skylake-X", "Core i9", 3.5, {2: 4.4, 1: 4.5}, 10, 20, [32, 1024, 19.25 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 9820X",  "Skylake-X", "Core i9", 3.3, {2: 4.1, 1: 4.2}, 10, 20, [32, 1024, 16.50 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
 
-    ["Intel 9800X",  "Skylake-X", "Core i7", 3.8, {2: 4.4, 1: 4.5},  8, 16, [32, 1024, 16.50 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
+    ["Intel 9800X",  "Skylake-X", "Core i7", 3.8, {2: 4.4, 1: 4.5},  8, 16, [32, 1024, 16.50 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
 
     # Xeon High-end desktop processors (Skylake-X)
-    ["Intel W-3175X", "Skylake-X",   "Xeon", 3.1, {2: 3.8, 1: 4.3}, 28, 56, [32, 1024, 38.50 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
+    ["Intel W-3175X", "Skylake-X",   "Xeon", 3.1, {2: 3.8, 1: 4.3}, 28, 56, [32, 1024, 38.50 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
 
     # TODO: Fill in Mobile, Workstation, Server, Skylake-SP Processors
 
     # Intel Kaby Lake
     # ref: https://en.wikipedia.org/wiki/Kaby_Lake
     # Desktop processors
-    ["Intel 7740X", "Kaby Lake", "Core i7", 4.3, {1: 4.5, 2: 4.5, 4: 4.5}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 7700K", "Kaby Lake", "Core i7", 4.2, {1: 4.5, 2: 4.4, 4: 4.5}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 7700",  "Kaby Lake", "Core i7", 3.6, {1: 4.2, 2: 4.1, 4: 4.0}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 7700T", "Kaby Lake", "Core i7", 2.9, {1: 3.8, 2: 3.7, 4: 3.6}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-
-    ["Intel 7640X", "Kaby Lake", "Core i5", 4.0, {1: 4.2, 2: 4.2, 4: 4.0}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 7600K", "Kaby Lake", "Core i5", 3.8, {1: 4.2, 2: 4.1, 4: 4.0}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 7600",  "Kaby Lake", "Core i5", 3.5, {1: 4.1, 2: 4.0, 4: 3.9}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 7600T", "Kaby Lake", "Core i5", 2.8, {1: 3.7, 2: 3.6, 4: 3.5}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 7500",  "Kaby Lake", "Core i5", 3.4, {1: 3.8, 2: 3.7, 4: 3.6}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 7500T", "Kaby Lake", "Core i5", 2.7, {1: 3.3, 2: 3.2, 4: 3.1}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 7400",  "Kaby Lake", "Core i5", 3.0, {1: 3.5, 2: 3.4, 4: 3.3}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 7400T", "Kaby Lake", "Core i5", 2.4, {1: 3.0, 2: 2.9, 4: 2.7}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-
-    ["Intel 7350K",  "Kaby Lake", "Core i3", 4.2,                   {}, 2, 4, [32, 256, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 7320",   "Kaby Lake", "Core i3", 4.1,                   {}, 2, 4, [32, 256, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 7300",   "Kaby Lake", "Core i3", 4.0,                   {}, 2, 4, [32, 256, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 7300T",  "Kaby Lake", "Core i3", 3.5,                   {}, 2, 4, [32, 256, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 7100",   "Kaby Lake", "Core i3", 3.9,                   {}, 2, 4, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 7100T",  "Kaby Lake", "Core i3", 3.4,                   {}, 2, 4, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 7101E",  "Kaby Lake", "Core i3", 3.9,                   {}, 2, 4, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 7101TE", "Kaby Lake", "Core i3", 3.4,                   {}, 2, 4, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
+    ["Intel 7740X", "Kaby Lake", "Core i7", 4.3, {1: 4.5, 2: 4.5, 4: 4.5}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 7700K", "Kaby Lake", "Core i7", 4.2, {1: 4.5, 2: 4.4, 4: 4.5}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 7700",  "Kaby Lake", "Core i7", 3.6, {1: 4.2, 2: 4.1, 4: 4.0}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 7700T", "Kaby Lake", "Core i7", 2.9, {1: 3.8, 2: 3.7, 4: 3.6}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+
+    ["Intel 7640X", "Kaby Lake", "Core i5", 4.0, {1: 4.2, 2: 4.2, 4: 4.0}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 7600K", "Kaby Lake", "Core i5", 3.8, {1: 4.2, 2: 4.1, 4: 4.0}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 7600",  "Kaby Lake", "Core i5", 3.5, {1: 4.1, 2: 4.0, 4: 3.9}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 7600T", "Kaby Lake", "Core i5", 2.8, {1: 3.7, 2: 3.6, 4: 3.5}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 7500",  "Kaby Lake", "Core i5", 3.4, {1: 3.8, 2: 3.7, 4: 3.6}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 7500T", "Kaby Lake", "Core i5", 2.7, {1: 3.3, 2: 3.2, 4: 3.1}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 7400",  "Kaby Lake", "Core i5", 3.0, {1: 3.5, 2: 3.4, 4: 3.3}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 7400T", "Kaby Lake", "Core i5", 2.4, {1: 3.0, 2: 2.9, 4: 2.7}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+
+    ["Intel 7350K",  "Kaby Lake", "Core i3", 4.2,                   {}, 2, 4, [32, 256, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 7320",   "Kaby Lake", "Core i3", 4.1,                   {}, 2, 4, [32, 256, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 7300",   "Kaby Lake", "Core i3", 4.0,                   {}, 2, 4, [32, 256, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 7300T",  "Kaby Lake", "Core i3", 3.5,                   {}, 2, 4, [32, 256, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 7100",   "Kaby Lake", "Core i3", 3.9,                   {}, 2, 4, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 7100T",  "Kaby Lake", "Core i3", 3.4,                   {}, 2, 4, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 7101E",  "Kaby Lake", "Core i3", 3.9,                   {}, 2, 4, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 7101TE", "Kaby Lake", "Core i3", 3.4,                   {}, 2, 4, [32, 256, 3 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
 
     # TODO: Fill in Pentium, Celeron Processors
 
     # TODO: Fill in Mobile Processors
 
     # Server/workstation Xeon processors
-    ["Intel E3-1285 v6", "Kaby Lake", "Xeon", 4.1, {1: 4.5}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel E3-1280 v6", "Kaby Lake", "Xeon", 3.9, {1: 4.2}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel E3-1275 v6", "Kaby Lake", "Xeon", 3.8, {1: 4.2}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel E3-1270 v6", "Kaby Lake", "Xeon", 3.8, {1: 4.2}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel E3-1245 v6", "Kaby Lake", "Xeon", 3.7, {1: 4.1}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel E3-1240 v6", "Kaby Lake", "Xeon", 3.7, {1: 4.1}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel E3-1230 v6", "Kaby Lake", "Xeon", 3.5, {1: 3.9}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
+    ["Intel E3-1285 v6", "Kaby Lake", "Xeon", 4.1, {1: 4.5}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel E3-1280 v6", "Kaby Lake", "Xeon", 3.9, {1: 4.2}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel E3-1275 v6", "Kaby Lake", "Xeon", 3.8, {1: 4.2}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel E3-1270 v6", "Kaby Lake", "Xeon", 3.8, {1: 4.2}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel E3-1245 v6", "Kaby Lake", "Xeon", 3.7, {1: 4.1}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel E3-1240 v6", "Kaby Lake", "Xeon", 3.7, {1: 4.1}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel E3-1230 v6", "Kaby Lake", "Xeon", 3.5, {1: 3.9}, 4, 8, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
 
-    ["Intel E3-1225 v6", "Kaby Lake", "Xeon", 3.3, {1: 3.7}, 4, 4, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel E3-1220 v6", "Kaby Lake", "Xeon", 3.0, {1: 3.5}, 4, 4, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
+    ["Intel E3-1225 v6", "Kaby Lake", "Xeon", 3.3, {1: 3.7}, 4, 4, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel E3-1220 v6", "Kaby Lake", "Xeon", 3.0, {1: 3.5}, 4, 4, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
 
     # TODO: Fill in remaining Kaby Lake data
 
     # Intel Coffee Lake
     # ref: https://en.wikipedia.org/wiki/Coffee_Lake
     # Desktop processors (Coffee Lake S)
-    ["Intel 8086K", "Coffee Lake", "Core i7", 4.0, {1: 5.0, 2: 4.6, 3: 4.5, 4: 4.4, 5: 4.4, 6: 4.3}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 8700K", "Coffee Lake", "Core i7", 3.7, {1: 4.7, 2: 4.6, 3: 4.5, 4: 4.4, 5: 4.4, 6: 4.3}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 8700",  "Coffee Lake", "Core i7", 3.2, {1: 4.6, 2: 4.5, 3: 4.4, 4: 4.3, 5: 4.3, 6: 4.3}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 8700T", "Coffee Lake", "Core i7", 2.4, {1: 4.0, 2: 4.0, 3: 3.9, 4: 3.9, 5: 3.8, 6: 3.8}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-
-    ["Intel 8600K", "Coffee Lake", "Core i5", 3.6, {1: 4.3, 2: 4.2, 3: 4.2, 4: 4.2, 5: 4.1, 6: 4.1}, 6, 6, [32, 256, 9 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 8600",  "Coffee Lake", "Core i5", 3.1, {1: 4.3, 2: 4.2, 3: 4.2, 4: 4.2, 5: 4.1, 6: 4.1}, 6, 6, [32, 256, 9 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 8600T", "Coffee Lake", "Core i5", 2.3, {1: 3.7, 2: 3.6, 3: 3.6, 4: 3.6, 5: 3.5, 6: 3.5}, 6, 6, [32, 256, 9 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 8500",  "Coffee Lake", "Core i5", 3.0, {1: 4.1, 2: 4.0, 3: 4.0, 4: 4.0, 5: 3.9, 6: 3.9}, 6, 6, [32, 256, 9 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 8500T", "Coffee Lake", "Core i5", 2.1, {1: 3.5, 2: 3.4, 3: 3.3, 4: 3.3, 5: 3.2, 6: 3.2}, 6, 6, [32, 256, 9 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 8400",  "Coffee Lake", "Core i5", 2.8, {1: 4.0, 2: 3.9, 3: 3.9, 4: 3.9, 5: 3.8, 6: 3.8}, 6, 6, [32, 256, 9 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 8400T", "Coffee Lake", "Core i5", 1.7, {1: 3.3, 2: 3.2, 3: 3.1, 4: 3.1, 5: 3.0, 6: 3.0}, 6, 6, [32, 256, 9 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-
-    ["Intel 8350K", "Coffee Lake", "Core i3", 4.0,                                         {}, 4, 4, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 8300",  "Coffee Lake", "Core i3", 3.7,                                         {}, 4, 4, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 8300T", "Coffee Lake", "Core i3", 3.2,                                         {}, 4, 4, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 8100",  "Coffee Lake", "Core i3", 3.6,                                         {}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 8100F", "Coffee Lake", "Core i3", 3.6,                                         {}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 8100T", "Coffee Lake", "Core i3", 3.1,                                         {}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
+    ["Intel 8086K", "Coffee Lake", "Core i7", 4.0, {1: 5.0, 2: 4.6, 3: 4.5, 4: 4.4, 5: 4.4, 6: 4.3}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 8700K", "Coffee Lake", "Core i7", 3.7, {1: 4.7, 2: 4.6, 3: 4.5, 4: 4.4, 5: 4.4, 6: 4.3}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 8700",  "Coffee Lake", "Core i7", 3.2, {1: 4.6, 2: 4.5, 3: 4.4, 4: 4.3, 5: 4.3, 6: 4.3}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 8700T", "Coffee Lake", "Core i7", 2.4, {1: 4.0, 2: 4.0, 3: 3.9, 4: 3.9, 5: 3.8, 6: 3.8}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+
+    ["Intel 8600K", "Coffee Lake", "Core i5", 3.6, {1: 4.3, 2: 4.2, 3: 4.2, 4: 4.2, 5: 4.1, 6: 4.1}, 6, 6, [32, 256, 9 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 8600",  "Coffee Lake", "Core i5", 3.1, {1: 4.3, 2: 4.2, 3: 4.2, 4: 4.2, 5: 4.1, 6: 4.1}, 6, 6, [32, 256, 9 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 8600T", "Coffee Lake", "Core i5", 2.3, {1: 3.7, 2: 3.6, 3: 3.6, 4: 3.6, 5: 3.5, 6: 3.5}, 6, 6, [32, 256, 9 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 8500",  "Coffee Lake", "Core i5", 3.0, {1: 4.1, 2: 4.0, 3: 4.0, 4: 4.0, 5: 3.9, 6: 3.9}, 6, 6, [32, 256, 9 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 8500T", "Coffee Lake", "Core i5", 2.1, {1: 3.5, 2: 3.4, 3: 3.3, 4: 3.3, 5: 3.2, 6: 3.2}, 6, 6, [32, 256, 9 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 8400",  "Coffee Lake", "Core i5", 2.8, {1: 4.0, 2: 3.9, 3: 3.9, 4: 3.9, 5: 3.8, 6: 3.8}, 6, 6, [32, 256, 9 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 8400T", "Coffee Lake", "Core i5", 1.7, {1: 3.3, 2: 3.2, 3: 3.1, 4: 3.1, 5: 3.0, 6: 3.0}, 6, 6, [32, 256, 9 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+
+    ["Intel 8350K", "Coffee Lake", "Core i3", 4.0,                                         {}, 4, 4, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 8300",  "Coffee Lake", "Core i3", 3.7,                                         {}, 4, 4, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 8300T", "Coffee Lake", "Core i3", 3.2,                                         {}, 4, 4, [32, 256, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 8100",  "Coffee Lake", "Core i3", 3.6,                                         {}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 8100F", "Coffee Lake", "Core i3", 3.6,                                         {}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 8100T", "Coffee Lake", "Core i3", 3.1,                                         {}, 4, 4, [32, 256, 6 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
 
     # TODO: Fill in Pentium, Celeron Processors
 
     # Workstation processors (Coffee Lake S)
-    ["Intel 2186G", "Coffee Lake", "Xeon E", 3.8, {i + 1:4.7 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 2176G", "Coffee Lake", "Xeon E", 3.7, {i + 1:4.7 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 2146G", "Coffee Lake", "Xeon E", 3.5, {i + 1:4.5 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 2136",  "Coffee Lake", "Xeon E", 3.3, {i + 1:4.5 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 2126G", "Coffee Lake", "Xeon E", 3.3, {i + 1:4.5 for i in range(6)}, 6,  6, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 2174G", "Coffee Lake", "Xeon E", 3.8, {i + 1:4.7 for i in range(6)}, 4,  8, [32, 256,  8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 2144G", "Coffee Lake", "Xeon E", 3.6, {i + 1:4.5 for i in range(6)}, 4,  8, [32, 256,  8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 2134",  "Coffee Lake", "Xeon E", 3.5, {i + 1:4.5 for i in range(6)}, 4,  8, [32, 256,  8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 2124G", "Coffee Lake", "Xeon E", 3.4, {i + 1:4.5 for i in range(6)}, 4,  4, [32, 256,  8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 2124",  "Coffee Lake", "Xeon E", 3.3, {i + 1:4.3 for i in range(6)}, 4,  4, [32, 256,  8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 2104G", "Coffee Lake", "Xeon E", 3.2,                            {}, 4,  4, [32, 256,  8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
+    ["Intel 2186G", "Coffee Lake", "Xeon E", 3.8, {i + 1:4.7 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 2176G", "Coffee Lake", "Xeon E", 3.7, {i + 1:4.7 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 2146G", "Coffee Lake", "Xeon E", 3.5, {i + 1:4.5 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 2136",  "Coffee Lake", "Xeon E", 3.3, {i + 1:4.5 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 2126G", "Coffee Lake", "Xeon E", 3.3, {i + 1:4.5 for i in range(6)}, 6,  6, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 2174G", "Coffee Lake", "Xeon E", 3.8, {i + 1:4.7 for i in range(6)}, 4,  8, [32, 256,  8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 2144G", "Coffee Lake", "Xeon E", 3.6, {i + 1:4.5 for i in range(6)}, 4,  8, [32, 256,  8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 2134",  "Coffee Lake", "Xeon E", 3.5, {i + 1:4.5 for i in range(6)}, 4,  8, [32, 256,  8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 2124G", "Coffee Lake", "Xeon E", 3.4, {i + 1:4.5 for i in range(6)}, 4,  4, [32, 256,  8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 2124",  "Coffee Lake", "Xeon E", 3.3, {i + 1:4.3 for i in range(6)}, 4,  4, [32, 256,  8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 2104G", "Coffee Lake", "Xeon E", 3.2,                            {}, 4,  4, [32, 256,  8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
 
     # TODO: Fill in remaining Coffee Lake data
 
     # Intel Comet Lake
     # https://en.wikipedia.org/wiki/Comet_Lake_(microprocessor)
     # Desktop processors
-    ["Intel 10900K",  "Comet Lake", "Core i9", 3.7, {i+1:4.8 for i in range(10)}, 10, 20, [32, 256, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 10900KF", "Comet Lake", "Core i9", 3.7, {i+1:4.8 for i in range(10)}, 10, 20, [32, 256, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 10910",   "Comet Lake", "Core i9", 3.6, {i+1:4.7 for i in range(10)}, 10, 20, [32, 256, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 10900",   "Comet Lake", "Core i9", 2.8, {i+1:4.5 for i in range(10)}, 10, 20, [32, 256, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 10900F",  "Comet Lake", "Core i9", 2.8, {i+1:4.5 for i in range(10)}, 10, 20, [32, 256, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 10900T",  "Comet Lake", "Core i9", 1.9, {i+1:3.7 for i in range(10)}, 10, 20, [32, 256, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 10850K",  "Comet Lake", "Core i9", 3.6, {i+1:4.7 for i in range(10)}, 10, 20, [32, 256, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-
-    ["Intel 10700K",  "Comet Lake", "Core i7", 3.8, {i+1:4.7 for i in range(8)}, 8, 16, [32, 256, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 10700KF", "Comet Lake", "Core i7", 3.8, {i+1:4.7 for i in range(8)}, 8, 16, [32, 256, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 10700",   "Comet Lake", "Core i7", 2.9, {i+1:4.6 for i in range(8)}, 8, 16, [32, 256, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 10700F",  "Comet Lake", "Core i7", 2.9, {i+1:4.6 for i in range(8)}, 8, 16, [32, 256, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 10700T",  "Comet Lake", "Core i7", 2.0, {i+1:3.7 for i in range(8)}, 8, 16, [32, 256, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-
-    ["Intel 10600K",  "Comet Lake", "Core i5", 4.1, {i+1:4.5 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 10600KF", "Comet Lake", "Core i5", 4.1, {i+1:4.5 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 10600",   "Comet Lake", "Core i5", 3.3, {i+1:4.4 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 10600T",  "Comet Lake", "Core i5", 2.4, {i+1:3.7 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 10500",   "Comet Lake", "Core i5", 3.1, {i+1:4.2 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 10500T",  "Comet Lake", "Core i5", 2.3, {i+1:3.5 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 10400",   "Comet Lake", "Core i5", 2.9, {i+1:4.0 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 10400F",  "Comet Lake", "Core i5", 2.9, {i+1:4.0 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 10400T",  "Comet Lake", "Core i5", 2.0, {i+1:3.2 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-
-    ["Intel 10320",  "Comet Lake", "Core i3", 3.8, {i+1:4.4 for i in range(4)}, 4, 8, [32, 356, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 10300",  "Comet Lake", "Core i3", 3.7, {i+1:4.2 for i in range(4)}, 4, 8, [32, 356, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 10300T", "Comet Lake", "Core i3", 3.0, {i+1:3.6 for i in range(4)}, 4, 8, [32, 356, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 10100",  "Comet Lake", "Core i3", 3.6, {i+1:4.1 for i in range(4)}, 4, 8, [32, 356, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 10100F", "Comet Lake", "Core i3", 3.6, {i+1:4.1 for i in range(4)}, 4, 8, [32, 356, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 10100T", "Comet Lake", "Core i3", 3.0, {i+1:3.5 for i in range(4)}, 4, 8, [32, 356, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
+    ["Intel 10900K",  "Comet Lake", "Core i9", 3.7, {i+1:4.8 for i in range(10)}, 10, 20, [32, 256, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 10900KF", "Comet Lake", "Core i9", 3.7, {i+1:4.8 for i in range(10)}, 10, 20, [32, 256, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 10910",   "Comet Lake", "Core i9", 3.6, {i+1:4.7 for i in range(10)}, 10, 20, [32, 256, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 10900",   "Comet Lake", "Core i9", 2.8, {i+1:4.5 for i in range(10)}, 10, 20, [32, 256, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 10900F",  "Comet Lake", "Core i9", 2.8, {i+1:4.5 for i in range(10)}, 10, 20, [32, 256, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 10900T",  "Comet Lake", "Core i9", 1.9, {i+1:3.7 for i in range(10)}, 10, 20, [32, 256, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 10850K",  "Comet Lake", "Core i9", 3.6, {i+1:4.7 for i in range(10)}, 10, 20, [32, 256, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+
+    ["Intel 10700K",  "Comet Lake", "Core i7", 3.8, {i+1:4.7 for i in range(8)}, 8, 16, [32, 256, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 10700KF", "Comet Lake", "Core i7", 3.8, {i+1:4.7 for i in range(8)}, 8, 16, [32, 256, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 10700",   "Comet Lake", "Core i7", 2.9, {i+1:4.6 for i in range(8)}, 8, 16, [32, 256, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 10700F",  "Comet Lake", "Core i7", 2.9, {i+1:4.6 for i in range(8)}, 8, 16, [32, 256, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 10700T",  "Comet Lake", "Core i7", 2.0, {i+1:3.7 for i in range(8)}, 8, 16, [32, 256, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+
+    ["Intel 10600K",  "Comet Lake", "Core i5", 4.1, {i+1:4.5 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 10600KF", "Comet Lake", "Core i5", 4.1, {i+1:4.5 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 10600",   "Comet Lake", "Core i5", 3.3, {i+1:4.4 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 10600T",  "Comet Lake", "Core i5", 2.4, {i+1:3.7 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 10500",   "Comet Lake", "Core i5", 3.1, {i+1:4.2 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 10500T",  "Comet Lake", "Core i5", 2.3, {i+1:3.5 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 10400",   "Comet Lake", "Core i5", 2.9, {i+1:4.0 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 10400F",  "Comet Lake", "Core i5", 2.9, {i+1:4.0 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 10400T",  "Comet Lake", "Core i5", 2.0, {i+1:3.2 for i in range(6)}, 6, 12, [32, 256, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+
+    ["Intel 10320",  "Comet Lake", "Core i3", 3.8, {i+1:4.4 for i in range(4)}, 4, 8, [32, 356, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 10300",  "Comet Lake", "Core i3", 3.7, {i+1:4.2 for i in range(4)}, 4, 8, [32, 356, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 10300T", "Comet Lake", "Core i3", 3.0, {i+1:3.6 for i in range(4)}, 4, 8, [32, 356, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 10100",  "Comet Lake", "Core i3", 3.6, {i+1:4.1 for i in range(4)}, 4, 8, [32, 356, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 10100F", "Comet Lake", "Core i3", 3.6, {i+1:4.1 for i in range(4)}, 4, 8, [32, 356, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 10100T", "Comet Lake", "Core i3", 3.0, {i+1:3.5 for i in range(4)}, 4, 8, [32, 356, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
 
     # TODO: Fill in Pentium, Celeron Processors
 
     # Workstation processors
-    ["Intel 1290P", "Comet Lake", "Xeon W", 3.7, {i+1:4.8 for i in range(10)}, 10, 20, [32, 356, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 1290",  "Comet Lake", "Xeon W", 3.2, {i+1:4.6 for i in range(10)}, 10, 20, [32, 356, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 1290T", "Comet Lake", "Xeon W", 1.9, {i+1:3.8 for i in range(10)}, 10, 20, [32, 356, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 1270P", "Comet Lake", "Xeon W", 3.8, {i+1:4.7 for i in range(10)}, 10, 20, [32, 356, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 1270",  "Comet Lake", "Xeon W", 3.4, {i+1:4.7 for i in range(10)}, 10, 20, [32, 356, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 1250P", "Comet Lake", "Xeon W", 4.1, {i+1:4.5 for i in range(10)}, 10, 20, [32, 356, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["Intel 1250",  "Comet Lake", "Xeon W", 3.3, {i+1:4.4 for i in range(10)}, 10, 20, [32, 356, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
+    ["Intel 1290P", "Comet Lake", "Xeon W", 3.7, {i+1:4.8 for i in range(10)}, 10, 20, [32, 356, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 1290",  "Comet Lake", "Xeon W", 3.2, {i+1:4.6 for i in range(10)}, 10, 20, [32, 356, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 1290T", "Comet Lake", "Xeon W", 1.9, {i+1:3.8 for i in range(10)}, 10, 20, [32, 356, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 1270P", "Comet Lake", "Xeon W", 3.8, {i+1:4.7 for i in range(10)}, 10, 20, [32, 356, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 1270",  "Comet Lake", "Xeon W", 3.4, {i+1:4.7 for i in range(10)}, 10, 20, [32, 356, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 1250P", "Comet Lake", "Xeon W", 4.1, {i+1:4.5 for i in range(10)}, 10, 20, [32, 356, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["Intel 1250",  "Comet Lake", "Xeon W", 3.3, {i+1:4.4 for i in range(10)}, 10, 20, [32, 356, 12 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
 
     # TODO: Fill in remaining Comet Lake data
 
     # Intel Rocket Lake
     # https://en.wikipedia.org/wiki/Rocket_Lake
     # Desktop processors
-    ["Intel 11600T",  "Rocket Lake", "Core i5", 1.7, {**{i+1:3.5 for i in range(6)}, **{1: 4.1},          }, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 11600KF", "Rocket Lake", "Core i5", 3.9, {**{i+1:4.6 for i in range(6)}, **{1: 4.9},          }, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 11600K",  "Rocket Lake", "Core i5", 3.9, {**{i+1:4.6 for i in range(6)}, **{1: 4.9},          }, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 11600",   "Rocket Lake", "Core i5", 2.8, {**{i+1:4.3 for i in range(6)}, **{1: 4.8},          }, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 11500T",  "Rocket Lake", "Core i5", 1.5, {**{i+1:3.4 for i in range(6)}, **{1: 3.9},          }, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 11500",   "Rocket Lake", "Core i5", 2.7, {**{i+1:4.2 for i in range(6)}, **{1: 4.6},          }, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 11400T",  "Rocket Lake", "Core i5", 1.3, {**{i+1:3.3 for i in range(6)}, **{1: 3.7},          }, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 11400F",  "Rocket Lake", "Core i5", 2.6, {**{i+1:4.2 for i in range(6)}, **{1: 4.4},          }, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 11400",   "Rocket Lake", "Core i5", 2.6, {**{i+1:4.2 for i in range(6)}, **{1: 4.4},          }, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 11700T",  "Rocket Lake", "Core i7", 1.4, {**{i+1:3.6 for i in range(8)}, **{1: 4.5}, **{2: 4.6}}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 11700KF", "Rocket Lake", "Core i7", 3.6, {**{i+1:4.6 for i in range(8)}, **{1: 4.9}, **{2: 5.0}}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 11700K",  "Rocket Lake", "Core i7", 3.6, {**{i+1:4.6 for i in range(8)}, **{1: 4.9}, **{2: 5.0}}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 11700F",  "Rocket Lake", "Core i7", 2.5, {**{i+1:4.4 for i in range(8)}, **{1: 4.8}, **{2: 4.9}}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 11700",   "Rocket Lake", "Core i7", 2.5, {**{i+1:4.4 for i in range(8)}, **{1: 4.8}, **{2: 4.9}}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 11900T",  "Rocket Lake", "Core i9", 1.5, {**{i+1:3.7 for i in range(8)}, **{1: 4.8}, **{2: 4.9}}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 11900KF", "Rocket Lake", "Core i9", 3.5, {**{i+1:4.8 for i in range(8)}, **{1: 5.1}, **{2: 5.2}}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 11900K",  "Rocket Lake", "Core i9", 3.5, {**{i+1:4.8 for i in range(8)}, **{1: 5.1}, **{2: 5.2}}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 11900F",  "Rocket Lake", "Core i9", 2.5, {**{i+1:4.7 for i in range(8)}, **{1: 5.0}, **{2: 5.1}}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 11900",   "Rocket Lake", "Core i9", 2.5, {**{i+1:4.7 for i in range(8)}, **{1: 5.0}, **{2: 5.1}}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
+    ["Intel 11600T",  "Rocket Lake", "Core i5", 1.7, {**{i+1:3.5 for i in range(6)}, **{1: 4.1},          }, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 11600KF", "Rocket Lake", "Core i5", 3.9, {**{i+1:4.6 for i in range(6)}, **{1: 4.9},          }, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 11600K",  "Rocket Lake", "Core i5", 3.9, {**{i+1:4.6 for i in range(6)}, **{1: 4.9},          }, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 11600",   "Rocket Lake", "Core i5", 2.8, {**{i+1:4.3 for i in range(6)}, **{1: 4.8},          }, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 11500T",  "Rocket Lake", "Core i5", 1.5, {**{i+1:3.4 for i in range(6)}, **{1: 3.9},          }, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 11500",   "Rocket Lake", "Core i5", 2.7, {**{i+1:4.2 for i in range(6)}, **{1: 4.6},          }, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 11400T",  "Rocket Lake", "Core i5", 1.3, {**{i+1:3.3 for i in range(6)}, **{1: 3.7},          }, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 11400F",  "Rocket Lake", "Core i5", 2.6, {**{i+1:4.2 for i in range(6)}, **{1: 4.4},          }, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 11400",   "Rocket Lake", "Core i5", 2.6, {**{i+1:4.2 for i in range(6)}, **{1: 4.4},          }, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 11700T",  "Rocket Lake", "Core i7", 1.4, {**{i+1:3.6 for i in range(8)}, **{1: 4.5}, **{2: 4.6}}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 11700KF", "Rocket Lake", "Core i7", 3.6, {**{i+1:4.6 for i in range(8)}, **{1: 4.9}, **{2: 5.0}}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 11700K",  "Rocket Lake", "Core i7", 3.6, {**{i+1:4.6 for i in range(8)}, **{1: 4.9}, **{2: 5.0}}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 11700F",  "Rocket Lake", "Core i7", 2.5, {**{i+1:4.4 for i in range(8)}, **{1: 4.8}, **{2: 4.9}}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 11700",   "Rocket Lake", "Core i7", 2.5, {**{i+1:4.4 for i in range(8)}, **{1: 4.8}, **{2: 4.9}}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 11900T",  "Rocket Lake", "Core i9", 1.5, {**{i+1:3.7 for i in range(8)}, **{1: 4.8}, **{2: 4.9}}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 11900KF", "Rocket Lake", "Core i9", 3.5, {**{i+1:4.8 for i in range(8)}, **{1: 5.1}, **{2: 5.2}}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 11900K",  "Rocket Lake", "Core i9", 3.5, {**{i+1:4.8 for i in range(8)}, **{1: 5.1}, **{2: 5.2}}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 11900F",  "Rocket Lake", "Core i9", 2.5, {**{i+1:4.7 for i in range(8)}, **{1: 5.0}, **{2: 5.1}}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 11900",   "Rocket Lake", "Core i9", 2.5, {**{i+1:4.7 for i in range(8)}, **{1: 5.0}, **{2: 5.1}}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
 
     # Workstation processors
-    ["Intel 1350",  "Rocket Lake", "Xeon W", 3.3, {i+1:5.0 for i in range(6)}, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 1350P", "Rocket Lake", "Xeon W", 4.0, {i+1:5.1 for i in range(6)}, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 1370",  "Rocket Lake", "Xeon W", 2.9, {i+1:5.1 for i in range(8)}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 1370P", "Rocket Lake", "Xeon W", 3.6, {i+1:5.2 for i in range(8)}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 1390",  "Rocket Lake", "Xeon W", 2.8, {i+1:5.2 for i in range(8)}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 1390P", "Rocket Lake", "Xeon W", 3.5, {i+1:5.3 for i in range(8)}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 1390T", "Rocket Lake", "Xeon W", 1.5, {i+1:4.9 for i in range(8)}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
+    ["Intel 1350",  "Rocket Lake", "Xeon W", 3.3, {i+1:5.0 for i in range(6)}, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 1350P", "Rocket Lake", "Xeon W", 4.0, {i+1:5.1 for i in range(6)}, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 1370",  "Rocket Lake", "Xeon W", 2.9, {i+1:5.1 for i in range(8)}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 1370P", "Rocket Lake", "Xeon W", 3.6, {i+1:5.2 for i in range(8)}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 1390",  "Rocket Lake", "Xeon W", 2.8, {i+1:5.2 for i in range(8)}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 1390P", "Rocket Lake", "Xeon W", 3.5, {i+1:5.3 for i in range(8)}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 1390T", "Rocket Lake", "Xeon W", 1.5, {i+1:4.9 for i in range(8)}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
 
     # Server processors
-    ["Intel 2314",  "Rocket Lake", "Xeon E", 2.8, {1: 4.5}, 4, 4, [48, 512, 8 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 2324G", "Rocket Lake", "Xeon E", 3.1, {1: 4.6}, 4, 4, [48, 512, 8 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 2334",  "Rocket Lake", "Xeon E", 3.4, {1: 4.8}, 4, 8, [48, 512, 8 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 2336",  "Rocket Lake", "Xeon E", 2.9, {1: 4.8}, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 2356G", "Rocket Lake", "Xeon E", 3.2, {1: 5.0}, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 2374G", "Rocket Lake", "Xeon E", 3.7, {1: 5.0}, 4, 8, [48, 512, 8 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 2378",  "Rocket Lake", "Xeon E", 2.6, {1: 4.8}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 2378G", "Rocket Lake", "Xeon E", 2.8, {1: 5.1}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 2386G", "Rocket Lake", "Xeon E", 3.5, {1: 5.1}, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
-    ["Intel 2388G", "Rocket Lake", "Xeon E", 3.2, {1: 5.1}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64"],
+    ["Intel 2314",  "Rocket Lake", "Xeon E", 2.8, {1: 4.5}, 4, 4, [48, 512, 8 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 2324G", "Rocket Lake", "Xeon E", 3.1, {1: 4.6}, 4, 4, [48, 512, 8 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 2334",  "Rocket Lake", "Xeon E", 3.4, {1: 4.8}, 4, 8, [48, 512, 8 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 2336",  "Rocket Lake", "Xeon E", 2.9, {1: 4.8}, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 2356G", "Rocket Lake", "Xeon E", 3.2, {1: 5.0}, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 2374G", "Rocket Lake", "Xeon E", 3.7, {1: 5.0}, 4, 8, [48, 512, 8 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 2378",  "Rocket Lake", "Xeon E", 2.6, {1: 4.8}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 2378G", "Rocket Lake", "Xeon E", 2.8, {1: 5.1}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 2386G", "Rocket Lake", "Xeon E", 3.5, {1: 5.1}, 6, 12, [48, 512, 12 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
+    ["Intel 2388G", "Rocket Lake", "Xeon E", 3.2, {1: 5.1}, 8, 16, [48, 512, 16 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512"], "X86_64", "OPENMP"],
 
     # Intel Ice Lake
     # ref: https://en.wikipedia.org/wiki/Ice_Lake_(microprocessor)
     # ref: https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(client)
-    ["Intel 1000G1", "Ice Lake", "Core i3", 1.1, {1: 3.2, 2: 3.2       }, 2, 4, [48, 512, 2 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 1000G4", "Ice Lake", "Core i3", 1.1, {1: 3.2, 2: 3.2       }, 2, 4, [48, 512, 2 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 1005G1", "Ice Lake", "Core i3", 1.2, {1: 3.4, 2: 3.4       }, 2, 4, [48, 512, 2 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 1030G4", "Ice Lake", "Core i5", 0.7, {1: 3.5,        4: 3.2}, 4, 8, [48, 512, 2 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 1030G7", "Ice Lake", "Core i5", 0.8, {1: 3.5,        4: 3.2}, 4, 8, [48, 512, 2 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 1035G1", "Ice Lake", "Core i5", 1.0, {1: 3.6,        4: 3.3}, 4, 8, [48, 512, 2 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 1035G4", "Ice Lake", "Core i5", 1.1, {1: 3.7,        4: 3.3}, 4, 8, [48, 512, 2 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 1035G7", "Ice Lake", "Core i5", 1.2, {1: 3.7,        4: 3.3}, 4, 8, [48, 512, 2 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 1060G7", "Ice Lake", "Core i7", 1.0, {1: 3.8,        4: 3.4}, 4, 8, [48, 512, 2 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 1065G7", "Ice Lake", "Core i7", 1.3, {1: 3.9, 2: 3.8, 4: 3.5}, 4, 8, [48, 512, 2 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 1068G7", "Ice Lake", "Core i7", 2.3, {1: 4.1,        4: 3.6}, 4, 8, [48, 512, 2 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-
-    ["Intel 8351N", "Ice Lake", "Xeon Platinum", 2.40, {36: 3.10}, 36, 72, [48, 512, 54 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8352S", "Ice Lake", "Xeon Platinum", 2.20, {32: 2.80}, 32, 64, [48, 512, 48 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8352V", "Ice Lake", "Xeon Platinum", 2.10, {36: 2.50}, 36, 72, [48, 512, 54 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8352Y", "Ice Lake", "Xeon Platinum", 2.20, {32: 2.80}, 32, 64, [48, 512, 48 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8358",  "Ice Lake", "Xeon Platinum", 2.60, {32: 3.30}, 32, 64, [48, 512, 48 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8358P", "Ice Lake", "Xeon Platinum", 2.60, {32: 3.20}, 32, 64, [48, 512, 48 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8360Y", "Ice Lake", "Xeon Platinum", 2.40, {36: 3.10}, 36, 72, [48, 512, 54 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8362",  "Ice Lake", "Xeon Platinum", 2.80, {32: 3.50}, 32, 64, [48, 512, 48 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8368",  "Ice Lake", "Xeon Platinum", 2.40, {38: 3.20}, 38, 76, [48, 512, 57 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8368Q", "Ice Lake", "Xeon Platinum", 2.60, {38: 3.30}, 38, 76, [48, 512, 57 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8380",  "Ice Lake", "Xeon Platinum", 2.30, {40: 3.00}, 40, 80, [48, 512, 60 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
+    ["Intel 1000G1", "Ice Lake", "Core i3", 1.1, {1: 3.2, 2: 3.2       }, 2, 4, [48, 512, 2 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 1000G4", "Ice Lake", "Core i3", 1.1, {1: 3.2, 2: 3.2       }, 2, 4, [48, 512, 2 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 1005G1", "Ice Lake", "Core i3", 1.2, {1: 3.4, 2: 3.4       }, 2, 4, [48, 512, 2 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 1030G4", "Ice Lake", "Core i5", 0.7, {1: 3.5,        4: 3.2}, 4, 8, [48, 512, 2 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 1030G7", "Ice Lake", "Core i5", 0.8, {1: 3.5,        4: 3.2}, 4, 8, [48, 512, 2 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 1035G1", "Ice Lake", "Core i5", 1.0, {1: 3.6,        4: 3.3}, 4, 8, [48, 512, 2 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 1035G4", "Ice Lake", "Core i5", 1.1, {1: 3.7,        4: 3.3}, 4, 8, [48, 512, 2 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 1035G7", "Ice Lake", "Core i5", 1.2, {1: 3.7,        4: 3.3}, 4, 8, [48, 512, 2 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 1060G7", "Ice Lake", "Core i7", 1.0, {1: 3.8,        4: 3.4}, 4, 8, [48, 512, 2 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 1065G7", "Ice Lake", "Core i7", 1.3, {1: 3.9, 2: 3.8, 4: 3.5}, 4, 8, [48, 512, 2 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 1068G7", "Ice Lake", "Core i7", 2.3, {1: 4.1,        4: 3.6}, 4, 8, [48, 512, 2 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+
+    ["Intel 8351N", "Ice Lake", "Xeon Platinum", 2.40, {36: 3.10}, 36, 72, [48, 512, 54 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8352S", "Ice Lake", "Xeon Platinum", 2.20, {32: 2.80}, 32, 64, [48, 512, 48 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8352V", "Ice Lake", "Xeon Platinum", 2.10, {36: 2.50}, 36, 72, [48, 512, 54 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8352Y", "Ice Lake", "Xeon Platinum", 2.20, {32: 2.80}, 32, 64, [48, 512, 48 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8358",  "Ice Lake", "Xeon Platinum", 2.60, {32: 3.30}, 32, 64, [48, 512, 48 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8358P", "Ice Lake", "Xeon Platinum", 2.60, {32: 3.20}, 32, 64, [48, 512, 48 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8360Y", "Ice Lake", "Xeon Platinum", 2.40, {36: 3.10}, 36, 72, [48, 512, 54 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8362",  "Ice Lake", "Xeon Platinum", 2.80, {32: 3.50}, 32, 64, [48, 512, 48 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8368",  "Ice Lake", "Xeon Platinum", 2.40, {38: 3.20}, 38, 76, [48, 512, 57 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8368Q", "Ice Lake", "Xeon Platinum", 2.60, {38: 3.30}, 38, 76, [48, 512, 57 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8380",  "Ice Lake", "Xeon Platinum", 2.30, {40: 3.00}, 40, 80, [48, 512, 60 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
 
     # Intel Cascade Lake
     # ref: https://en.wikipedia.org/wiki/Cascade_Lake_(microarchitecture)
     # ref: https://en.wikichip.org/wiki/intel/microarchitectures/cascade_lake
-    ["Intel 6209U",   "Cascade Lake", "Xeon Gold", 2.1, {20: 3.9}, 20, 40, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6210U",   "Cascade Lake", "Xeon Gold", 2.5, {20: 3.9}, 20, 40, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6212U",   "Cascade Lake", "Xeon Gold", 2.4, {24: 3.9}, 24, 48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel W-3223",  "Cascade Lake", "Xeon W",    3.5, { 8: 4.0},  8, 16, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel W-3225",  "Cascade Lake", "Xeon W",    3.7, { 8: 4.3},  8, 16, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel W-3235",  "Cascade Lake", "Xeon W",    3.3, {12: 4.4}, 12, 24, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel W-3245",  "Cascade Lake", "Xeon W",    3.2, {16: 4.4}, 16, 32, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel W-3245M", "Cascade Lake", "Xeon W",    3.2, {16: 4.4}, 16, 32, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel W-3265",  "Cascade Lake", "Xeon W",    2.7, {24: 4.4}, 24, 48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel W-3265M", "Cascade Lake", "Xeon W",    2.7, {24: 4.4}, 24, 48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel W-3275",  "Cascade Lake", "Xeon W",    2.5, {28: 4.4}, 28, 56, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel W-3275M", "Cascade Lake", "Xeon W",    2.5, {28: 4.4}, 28, 56, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-
-    ["Intel 3204",  "Cascade Lake", "Xeon Bronze",   1.9,        {},  6,   6, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 5218R", "Cascade Lake", "Xeon Gold",     2.1, {20: 4.0}, 20,  40, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 5220R", "Cascade Lake", "Xeon Gold",     2.2, {24: 4.0}, 24,  48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6226R", "Cascade Lake", "Xeon Gold",     2.9, {16: 3.9}, 16,  32, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6230R", "Cascade Lake", "Xeon Gold",     2.1, {26: 4.0}, 26,  52, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6238R", "Cascade Lake", "Xeon Gold",     2.2, {28: 4.0}, 28,  56, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6240R", "Cascade Lake", "Xeon Gold",     2.4, {24: 4.0}, 24,  48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6242R", "Cascade Lake", "Xeon Gold",     3.1, {20: 4.1}, 20,  40, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6246R", "Cascade Lake", "Xeon Gold",     3.4, {16: 4.1}, 16,  32, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6248R", "Cascade Lake", "Xeon Gold",       3, {24: 4.0}, 24,  48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6258R", "Cascade Lake", "Xeon Gold",     2.7, {28: 4.0}, 28,  56, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 9221",  "Cascade Lake", "Xeon Platinum", 2.1, {32: 3.7}, 32,  64, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 9222",  "Cascade Lake", "Xeon Platinum", 2.3, {32: 3.7}, 32,  64, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 9242",  "Cascade Lake", "Xeon Platinum", 2.3, {48: 3.8}, 48,  96, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 9282",  "Cascade Lake", "Xeon Platinum", 2.6, {56: 3.8}, 56, 112, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 4208",  "Cascade Lake", "Xeon Silver",   2.1, { 8: 3.2},  8,  16, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 4209T", "Cascade Lake", "Xeon Silver",   2.2, { 8: 3.2},  8,  16, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 4210",  "Cascade Lake", "Xeon Silver",   2.2, {10: 3.2}, 10,  20, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 4210R", "Cascade Lake", "Xeon Silver",   2.4, {10: 3.2}, 10,  20, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 4214",  "Cascade Lake", "Xeon Silver",   2.2, {12: 3.2}, 12,  24, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 4214R", "Cascade Lake", "Xeon Silver",   2.4, {12: 3.5}, 12,  24, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 4214Y", "Cascade Lake", "Xeon Silver",   2.2, {12: 3.2}, 12,  24, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 4215",  "Cascade Lake", "Xeon Silver",   2.5, { 8: 3.5},  8,  16, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 4215R", "Cascade Lake", "Xeon Silver",   3.2, { 8: 4.0},  8,  16, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 4216",  "Cascade Lake", "Xeon Silver",   2.1, {16: 3.2}, 16,  32, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-
-    ["Intel 5215",  "Cascade Lake", "Xeon Gold", 2.5, {10: 3.4}, 10, 20, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 5215L", "Cascade Lake", "Xeon Gold", 2.5, {10: 3.4}, 10, 20, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 5215M", "Cascade Lake", "Xeon Gold", 2.5, {10: 3.4}, 10, 20, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 5217",  "Cascade Lake", "Xeon Gold", 3.0, { 8: 3.7},  8, 16, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 5218",  "Cascade Lake", "Xeon Gold", 2.3, {16: 3.9}, 16, 32, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 5218B", "Cascade Lake", "Xeon Gold", 2.3, {16: 3.9}, 16, 32, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 5218N", "Cascade Lake", "Xeon Gold", 2.3, {16: 3.7}, 16, 32, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 5218T", "Cascade Lake", "Xeon Gold", 2.1, {16: 3.8}, 16, 32, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 5220",  "Cascade Lake", "Xeon Gold", 2.2, {18: 3.9}, 18, 36, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 5220S", "Cascade Lake", "Xeon Gold", 2.7, {18: 3.9}, 18, 36, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 5220T", "Cascade Lake", "Xeon Gold", 1.9, {18: 3.9}, 18, 36, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 5222",  "Cascade Lake", "Xeon Gold", 3.8, { 4: 3.9},  4,  8, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6222V", "Cascade Lake", "Xeon Gold", 1.8, {20: 3.6}, 20, 40, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6226",  "Cascade Lake", "Xeon Gold", 2.7, {12: 3.7}, 12, 24, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6230",  "Cascade Lake", "Xeon Gold", 2.1, {20: 3.9}, 20, 40, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6230N", "Cascade Lake", "Xeon Gold", 2.3, {20: 3.9}, 20, 40, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6230T", "Cascade Lake", "Xeon Gold", 2.1, {20: 3.9}, 20, 40, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6234",  "Cascade Lake", "Xeon Gold", 3.3, { 8: 4.0},  8, 16, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6238",  "Cascade Lake", "Xeon Gold", 2.1, {22: 3.7}, 22, 44, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6238L", "Cascade Lake", "Xeon Gold", 2.1, {22: 3.7}, 22, 44, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6238M", "Cascade Lake", "Xeon Gold", 2.1, {22: 3.7}, 22, 44, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6238T", "Cascade Lake", "Xeon Gold", 1.9, {22: 3.7}, 22, 44, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6240",  "Cascade Lake", "Xeon Gold", 2.6, {18: 3.9}, 18, 36, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6240L", "Cascade Lake", "Xeon Gold", 2.6, {18: 3.9}, 18, 36, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6240M", "Cascade Lake", "Xeon Gold", 2.6, {18: 3.9}, 18, 36, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6240Y", "Cascade Lake", "Xeon Gold", 2.6, {18: 3.9}, 18, 36, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6242",  "Cascade Lake", "Xeon Gold", 2.8, {16: 3.9}, 16, 32, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6244",  "Cascade Lake", "Xeon Gold", 3.6, { 8: 4.4},  8, 16, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6246",  "Cascade Lake", "Xeon Gold", 3.3, {12: 4.2}, 12, 24, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6248",  "Cascade Lake", "Xeon Gold", 2.5, {20: 3.9}, 20, 40, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6252",  "Cascade Lake", "Xeon Gold", 2.1, {24: 3.7}, 24, 48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6252N", "Cascade Lake", "Xeon Gold", 2.3, {24: 3.6}, 24, 48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6254",  "Cascade Lake", "Xeon Gold", 3.1, {18: 4.0}, 18, 36, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 6262V", "Cascade Lake", "Xeon Gold", 1.9, {24: 3.6}, 24, 48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-
-    ["Intel 8253",  "Cascade Lake", "Xeon Platinum", 2.2, {16: 3.0}, 16, 32, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8256",  "Cascade Lake", "Xeon Platinum", 3.8, { 4: 3.9},  4,  8, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8260",  "Cascade Lake", "Xeon Platinum", 2.4, {24: 3.9}, 24, 48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8260L", "Cascade Lake", "Xeon Platinum", 2.4, {24: 3.9}, 24, 48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8260M", "Cascade Lake", "Xeon Platinum", 2.4, {24: 3.9}, 24, 48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8260Y", "Cascade Lake", "Xeon Platinum", 2.4, {24: 3.9}, 24, 48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8268",  "Cascade Lake", "Xeon Platinum", 2.9, {24: 3.9}, 24, 48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8270",  "Cascade Lake", "Xeon Platinum", 2.7, {26: 4.0}, 26, 52, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8276",  "Cascade Lake", "Xeon Platinum", 2.2, {28: 4.0}, 28, 56, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8276L", "Cascade Lake", "Xeon Platinum", 2.2, {28: 4.0}, 28, 56, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8276M", "Cascade Lake", "Xeon Platinum", 2.2, {28: 4.0}, 28, 56, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8280",  "Cascade Lake", "Xeon Platinum", 2.7, {28: 4.0}, 28, 56, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8280L", "Cascade Lake", "Xeon Platinum", 2.7, {28: 4.0}, 28, 56, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8280M", "Cascade Lake", "Xeon Platinum", 2.7, {28: 4.0}, 28, 56, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
-    ["Intel 8284",  "Cascade Lake", "Xeon Platinum", 3.0, {28: 4.0}, 28, 56, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64"],
+    ["Intel 6209U",   "Cascade Lake", "Xeon Gold", 2.1, {20: 3.9}, 20, 40, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6210U",   "Cascade Lake", "Xeon Gold", 2.5, {20: 3.9}, 20, 40, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6212U",   "Cascade Lake", "Xeon Gold", 2.4, {24: 3.9}, 24, 48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel W-3223",  "Cascade Lake", "Xeon W",    3.5, { 8: 4.0},  8, 16, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel W-3225",  "Cascade Lake", "Xeon W",    3.7, { 8: 4.3},  8, 16, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel W-3235",  "Cascade Lake", "Xeon W",    3.3, {12: 4.4}, 12, 24, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel W-3245",  "Cascade Lake", "Xeon W",    3.2, {16: 4.4}, 16, 32, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel W-3245M", "Cascade Lake", "Xeon W",    3.2, {16: 4.4}, 16, 32, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel W-3265",  "Cascade Lake", "Xeon W",    2.7, {24: 4.4}, 24, 48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel W-3265M", "Cascade Lake", "Xeon W",    2.7, {24: 4.4}, 24, 48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel W-3275",  "Cascade Lake", "Xeon W",    2.5, {28: 4.4}, 28, 56, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel W-3275M", "Cascade Lake", "Xeon W",    2.5, {28: 4.4}, 28, 56, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+
+    ["Intel 3204",  "Cascade Lake", "Xeon Bronze",   1.9,        {},  6,   6, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 5218R", "Cascade Lake", "Xeon Gold",     2.1, {20: 4.0}, 20,  40, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 5220R", "Cascade Lake", "Xeon Gold",     2.2, {24: 4.0}, 24,  48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6226R", "Cascade Lake", "Xeon Gold",     2.9, {16: 3.9}, 16,  32, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6230R", "Cascade Lake", "Xeon Gold",     2.1, {26: 4.0}, 26,  52, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6238R", "Cascade Lake", "Xeon Gold",     2.2, {28: 4.0}, 28,  56, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6240R", "Cascade Lake", "Xeon Gold",     2.4, {24: 4.0}, 24,  48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6242R", "Cascade Lake", "Xeon Gold",     3.1, {20: 4.1}, 20,  40, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6246R", "Cascade Lake", "Xeon Gold",     3.4, {16: 4.1}, 16,  32, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6248R", "Cascade Lake", "Xeon Gold",       3, {24: 4.0}, 24,  48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6258R", "Cascade Lake", "Xeon Gold",     2.7, {28: 4.0}, 28,  56, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 9221",  "Cascade Lake", "Xeon Platinum", 2.1, {32: 3.7}, 32,  64, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 9222",  "Cascade Lake", "Xeon Platinum", 2.3, {32: 3.7}, 32,  64, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 9242",  "Cascade Lake", "Xeon Platinum", 2.3, {48: 3.8}, 48,  96, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 9282",  "Cascade Lake", "Xeon Platinum", 2.6, {56: 3.8}, 56, 112, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 4208",  "Cascade Lake", "Xeon Silver",   2.1, { 8: 3.2},  8,  16, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 4209T", "Cascade Lake", "Xeon Silver",   2.2, { 8: 3.2},  8,  16, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 4210",  "Cascade Lake", "Xeon Silver",   2.2, {10: 3.2}, 10,  20, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 4210R", "Cascade Lake", "Xeon Silver",   2.4, {10: 3.2}, 10,  20, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 4214",  "Cascade Lake", "Xeon Silver",   2.2, {12: 3.2}, 12,  24, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 4214R", "Cascade Lake", "Xeon Silver",   2.4, {12: 3.5}, 12,  24, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 4214Y", "Cascade Lake", "Xeon Silver",   2.2, {12: 3.2}, 12,  24, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 4215",  "Cascade Lake", "Xeon Silver",   2.5, { 8: 3.5},  8,  16, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 4215R", "Cascade Lake", "Xeon Silver",   3.2, { 8: 4.0},  8,  16, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 4216",  "Cascade Lake", "Xeon Silver",   2.1, {16: 3.2}, 16,  32, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+
+    ["Intel 5215",  "Cascade Lake", "Xeon Gold", 2.5, {10: 3.4}, 10, 20, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 5215L", "Cascade Lake", "Xeon Gold", 2.5, {10: 3.4}, 10, 20, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 5215M", "Cascade Lake", "Xeon Gold", 2.5, {10: 3.4}, 10, 20, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 5217",  "Cascade Lake", "Xeon Gold", 3.0, { 8: 3.7},  8, 16, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 5218",  "Cascade Lake", "Xeon Gold", 2.3, {16: 3.9}, 16, 32, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 5218B", "Cascade Lake", "Xeon Gold", 2.3, {16: 3.9}, 16, 32, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 5218N", "Cascade Lake", "Xeon Gold", 2.3, {16: 3.7}, 16, 32, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 5218T", "Cascade Lake", "Xeon Gold", 2.1, {16: 3.8}, 16, 32, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 5220",  "Cascade Lake", "Xeon Gold", 2.2, {18: 3.9}, 18, 36, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 5220S", "Cascade Lake", "Xeon Gold", 2.7, {18: 3.9}, 18, 36, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 5220T", "Cascade Lake", "Xeon Gold", 1.9, {18: 3.9}, 18, 36, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 5222",  "Cascade Lake", "Xeon Gold", 3.8, { 4: 3.9},  4,  8, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6222V", "Cascade Lake", "Xeon Gold", 1.8, {20: 3.6}, 20, 40, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6226",  "Cascade Lake", "Xeon Gold", 2.7, {12: 3.7}, 12, 24, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6230",  "Cascade Lake", "Xeon Gold", 2.1, {20: 3.9}, 20, 40, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6230N", "Cascade Lake", "Xeon Gold", 2.3, {20: 3.9}, 20, 40, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6230T", "Cascade Lake", "Xeon Gold", 2.1, {20: 3.9}, 20, 40, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6234",  "Cascade Lake", "Xeon Gold", 3.3, { 8: 4.0},  8, 16, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6238",  "Cascade Lake", "Xeon Gold", 2.1, {22: 3.7}, 22, 44, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6238L", "Cascade Lake", "Xeon Gold", 2.1, {22: 3.7}, 22, 44, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6238M", "Cascade Lake", "Xeon Gold", 2.1, {22: 3.7}, 22, 44, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6238T", "Cascade Lake", "Xeon Gold", 1.9, {22: 3.7}, 22, 44, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6240",  "Cascade Lake", "Xeon Gold", 2.6, {18: 3.9}, 18, 36, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6240L", "Cascade Lake", "Xeon Gold", 2.6, {18: 3.9}, 18, 36, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6240M", "Cascade Lake", "Xeon Gold", 2.6, {18: 3.9}, 18, 36, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6240Y", "Cascade Lake", "Xeon Gold", 2.6, {18: 3.9}, 18, 36, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6242",  "Cascade Lake", "Xeon Gold", 2.8, {16: 3.9}, 16, 32, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6244",  "Cascade Lake", "Xeon Gold", 3.6, { 8: 4.4},  8, 16, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6246",  "Cascade Lake", "Xeon Gold", 3.3, {12: 4.2}, 12, 24, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6248",  "Cascade Lake", "Xeon Gold", 2.5, {20: 3.9}, 20, 40, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6252",  "Cascade Lake", "Xeon Gold", 2.1, {24: 3.7}, 24, 48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6252N", "Cascade Lake", "Xeon Gold", 2.3, {24: 3.6}, 24, 48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6254",  "Cascade Lake", "Xeon Gold", 3.1, {18: 4.0}, 18, 36, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 6262V", "Cascade Lake", "Xeon Gold", 1.9, {24: 3.6}, 24, 48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+
+    ["Intel 8253",  "Cascade Lake", "Xeon Platinum", 2.2, {16: 3.0}, 16, 32, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8256",  "Cascade Lake", "Xeon Platinum", 3.8, { 4: 3.9},  4,  8, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8260",  "Cascade Lake", "Xeon Platinum", 2.4, {24: 3.9}, 24, 48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8260L", "Cascade Lake", "Xeon Platinum", 2.4, {24: 3.9}, 24, 48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8260M", "Cascade Lake", "Xeon Platinum", 2.4, {24: 3.9}, 24, 48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8260Y", "Cascade Lake", "Xeon Platinum", 2.4, {24: 3.9}, 24, 48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8268",  "Cascade Lake", "Xeon Platinum", 2.9, {24: 3.9}, 24, 48, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8270",  "Cascade Lake", "Xeon Platinum", 2.7, {26: 4.0}, 26, 52, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8276",  "Cascade Lake", "Xeon Platinum", 2.2, {28: 4.0}, 28, 56, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8276L", "Cascade Lake", "Xeon Platinum", 2.2, {28: 4.0}, 28, 56, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8276M", "Cascade Lake", "Xeon Platinum", 2.2, {28: 4.0}, 28, 56, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8280",  "Cascade Lake", "Xeon Platinum", 2.7, {28: 4.0}, 28, 56, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8280L", "Cascade Lake", "Xeon Platinum", 2.7, {28: 4.0}, 28, 56, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8280M", "Cascade Lake", "Xeon Platinum", 2.7, {28: 4.0}, 28, 56, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
+    ["Intel 8284",  "Cascade Lake", "Xeon Platinum", 3.0, {28: 4.0}, 28, 56, [32, 1024, 1.375 * 1024], [64, 64, 64], 64, 32, ["SSE4.1", "SSE4.2", "AVX2", "AVX512", "AVX-VNNI"], "X86_64", "OPENMP"],
 
     # AMD Zen
     # ref: https://en.wikipedia.org/wiki/Zen_(first_generation)
     # ref: https://en.wikichip.org/wiki/amd/microarchitectures/zen
-    ["AMD 200GE", "Zen", "Athlon", 3.2, {}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 220GE", "Zen", "Athlon", 3.4, {}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 240GE", "Zen", "Athlon", 3.5, {}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 300U", "Zen", "Athlon", 2.4, {}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3150U", "Zen", "Athlon Gold", 2.4, {}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 200GE", "Zen", "Athlon", 3.2, {}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3050U", "Zen", "Athlon Silver", 2.3, {}, 2, 2, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7351P", "Zen", "EPYC", 2.4, {1: 2.9}, 16, 32, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7401P", "Zen", "EPYC", 2, {1: 3}, 24, 48, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7551P", "Zen", "EPYC", 2, {1: 3}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3101", "Zen", "EPYC Embedded", 2.1, {1: 2.9}, 4, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3151", "Zen", "EPYC Embedded", 2.7, {1: 2.9}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3201", "Zen", "EPYC Embedded", 1.5, {1: 3.1}, 8, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3251", "Zen", "EPYC Embedded", 2.5, {1: 3.1}, 8, 16, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3255", "Zen", "EPYC Embedded", 2.5, {}, 8, 16, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3301", "Zen", "EPYC Embedded", 2, {1: 3}, 12, 12, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3351", "Zen", "EPYC Embedded", 1.9, {1: 3}, 12, 24, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3401", "Zen", "EPYC Embedded", 1.85, {1: 3}, 16, 16, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3451", "Zen", "EPYC Embedded", 2.15, {1: 3}, 16, 32, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD FireFlight", "Zen", "", 3, {}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 1200", "Zen", "Ryzen 3", 3.1, {1: 3.4}, 4, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 1300X", "Zen", "Ryzen 3", 3.5, {1: 3.7}, 4, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 2200G", "Zen", "Ryzen 3", 3.5, {1: 3.7}, 4, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 2200GE", "Zen", "Ryzen 3", 3.2, {1: 3.6}, 4, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 2200U", "Zen", "Ryzen 3", 2.5, {1: 3.4}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 2300U", "Zen", "Ryzen 3", 2, {1: 3.4}, 4, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3250U", "Zen", "Ryzen 3", 2.6, {}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 1200", "Zen", "Ryzen 3", 3.1, {1: 3.4}, 4, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 1300", "Zen", "Ryzen 3", 3.5, {1: 3.7}, 4, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 2200G", "Zen", "Ryzen 3", 3.5, {1: 3.7}, 4, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 2200GE", "Zen", "Ryzen 3", 3.2, {1: 3.6}, 4, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 2300U", "Zen", "Ryzen 3", 2, {1: 3.4}, 4, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 1400", "Zen", "Ryzen 5", 3.2, {1: 3.4}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 1500X", "Zen", "Ryzen 5", 3.5, {1: 3.7}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 1600", "Zen", "Ryzen 5", 3.2, {1: 3.6}, 6, 12, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 1600X", "Zen", "Ryzen 5", 3.6, {1: 4}, 6, 12, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 2400G", "Zen", "Ryzen 5", 3.6, {1: 3.9}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 2400GE", "Zen", "Ryzen 5", 3.2, {1: 3.8}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 2500U", "Zen", "Ryzen 5", 2, {1: 3.6}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 2600H", "Zen", "Ryzen 5", 3.2, {1: 3.6}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 1500", "Zen", "Ryzen 5", 3.5, {1: 3.7}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 1600", "Zen", "Ryzen 5", 3.2, {1: 3.6}, 6, 12, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 2400G", "Zen", "Ryzen 5", 3.6, {1: 3.9}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 2400GE", "Zen", "Ryzen 5", 3.2, {1: 3.8}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 2500U", "Zen", "Ryzen 5", 2, {1: 3.6}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 1700", "Zen", "Ryzen 7", 3, {1: 3.7}, 8, 16, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 1700X", "Zen", "Ryzen 7", 3.4, {1: 3.8}, 8, 16, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 1800X", "Zen", "Ryzen 7", 3.6, {1: 4}, 8, 16, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 2700U", "Zen", "Ryzen 7", 2.2, {1: 3.8}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 2800H", "Zen", "Ryzen 7", 3.3, {1: 3.8}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 1700", "Zen", "Ryzen 7", 3, {1: 3.7}, 8, 16, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 1700X", "Zen", "Ryzen 7", 3.4, {1: 3.8}, 8, 16, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 2700U", "Zen", "Ryzen 7", 2.2, {1: 3.8}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD R1102G", "Zen", "Ryzen Embedded", 1.2, {1: 2.6}, 2, 2, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD R1305G", "Zen", "Ryzen Embedded", 1.5, {1: 2.8}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD R1505G", "Zen", "Ryzen Embedded", 2.4, {1: 3.3}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD R1606G", "Zen", "Ryzen Embedded", 2.6, {1: 3.5}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD V1202B", "Zen", "Ryzen Embedded", 2.3, {1: 3.2}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD V1404I", "Zen", "Ryzen Embedded", 2, {1: 3.6}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD V1500B", "Zen", "Ryzen Embedded", 2.2, {}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD V1605B", "Zen", "Ryzen Embedded", 2, {1: 3.6}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD V1756B", "Zen", "Ryzen Embedded", 3.25, {1: 3.6}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD V1780B", "Zen", "Ryzen Embedded", 3.35, {1: 3.6}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD V1807B", "Zen", "Ryzen Embedded", 3.35, {1: 3.8}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 1900X", "Zen", "Ryzen Threadripper", 3.8, {1: 4}, 8, 16, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 1920X", "Zen", "Ryzen Threadripper", 3.5, {1: 4}, 12, 24, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 1950X", "Zen", "Ryzen Threadripper", 3.4, {1: 4}, 16, 32, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7251", "Zen", "EPYC", 2.1, {1: 2.9}, 8, 16, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7261", "Zen", "EPYC", 2.5, {1: 2.9}, 8, 16, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7281", "Zen", "EPYC", 2.1, {1: 2.7}, 16, 32, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7301", "Zen", "EPYC", 2.2, {1: 2.7}, 16, 32, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7351", "Zen", "EPYC", 2.4, {1: 2.9}, 16, 32, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7371", "Zen", "EPYC", 3.1, {1: 3.8}, 16, 32, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7401", "Zen", "EPYC", 2, {1: 3}, 24, 48, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7451", "Zen", "EPYC", 2.3, {1: 3.2}, 24, 48, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7501", "Zen", "EPYC", 2, {1: 3}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7551", "Zen", "EPYC", 2, {1: 3}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7601", "Zen", "EPYC", 2.2, {1: 3.2}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
+    ["AMD 200GE", "Zen", "Athlon", 3.2, {}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 220GE", "Zen", "Athlon", 3.4, {}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 240GE", "Zen", "Athlon", 3.5, {}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 300U", "Zen", "Athlon", 2.4, {}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3150U", "Zen", "Athlon Gold", 2.4, {}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 200GE", "Zen", "Athlon", 3.2, {}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3050U", "Zen", "Athlon Silver", 2.3, {}, 2, 2, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7351P", "Zen", "EPYC", 2.4, {1: 2.9}, 16, 32, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7401P", "Zen", "EPYC", 2, {1: 3}, 24, 48, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7551P", "Zen", "EPYC", 2, {1: 3}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3101", "Zen", "EPYC Embedded", 2.1, {1: 2.9}, 4, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3151", "Zen", "EPYC Embedded", 2.7, {1: 2.9}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3201", "Zen", "EPYC Embedded", 1.5, {1: 3.1}, 8, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3251", "Zen", "EPYC Embedded", 2.5, {1: 3.1}, 8, 16, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3255", "Zen", "EPYC Embedded", 2.5, {}, 8, 16, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3301", "Zen", "EPYC Embedded", 2, {1: 3}, 12, 12, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3351", "Zen", "EPYC Embedded", 1.9, {1: 3}, 12, 24, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3401", "Zen", "EPYC Embedded", 1.85, {1: 3}, 16, 16, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3451", "Zen", "EPYC Embedded", 2.15, {1: 3}, 16, 32, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD FireFlight", "Zen", "", 3, {}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 1200", "Zen", "Ryzen 3", 3.1, {1: 3.4}, 4, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 1300X", "Zen", "Ryzen 3", 3.5, {1: 3.7}, 4, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 2200G", "Zen", "Ryzen 3", 3.5, {1: 3.7}, 4, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 2200GE", "Zen", "Ryzen 3", 3.2, {1: 3.6}, 4, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 2200U", "Zen", "Ryzen 3", 2.5, {1: 3.4}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 2300U", "Zen", "Ryzen 3", 2, {1: 3.4}, 4, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3250U", "Zen", "Ryzen 3", 2.6, {}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 1200", "Zen", "Ryzen 3", 3.1, {1: 3.4}, 4, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 1300", "Zen", "Ryzen 3", 3.5, {1: 3.7}, 4, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 2200G", "Zen", "Ryzen 3", 3.5, {1: 3.7}, 4, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 2200GE", "Zen", "Ryzen 3", 3.2, {1: 3.6}, 4, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 2300U", "Zen", "Ryzen 3", 2, {1: 3.4}, 4, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 1400", "Zen", "Ryzen 5", 3.2, {1: 3.4}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 1500X", "Zen", "Ryzen 5", 3.5, {1: 3.7}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 1600", "Zen", "Ryzen 5", 3.2, {1: 3.6}, 6, 12, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 1600X", "Zen", "Ryzen 5", 3.6, {1: 4}, 6, 12, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 2400G", "Zen", "Ryzen 5", 3.6, {1: 3.9}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 2400GE", "Zen", "Ryzen 5", 3.2, {1: 3.8}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 2500U", "Zen", "Ryzen 5", 2, {1: 3.6}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 2600H", "Zen", "Ryzen 5", 3.2, {1: 3.6}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 1500", "Zen", "Ryzen 5", 3.5, {1: 3.7}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 1600", "Zen", "Ryzen 5", 3.2, {1: 3.6}, 6, 12, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 2400G", "Zen", "Ryzen 5", 3.6, {1: 3.9}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 2400GE", "Zen", "Ryzen 5", 3.2, {1: 3.8}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 2500U", "Zen", "Ryzen 5", 2, {1: 3.6}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 1700", "Zen", "Ryzen 7", 3, {1: 3.7}, 8, 16, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 1700X", "Zen", "Ryzen 7", 3.4, {1: 3.8}, 8, 16, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 1800X", "Zen", "Ryzen 7", 3.6, {1: 4}, 8, 16, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 2700U", "Zen", "Ryzen 7", 2.2, {1: 3.8}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 2800H", "Zen", "Ryzen 7", 3.3, {1: 3.8}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 1700", "Zen", "Ryzen 7", 3, {1: 3.7}, 8, 16, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 1700X", "Zen", "Ryzen 7", 3.4, {1: 3.8}, 8, 16, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 2700U", "Zen", "Ryzen 7", 2.2, {1: 3.8}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD R1102G", "Zen", "Ryzen Embedded", 1.2, {1: 2.6}, 2, 2, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD R1305G", "Zen", "Ryzen Embedded", 1.5, {1: 2.8}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD R1505G", "Zen", "Ryzen Embedded", 2.4, {1: 3.3}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD R1606G", "Zen", "Ryzen Embedded", 2.6, {1: 3.5}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD V1202B", "Zen", "Ryzen Embedded", 2.3, {1: 3.2}, 2, 4, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD V1404I", "Zen", "Ryzen Embedded", 2, {1: 3.6}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD V1500B", "Zen", "Ryzen Embedded", 2.2, {}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD V1605B", "Zen", "Ryzen Embedded", 2, {1: 3.6}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD V1756B", "Zen", "Ryzen Embedded", 3.25, {1: 3.6}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD V1780B", "Zen", "Ryzen Embedded", 3.35, {1: 3.6}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD V1807B", "Zen", "Ryzen Embedded", 3.35, {1: 3.8}, 4, 8, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 1900X", "Zen", "Ryzen Threadripper", 3.8, {1: 4}, 8, 16, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 1920X", "Zen", "Ryzen Threadripper", 3.5, {1: 4}, 12, 24, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 1950X", "Zen", "Ryzen Threadripper", 3.4, {1: 4}, 16, 32, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7251", "Zen", "EPYC", 2.1, {1: 2.9}, 8, 16, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7261", "Zen", "EPYC", 2.5, {1: 2.9}, 8, 16, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7281", "Zen", "EPYC", 2.1, {1: 2.7}, 16, 32, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7301", "Zen", "EPYC", 2.2, {1: 2.7}, 16, 32, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7351", "Zen", "EPYC", 2.4, {1: 2.9}, 16, 32, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7371", "Zen", "EPYC", 3.1, {1: 3.8}, 16, 32, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7401", "Zen", "EPYC", 2, {1: 3}, 24, 48, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7451", "Zen", "EPYC", 2.3, {1: 3.2}, 24, 48, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7501", "Zen", "EPYC", 2, {1: 3}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7551", "Zen", "EPYC", 2, {1: 3}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7601", "Zen", "EPYC", 2.2, {1: 3.2}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
 
     # AMD Zen+
     # ref: https://en.wikichip.org/wiki/amd/microarchitectures/zen%2B#All_Zen.2B_Chips
-    ["AMD 3000G", "Zen+", "Athlon", 3.5, {}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 300GE", "Zen+", "Athlon", 3.4, {}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 300U", "Zen+", "Athlon", 2.4, {1: 3.3}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 2300X", "Zen+", "Ryzen 3", 3.5, {1: 4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3200G", "Zen+", "Ryzen 3", 3.6, {1: 4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3200U", "Zen+", "Ryzen 3", 2.6, {1: 3.5}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3300U", "Zen+", "Ryzen 3", 2.1, {1: 3.5}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 3200G", "Zen+", "Ryzen 3", 3.6, {1: 4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 3200GE", "Zen+", "Ryzen 3", 3.3, {1: 3.8}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 3300U", "Zen+", "Ryzen 3", 2.1, {1: 3.5}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 2500X", "Zen+", "Ryzen 5", 3.6, {1: 4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 2600", "Zen+", "Ryzen 5", 3.4, {1: 3.9}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 2600E", "Zen+", "Ryzen 5", 3.1, {1: 4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 2600X", "Zen+", "Ryzen 5", 3.6, {1: 4.2}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3400G", "Zen+", "Ryzen 5", 3.7, {1: 4.2}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3500U", "Zen+", "Ryzen 5", 2.1, {1: 3.7}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3550H", "Zen+", "Ryzen 5", 2.1, {1: 3.7}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3580U", "Zen+", "Ryzen 5", 2.1, {1: 3.7}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 2600", "Zen+", "Ryzen 5", 3.4, {1: 3.9}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 3400G", "Zen+", "Ryzen 5", 3.7, {1: 4.2}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 3400GE", "Zen+", "Ryzen 5", 3.3, {1: 4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 3500U", "Zen+", "Ryzen 5", 2.1, {1: 3.7}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 2700", "Zen+", "Ryzen 7", 3.2, {1: 4.1}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 2700E", "Zen+", "Ryzen 7", 2.8, {1: 4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 2700X Gold Edition", "Zen+", "Ryzen 7", 3.7, {1: 4.3}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 2700X", "Zen+", "Ryzen 7", 3.7, {1: 4.3}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3700U", "Zen+", "Ryzen 7", 2.3, {1: 4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3750H", "Zen+", "Ryzen 7", 2.3, {1: 4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3780U", "Zen+", "Ryzen 7", 2.3, {1: 4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 2700", "Zen+", "Ryzen 7", 3.2, {1: 4.1}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 2700X", "Zen+", "Ryzen 7", 3.6, {1: 4.1}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 3700U", "Zen+", "Ryzen 7", 2.3, {1: 4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 2920X", "Zen+", "Ryzen Threadripper", 3.5, {1: 4.3}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 2950X", "Zen+", "Ryzen Threadripper", 3.5, {1: 4.4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 2970WX", "Zen+", "Ryzen Threadripper", 3, {1: 4.2}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 2990WX", "Zen+", "Ryzen Threadripper", 3, {1: 4.2}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
+    ["AMD 3000G", "Zen+", "Athlon", 3.5, {}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 300GE", "Zen+", "Athlon", 3.4, {}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 300U", "Zen+", "Athlon", 2.4, {1: 3.3}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 2300X", "Zen+", "Ryzen 3", 3.5, {1: 4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3200G", "Zen+", "Ryzen 3", 3.6, {1: 4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3200U", "Zen+", "Ryzen 3", 2.6, {1: 3.5}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3300U", "Zen+", "Ryzen 3", 2.1, {1: 3.5}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 3200G", "Zen+", "Ryzen 3", 3.6, {1: 4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 3200GE", "Zen+", "Ryzen 3", 3.3, {1: 3.8}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 3300U", "Zen+", "Ryzen 3", 2.1, {1: 3.5}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 2500X", "Zen+", "Ryzen 5", 3.6, {1: 4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 2600", "Zen+", "Ryzen 5", 3.4, {1: 3.9}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 2600E", "Zen+", "Ryzen 5", 3.1, {1: 4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 2600X", "Zen+", "Ryzen 5", 3.6, {1: 4.2}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3400G", "Zen+", "Ryzen 5", 3.7, {1: 4.2}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3500U", "Zen+", "Ryzen 5", 2.1, {1: 3.7}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3550H", "Zen+", "Ryzen 5", 2.1, {1: 3.7}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3580U", "Zen+", "Ryzen 5", 2.1, {1: 3.7}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 2600", "Zen+", "Ryzen 5", 3.4, {1: 3.9}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 3400G", "Zen+", "Ryzen 5", 3.7, {1: 4.2}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 3400GE", "Zen+", "Ryzen 5", 3.3, {1: 4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 3500U", "Zen+", "Ryzen 5", 2.1, {1: 3.7}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 2700", "Zen+", "Ryzen 7", 3.2, {1: 4.1}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 2700E", "Zen+", "Ryzen 7", 2.8, {1: 4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 2700X Gold Edition", "Zen+", "Ryzen 7", 3.7, {1: 4.3}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 2700X", "Zen+", "Ryzen 7", 3.7, {1: 4.3}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3700U", "Zen+", "Ryzen 7", 2.3, {1: 4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3750H", "Zen+", "Ryzen 7", 2.3, {1: 4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3780U", "Zen+", "Ryzen 7", 2.3, {1: 4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 2700", "Zen+", "Ryzen 7", 3.2, {1: 4.1}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 2700X", "Zen+", "Ryzen 7", 3.6, {1: 4.1}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 3700U", "Zen+", "Ryzen 7", 2.3, {1: 4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 2920X", "Zen+", "Ryzen Threadripper", 3.5, {1: 4.3}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 2950X", "Zen+", "Ryzen Threadripper", 3.5, {1: 4.4}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 2970WX", "Zen+", "Ryzen Threadripper", 3, {1: 4.2}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 2990WX", "Zen+", "Ryzen Threadripper", 3, {1: 4.2}, 32, 64, [32, 512, 2 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
 
     # AMD Zen2
     # ref: https://en.wikichip.org/wiki/amd/microarchitectures/zen_2
-    ["AMD 7232P", "Zen2", "EPYC", 3.1, {1: 3.2}, 8, 16, [32, 2 * 1024, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7302P", "Zen2", "EPYC", 3, {1: 3.3}, 16, 32, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7402P", "Zen2", "EPYC", 2.8, {1: 3.35}, 24, 48, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7502P", "Zen2", "EPYC", 2.5, {1: 3.35}, 32, 64, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7702P", "Zen2", "EPYC", 2, {1: 3.35}, 64, 128, [32, 2 * 1024, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 4300G", "Zen2", "Ryzen 3", 3.8, {1: 4}, 4, 8, [32, 2 * 1024, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 4300GE", "Zen2", "Ryzen 3", 3.5, {1: 4}, 4, 8, [32, 2 * 1024, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 4300U", "Zen2", "Ryzen 3", 2.7, {1: 3.7}, 4, 4, [32, 2 * 1024, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5300U", "Zen2", "Ryzen 3", 2.6, {1: 3.8}, 4, 8, [32, 2 * 1024, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 4350G", "Zen2", "Ryzen 3", 3.8, {1: 4}, 4, 8, [32, 2 * 1024, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 4350GE", "Zen2", "Ryzen 3", 3.5, {1: 4}, 4, 8, [32, 2 * 1024, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 4450U", "Zen2", "Ryzen 3", 2.5, {1: 3.7}, 4, 8, [32, 2 * 1024, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3500", "Zen2", "Ryzen 5", 3.6, {1: 4.1}, 6, 6, [32, 2 * 1024, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3500X", "Zen2", "Ryzen 5", 3.6, {1: 4.1}, 6, 6, [32, 2 * 1024, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3600", "Zen2", "Ryzen 5", 3.6, {1: 4.2}, 6, 12, [32, 2 * 1024, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3600X", "Zen2", "Ryzen 5", 3.8, {1: 4.4}, 6, 12, [32, 2 * 1024, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3600XT", "Zen2", "Ryzen 5", 3.8, {1: 4.5}, 6, 12, [32, 2 * 1024, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 4500U", "Zen2", "Ryzen 5", 2.3, {1: 4}, 6, 6, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 4600G", "Zen2", "Ryzen 5", 3.7, {1: 4.2}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 4600GE", "Zen2", "Ryzen 5", 3.3, {1: 4.2}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 4600H", "Zen2", "Ryzen 5", 3, {1: 4}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 4600HS", "Zen2", "Ryzen 5", 3, {1: 4}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 4600U", "Zen2", "Ryzen 5", 2.1, {1: 4}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 4680U", "Zen2", "Ryzen 5", 2.1, {1: 4}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5500U", "Zen2", "Ryzen 5", 2.1, {1: 4}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 3600", "Zen2", "Ryzen 5", 3.6, {1: 4.2}, 6, 12, [32, 2 * 1024, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 4650G", "Zen2", "Ryzen 5", 3.7, {1: 4.2}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 4650GE", "Zen2", "Ryzen 5", 3.3, {1: 4.2}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 4650U", "Zen2", "Ryzen 5", 2.1, {1: 4}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3700X", "Zen2", "Ryzen 7", 3.6, {1: 4.4}, 8, 16, [32, 2 * 1024, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3800X", "Zen2", "Ryzen 7", 3.9, {1: 4.5}, 8, 16, [32, 2 * 1024, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3800XT", "Zen2", "Ryzen 7", 3.9, {1: 4.7}, 8, 16, [32, 2 * 1024, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 4700G", "Zen2", "Ryzen 7", 3.6, {1: 4.4}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 4700GE", "Zen2", "Ryzen 7", 3.1, {1: 4.3}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 4700U", "Zen2", "Ryzen 7", 2, {1: 4.1}, 8, 8, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 4800H", "Zen2", "Ryzen 7", 2.9, {1: 4.2}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 4800HS", "Zen2", "Ryzen 7", 2.9, {1: 4.2}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 4800U", "Zen2", "Ryzen 7", 1.8, {1: 4.2}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 4980U", "Zen2", "Ryzen 7", 2, {1: 4.4}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5700U", "Zen2", "Ryzen 7", 1.8, {1: 4.3}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 3700", "Zen2", "Ryzen 7", 3.6, {1: 4.4}, 8, 16, [32, 2 * 1024, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 4750G", "Zen2", "Ryzen 7", 3.6, {1: 4.4}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 4750GE", "Zen2", "Ryzen 7", 3.1, {1: 4.3}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 4750U", "Zen2", "Ryzen 7", 1.7, {1: 4.1}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3900", "Zen2", "Ryzen 9", 3.1, {1: 4.3}, 12, 24, [32, 2 * 1024, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3900X", "Zen2", "Ryzen 9", 3.8, {1: 4.6}, 12, 24, [32, 2 * 1024, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3900XT", "Zen2", "Ryzen 9", 3.8, {1: 4.7}, 12, 24, [32, 2 * 1024, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3950X", "Zen2", "Ryzen 9", 3.5, {1: 4.7}, 16, 32, [32, 2 * 1024, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 4900H", "Zen2", "Ryzen 9", 3.3, {1: 4.4}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 4900HS", "Zen2", "Ryzen 9", 3, {1: 4.3}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 3900", "Zen2", "Ryzen 9", 3.1, {1: 4.3}, 12, 24, [32, 2 * 1024, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD V2516", "Zen2", "Ryzen Embedded", 2.1, {}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD V2546", "Zen2", "Ryzen Embedded", 3, {}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD V2718", "Zen2", "Ryzen Embedded", 1.7, {}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD V2748", "Zen2", "Ryzen Embedded", 2.9, {}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3960X", "Zen2", "Ryzen Threadripper", 3.8, {1: 4.5}, 24, 48, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3970X", "Zen2", "Ryzen Threadripper", 3.7, {1: 4.5}, 32, 64, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3980X", "Zen2", "Ryzen Threadripper", 3.2, {1: 4.5}, 48, 96, [32, 2 * 1024, ], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 3990X", "Zen2", "Ryzen Threadripper", 2.9, {1: 4.3}, 64, 128, [32, 2 * 1024, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7252", "Zen2", "EPYC", 3.1, {1: 3.2}, 8, 16, [32, 2 * 1024, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7262", "Zen2", "EPYC", 3.2, {1: 3.4}, 8, 16, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7272", "Zen2", "EPYC", 2.9, {1: 3.2}, 12, 24, [32, 2 * 1024, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7282", "Zen2", "EPYC", 2.8, {1: 3.2}, 16, 32, [32, 2 * 1024, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7302", "Zen2", "EPYC", 3, {1: 3.3}, 16, 32, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7352", "Zen2", "EPYC", 2.3, {1: 3.2}, 24, 48, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7402", "Zen2", "EPYC", 2.8, {1: 3.35}, 24, 48, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7452", "Zen2", "EPYC", 2.35, {1: 3.35}, 32, 64, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7502", "Zen2", "EPYC", 2.5, {1: 3.35}, 32, 64, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7532", "Zen2", "EPYC", 2.4, {1: 3.3}, 32, 64, [32, 2 * 1024, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7542", "Zen2", "EPYC", 2.9, {1: 3.4}, 32, 64, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7552", "Zen2", "EPYC", 2.2, {1: 3.35}, 48, 96, [32, 2 * 1024, 192 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7642", "Zen2", "EPYC", 2.3, {1: 3.3}, 48, 96, [32, 2 * 1024, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7662", "Zen2", "EPYC", 2, {1: 3.3}, 64, 128, [32, 2 * 1024, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7702", "Zen2", "EPYC", 2, {1: 3.35}, 64, 128, [32, 2 * 1024, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7742", "Zen2", "EPYC", 2.25, {1: 3.4}, 64, 128, [32, 2 * 1024, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7F32", "Zen2", "EPYC", 3.7, {1: 3.9}, 8, 16, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7F52", "Zen2", "EPYC", 3.5, {1: 3.9}, 16, 32, [32, 2 * 1024, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7F72", "Zen2", "EPYC", 3.2, {1: 3.7}, 24, 48, [32, 2 * 1024, 192 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7H12", "Zen2", "EPYC", 2.6, {1: 3.3}, 64, 128, [32, 2 * 1024, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
+    ["AMD 7232P", "Zen2", "EPYC", 3.1, {1: 3.2}, 8, 16, [32, 2 * 1024, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7302P", "Zen2", "EPYC", 3, {1: 3.3}, 16, 32, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7402P", "Zen2", "EPYC", 2.8, {1: 3.35}, 24, 48, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7502P", "Zen2", "EPYC", 2.5, {1: 3.35}, 32, 64, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7702P", "Zen2", "EPYC", 2, {1: 3.35}, 64, 128, [32, 2 * 1024, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 4300G", "Zen2", "Ryzen 3", 3.8, {1: 4}, 4, 8, [32, 2 * 1024, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 4300GE", "Zen2", "Ryzen 3", 3.5, {1: 4}, 4, 8, [32, 2 * 1024, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 4300U", "Zen2", "Ryzen 3", 2.7, {1: 3.7}, 4, 4, [32, 2 * 1024, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5300U", "Zen2", "Ryzen 3", 2.6, {1: 3.8}, 4, 8, [32, 2 * 1024, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 4350G", "Zen2", "Ryzen 3", 3.8, {1: 4}, 4, 8, [32, 2 * 1024, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 4350GE", "Zen2", "Ryzen 3", 3.5, {1: 4}, 4, 8, [32, 2 * 1024, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 4450U", "Zen2", "Ryzen 3", 2.5, {1: 3.7}, 4, 8, [32, 2 * 1024, 4 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3500", "Zen2", "Ryzen 5", 3.6, {1: 4.1}, 6, 6, [32, 2 * 1024, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3500X", "Zen2", "Ryzen 5", 3.6, {1: 4.1}, 6, 6, [32, 2 * 1024, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3600", "Zen2", "Ryzen 5", 3.6, {1: 4.2}, 6, 12, [32, 2 * 1024, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3600X", "Zen2", "Ryzen 5", 3.8, {1: 4.4}, 6, 12, [32, 2 * 1024, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3600XT", "Zen2", "Ryzen 5", 3.8, {1: 4.5}, 6, 12, [32, 2 * 1024, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 4500U", "Zen2", "Ryzen 5", 2.3, {1: 4}, 6, 6, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 4600G", "Zen2", "Ryzen 5", 3.7, {1: 4.2}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 4600GE", "Zen2", "Ryzen 5", 3.3, {1: 4.2}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 4600H", "Zen2", "Ryzen 5", 3, {1: 4}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 4600HS", "Zen2", "Ryzen 5", 3, {1: 4}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 4600U", "Zen2", "Ryzen 5", 2.1, {1: 4}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 4680U", "Zen2", "Ryzen 5", 2.1, {1: 4}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5500U", "Zen2", "Ryzen 5", 2.1, {1: 4}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 3600", "Zen2", "Ryzen 5", 3.6, {1: 4.2}, 6, 12, [32, 2 * 1024, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 4650G", "Zen2", "Ryzen 5", 3.7, {1: 4.2}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 4650GE", "Zen2", "Ryzen 5", 3.3, {1: 4.2}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 4650U", "Zen2", "Ryzen 5", 2.1, {1: 4}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3700X", "Zen2", "Ryzen 7", 3.6, {1: 4.4}, 8, 16, [32, 2 * 1024, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3800X", "Zen2", "Ryzen 7", 3.9, {1: 4.5}, 8, 16, [32, 2 * 1024, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3800XT", "Zen2", "Ryzen 7", 3.9, {1: 4.7}, 8, 16, [32, 2 * 1024, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 4700G", "Zen2", "Ryzen 7", 3.6, {1: 4.4}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 4700GE", "Zen2", "Ryzen 7", 3.1, {1: 4.3}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 4700U", "Zen2", "Ryzen 7", 2, {1: 4.1}, 8, 8, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 4800H", "Zen2", "Ryzen 7", 2.9, {1: 4.2}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 4800HS", "Zen2", "Ryzen 7", 2.9, {1: 4.2}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 4800U", "Zen2", "Ryzen 7", 1.8, {1: 4.2}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 4980U", "Zen2", "Ryzen 7", 2, {1: 4.4}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5700U", "Zen2", "Ryzen 7", 1.8, {1: 4.3}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 3700", "Zen2", "Ryzen 7", 3.6, {1: 4.4}, 8, 16, [32, 2 * 1024, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 4750G", "Zen2", "Ryzen 7", 3.6, {1: 4.4}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 4750GE", "Zen2", "Ryzen 7", 3.1, {1: 4.3}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 4750U", "Zen2", "Ryzen 7", 1.7, {1: 4.1}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3900", "Zen2", "Ryzen 9", 3.1, {1: 4.3}, 12, 24, [32, 2 * 1024, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3900X", "Zen2", "Ryzen 9", 3.8, {1: 4.6}, 12, 24, [32, 2 * 1024, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3900XT", "Zen2", "Ryzen 9", 3.8, {1: 4.7}, 12, 24, [32, 2 * 1024, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3950X", "Zen2", "Ryzen 9", 3.5, {1: 4.7}, 16, 32, [32, 2 * 1024, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 4900H", "Zen2", "Ryzen 9", 3.3, {1: 4.4}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 4900HS", "Zen2", "Ryzen 9", 3, {1: 4.3}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 3900", "Zen2", "Ryzen 9", 3.1, {1: 4.3}, 12, 24, [32, 2 * 1024, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD V2516", "Zen2", "Ryzen Embedded", 2.1, {}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD V2546", "Zen2", "Ryzen Embedded", 3, {}, 6, 12, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD V2718", "Zen2", "Ryzen Embedded", 1.7, {}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD V2748", "Zen2", "Ryzen Embedded", 2.9, {}, 8, 16, [32, 2 * 1024, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3960X", "Zen2", "Ryzen Threadripper", 3.8, {1: 4.5}, 24, 48, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3970X", "Zen2", "Ryzen Threadripper", 3.7, {1: 4.5}, 32, 64, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3980X", "Zen2", "Ryzen Threadripper", 3.2, {1: 4.5}, 48, 96, [32, 2 * 1024, ], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 3990X", "Zen2", "Ryzen Threadripper", 2.9, {1: 4.3}, 64, 128, [32, 2 * 1024, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7252", "Zen2", "EPYC", 3.1, {1: 3.2}, 8, 16, [32, 2 * 1024, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7262", "Zen2", "EPYC", 3.2, {1: 3.4}, 8, 16, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7272", "Zen2", "EPYC", 2.9, {1: 3.2}, 12, 24, [32, 2 * 1024, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7282", "Zen2", "EPYC", 2.8, {1: 3.2}, 16, 32, [32, 2 * 1024, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7302", "Zen2", "EPYC", 3, {1: 3.3}, 16, 32, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7352", "Zen2", "EPYC", 2.3, {1: 3.2}, 24, 48, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7402", "Zen2", "EPYC", 2.8, {1: 3.35}, 24, 48, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7452", "Zen2", "EPYC", 2.35, {1: 3.35}, 32, 64, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7502", "Zen2", "EPYC", 2.5, {1: 3.35}, 32, 64, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7532", "Zen2", "EPYC", 2.4, {1: 3.3}, 32, 64, [32, 2 * 1024, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7542", "Zen2", "EPYC", 2.9, {1: 3.4}, 32, 64, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7552", "Zen2", "EPYC", 2.2, {1: 3.35}, 48, 96, [32, 2 * 1024, 192 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7642", "Zen2", "EPYC", 2.3, {1: 3.3}, 48, 96, [32, 2 * 1024, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7662", "Zen2", "EPYC", 2, {1: 3.3}, 64, 128, [32, 2 * 1024, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7702", "Zen2", "EPYC", 2, {1: 3.35}, 64, 128, [32, 2 * 1024, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7742", "Zen2", "EPYC", 2.25, {1: 3.4}, 64, 128, [32, 2 * 1024, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7F32", "Zen2", "EPYC", 3.7, {1: 3.9}, 8, 16, [32, 2 * 1024, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7F52", "Zen2", "EPYC", 3.5, {1: 3.9}, 16, 32, [32, 2 * 1024, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7F72", "Zen2", "EPYC", 3.2, {1: 3.7}, 24, 48, [32, 2 * 1024, 192 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7H12", "Zen2", "EPYC", 2.6, {1: 3.3}, 64, 128, [32, 2 * 1024, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
 
     # AMD Zen3
     # ref: https://en.wikichip.org/wiki/amd/microarchitectures/zen_3
-    ["AMD 7313P", "Zen3", "Milan", 3, {1: 3.7}, 16, 32, [32, 512, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7443P", "Zen3", "Milan", 2.85, {1: 4}, 24, 48, [32, 512, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7543P", "Zen3", "Milan", 2.8, {1: 3.7}, 32, 64, [32, 512, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7713P", "Zen3", "Milan", 2, {1: 3.675}, 64, 128, [32, 512, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5300G", "Zen3", "Cezanne", 4, {1: 4.2}, 4, 8, [32, 512, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5300GE", "Zen3", "Cezanne", 3.6, {1: 4.2}, 4, 8, [32, 512, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5400U", "Zen3", "Cezanne", 2.6, {1: 4}, 4, 8, [32, 512, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 5350G", "Zen3", "Cezanne", 4, {1: 4.2}, 4, 8, [32, 512, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 5350GE", "Zen3", "Cezanne", 3.6, {1: 4.2}, 4, 8, [32, 512, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 5450U", "Zen3", "Cezanne", 2.6, {1: 4}, 4, 8, [32, 512, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5600G", "Zen3", "Cezanne", 3.9, {1: 4.4}, 6, 12, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5600GE", "Zen3", "Cezanne", 3.4, {1: 4.4}, 6, 12, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5600H", "Zen3", "Cezanne", 3.3, {1: 4.2}, 6, 12, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5600HS", "Zen3", "Cezanne", 3, {1: 4.2}, 6, 12, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5600U", "Zen3", "Cezanne", 2.3, {1: 4.2}, 6, 12, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5600X", "Zen3", "Vermeer", 3.7, {1: 4.6}, 6, 12, [32, 512, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 5650G", "Zen3", "Cezanne", 3.9, {1: 4.4}, 6, 12, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 5650GE", "Zen3", "Cezanne", 3.4, {1: 4.4}, 6, 12, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 5650U", "Zen3", "Cezanne", 2.3, {1: 4.2}, 6, 12, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5700G", "Zen3", "Cezanne", 3.8, {1: 4.6}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5700GE", "Zen3", "Cezanne", 3.2, {1: 4.6}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5800", "Zen3", "Vermeer", 3.4, {1: 4.6}, 8, 16, [32, 512, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5800H", "Zen3", "Cezanne", 3.2, {1: 4.4}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5800HS", "Zen3", "Cezanne", 2.8, {1: 4.4}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5800U", "Zen3", "Cezanne", 1.9, {1: 4.4}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5800X", "Zen3", "Vermeer", 3.8, {1: 4.7}, 8, 16, [32, 512, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 5750G", "Zen3", "Cezanne", 3.8, {1: 4.6}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 5750GE", "Zen3", "Cezanne", 3.2, {1: 4.6}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD PRO 5850U", "Zen3", "Cezanne", 1.9, {1: 4.4}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5900", "Zen3", "Vermeer", 3, {1: 4.7}, 12, 24, [32, 512, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5900HS", "Zen3", "Cezanne", 3, {1: 4.6}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5900HX", "Zen3", "Cezanne", 3.3, {1: 4.6}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5900X", "Zen3", "Vermeer", 3.7, {1: 4.8}, 12, 24, [32, 512, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5950X", "Zen3", "Vermeer", 3.4, {1: 4.9}, 16, 32, [32, 512, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5980HS", "Zen3", "Cezanne", 3, {1: 4.8}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 5980HX", "Zen3", "Cezanne", 3.3, {1: 4.8}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 72F3", "Zen3", "Milan", 3.7, {1: 4.1}, 8, 16, [32, 512, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7313", "Zen3", "Milan", 3, {1: 3.7}, 16, 32, [32, 512, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7343", "Zen3", "Milan", 3.2, {1: 3.9}, 16, 32, [32, 512, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 73F3", "Zen3", "Milan", 3.5, {1: 4}, 16, 32, [32, 512, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7413", "Zen3", "Milan", 2.65, {1: 3.6}, 24, 48, [32, 512, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7443", "Zen3", "Milan", 2.85, {1: 4}, 24, 48, [32, 512, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7453", "Zen3", "Milan", 2.75, {1: 3.45}, 28, 56, [32, 512, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 74F3", "Zen3", "Milan", 3.2, {1: 4}, 24, 48, [32, 512, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7513", "Zen3", "Milan", 2.6, {1: 3.65}, 32, 64, [32, 512, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7543", "Zen3", "Milan", 2.8, {1: 3.7}, 32, 64, [32, 512, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 75F3", "Zen3", "Milan", 2.95, {1: 4}, 32, 64, [32, 512, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7643", "Zen3", "Milan", 2.3, {1: 3.6}, 48, 96, [32, 512, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7663", "Zen3", "Milan", 2, {1: 3.5}, 56, 112, [32, 512, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7713", "Zen3", "Milan", 2, {1: 3.675}, 64, 128, [32, 512, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
-    ["AMD 7763", "Zen3", "Milan", 2.45, {1: 3.5}, 64, 128, [32, 512, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64"],
+    ["AMD 7313P", "Zen3", "Milan", 3, {1: 3.7}, 16, 32, [32, 512, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7443P", "Zen3", "Milan", 2.85, {1: 4}, 24, 48, [32, 512, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7543P", "Zen3", "Milan", 2.8, {1: 3.7}, 32, 64, [32, 512, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7713P", "Zen3", "Milan", 2, {1: 3.675}, 64, 128, [32, 512, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5300G", "Zen3", "Cezanne", 4, {1: 4.2}, 4, 8, [32, 512, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5300GE", "Zen3", "Cezanne", 3.6, {1: 4.2}, 4, 8, [32, 512, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5400U", "Zen3", "Cezanne", 2.6, {1: 4}, 4, 8, [32, 512, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 5350G", "Zen3", "Cezanne", 4, {1: 4.2}, 4, 8, [32, 512, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 5350GE", "Zen3", "Cezanne", 3.6, {1: 4.2}, 4, 8, [32, 512, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 5450U", "Zen3", "Cezanne", 2.6, {1: 4}, 4, 8, [32, 512, 8 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5600G", "Zen3", "Cezanne", 3.9, {1: 4.4}, 6, 12, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5600GE", "Zen3", "Cezanne", 3.4, {1: 4.4}, 6, 12, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5600H", "Zen3", "Cezanne", 3.3, {1: 4.2}, 6, 12, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5600HS", "Zen3", "Cezanne", 3, {1: 4.2}, 6, 12, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5600U", "Zen3", "Cezanne", 2.3, {1: 4.2}, 6, 12, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5600X", "Zen3", "Vermeer", 3.7, {1: 4.6}, 6, 12, [32, 512, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 5650G", "Zen3", "Cezanne", 3.9, {1: 4.4}, 6, 12, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 5650GE", "Zen3", "Cezanne", 3.4, {1: 4.4}, 6, 12, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 5650U", "Zen3", "Cezanne", 2.3, {1: 4.2}, 6, 12, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5700G", "Zen3", "Cezanne", 3.8, {1: 4.6}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5700GE", "Zen3", "Cezanne", 3.2, {1: 4.6}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5800", "Zen3", "Vermeer", 3.4, {1: 4.6}, 8, 16, [32, 512, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5800H", "Zen3", "Cezanne", 3.2, {1: 4.4}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5800HS", "Zen3", "Cezanne", 2.8, {1: 4.4}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5800U", "Zen3", "Cezanne", 1.9, {1: 4.4}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5800X", "Zen3", "Vermeer", 3.8, {1: 4.7}, 8, 16, [32, 512, 32 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 5750G", "Zen3", "Cezanne", 3.8, {1: 4.6}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 5750GE", "Zen3", "Cezanne", 3.2, {1: 4.6}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD PRO 5850U", "Zen3", "Cezanne", 1.9, {1: 4.4}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5900", "Zen3", "Vermeer", 3, {1: 4.7}, 12, 24, [32, 512, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5900HS", "Zen3", "Cezanne", 3, {1: 4.6}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5900HX", "Zen3", "Cezanne", 3.3, {1: 4.6}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5900X", "Zen3", "Vermeer", 3.7, {1: 4.8}, 12, 24, [32, 512, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5950X", "Zen3", "Vermeer", 3.4, {1: 4.9}, 16, 32, [32, 512, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5980HS", "Zen3", "Cezanne", 3, {1: 4.8}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 5980HX", "Zen3", "Cezanne", 3.3, {1: 4.8}, 8, 16, [32, 512, 16 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 72F3", "Zen3", "Milan", 3.7, {1: 4.1}, 8, 16, [32, 512, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7313", "Zen3", "Milan", 3, {1: 3.7}, 16, 32, [32, 512, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7343", "Zen3", "Milan", 3.2, {1: 3.9}, 16, 32, [32, 512, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 73F3", "Zen3", "Milan", 3.5, {1: 4}, 16, 32, [32, 512, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7413", "Zen3", "Milan", 2.65, {1: 3.6}, 24, 48, [32, 512, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7443", "Zen3", "Milan", 2.85, {1: 4}, 24, 48, [32, 512, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7453", "Zen3", "Milan", 2.75, {1: 3.45}, 28, 56, [32, 512, 64 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 74F3", "Zen3", "Milan", 3.2, {1: 4}, 24, 48, [32, 512, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7513", "Zen3", "Milan", 2.6, {1: 3.65}, 32, 64, [32, 512, 128 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7543", "Zen3", "Milan", 2.8, {1: 3.7}, 32, 64, [32, 512, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 75F3", "Zen3", "Milan", 2.95, {1: 4}, 32, 64, [32, 512, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7643", "Zen3", "Milan", 2.3, {1: 3.6}, 48, 96, [32, 512, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7663", "Zen3", "Milan", 2, {1: 3.5}, 56, 112, [32, 512, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7713", "Zen3", "Milan", 2, {1: 3.675}, 64, 128, [32, 512, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+    ["AMD 7763", "Zen3", "Milan", 2.45, {1: 3.5}, 64, 128, [32, 512, 256 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
 
     # Raspberry Pi
     # ref: https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf
     # ref: http://sandsoftwaresound.net/raspberry-pi/raspberry-pi-gen-1/memory-hierarchy/
-    ["Raspberry Pi Zero", "Pi0", "Broadcom BCM2835", 0.7, {1: 1}, 1, 2, [16], [32], 0, 0, [], "ARM"], # pi0 has a 128 KB L2, but it's usually reserved for the GPU
+    ["Raspberry Pi Zero", "Pi0", "Broadcom BCM2835", 0.7, {1: 1}, 1, 2, [16], [32], 0, 0, [], "ARM", ""], # pi0 has a 128 KB L2, but it's usually reserved for the GPU
     # ref: https://patchwork.kernel.org/project/linux-arm-kernel/patch/20211218200009.16856-1-rs@noreya.tech/
-    ["Raspberry Pi 3B", "Pi3", "Broadcom BCM2837B0", 1.4, {}, 4, 8, [32, 512], [64, 64], 0, 0, [], "ARM"], # pi3 has a 128 KB L2, but it's usually reserved for GPU
+    ["Raspberry Pi 3B", "Pi3", "Broadcom BCM2837B0", 1.4, {}, 4, 8, [32, 512], [64, 64], 0, 0, [], "ARM", "OPENMP"], # pi3 has a 128 KB L2, but it's usually reserved for GPU
     # ref: https://patchwork.kernel.org/project/linux-arm-kernel/patch/20211221224830.16746-1-rs@noreya.tech/
-    ["Raspberry Pi 4B", "Pi4", "Broadcom BCM2711", 1.5, {}, 4, 8, [32, 1024], [64, 64], 0, 0, [], "ARM"],
+    ["Raspberry Pi 4B", "Pi4", "Broadcom BCM2711", 1.5, {}, 4, 8, [32, 1024], [64, 64], 0, 0, [], "ARM", "OPENMP"],
 
-    ["ARM Cortex-M4", "Cortex-M4", "ARM Cortex-M4", .008, {}, 1, 1, [], [], 0, 0, [], "ARM"],
-    ["ARM Cortex-M4F", "Cortex-M4", "ARM Cortex-M4F", .008, {}, 1, 1, [], [], 0, 0, ["fpu"], "ARM"],
+    ["ARM Cortex-M4", "Cortex-M4", "ARM Cortex-M4", .008, {}, 1, 1, [], [], 0, 0, [], "ARM", ""],
+    ["ARM Cortex-M4F", "Cortex-M4", "ARM Cortex-M4F", .008, {}, 1, 1, [], [], 0, 0, ["fpu"], "ARM", ""],
 ]
 # yapf: enable
 
+@dataclass(frozen=True, eq=True)
+class TensorCoreInformationEntry:
+    input_type : ScalarType
+    output_type : ScalarType
+    shape : List[int]
+
+@dataclass(frozen=True)
+class TensorCoreInformation:
+    entries : List[TensorCoreInformationEntry] = field(default_factory=list)
+
+    def supports(self, input_type : ScalarType, output_type : ScalarType, shape : List[int]) -> bool:
+        return TensorCoreInformationEntry(input_type, output_type, shape) in self.entries
+
+
+MI100_TENSORCORE_INFO = TensorCoreInformation([
+    TensorCoreInformationEntry(ScalarType.float32, ScalarType.float32, [2,2,16]) # maps to the 16x16x4 warp mfma instruction
+])
+
 # Tensor Cores is current unused
-KNOWN_GPUS_HEADER = ["Runtime", "Model", "Family", "Cores", "Block Size"]
+KNOWN_GPUS_HEADER = ["Runtime", "Model", "Branding", "Family", "Cores", "MaxThreadsPerBlock", "MaxBlockSize", "MaxSharedMemoryPerBlock", "WarpSize", "Base Freq", "MaxRegistersPerBlock", "TensorCoreInformation"]
 KNOWN_GPUS = [
     # NVIDIA
-    ["CUDA", "NVidia P100", "Pascal", 56, 16],
-    ["CUDA", "NVidia V100", "Volta", 80, 16],
-    ["CUDA", "NVidia A100", "Ampere", 108, 16],
+    ["CUDA", "NVidia P100", "Pascal", "sm60",  56, 1024, [1024, 1024, 64], 49152, 32, 1.328500, 65536, None],
+    ["CUDA", "NVidia V100", "Volta",  "sm70",  80, 1024, [1024, 1024, 64], 49152, 32, 1.380000, 65536, None],
+    ["CUDA", "NVidia A100", "Ampere", "sm80", 108, 1024, [1024, 1024, 64], 49152, 32, 1.410000, 65536, None],
     # AMD
-    ["ROCM", "AMD MI50", "CDNA", 60, 16],
-    ["ROCM", "AMD MI100", "CDNA2", 120, 16]
+    ["ROCM", "AMD Radeon7", "CDNA",  "gfx906", 60,  1024, [1024, 1024, 1024], 65536, 64, 1.801000, 65536, None],
+    ["ROCM", "AMD MI50",    "CDNA",  "gfx906", 60,  1024, [1024, 1024, 1024], 65536, 64, 1.725000, 65536, None],
+    ["ROCM", "AMD MI100",   "CDNA2", "gfx908", 120, 1024, [1024, 1024, 1024], 65536, 64, 1.502000, 65536, MI100_TENSORCORE_INFO]
 ]
 # yapf: enable
 
@@ -689,17 +702,23 @@ class _TargetContainer:
     cache_lines: List[int] = field(default_factory=list)
     cache_sizes: List[int] = field(default_factory=list)
     category: Category = None
-    runtime: Runtime = Runtime.DEFAULT
-    default_block_size: int = 0
+    runtime: Runtime = Runtime.NONE
     extensions: List[str] = field(default_factory=list)
     family: str = ""
     frequency_GHz: float = 0.0
     name: str = ""
     num_cores: int = 0
     num_threads: int = 0
+    tensor_core : TensorCoreInformation = field(default_factory=TensorCoreInformation)
     turbo_frequency_GHz: dict = field(default_factory=dict)    # Dictionary of number of cores needed => Turbo frequency
     vector_bytes: int = 0
     vector_registers: int = 0
+    warp_size: int = 0
+    max_threads_per_block: int = 0
+    max_block_size: List[int] = field(default_factory=list)
+    max_shared_memory_per_block: int = 0
+    max_registers_per_block: int = 0
+
 
     _device_name: str = "host"    # used internally for emitting known targets
 
@@ -764,6 +783,7 @@ def _recompute_known_devices():
             name=device["Model"],
             num_cores=device["Cores"],
             num_threads=device["Threads"],
+            runtime=Runtime.__members__[device["Runtime"]] if device["Runtime"] else Runtime.NONE,
             turbo_frequency_GHz=device["Turbo Freq"],
             vector_bytes=device["Vector Bytes"],
             vector_registers=device["Vector Registers"],
@@ -777,11 +797,17 @@ def _recompute_known_devices():
                   for i, v in enumerate(KNOWN_GPUS_HEADER)}
         target = _TargetContainer(
             category=Category.GPU,
-            runtime=Runtime[device["Runtime"]],
-            default_block_size=device["Block Size"],
+            runtime=Runtime.__members__[device["Runtime"]],
             family=device["Family"],
             name=device["Model"],
             num_cores=device["Cores"],
+            warp_size=device["WarpSize"],
+            max_threads_per_block=device["MaxThreadsPerBlock"],
+            max_block_size=device["MaxBlockSize"],
+            max_shared_memory_per_block=device["MaxSharedMemoryPerBlock"],
+            frequency_GHz=device["Base Freq"],
+            max_registers_per_block=device["MaxRegistersPerBlock"],
+            tensor_core=device["TensorCoreInformation"],
         )
         KNOWN_DEVICES[target.category][target.name] = target
         model_names.append((target.name, target.name))
@@ -829,10 +855,10 @@ def __init__(
         vector_bytes: int = 0,
         vector_registers: int = None,
         frequency_GHz: float = None,
+        tensor_core : TensorCoreInformation = None,
         turbo_frequency_GHz: float = None,
         cache_sizes: List[int] = None,
-        cache_lines: List[int] = None,
-        default_block_size: int = None
+        cache_lines: List[int] = None
     ):
         "Factory-like constructor that uses the model parameter to fill-in known defaults"
 
@@ -886,19 +912,20 @@ def __init__(
         self.cache_sizes = cache_sizes or self.cache_sizes
         self.category = category or self.category
         self.runtime = runtime or self.runtime
-        self.default_block_size = default_block_size or self.default_block_size
         self.extensions = extensions or self.extensions
         self.family = family or self.family
         self.frequency_GHz = frequency_GHz or self.frequency_GHz
         self.name = name or self.name
         self.num_cores = num_cores or self.num_cores
         self.num_threads = num_threads or self.num_threads
+        self.tensor_core = tensor_core or self.tensor_core
         self.turbo_frequency_GHz = turbo_frequency_GHz or self.turbo_frequency_GHz
         self.vector_bytes = vector_bytes or self.vector_bytes
         self.vector_registers = vector_registers or self.vector_registers
         # TODO: inspect target characteristics of HOST rather than assuming these defaults
         if self.category == Target.Category.GPU:
             self.GridUnit = copy.deepcopy(GridUnits)
+            self.MemorySpace = copy.deepcopy(_MemorySpace)
 
         # If known_name was provided, we should override the internal fields too
         if known_name:
diff --git a/accera/python/accera/lang/Array.py b/accera/python/accera/lang/Array.py
index 8ba72bd9..ee9166cd 100644
--- a/accera/python/accera/lang/Array.py
+++ b/accera/python/accera/lang/Array.py
@@ -84,6 +84,7 @@ def __init__(
                 ScalarType.int16: "int16",
                 ScalarType.int32: "int32",
                 ScalarType.int64: "int64",
+                ScalarType.float16: "float16",
                 ScalarType.float32: "float32",
                 ScalarType.float64: "float64",
             # TODO: more types
@@ -218,6 +219,16 @@ def _build_native_context(self, context: NativeLoopNestContext):
         context.function_args = tuple(args_list)
 
     def _replay_delayed_calls(self):
+        '''
+        This method is called once per adding function, so it can be called multiple times when  
+        multiple functions get added. In order for the functions to be added correctly, we need to make sure all 
+        the residual states are cleared between different method calls.
+
+        For example, in Schedule class, we identify that Schedule._index_map can have residual states, so we need to reset self._index_map
+        before we replay the delayed methods.
+
+        If there is no residual state between different method calls, no need to reset.
+        '''
         for delayed_call in self._delayed_calls:
             params = self._delayed_calls[delayed_call]
             if isinstance(params, Tuple):
diff --git a/accera/python/accera/lang/Cache.py b/accera/python/accera/lang/Cache.py
index 7d4029a8..49a136b0 100644
--- a/accera/python/accera/lang/Cache.py
+++ b/accera/python/accera/lang/Cache.py
@@ -23,31 +23,40 @@ class Cache:
     layout: Union[Array.Layout, Tuple[int]] = None
     max_elements: int = None
     thrifty: bool = False
+    double_buffer: bool = False
+    double_buffer_location: _MemorySpace = _MemorySpace.NONE
     offset: int = 0
     native_cache: Any = None
     location: _MemorySpace = _MemorySpace.NONE
     indexing: CacheIndexing = CacheIndexing.GLOBAL_TO_PHYSICAL
     allocation: _CacheAllocation = _CacheAllocation.AUTO
 
+    @property
     def target_shape(self):
         if isinstance(self.target, Cache):
-            return self.target.target_shape()
+            return self.target.target_shape
         else:
             return self.target.shape
+    @property
+    def target_role(self):
+        if isinstance(self.target, Cache):
+            return self.target.target_role
+        else:
+            return self.target.role
 
     @property
     def memory_map(self):
         if isinstance(self.layout, tuple):
             from .Layout import MemoryMapLayout
 
-            mmap_layout = MemoryMapLayout(self.layout, self.target_shape(), self.offset)
+            mmap_layout = MemoryMapLayout(self.layout, self.target_shape, self.offset)
             return _MemoryAffineCoefficients(mmap_layout.coefficients, mmap_layout.offset)
         return None
 
     @property
     def dimension_permutation(self):
         if isinstance(self.layout, Array.Layout) and self.layout is not Array.Layout.DEFERRED:
-            first_major = list(range(len(self.target_shape())))
+            first_major = list(range(len(self.target_shape)))
             dim_orders = {
                 Array.Layout.FIRST_MAJOR: first_major,
                 Array.Layout.LAST_MAJOR: list(reversed(first_major)),
@@ -68,6 +77,8 @@ def complete(self, cache: Cache):
         self.layout = cache.layout
         self.max_elements = cache.max_elements
         self.thrifty = cache.thrifty
+        self.double_buffer = cache.double_buffer
+        self.double_buffer_location = cache.double_buffer_location
         self.offset = cache.offset
         self.native_cache = cache.native_cache
         self.location = cache.location
diff --git a/accera/python/accera/lang/Nest.py b/accera/python/accera/lang/Nest.py
index 2975600a..922a3ae4 100644
--- a/accera/python/accera/lang/Nest.py
+++ b/accera/python/accera/lang/Nest.py
@@ -174,6 +174,16 @@ def _init_delayed(self, shape: List[int]):
         self._shape = resolved_shape
 
     def _replay_delayed_calls(self):
+        '''
+        This method is called once per adding function, so it can be called multiple times when  
+        multiple functions get added. In order for the functions to be added correctly, we need to make sure all 
+        the residual states are cleared between different method calls.
+
+        For example, in Schedule class, we identify that Schedule._index_map can have residual states, so we need to reset self._index_map
+        before we replay the delayed methods.
+
+        If there is no residual state between different method calls, no need to reset.
+        '''
         for delayed_call in self._delayed_calls:
             params = self._delayed_calls[delayed_call]
 
diff --git a/accera/python/accera/lang/Plan.py b/accera/python/accera/lang/Plan.py
index 8b997329..06bf24e4 100644
--- a/accera/python/accera/lang/Plan.py
+++ b/accera/python/accera/lang/Plan.py
@@ -17,9 +17,9 @@
 from .NativeLoopNestContext import NativeLoopNestContext
 from ..Targets import Target
 from ..Platforms import LibraryDependency
+from ..Constants import AUTO
 
-from .._lang_python._lang import CacheIndexing, _MemorySpace
-
+from .._lang_python._lang import BarrierScope, CacheIndexing, _MemorySpace
 
 class Plan:
     def __init__(self, schedule: Schedule, target: Target = Target.HOST):
@@ -31,7 +31,7 @@ def __init__(self, schedule: Schedule, target: Target = Target.HOST):
         self._dynamic_dependencies = set()
         self._bindings = {}
 
-        if target.category == Target.Category.GPU:
+        if target.category == Target.Category.GPU and target.runtime == Target.Runtime.VULKAN:
             self._dynamic_dependencies.add(LibraryDependency.VULKAN)
 
     def _add_index_attr(self, index: LoopIndex, attr: str):
@@ -137,6 +137,52 @@ def _parallelize(self, indices, policy, context: NativeLoopNestContext):
             idxs, num_threads, _ParallelizationPolicy.DYNAMIC if policy == "dynamic" else _ParallelizationPolicy.STATIC
         )
 
+
+    def tensorize(
+        self,
+        indices: Union[LoopIndex, Tuple[LoopIndex]]
+    ):
+        if self._target.category != Target.Category.GPU:
+            raise ValueError("tensorization currently only supported on GPU targets")
+
+        indices = [indices] if isinstance(indices, LoopIndex) else list(indices) 
+
+        if len(indices) != 3:
+            raise ValueError("tensorization requires three input indices")
+
+        # ensure the indices are contiguous and follow the Schedule ordering
+        start = self._sched._indices.index(indices[0])
+        end = start + len(indices)
+        if end > len(self._sched._indices) or indices != self._sched._indices[start:end]:
+            raise ValueError("indices must be contiguous in the Schedule dimension order")
+
+        for index in indices:
+            self._add_index_attr(index, "tensorized")
+
+        self._commands.append(partial(self._tensorize, indices))
+
+    def _tensorize(self, indices, context: NativeLoopNestContext):
+        from .._lang_python import ScalarType 
+
+        tensorize_dims = []
+        for index in list(map(self._sched._resolve_index, indices)): 
+            index_map = self._sched._index_map
+            inners = index_map[index].inners 
+            if len(inners) != 0:
+                raise ValueError("The tensorization index cannot be split")
+            start, stop, step = self._sched.get_index_range(index)
+            if start != 0:
+                raise ValueError("The tensorization index must start at 0")
+            if step != 1:
+                raise ValueError("The tensorization index stride must be contiguous") 
+            tensorize_dims.append(stop) 
+        if not self._target.tensor_core.supports(input_type=ScalarType.float32, output_type=ScalarType.float32, shape=tensorize_dims):
+            raise ValueError("The target does not support the given tensorization dimensions") 
+
+        idxs = [context.mapping[id(index)] for index in indices] 
+
+        context.plan.tensorize(indices=idxs, dims=tensorize_dims)
+
     def cache(
         self,
         source: Union[Array, Cache],
@@ -144,10 +190,12 @@ def cache(
         trigger_index: Union[LoopIndex, DelayedParameter] = None,
         layout: Array.Layout = None,
         max_elements: int = None,
-        thrifty: bool = None,
+        thrifty: Union[bool, DelayedParameter] = None,
         location: _MemorySpace = _MemorySpace.NONE,
         level: Union[int, DelayedParameter] = None,
         trigger_level: Union[int, DelayedParameter] = None,
+        double_buffer: Union[bool, DelayedParameter] = False,
+        double_buffer_location: Union[object, _MemorySpace, DelayedParameter] = AUTO,
         _delayed_cache: DelayedCache = None
     ):
         """Adds a cache for a view target
@@ -160,9 +208,16 @@ def cache(
             level: The key-slice level to cache (the number of wildcard dimensions in a key-slice). Specify one and only one of `index`, `level`, `max_elements`.
             trigger_level: The key-slice level to fill the cache at. `trigger_level` can't be smaller than `level`, and will default to `level` if not specified. Specify at most one of `trigger_index` or `trigger_level`.
             max_elements: The maximum elements to include in the cached region. Specify one and only one of `index`, `level`, `max_elements`.
-            thrifty: Use thrifty caching (copy data into a cache only if the cached data differs from the original active block).
+            thrifty: Use thrifty caching (copy data into a cache only if the cached data differs from the original active block). This defaults to False as it slows down compilation speed so it is intended as an opt-in feature.
+            double_buffer: Make this a double buffer cache by copying data one iteration ahead and using private memory on GPU for this procedure.
+            double_buffer_location: The memory space used for storing iteration data for the double buffer cache. Requires that double_buffer is set to True. Defaults to AUTO.
+                AUTO will configure the double buffering location based on the following:
+                | location            | double_buffer | double_buffer_location = `AUTO` |
+                | ------------------- | ------------- | ------------------------------- |
+                | MemorySpace.SHARED  | True          | MemorySpace.PRIVATE             |
+                | !MemorySpace.SHARED | True          | Same value as location          |
         """
-        if any([isinstance(arg, DelayedParameter) for arg in (index, trigger_index, level, trigger_level)]) or \
+        if any([isinstance(arg, DelayedParameter) for arg in (index, trigger_index, level, trigger_level, thrifty, double_buffer, double_buffer_location)]) or \
             (isinstance(source, DelayedCache) and not source.completed):
             # If any of the cache level arguments are parameters, then this cache call is incomplete until those parameters
             # have values. Additionally, if this is a hierarchical cache and an outer cache is parameterized,
@@ -176,26 +231,45 @@ def cache(
                 source=source,
                 layout=layout,
                 max_elements=max_elements,
-                thrifty=thrifty,
                 location=location,
                 _delayed_cache=delayed_cache
             )] = {
                 "index": index,
                 "trigger_index": trigger_index,
                 "level": level,
-                "trigger_level": trigger_level
+                "trigger_level": trigger_level,
+                "thrifty": thrifty,
+                "double_buffer": double_buffer,
+                "double_buffer_location" : double_buffer_location
             }
             return delayed_cache
 
-        if thrifty:
-            raise NotImplementedError("Thrifty caching is not yet implemented")    # TODO
-
         if sum(i is not None for i in [index, level, max_elements]) != 1:
             raise ValueError("Specify one and only one of index, level, or max_elements")
 
         if max_elements is not None and max_elements <= 0:
             raise ValueError("Max element count specified as a cache budget must be greater than 0")
 
+        if isinstance(source, Array):
+            array_role = source.role
+        elif isinstance(source, Cache):
+            array_role = source.target_role
+
+        if double_buffer and array_role not in [Array.Role.CONST, Array.Role.INPUT]:
+            raise ValueError("Double-buffering is only supported for CONST and INPUT arrays")
+
+        if not double_buffer and double_buffer_location != AUTO:
+            raise ValueError("double_buffer_location is only valid to specify when double_buffer is set to True")
+
+        if double_buffer_location is AUTO:
+            if double_buffer:
+                if self._target.category == Target.Category.GPU and location == _MemorySpace.SHARED:
+                    double_buffer_location = _MemorySpace.PRIVATE
+                else:
+                    double_buffer_location = location
+            else:
+                double_buffer_location = _MemorySpace.NONE
+
         if max_elements is None:
             # Validate or set index / level values
 
@@ -217,9 +291,8 @@ def cache(
                 index_pos = self._sched._indices.index(index)
                 level = len(self._sched._indices) - index_pos
 
-            if (trigger_level or trigger_index):
-                if isinstance(source, Array) and source.role not in [Array.Role.CONST, Array.Role.INPUT]:
-                    raise ValueError("Multicaching is only supported for CONST and INPUT arrays")
+            if (trigger_level or trigger_index) and array_role not in [Array.Role.CONST, Array.Role.INPUT]:
+                raise ValueError("Multicaching is only supported for CONST and INPUT arrays")
 
             if layout is None:
                 layout = source._requested_layout
@@ -283,7 +356,9 @@ def cache(
             layout=layout,
             max_elements=max_elements,
             thrifty=thrifty,
-            location=location
+            location=location,
+            double_buffer=double_buffer,
+            double_buffer_location=double_buffer_location
         )
 
         if _delayed_cache:
@@ -305,7 +380,6 @@ def _add_cache(self, cache, context: NativeLoopNestContext):
         else:
             target = cache.target.native_cache
 
-        # TODO: support layout, location, thrifty
         if (isinstance(self._target, Target) and self._target.category == Target.Category.GPU):
             cache.native_cache = context.plan.add_cache(
                 target=target,
@@ -316,7 +390,10 @@ def _add_cache(self, cache, context: NativeLoopNestContext):
                 allocation=cache.allocation,
                 location=cache.location,
                 memory_map=cache.memory_map,
-                dim_order=cache.dimension_permutation
+                dim_order=cache.dimension_permutation,
+                thrifty=cache.thrifty,
+                double_buffer=cache.double_buffer,
+                double_buffer_location=cache.double_buffer_location
             )
         else:
             cache.native_cache = context.plan.add_cache(
@@ -328,7 +405,10 @@ def _add_cache(self, cache, context: NativeLoopNestContext):
                 allocation=cache.allocation,
                 location=cache.location,
                 memory_map=cache.memory_map,
-                dim_order=cache.dimension_permutation
+                dim_order=cache.dimension_permutation,
+                thrifty=cache.thrifty,
+                double_buffer=cache.double_buffer,
+                double_buffer_location=cache.double_buffer_location
             )
 
     def pack_and_embed_buffer(
@@ -493,12 +573,11 @@ def units_to_dim(units, dims):
                 for i, u in enumerate(units):
                     index = self._bindings.get(u)
                     if index is not None:
-
-                        if index in index_to_splitfactor_map:
+                        begin, end, step = self._sched.get_index_range(index)
+                        if step == 1 and index in index_to_splitfactor_map:
                             dims[i] = index_to_splitfactor_map[index]
 
                         else:
-                            begin, end, step = self._sched.get_index_range(index)
                             dims[i], rem = divmod(end - begin, step)
 
                             if rem:
@@ -537,7 +616,7 @@ def units_to_dim(units, dims):
                         raise RuntimeError(f"Shape {shape} must be a multiple of split factor {block_dims[i]}")
 
             context.options = _GPU(grid=_Dim3(*grid_dims), block=_Dim3(*block_dims))
-            context.plan = context.schedule.create_gpu_plan(context.options)
+            context.plan = context.schedule.create_gpu_plan(gpu_options=context.options, runtime=target.runtime)
         else:
             context.plan = context.schedule.create_plan()
 
@@ -546,6 +625,16 @@ def _build_with_native_context(self, context: NativeLoopNestContext):
             cmd(context)
 
     def _replay_delayed_calls(self):
+        '''
+        This method is called once per adding function, so it can be called multiple times when  
+        multiple functions get added. In order for the functions to be added correctly, we need to make sure all 
+        the residual states are cleared between different method calls.
+
+        For example, in Schedule class, we identify that Schedule._index_map can have residual states, so we need to reset self._index_map
+        before we replay the delayed methods.
+
+        If there is no residual states between different method calls, no need to reset.
+        '''
         for delayed_call in self._delayed_calls:
             params = self._delayed_calls[delayed_call]
             if isinstance(params, dict):
diff --git a/accera/python/accera/lang/Schedule.py b/accera/python/accera/lang/Schedule.py
index 9b6bbbaa..c0f6ef3c 100644
--- a/accera/python/accera/lang/Schedule.py
+++ b/accera/python/accera/lang/Schedule.py
@@ -48,6 +48,7 @@ class Schedule:
     def __init__(self, nest: Nest):
         self._nest = nest
         self._delayed_calls = {}
+        self._parameterized_index_map = {}
 
         # nest.get_indices gives us a single index if there's only one index
         self._indices = nest.get_indices()
@@ -357,13 +358,29 @@ def _skew_delayed(self, index: LoopIndex, reference_index: LoopIndex, unroll_loo
 
     # If this function is updated to return something, fused schedule needs to be updated as well
     def _replay_delayed_calls(self):
-        for delayed_call in self._delayed_calls:
-            params = self._delayed_calls[delayed_call]
-
-            if isinstance(params, DelayedParameter):
-                delayed_call(params.get_value())
+        '''
+        This method is called once per adding function, so it can be called multiple times when  
+        multiple functions get added. In order for the functions to be added correctly, we need to make sure all 
+        the residual states are cleared between different method calls.
+
+        In Schedule class, we identify that Schedule._index_map can have residual states, so we need to reset self._index_map
+        before we replay the delayed methods.
+        '''
+
+        if self._delayed_calls:
+            # Reset the index map to its pre-parameterized state before applying function-specific parameters
+            if self._parameterized_index_map:
+                self._index_map = self._deep_copy_index_map(self._parameterized_index_map)
             else:
-                delayed_call(params)
+                self._parameterized_index_map = self._deep_copy_index_map(self._index_map)
+
+            for delayed_call in self._delayed_calls:
+                params = self._delayed_calls[delayed_call]
+
+                if isinstance(params, DelayedParameter):
+                    delayed_call(params.get_value())
+                else:
+                    delayed_call(params)
 
     def _resolve_index(self, index):
         if index not in self._index_map:
@@ -422,6 +439,13 @@ def _get_num_split_blocks(self, indices: List[LoopIndex]):
             result += self._get_index_num_blocks(i)
         return result
 
+    def _deep_copy_index_map(self, index_map):
+        index_map_copy = {}
+        for index, entry in index_map.items():
+            inners_copy = [idx for idx in entry.inners]
+            index_map_copy[index] = IndexEntry(entry.stop, entry.start, entry.step, inners_copy, entry.parent, entry.transform)
+
+        return index_map_copy
 
 class FusedSchedule(Schedule):
 
diff --git a/accera/python/accera/test/dsl_tests.py b/accera/python/accera/test/dsl_tests.py
index 3273acb5..d8ce013c 100644
--- a/accera/python/accera/test/dsl_tests.py
+++ b/accera/python/accera/test/dsl_tests.py
@@ -49,6 +49,7 @@ def expectedFailure(reason: FailedReason, msg: str, condition: bool = True) -> C
     "Extends the unittest.expectedFailure decorator to print failure details and takes an optional condition"
 
     def _decorator(func):
+
         @unittest.expectedFailure
         def _wrapper(x):
             print(f"\n{reason.value}: {msg}")
@@ -64,6 +65,7 @@ def _wrapper(x):
 
 
 class DSLTest_01Arrays(unittest.TestCase):
+
     def _verify_nest(self, nest, args: Tuple[Array], package_name, correctness_check_values=None) -> None:
 
         # create a HAT package and add the function to it
@@ -150,8 +152,8 @@ def test_const_array_type_layout(self) -> None:
         import numpy as np
 
         D = np.ones((128, 256), dtype=np.float64)
-        for t in [ScalarType.int8, ScalarType.int16, ScalarType.int32, ScalarType.int64, ScalarType.float32,
-                  ScalarType.float64]:
+        for t in [ScalarType.int8, ScalarType.int16, ScalarType.int32, ScalarType.int64, ScalarType.float16,
+                  ScalarType.float32, ScalarType.float64]:
             A = Array(role=Array.Role.CONST, element_type=t, layout=Array.Layout.LAST_MAJOR, data=D)
             self.assertIsNotNone(A)
 
@@ -429,6 +431,7 @@ def main(arr):
 
 
 class DSLTest_02SimpleAffineLoopNests(unittest.TestCase):
+
     def _create_nest(self, shape: Tuple[int], type=ScalarType.float32) -> Tuple:
         # helper function to create a nest so that we can focus on the logic function
         from accera import Nest
@@ -453,8 +456,8 @@ def _build_nest(self, nest, args: Tuple[Array], package_name) -> None:
             package.build(package_name, format=TEST_FORMAT, mode=TEST_MODE, output_dir=TEST_PACKAGE_DIR)
 
     def test_signed_types(self) -> None:
-        for t in [ScalarType.int8, ScalarType.int16, ScalarType.int32, ScalarType.int64, ScalarType.float32,
-                  ScalarType.float64]:
+        for t in [ScalarType.int8, ScalarType.int16, ScalarType.int32, ScalarType.int64, ScalarType.float16,
+                  ScalarType.float32, ScalarType.float64]:
             nest, A, B, C = self._create_nest((16, 10, 11), type=t)
             i, j, k = nest.get_indices()
 
@@ -513,8 +516,8 @@ def _():
             self._build_nest(nest, [A, B, C], f"test_types_{t}")
 
     def test_arithmetic_operations_1(self) -> None:
-        for t in [ScalarType.int8, ScalarType.int16, ScalarType.int32, ScalarType.int64, ScalarType.float32,
-                  ScalarType.float64]:
+        for t in [ScalarType.int8, ScalarType.int16, ScalarType.int32, ScalarType.int64, ScalarType.float16,
+                  ScalarType.float32, ScalarType.float64]:
             nest, A, B, C = self._create_nest((16, 10, 11), type=t)
             i, j, k = nest.get_indices()
 
@@ -539,6 +542,7 @@ def test_relational_operations(self) -> None:
 
         @nest.iteration_logic
         def _():
+
             def f1():
                 C[i, j] += A[i, k] + B[k, j]
 
@@ -595,8 +599,8 @@ def _():
     def test_intrinsics_1(self) -> None:
         from accera import max, min
 
-        for t in [ScalarType.float32, ScalarType.float64, ScalarType.int8, ScalarType.int16, ScalarType.int32,
-                  ScalarType.int64]:
+        for t in [ScalarType.float16, ScalarType.float32, ScalarType.float64, ScalarType.int8, ScalarType.int16,
+                  ScalarType.int32, ScalarType.int64]:
 
             nest, A, B, C = self._create_nest((16, 10, 11), type=t)
             i, j, k = nest.get_indices()
@@ -613,7 +617,7 @@ def test_intrinsics_2(self) -> None:
         from accera import abs, sqrt, exp, log, log10, log2, sin, cos, ceil, floor, tan, cosh, sinh, tanh
         # from accera._lang_python import fast_exp, fast_exp_mlas
 
-        for t in [ScalarType.float32, ScalarType.float64]:
+        for t in [ScalarType.float16, ScalarType.float32, ScalarType.float64]:
 
             nest, A, B, C = self._create_nest((16, 10, 11), type=t)
             i, j, k = nest.get_indices()
@@ -675,6 +679,7 @@ def _():
 
 
 class DSLTest_03Schedules(unittest.TestCase):
+
     def _create_nest(self, shape: Tuple[int], type=ScalarType.float32) -> Tuple:
         from accera import Nest
 
@@ -939,6 +944,7 @@ def _():
 
 
 class DSLTest_04Fusing(unittest.TestCase):
+
     def _verify_schedule(self, schedule, args: Tuple[Array], package_name, correctness_check_values) -> None:
         # create a HAT package and add the function to it
         package = Package()
@@ -1571,6 +1577,7 @@ def _():
 
 
 class DSLTest_05Targets(unittest.TestCase):
+
     def test_known_targets(self) -> None:
         intel_name = "Intel 6400"
         intel = Target(known_name=intel_name, num_threads=44)
@@ -1609,11 +1616,21 @@ def test_gpu_targets(self) -> None:
         v100_name = "NVidia V100"
         v100 = Target(Target.Model.NVIDIA_V100, category=Target.Category.GPU)
         self.assertEqual(v100.name, v100_name)
-        self.assertEqual(v100.default_block_size, 16)
         self.assertEqual(v100.category, Target.Category.GPU)
+        self.assertEqual(v100.warp_size, 32)
+
+
+        mi100 = Target(Target.Model.AMD_MI100)
+        self.assertEqual(mi100.warp_size, 64)
+        self.assertEqual(mi100.frequency_GHz, 1.502)
+        
+        a100 = Target(Target.Model.NVIDIA_A100)
+        self.assertEqual(a100.warp_size, 32)
+
 
 
 class DSLTest_06PlansCaching(unittest.TestCase):
+
     def _create_plan(self, shape: Tuple[int], type=ScalarType.float32) -> Tuple:
         from accera import Nest
 
@@ -1694,7 +1711,6 @@ def test_caching_by_element_budget(self) -> None:
 
         self._verify_plan(plan, [A, B, C], "test_caching_by_element_budget")
 
-    @expectedFailure(FailedReason.NOT_IN_CORE, "thrifty caching")
     def test_thrifty_caching(self) -> None:
         plan, args, indices = self._create_plan((16, 10, 11))
         A, B, C = args
@@ -1726,7 +1742,7 @@ def _():
         v100 = Target(Target.Model.NVIDIA_V100, category=Target.Category.GPU, num_threads=16)
         plan = nest.create_plan(v100)
 
-        plan.cache(i, type=v100.MemoryType.SHARED)
+        plan.cache(i, type=v100.MemorySpace.SHARED)
         self._verify_plan(plan, [A], "test_cache_mapping")
 
     def test_cache_trigger_level(self) -> None:
@@ -1847,6 +1863,7 @@ def _():
 
 
 class DSLTest_07PlansVectorizationParallelization(unittest.TestCase):
+
     def _verify_plan(self, plan, args: Tuple[int], package_name, correctness_check_values=None) -> None:
         package = Package()
         function = package.add(plan, args, base_name="vectorization_parallelization_test")
@@ -1900,6 +1917,7 @@ def _():
         plan = nest.create_plan(my_target)
         plan.vectorize(index=i)
         self._verify_plan(plan, [A, B, C], "test_vectorize")
+        
 
     def test_kernelize(self) -> None:
         from accera import Target, Nest
@@ -2062,6 +2080,7 @@ def _():
 
 
 class DSLTest_08DeferredLayout(unittest.TestCase):
+
     def _verify_package(self, plan, args, package_name, correctness_check_values) -> None:
         package = Package()
         function = package.add(plan, args, base_name="deferred_layout")
@@ -2152,6 +2171,7 @@ def _():
 
 
 class DSLTest_09Parameters(unittest.TestCase):
+
     def test_parameterization_1(self) -> None:
         from accera import create_parameters, Nest
 
@@ -2864,8 +2884,57 @@ def _():
         with verifiers.VerifyPackage(self, package_name, TEST_PACKAGE_DIR):
             package.build(name=package_name, format=TEST_FORMAT, mode=TEST_MODE, output_dir=TEST_PACKAGE_DIR)
 
+    def test_parameterization_auxiliary_data(self) -> None:
+        from accera import create_parameters, get_parameters_from_grid, Nest, Schedule
+        from hatlib import HATPackage
+
+        P0, P1, P2, P3, P4 = create_parameters(5)
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(P0, P2))
+        B = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(P2, P1))
+        C = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(P0, P1))
+
+        nest = Nest(shape=(P0, P1, P2))
+        i, j, k = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            C[i, j] += P3 * A[i, k] * B[k, j]
+
+        sched: Schedule = nest.create_schedule()
+        sched.split(j, P4)
+
+        package = Package()
+        package_name = "test_parameterization_auxiliary_data"
+
+        parameter_grid = {
+            P0: [8, 16],
+            P1: [16, 32],
+            P2: [16],
+            P3: [1.0, 2.0],
+            P4: [3, 5, 7]
+        }
+
+        parameters = get_parameters_from_grid(parameter_grid)
+        package.add(sched, args=(A, B, C), base_name="matmul", parameters=parameters)
+
+        with verifiers.VerifyPackage(self, package_name, TEST_PACKAGE_DIR):
+            package.build(name=package_name, format=TEST_FORMAT, mode=TEST_MODE, output_dir=TEST_PACKAGE_DIR)
+
+        hat_package = HATPackage(pathlib.Path(TEST_PACKAGE_DIR) / f"{package_name}.hat")
+        functions = [fn for fn in hat_package.get_functions()]
+        for function in functions:
+            data_point = function.auxiliary['accera']
+            if data_point:
+                self.assertIn(int(data_point["P0"]), [8, 16])
+                self.assertIn(int(data_point["P1"]), [16, 32])
+                self.assertIn(int(data_point["P2"]), [16])
+                self.assertIn(float(data_point["P3"]), [1.0, 2.0])
+                self.assertIn(int(data_point["P4"]), [3, 5, 7])
+ 
 
 class DSLTest_10Packages(unittest.TestCase):
+
     def _create_plan(self, target=Target.HOST) -> Function:
         from accera import Nest
 
diff --git a/accera/python/accera/test/smoke_test.py b/accera/python/accera/test/smoke_test.py
index 9623204c..1facc1fa 100644
--- a/accera/python/accera/test/smoke_test.py
+++ b/accera/python/accera/test/smoke_test.py
@@ -12,7 +12,14 @@
 import shutil
 import numpy as np
 from enum import Enum
-from typing import Callable
+from typing import Callable, List
+
+try:
+    import cuda, pynvrtc
+except:
+    CUDA_AVAILABLE = False
+else:
+    CUDA_AVAILABLE = True
 
 DEV_MODE = False
 if "@CMAKE_INSTALL_PREFIX@"[1:-1] != "CMAKE_INSTALL_PREFIX":
@@ -38,12 +45,14 @@ class FailedReason(Enum):
     NOT_IN_PY = "Not yet implemented (python)"
     UNKNOWN = "Unknown failure"
     BUG = "Bug"
+    INVALID = "Invalid"
 
 
 def expectedFailure(reason: FailedReason, msg: str) -> Callable:
     "Extends the unittest.expectedFailure decorator to print failure details"
 
     def _decorator(func):
+
         @unittest.expectedFailure
         def _wrapper(x):
             print(f"\n{reason.value}: {msg}")
@@ -561,19 +570,13 @@ def test_const_array_shared_across_functions(self) -> None:
 
         const_matrix_shape = (K, N)
         data = np.random.random(const_matrix_shape).astype(np.float32)
-        const_matrix = Array(role=Array.Role.CONST,
-                             element_type=ScalarType.float32,
-                             data=data)
+        const_matrix = Array(role=Array.Role.CONST, element_type=ScalarType.float32, data=data)
 
         # Matmul function
 
-        matmul_input_matrix = Array(role=Array.Role.INPUT,
-                                    element_type=ScalarType.float32,
-                                    shape=(M, K))
+        matmul_input_matrix = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, K))
 
-        matmul_output_matrix = Array(role=Array.Role.INPUT_OUTPUT,
-                                     element_type=ScalarType.float32,
-                                     shape=(M, N))
+        matmul_output_matrix = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(M, N))
 
         matmul_nest = Nest(shape=(M, N, K))
         i, j, k = matmul_nest.get_indices()
@@ -587,16 +590,11 @@ def _():
 
         package.add(matmul_plan, args=(matmul_input_matrix, matmul_output_matrix), base_name="matmul_fn")
 
-
         # Elementwise add function
 
-        ew_add_input_matrix = Array(role=Array.Role.INPUT,
-                                    element_type=ScalarType.float32,
-                                    shape=(K, N))
+        ew_add_input_matrix = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(K, N))
 
-        ew_add_output_matrix = Array(role=Array.Role.INPUT_OUTPUT,
-                                     element_type=ScalarType.float32,
-                                     shape=(K, N))
+        ew_add_output_matrix = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(K, N))
 
         ew_add_nest = Nest(shape=(K, N))
         x, y = ew_add_nest.get_indices()
@@ -614,7 +612,6 @@ def _():
         with verifiers.VerifyPackage(self, package_name, TEST_PACKAGE_DIR):
             package.build(package_name, format=self.PACKAGE_FORMAT, mode=self.PACKAGE_MODE, output_dir=TEST_PACKAGE_DIR)
 
-
     def test_gpu_matmul(self) -> None:
         import math
         from accera import Target
@@ -656,6 +653,7 @@ def round_up(number, multiple):
 
         @nest.iteration_logic
         def _():
+
             def if_block():
                 C[i, j] += A[i, k] * B[k, j]
 
@@ -1095,8 +1093,7 @@ def test_strided_sub_array(self) -> None:
         subArrayNumRows = 2
         subArrayNumCols = 3
 
-        Input = Array(role=Array.Role.INPUT_OUTPUT,
-                      element_type=ScalarType.float32, shape=(N, N))
+        Input = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(N, N))
 
         # Zero out a sub array of size [2, 3]:
         # xxxxx
@@ -1116,15 +1113,14 @@ def _():
         schedule = out_nest.create_schedule()
 
         package = Package()
-        function = package.add(schedule, args=(Input,), base_name="strided_sub_array")
+        function = package.add(schedule, args=(Input, ), base_name="strided_sub_array")
 
         package_name = "test_strided_sub_array"
         output_dir = pathlib.Path(TEST_PACKAGE_DIR) / package_name
         shutil.rmtree(output_dir, ignore_errors=True)
 
         with verifiers.VerifyPackage(self, package_name, output_dir) as v:
-            package.build(name=package_name, format=self.PACKAGE_FORMAT,
-                          mode=self.PACKAGE_MODE, output_dir=output_dir)
+            package.build(name=package_name, format=self.PACKAGE_FORMAT, mode=self.PACKAGE_MODE, output_dir=output_dir)
 
             # correctness check
             Data = np.random.random([N, N]).astype(np.float32)
@@ -1135,7 +1131,7 @@ def _():
             DataStrided[3, 1] = 0.0
             DataStrided[3, 2] = 0.0
             DataStrided[3, 3] = 0.0
-            v.check_correctness(function.name, before=(Data,), after=(DataStrided,))
+            v.check_correctness(function.name, before=(Data, ), after=(DataStrided, ))
 
     def test_padded_nchwc_conv2d_manual_cache(self) -> None:
         input_channels = 64
@@ -1432,6 +1428,50 @@ def _verify_matrix_multiplication_function(
 
             v.check_correctness(function.name, before=(A_test, B_test, C_test), after=(A_test, B_test, C_ref))
 
+    def _verify_convolution_function(
+        self, function: "accera.Function", package: Package, package_name: str, buffer_padding: List[int],
+        conv_padding: List[int], stride: List[int]
+    ) -> None:
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / package_name
+        shutil.rmtree(output_dir, ignore_errors=True)
+
+        row_stride, column_stride = stride
+        channel_padding, row_padding, column_padding = conv_padding
+        channel_buffer_padding, row_buffer_padding, column_buffer_padding = buffer_padding
+
+        # correctness check
+        def naive_convolution_ref(input, kernel, output):
+            input_channels, input_rows, input_columns = input.shape
+            out_filters, output_rows, output_columns = output.shape
+            _, _, kernel_rows, kernel_columns = kernel.shape
+            output_ref = output.copy()
+            for out_f in range(out_filters):
+                for out_r in range(output_rows - 2 * row_buffer_padding):
+                    for out_c in range(output_columns - 2 * column_buffer_padding):
+                        for in_ch in range(input_channels):
+                            for k_r in range(kernel_rows):
+                                for k_c in range(kernel_columns):
+                                    in_r = out_r * row_stride + k_r - row_padding
+                                    in_c = out_c * column_stride + k_c - column_padding
+                                    output_ref[out_f, out_r + row_buffer_padding, out_c + column_buffer_padding] += \
+                                        input[in_ch, in_r + row_buffer_padding, in_c + column_buffer_padding] * \
+                                        kernel[out_f, in_ch, k_r, k_c]
+            return output_ref
+
+        # unpadded_Input_test, unpadded_Kernel_test, unpadded_Output_test = (np.random.random(p.shape).astype(np.float32) for p in function.args)
+
+        with verifiers.VerifyPackage(self, package_name, output_dir) as v:
+            package.build(name=package_name, format=self.PACKAGE_FORMAT, mode=self.PACKAGE_MODE, output_dir=output_dir)
+
+            Input_test, Kernel_test, Output_test = (np.random.random(p.shape).astype(np.float32) for p in function.args)
+            Output_ref = naive_convolution_ref(Input_test, Kernel_test, Output_test)
+
+            v.check_correctness(
+                function.name,
+                before=(Input_test, Kernel_test, Output_test),
+                after=(Input_test, Kernel_test, Output_ref)
+            )
+
     def _multicache_matmul_common(self, M, N, K, name_suffix, jjj_split=16) -> None:
         import accera as acc
 
@@ -2061,7 +2101,7 @@ def _():
         ii = schedule.split(i, block_x)
         schedule.reorder(i, ii)
 
-        target = Target(category=Target.Category.GPU)
+        target = Target(category=Target.Category.GPU, runtime=Target.Runtime.CUDA)
         plan = schedule.create_plan(target)
         plan.bind((i, ii), grid=(target.GridUnit.BLOCK_X, target.GridUnit.THREAD_X))
 
@@ -2072,14 +2112,21 @@ def _():
         output_dir = pathlib.Path(TEST_PACKAGE_DIR) / test_name
         shutil.rmtree(output_dir, ignore_errors=True)
 
-        with verifiers.VerifyPackage(self, test_name, output_dir, file_list=[f"{test_name}.cu"]) as v:
+        with verifiers.VerifyPackage(self, test_name, output_dir, file_list=[f"{test_name}.cu",
+                                                                             f"{test_name}.hat"]) as v:
             package.build(
                 name=test_name,
-                format=Package.Format.MLIR | Package.Format.CUDA,
+                format=Package.Format.MLIR | Package.Format.CUDA | Package.Format.HAT_PACKAGE,
                 mode=Package.Mode.RELEASE,
                 output_dir=output_dir
             )
 
+            if CUDA_AVAILABLE:
+                before = [np.random.rand(*p.shape).astype(np.float32) for p in function.args]
+                after = [before[0], before[1]] + [before[0] + before[1]]
+
+                v.check_correctness(function.name, before=before, after=after)
+
     def test_cuda_module_output(self) -> None:
         from accera import Array, Nest, Package, ScalarType, Target
         from accera._lang_python._lang import _MemorySpace
@@ -2102,6 +2149,7 @@ def _():
         schedule = nest.create_schedule()
 
         ii, jj = schedule.tile((i, j), (block_x, block_y))
+        schedule.reorder(i, j, ii, jj)
 
         target = Target(Target.Model.NVIDIA_V100)
         plan = schedule.create_plan(target=target)
@@ -2118,17 +2166,23 @@ def _():
         output_dir = pathlib.Path(TEST_PACKAGE_DIR) / test_name
         shutil.rmtree(output_dir, ignore_errors=True)
 
-        with verifiers.VerifyPackage(self, test_name, output_dir, file_list=[f"{test_name}.cu"]) as v:
+        with verifiers.VerifyPackage(self, test_name, output_dir, file_list=[f"{test_name}.cu",
+                                                                             f"{test_name}.hat"]) as v:
             package.build(
                 name=test_name,
-                format=Package.Format.MLIR | Package.Format.CUDA,
+                format=Package.Format.MLIR | Package.Format.CUDA | Package.Format.HAT_PACKAGE,
                 mode=Package.Mode.RELEASE,    # Package.Mode.DEBUG,
                 output_dir=output_dir
             )
 
+            if CUDA_AVAILABLE:
+                Input_test, Output_test = (np.random.uniform(-1, 1, p.shape).astype(np.float32) for p in function.args)
+                Input_ref = Output_ref = Input_test
+
+                v.check_correctness(function.name, before=(Input_test, Output_test), after=(Input_ref, Output_ref))
+
     def test_rocm_module_output(self) -> None:
         from accera import Array, Nest, Package, ScalarType, Target
-        from accera.lang import CacheIndexing, BLOCK_X, BLOCK_Y, THREAD_X, THREAD_Y
         from accera._lang_python._lang import _MemorySpace
 
         # Define our vector sizes
@@ -2173,6 +2227,1837 @@ def _():
                 output_dir=output_dir
             )
 
+    def test_rocm_tensorize_single_block_single_warp_output(self) -> None:
+        from accera import Array, Nest, Package, ScalarType, Target
+
+        M = 16
+        N = M
+        K = M
+        outer_tile_x = 16
+        outer_tile_y = outer_tile_x
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, K))
+        B = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(K, N))
+        C = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(M, N))
+
+        nest = Nest(shape=(M, N, K))
+        i, j, k = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            C[i, j] += A[i, k] * B[k, j]
+
+        schedule = nest.create_schedule()
+
+        ii, jj = schedule.tile((i, j), (outer_tile_x, outer_tile_y))
+        iii, jjj, kk = schedule.tile((ii, jj, k), (2, 2, 16))
+
+        schedule.reorder((i, j, ii, jj, k, iii, jjj, kk))
+
+        target = Target(Target.Model.AMD_MI100)
+        plan = schedule.create_plan(target=target)
+        plan.bind(
+            (i, j, ii, jj),
+            grid=(target.GridUnit.BLOCK_Y, target.GridUnit.BLOCK_X, target.GridUnit.THREAD_Y, target.GridUnit.THREAD_X)
+        )
+        plan.tensorize(indices=(iii, jjj, kk))
+
+        test_name = "test_rocm_tensorize_single_block_single_warp_output"
+        package = Package()
+        function = package.add(plan, args=(A, B, C), base_name=test_name)
+
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / test_name
+        shutil.rmtree(output_dir, ignore_errors=True)
+
+        with verifiers.VerifyPackage(self, test_name, output_dir, file_list=[f"{test_name}.cu",
+                                                                             f"{test_name}.hat"]) as v:
+            package.build(
+                name=test_name,
+                format=Package.Format.MLIR | Package.Format.CUDA | Package.Format.HAT_PACKAGE,
+                mode=Package.Mode.RELEASE,    # Package.Mode.DEBUG,
+                output_dir=output_dir
+            )
+
+    def test_rocm_tensorize_single_block_multi_warp_output(self) -> None:
+        from accera import Array, Nest, Package, ScalarType, Target
+
+        M = 64
+        N = M
+        K = M
+        outer_tile_x = 64
+        outer_tile_y = outer_tile_x
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, K))
+        B = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(K, N))
+        C = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(M, N))
+
+        nest = Nest(shape=(M, N, K))
+        i, j, k = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            C[i, j] += A[i, k] * B[k, j]
+
+        schedule = nest.create_schedule()
+
+        ii, jj = schedule.tile((i, j), (outer_tile_x, outer_tile_y))
+        iii, jjj, kk = schedule.tile((ii, jj, k), (2, 2, 16))
+
+        schedule.reorder((i, j, k, ii, jj, iii, jjj, kk))
+
+        target = Target(Target.Model.AMD_MI100)
+        plan = schedule.create_plan(target=target)
+        plan.bind(
+            (i, j, ii, jj),
+            grid=(target.GridUnit.BLOCK_Y, target.GridUnit.BLOCK_X, target.GridUnit.THREAD_Y, target.GridUnit.THREAD_X)
+        )
+        plan.tensorize(indices=(iii, jjj, kk))
+
+        test_name = "test_rocm_tensorize_single_block_multi_warp_output"
+        package = Package()
+        function = package.add(plan, args=(A, B, C), base_name=test_name)
+
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / test_name
+        shutil.rmtree(output_dir, ignore_errors=True)
+
+        with verifiers.VerifyPackage(self, test_name, output_dir, file_list=[f"{test_name}.cu",
+                                                                             f"{test_name}.hat"]) as v:
+            package.build(
+                name=test_name,
+                format=Package.Format.MLIR | Package.Format.CUDA | Package.Format.HAT_PACKAGE,
+                mode=Package.Mode.RELEASE,    # Package.Mode.DEBUG,
+                output_dir=output_dir
+            )
+
+    def test_rocm_tensorize_multi_block_multi_warp_output(self) -> None:
+        from accera import Array, Nest, Package, ScalarType, Target
+
+        M = 1024
+        N = M
+        K = M
+        outer_tile_x = 64
+        outer_tile_y = outer_tile_x
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, K))
+        B = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(K, N))
+        C = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(M, N))
+
+        nest = Nest(shape=(M, N, K))
+        i, j, k = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            C[i, j] += A[i, k] * B[k, j]
+
+        schedule = nest.create_schedule()
+
+        ii, jj = schedule.tile((i, j), (outer_tile_x, outer_tile_y))
+        iii, jjj, kk = schedule.tile((ii, jj, k), (2, 2, 16))
+
+        schedule.reorder((i, j, ii, jj, k, iii, jjj, kk))
+
+        target = Target(Target.Model.AMD_MI100)
+        plan = schedule.create_plan(target=target)
+        plan.bind(
+            (i, j, ii, jj),
+            grid=(target.GridUnit.BLOCK_Y, target.GridUnit.BLOCK_X, target.GridUnit.THREAD_Y, target.GridUnit.THREAD_X)
+        )
+        plan.tensorize(indices=(iii, jjj, kk))
+
+        test_name = "test_rocm_tensorize_multi_block_multi_warp_output"
+        package = Package()
+        function = package.add(plan, args=(A, B, C), base_name=test_name)
+
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / test_name
+        shutil.rmtree(output_dir, ignore_errors=True)
+
+        with verifiers.VerifyPackage(self, test_name, output_dir, file_list=[f"{test_name}.cu",
+                                                                             f"{test_name}.hat"]) as v:
+            package.build(
+                name=test_name,
+                format=Package.Format.MLIR | Package.Format.CUDA | Package.Format.HAT_PACKAGE,
+                mode=Package.Mode.RELEASE,    # Package.Mode.DEBUG,
+                output_dir=output_dir
+            )
+
+    @expectedFailure(FailedReason.INVALID, "the hardware does not support the requested tensorcore shape")
+    def test_rocm_tensorize_invalid_shape_output(self) -> None:
+        from accera import Array, Nest, Package, ScalarType, Target
+
+        M = 256
+        N = M
+        K = M
+        block_x = 64
+        block_y = block_x
+        tile_size = 64
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, K))
+        B = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(K, N))
+        C = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(M, N))
+
+        nest = Nest(shape=(M, N, K))
+        i, j, k = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            C[i, j] += A[i, k] * B[k, j]
+
+        schedule = nest.create_schedule()
+
+        ii, jj = schedule.tile((i, j), (block_x, block_y))
+        iii, jjj, kk = schedule.tile((ii, jj, k), (tile_size, tile_size, tile_size))
+
+        schedule.reorder((i, j, ii, jj, k, iii, jjj, kk))
+
+        target = Target(Target.Model.AMD_MI100)
+        plan = schedule.create_plan(target=target)
+        plan.bind(
+            (i, j, ii, jj),
+            grid=(target.GridUnit.BLOCK_X, target.GridUnit.BLOCK_Y, target.GridUnit.THREAD_X, target.GridUnit.THREAD_Y)
+        )
+        plan.tensorize(indices=(iii, jjj, kk))
+
+        test_name = "test_rocm_tensorize_invalid_shape_output"
+        package = Package()
+        function = package.add(plan, args=(A, B, C), base_name=test_name)
+
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / test_name
+        shutil.rmtree(output_dir, ignore_errors=True)
+
+        with verifiers.VerifyPackage(self, test_name, output_dir, file_list=[f"{test_name}.cu",
+                                                                             f"{test_name}.hat"]) as v:
+            package.build(
+                name=test_name,
+                format=Package.Format.MLIR | Package.Format.CUDA | Package.Format.HAT_PACKAGE,
+                mode=Package.Mode.RELEASE,    # Package.Mode.DEBUG,
+                output_dir=output_dir
+            )
+
+    def test_gpu_cache_simple(self) -> None:
+        from accera import Array, Nest, Package, ScalarType, Target
+        from accera.lang import CacheIndexing, BLOCK_X, BLOCK_Y, THREAD_X, THREAD_Y
+        from accera._lang_python._lang import _MemorySpace
+
+        M = 1024
+        N = 1024
+        K = 1024
+        block_x = 16
+        block_y = block_x
+        k_tile_size = 32
+
+        m_tile_size = block_x
+        n_tile_size = block_y
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, K), layout=Array.Layout.FIRST_MAJOR)
+        B = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(K, N), layout=Array.Layout.FIRST_MAJOR)
+        C = Array(
+            role=Array.Role.INPUT_OUTPUT,
+            element_type=ScalarType.float32,
+            shape=(M, N),
+            layout=Array.Layout.FIRST_MAJOR
+        )
+
+        nest = Nest(shape=(M, N, K))
+        i, j, k = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            C[i, j] += A[i, k] * B[k, j]
+
+        schedule = nest.create_schedule()
+
+        ii, jj, kk = schedule.tile((i, j, k), (m_tile_size, n_tile_size, k_tile_size))
+        schedule.reorder(i, j, k, ii, jj, kk)
+
+        target = Target(Target.Model.AMD_MI100)
+        plan = schedule.create_plan(target=target)
+        plan.bind(
+            (i, j, ii, jj),
+            grid=(target.GridUnit.BLOCK_X, target.GridUnit.BLOCK_Y, target.GridUnit.THREAD_X, target.GridUnit.THREAD_Y)
+        )
+        plan.cache(A, index=ii, location=_MemorySpace.SHARED, layout=Array.Layout.FIRST_MAJOR)
+        plan.cache(B, index=ii, location=_MemorySpace.SHARED, layout=Array.Layout.FIRST_MAJOR)
+
+        test_name = "test_gpu_cache_simple"
+        package = Package()
+        function = package.add(plan, args=(A, B, C), base_name=test_name)
+
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / test_name
+        shutil.rmtree(output_dir, ignore_errors=True)
+
+        with verifiers.VerifyPackage(self, test_name, output_dir, file_list=[f"{test_name}.cu",
+                                                                             f"{test_name}.hat"]) as v:
+            package.build(
+                name=test_name,
+                format=Package.Format.MLIR_VERBOSE | Package.Format.CUDA | Package.Format.HAT_PACKAGE,
+                mode=Package.Mode.RELEASE,    # Package.Mode.DEBUG,
+                output_dir=output_dir,
+                _quiet=False
+            )
+
+    def test_gpu_cache_double_buffering(self) -> None:
+        from accera import Array, Nest, Package, ScalarType, Target
+        from accera.lang import CacheIndexing, BLOCK_X, BLOCK_Y, THREAD_X, THREAD_Y
+        from accera._lang_python._lang import _MemorySpace
+
+        M = 2560
+        N = 1536
+        K = 2048
+        block_x = 16
+        block_y = block_x
+        k_tile_size = 32
+
+        m_tile_size = block_x
+        n_tile_size = block_y
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, K), layout=Array.Layout.FIRST_MAJOR)
+        B = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(K, N), layout=Array.Layout.FIRST_MAJOR)
+        C = Array(
+            role=Array.Role.INPUT_OUTPUT,
+            element_type=ScalarType.float32,
+            shape=(M, N),
+            layout=Array.Layout.FIRST_MAJOR
+        )
+
+        nest = Nest(shape=(M, N, K))
+        i, j, k = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            C[i, j] += A[i, k] * B[k, j]
+
+        schedule = nest.create_schedule()
+
+        ii, jj, kk = schedule.tile((i, j, k), (m_tile_size, n_tile_size, k_tile_size))
+        schedule.reorder(i, j, k, ii, jj, kk)
+
+        target = Target(Target.Model.AMD_MI100)
+        plan = schedule.create_plan(target=target)
+        plan.bind(
+            (i, j, ii, jj),
+            grid=(target.GridUnit.BLOCK_X, target.GridUnit.BLOCK_Y, target.GridUnit.THREAD_X, target.GridUnit.THREAD_Y)
+        )
+        plan.cache(A, index=ii, double_buffer=True, location=_MemorySpace.SHARED, layout=Array.Layout.FIRST_MAJOR)
+        plan.cache(B, index=ii, double_buffer=True, location=_MemorySpace.SHARED, layout=Array.Layout.FIRST_MAJOR)
+
+        test_name = "test_gpu_cache_double_buffering"
+        package = Package()
+        function = package.add(plan, args=(A, B, C), base_name=test_name)
+
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / test_name
+        shutil.rmtree(output_dir, ignore_errors=True)
+
+        with verifiers.VerifyPackage(self, test_name, output_dir, file_list=[f"{test_name}.cu",
+                                                                             f"{test_name}.hat"]) as v:
+            package.build(
+                name=test_name,
+                format=Package.Format.MLIR_VERBOSE | Package.Format.CUDA | Package.Format.HAT_PACKAGE,
+                mode=Package.Mode.RELEASE,    # Package.Mode.DEBUG,
+                output_dir=output_dir,
+                _quiet=False
+            )
+
+    def test_gpu_cache_double_buffering_trigger_index(self) -> None:
+        from accera import Array, Nest, Package, ScalarType, Target
+        from accera.lang import CacheIndexing, BLOCK_X, BLOCK_Y, THREAD_X, THREAD_Y
+        from accera._lang_python._lang import _MemorySpace
+
+        M = 2560
+        N = 1536
+        K = 2048
+        block_x = 16
+        block_y = block_x
+        k_outer_tile_size = 512
+        k_inner_tile_size = 32
+
+        m_tile_size = block_x
+        n_tile_size = block_y
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, K), layout=Array.Layout.FIRST_MAJOR)
+        B = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(K, N), layout=Array.Layout.FIRST_MAJOR)
+        C = Array(
+            role=Array.Role.INPUT_OUTPUT,
+            element_type=ScalarType.float32,
+            shape=(M, N),
+            layout=Array.Layout.FIRST_MAJOR
+        )
+
+        nest = Nest(shape=(M, N, K))
+        i, j, k = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            C[i, j] += A[i, k] * B[k, j]
+
+        schedule = nest.create_schedule()
+
+        ii, jj, kk = schedule.tile((i, j, k), (m_tile_size, n_tile_size, k_outer_tile_size))
+        kkk = schedule.split(kk, k_inner_tile_size)
+        schedule.reorder(i, j, k, kk, ii, jj, kkk)
+
+        target = Target(Target.Model.AMD_MI100)
+        plan = schedule.create_plan(target=target)
+        plan.bind(
+            (i, j, ii, jj),
+            grid=(target.GridUnit.BLOCK_X, target.GridUnit.BLOCK_Y, target.GridUnit.THREAD_X, target.GridUnit.THREAD_Y)
+        )
+        plan.cache(
+            A,
+            index=ii,
+            trigger_index=kk,
+            double_buffer=True,
+            location=_MemorySpace.SHARED,
+            layout=Array.Layout.FIRST_MAJOR
+        )
+        plan.cache(
+            B,
+            index=ii,
+            trigger_index=kk,
+            double_buffer=True,
+            location=_MemorySpace.SHARED,
+            layout=Array.Layout.FIRST_MAJOR
+        )
+
+        test_name = "test_gpu_cache_double_buffering_trigger_index"
+        package = Package()
+        function = package.add(plan, args=(A, B, C), base_name=test_name)
+
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / test_name
+        shutil.rmtree(output_dir, ignore_errors=True)
+
+        with verifiers.VerifyPackage(self, test_name, output_dir, file_list=[f"{test_name}.cu",
+                                                                             f"{test_name}.hat"]) as v:
+            package.build(
+                name=test_name,
+                format=Package.Format.MLIR_VERBOSE | Package.Format.CUDA | Package.Format.HAT_PACKAGE,
+                mode=Package.Mode.RELEASE,    # Package.Mode.DEBUG,
+                output_dir=output_dir,
+                _quiet=False
+            )
+
+    def test_gpu_cache_double_buffering_mem_space(self) -> None:
+        from accera import Array, Nest, Package, ScalarType, Target
+        from accera.lang import CacheIndexing, BLOCK_X, BLOCK_Y, THREAD_X, THREAD_Y
+        from accera._lang_python._lang import _MemorySpace
+
+        M = 2560
+        N = 1536
+        K = 2048
+        block_x = 16
+        block_y = block_x
+        k_tile_size = 32
+
+        m_tile_size = block_x
+        n_tile_size = block_y
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, K), layout=Array.Layout.FIRST_MAJOR)
+        B = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(K, N), layout=Array.Layout.FIRST_MAJOR)
+        C = Array(
+            role=Array.Role.INPUT_OUTPUT,
+            element_type=ScalarType.float32,
+            shape=(M, N),
+            layout=Array.Layout.FIRST_MAJOR
+        )
+
+        nest = Nest(shape=(M, N, K))
+        i, j, k = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            C[i, j] += A[i, k] * B[k, j]
+
+        schedule = nest.create_schedule()
+
+        ii, jj, kk = schedule.tile((i, j, k), (m_tile_size, n_tile_size, k_tile_size))
+        schedule.reorder(i, j, k, ii, jj, kk)
+
+        target = Target(Target.Model.AMD_MI100)
+        plan = schedule.create_plan(target=target)
+        plan.bind(
+            (i, j, ii, jj),
+            grid=(target.GridUnit.BLOCK_X, target.GridUnit.BLOCK_Y, target.GridUnit.THREAD_X, target.GridUnit.THREAD_Y)
+        )
+        plan.cache(
+            A, index=ii, double_buffer=True, location=_MemorySpace.SHARED, layout=Array.Layout.FIRST_MAJOR
+        )    # Double buffer should be in private mem
+        plan.cache(
+            B,
+            index=ii,
+            double_buffer=True,
+            location=_MemorySpace.SHARED,
+            double_buffer_location=_MemorySpace.SHARED,
+            layout=Array.Layout.FIRST_MAJOR
+        )
+
+        test_name = "test_gpu_cache_double_buffering_mem_space"
+        package = Package()
+        function = package.add(plan, args=(A, B, C), base_name=test_name)
+
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / test_name
+        shutil.rmtree(output_dir, ignore_errors=True)
+
+        with verifiers.VerifyPackage(self, test_name, output_dir, file_list=[f"{test_name}.cu",
+                                                                             f"{test_name}.hat"]) as v:
+            package.build(
+                name=test_name,
+                format=Package.Format.MLIR_VERBOSE | Package.Format.CUDA | Package.Format.HAT_PACKAGE,
+                mode=Package.Mode.RELEASE,    # Package.Mode.DEBUG,
+                output_dir=output_dir,
+                _quiet=False
+            )
+
+    def test_cpu_cache_double_buffering_trigger_index(self) -> None:
+        from accera import Array, Nest, Package, ScalarType
+
+        M = 1024
+        N = 1024
+        K = 1024
+        m_tile_size = 16
+        n_tile_size = 16
+        k_outer_tile_size = 256
+        k_inner_tile_size = 32
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, K), layout=Array.Layout.FIRST_MAJOR)
+        B = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(K, N), layout=Array.Layout.FIRST_MAJOR)
+        C = Array(
+            role=Array.Role.INPUT_OUTPUT,
+            element_type=ScalarType.float32,
+            shape=(M, N),
+            layout=Array.Layout.FIRST_MAJOR
+        )
+
+        nest = Nest(shape=(M, N, K))
+        i, j, k = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            C[i, j] += A[i, k] * B[k, j]
+
+        schedule = nest.create_schedule()
+
+        ii, jj, kk = schedule.tile((i, j, k), (m_tile_size, n_tile_size, k_outer_tile_size))
+        kkk = schedule.split(kk, k_inner_tile_size)
+        schedule.reorder(i, j, k, kk, ii, jj, kkk)
+
+        plan = schedule.create_plan()
+        plan.cache(A, index=ii, trigger_index=kk, double_buffer=True, layout=Array.Layout.FIRST_MAJOR)
+        plan.cache(B, index=ii, trigger_index=kk, double_buffer=True, layout=Array.Layout.FIRST_MAJOR)
+
+        test_name = "test_cpu_cache_double_buffering_trigger_index"
+        package = Package()
+        function = package.add(plan, args=(A, B, C), base_name=test_name)
+
+        self._verify_matrix_multiplication_function(function, package, test_name)
+
+    def test_gpu_barrier_opt(self):
+        from accera import Array, Nest, Package, ScalarType, Target
+        from accera._lang_python._lang import Allocate, _MemorySpace, Array as NativeArray
+        from accera._lang_python._lang._gpu import Barrier
+        from accera._lang_python import _MemoryLayout
+
+        N = 256
+        block_x = 16
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(N, ))
+        B = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(N, ))
+        C = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(N, ))
+
+        nest = Nest(shape=(N, ))
+        i = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            # Performs excessive barriers.
+            shA = NativeArray(
+                Allocate(
+                    type=ScalarType.float32, layout=_MemoryLayout([block_x]).set_memory_space(_MemorySpace.SHARED)
+                )
+            )
+            shB = NativeArray(
+                Allocate(
+                    type=ScalarType.float32, layout=_MemoryLayout([block_x]).set_memory_space(_MemorySpace.SHARED)
+                )
+            )
+            Barrier()
+            shA[i] = A[i]
+            Barrier()
+            Barrier()
+            shA[i] = B[i]
+            Barrier()    # Only this is needed.
+            C[i] = shA[i] + shB[i]
+            Barrier()
+
+        schedule = nest.create_schedule()
+        ii = schedule.split(i, block_x)
+        schedule.reorder(i, ii)
+
+        target = Target(category=Target.Category.GPU, runtime=Target.Runtime.ROCM)
+        plan = schedule.create_plan(target)
+        plan.bind((i, ii), grid=(target.GridUnit.BLOCK_X, target.GridUnit.THREAD_X))
+
+        test_name = "test_gpu_barrier_opt"
+        package = Package()
+        function = package.add(plan, args=(A, B, C), base_name=test_name)
+
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / test_name
+        shutil.rmtree(output_dir, ignore_errors=True)
+
+        with verifiers.VerifyPackage(self, test_name, output_dir, file_list=[f"{test_name}.cu"]) as v:
+            package.build(
+                name=test_name,
+                format=Package.Format.MLIR | Package.Format.CUDA,
+                mode=Package.Mode.RELEASE,
+                output_dir=output_dir
+            )
+
+    def test_rocm_gemm_tiled_output(self) -> None:
+        from accera import Array, Nest, Package, ScalarType, Target
+
+        M = 16
+        N = M
+        K = M
+        block_x = 16
+        block_y = block_x
+        tile_size = 16
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, K))
+        B = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(K, N))
+        C = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(M, N))
+
+        nest = Nest(shape=(M, N, K))
+        i, j, k = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            C[i, j] += A[i, k] * B[k, j]
+
+        schedule = nest.create_schedule()
+        ii, jj = schedule.tile((i, j), (block_x, block_y))
+        iii, jjj, kk = schedule.tile((ii, jj, k), (tile_size, tile_size, tile_size))
+
+        schedule.reorder((i, j, ii, jj, k, iii, jjj, kk))
+
+        target = Target(Target.Model.AMD_MI100)
+        plan = schedule.create_plan(target=target)
+        plan.bind(
+            (i, j, ii, jj),
+            grid=(target.GridUnit.BLOCK_X, target.GridUnit.BLOCK_Y, target.GridUnit.THREAD_X, target.GridUnit.THREAD_Y)
+        )
+
+        test_name = "test_rocm_gemm_tiled_output"
+        package = Package()
+        function = package.add(plan, args=(A, B, C), base_name=test_name)
+
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / test_name
+        shutil.rmtree(output_dir, ignore_errors=True)
+
+        # We expect the output to have a block dim = [1,1,1] and grid dim = [1,1,1]
+        # there will be an inner 16x16x16 loop that performs the actual computation.
+        # i.e. the computation is performed by a single block and which contains a single thread
+        with verifiers.VerifyPackage(self, test_name, output_dir, file_list=[f"{test_name}.cu",
+                                                                             f"{test_name}.hat"]) as v:
+            package.build(
+                name=test_name,
+                format=Package.Format.MLIR | Package.Format.CUDA | Package.Format.HAT_PACKAGE,
+                mode=Package.Mode.RELEASE,    # Package.Mode.DEBUG,
+                output_dir=output_dir
+            )
+
+    # Thrifty caching
+    # Note: these tests will only verify that the thrify caching cases compile and compute the correct result,
+    #       however they will not validate when a cache buffer is successfully elided due to the delayed lowering
+    #       model we have. Currently the only way to verify this is manual IR inspection following the LoopNestToValuFunc
+    #       lowering pass
+
+    def test_thrifty_caching_simple_input_cache(self) -> None:
+        import accera as acc
+
+        package = Package()
+
+        M = 32
+        N = 32
+        K = 32
+
+        A = acc.Array(role=acc.Array.Role.INPUT, shape=(M, K), layout=acc.Array.Layout.FIRST_MAJOR)
+        B = acc.Array(role=acc.Array.Role.INPUT, shape=(K, N), layout=acc.Array.Layout.FIRST_MAJOR)
+        C = acc.Array(role=acc.Array.Role.INPUT_OUTPUT, shape=(M, N), layout=acc.Array.Layout.FIRST_MAJOR)
+
+        nest = acc.Nest(shape=(M, N, K))
+        i, j, k = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            C[i, j] += A[i, k] * B[k, j]
+
+        schedule = nest.create_schedule()
+
+        ii = schedule.split(i, 4)
+        jj = schedule.split(j, 16)
+        kk = schedule.split(k, 32)
+
+        order = [i, j, k, ii, jj, kk]
+        schedule.reorder(order)
+
+        plan = schedule.create_plan()
+
+        # This cache should get elided because at ii the active block is of shape 4x32, which is a contiguous subarray of the 32x32 base array A
+        AA = plan.cache(A, index=ii, thrifty=True, layout=acc.Array.Layout.FIRST_MAJOR)
+
+        # This cache should not get elided because at k the active block is of shape 32x16, which is a discontiguous subarray of the 32x32 base array B
+        BB = plan.cache(B, index=k, thrifty=True, layout=acc.Array.Layout.FIRST_MAJOR)
+
+        function = package.add(plan, args=(A, B, C), base_name=f"test_thrifty_caching_simple_input_cache")
+
+        self._verify_matrix_multiplication_function(function, package, f"test_thrifty_caching_simple_input_cache")
+
+    def test_thrifty_caching_simple_output_cache_elide(self) -> None:
+        import accera as acc
+
+        package = Package()
+
+        M = 32
+        N = 32
+        K = 32
+
+        A = acc.Array(role=acc.Array.Role.INPUT, shape=(M, K), layout=acc.Array.Layout.FIRST_MAJOR)
+        B = acc.Array(role=acc.Array.Role.INPUT, shape=(K, N), layout=acc.Array.Layout.FIRST_MAJOR)
+        C = acc.Array(role=acc.Array.Role.INPUT_OUTPUT, shape=(M, N), layout=acc.Array.Layout.FIRST_MAJOR)
+
+        nest = acc.Nest(shape=(M, N, K))
+        i, j, k = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            C[i, j] += A[i, k] * B[k, j]
+
+        schedule = nest.create_schedule()
+
+        ii = schedule.split(i, 4)
+        jj = schedule.split(j, 32)
+        kk = schedule.split(k, 8)
+
+        order = [i, j, k, ii, jj, kk]
+        schedule.reorder(order)
+
+        plan = schedule.create_plan()
+
+        # This cache should get elided because at ii the active block has the shape 4x32, which is a contiguous subarray of the 32x32 base array C
+        CC = plan.cache(C, index=ii, thrifty=True, layout=acc.Array.Layout.FIRST_MAJOR)
+
+        function = package.add(plan, args=(A, B, C), base_name=f"test_thrifty_caching_simple_output_cache_elide")
+
+        self._verify_matrix_multiplication_function(
+            function, package, f"test_thrifty_caching_simple_output_cache_elide"
+        )
+
+    # Note: The following thrifty cache tests are commented out as they increase the runtime of the smoke_test by too much
+    # TODO : move these to a new exhaustive test suite that isn't run as part of the buddy build
+
+    # def test_thrifty_caching_simple_output_cache_no_elide(self) -> None:
+    #     import accera as acc
+
+    #     package = Package()
+
+    #     M = 32
+    #     N = 32
+    #     K = 32
+
+    #     A = acc.Array(role=acc.Array.Role.INPUT, shape=(M, K))
+    #     B = acc.Array(role=acc.Array.Role.INPUT, shape=(K, N))
+    #     C = acc.Array(role=acc.Array.Role.INPUT_OUTPUT, shape=(M, N))
+
+    #     nest = acc.Nest(shape=(M, N, K))
+    #     i, j, k = nest.get_indices()
+
+    #     @nest.iteration_logic
+    #     def _():
+    #         C[i, j] += A[i, k] * B[k, j]
+
+    #     schedule = nest.create_schedule()
+
+    #     ii = schedule.split(i, 4)
+    #     jj = schedule.split(j, 8)
+    #     kk = schedule.split(k, 32)
+
+    #     order = [i, j, k, ii, jj, kk]
+    #     schedule.reorder(order)
+
+    #     plan = schedule.create_plan()
+
+    #     # This cache should not get elided because at ii the active block is of shape 4x8, which is a discontiguous subarray of the base array C
+    #     CC = plan.cache(C, index=ii, thrifty=True, layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     function = package.add(plan,
+    #                            args=(A,B,C),
+    #                            base_name=f"test_thrifty_caching_simple_output_cache_no_elide")
+
+    #     self._verify_matrix_multiplication_function(function, package, f"test_thrifty_caching_simple_output_cache_no_elide")
+
+    # def test_thrifty_caching_transpose_input_no_elide(self) -> None:
+    #     import accera as acc
+
+    #     package = Package()
+
+    #     M = 32
+    #     N = 32
+    #     K = 32
+
+    #     A = acc.Array(role=acc.Array.Role.INPUT, shape=(M, K), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     B = acc.Array(role=acc.Array.Role.INPUT, shape=(K, N), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     C = acc.Array(role=acc.Array.Role.INPUT_OUTPUT, shape=(M, N), layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     nest = acc.Nest(shape=(M, N, K))
+    #     i, j, k = nest.get_indices()
+
+    #     @nest.iteration_logic
+    #     def _():
+    #         C[i, j] += A[i, k] * B[k, j]
+
+    #     schedule = nest.create_schedule()
+
+    #     ii = schedule.split(i, 4)
+    #     jj = schedule.split(j, 8)
+    #     kk = schedule.split(k, 32)
+
+    #     order = [i, j, k, ii, jj, kk]
+    #     schedule.reorder(order)
+
+    #     plan = schedule.create_plan()
+
+    #     # Note that at index ii, the active block is of shape 4x32 and is a contiguous sub-buffer of the base array A,
+    #     # however the cache stride is different between elements so it should not get elided
+    #     AA = plan.cache(A, index=ii, thrifty=True, layout=acc.Array.Layout.LAST_MAJOR)
+
+    #     function = package.add(plan,
+    #                            args=(A,B,C),
+    #                            base_name=f"test_thrifty_caching_transpose_input_no_elide")
+
+    #     self._verify_matrix_multiplication_function(function, package, f"test_thrifty_caching_transpose_input_no_elide")
+
+    # def test_thrifty_caching_transpose_input_elide(self) -> None:
+    #     # This case transposes the shape of the input cache, however the schedule and cache are constructed such that the active
+    #     # block is a 1-D slice that is contiguous in the base array
+
+    #     import accera as acc
+
+    #     package = Package()
+
+    #     M = 32
+    #     N = 32
+    #     K = 32
+
+    #     A = acc.Array(role=acc.Array.Role.INPUT, shape=(M, K), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     B = acc.Array(role=acc.Array.Role.INPUT, shape=(K, N), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     C = acc.Array(role=acc.Array.Role.INPUT_OUTPUT, shape=(M, N), layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     nest = acc.Nest(shape=(M, N, K))
+    #     i, j, k = nest.get_indices()
+
+    #     @nest.iteration_logic
+    #     def _():
+    #         C[i, j] += A[i, k] * B[k, j]
+
+    #     schedule = nest.create_schedule()
+
+    #     ii = schedule.split(i, 1)
+    #     jj = schedule.split(j, 8)
+    #     kk = schedule.split(k, 32)
+
+    #     order = [i, j, k, ii, jj, kk]
+    #     schedule.reorder(order)
+
+    #     plan = schedule.create_plan()
+
+    #     # Note that at index ii, the active block is of shape 1x32 and is a contiguous sub-buffer of the base array A,
+    #     # and even though the cache transposes the active block, it still has a stride of 1 between the elements as
+    #     # so it should get elided
+    #     AA = plan.cache(A, index=ii, thrifty=True, layout=acc.Array.Layout.LAST_MAJOR)
+
+    #     function = package.add(plan,
+    #                            args=(A,B,C),
+    #                            base_name=f"test_thrifty_caching_transpose_input_elide")
+
+    #     self._verify_matrix_multiplication_function(function, package, f"test_thrifty_caching_transpose_input_elide")
+
+    # def test_thrifty_caching_with_trigger_index_elide(self) -> None:
+    #     import accera as acc
+
+    #     package = Package()
+
+    #     M = 32
+    #     N = 32
+    #     K = 32
+
+    #     A = acc.Array(role=acc.Array.Role.INPUT, shape=(M, K), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     B = acc.Array(role=acc.Array.Role.INPUT, shape=(K, N), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     C = acc.Array(role=acc.Array.Role.INPUT_OUTPUT, shape=(M, N), layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     nest = acc.Nest(shape=(M, N, K))
+    #     i, j, k = nest.get_indices()
+
+    #     @nest.iteration_logic
+    #     def _():
+    #         C[i, j] += A[i, k] * B[k, j]
+
+    #     schedule = nest.create_schedule()
+
+    #     ii = schedule.split(i, 16)
+    #     iii = schedule.split(ii, 4)
+    #     jj = schedule.split(j, 8)
+    #     kk = schedule.split(k, 32)
+
+    #     order = [i, k, ii, j, iii, jj, kk]
+    #     schedule.reorder(order)
+
+    #     plan = schedule.create_plan()
+
+    #     # This cache should get elided because at ii the active block is of shape 4x32, which is a contiguous subarray of the 32x32 base array A
+    #     # and the successive 4x32 active blocks within the 16x32 region covered at index ii are sequential contiguous subarrays of the 32x32 base array A
+    #     # Note: index j between ii and iii in the order should have no effect on this
+    #     AA = plan.cache(A, index=iii, trigger_index=ii, thrifty=True, layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     function = package.add(plan,
+    #                            args=(A,B,C),
+    #                            base_name=f"test_thrifty_caching_with_trigger_index_elide")
+
+    #     self._verify_matrix_multiplication_function(function, package, f"test_thrifty_caching_with_trigger_index_elide")
+
+    # def test_thrifty_caching_convolution_duplication_no_elide(self) -> None:
+    #     # A cache with a trigger index can result in input element duplication if the successive active blocks overlap.
+    #     # When the active blocks overlap, the resulting multi-cache with duplication is certainly not a strict
+    #     # subbuffer of the original base array, and therefore the cache should not get elided
+
+    #     # Caching the input array in a Conv2D operation produces duplication as long as the
+    #     # kernel rows and/or kernel columns loops are inside of the cache region in the loopnest
+
+    #     input_channels = 32
+    #     base_input_shape = (input_channels, 14, 14)
+    #     buffer_padding = (0, 1, 1)
+    #     conv_padding = (0, 1, 1)
+    #     stride = (2, 2)
+    #     kernel_shape = (3, 3)
+    #     output_filters = 32
+
+    #     import math
+    #     unpadded_output_rows = math.floor(
+    #         ((base_input_shape[1] + (2 * conv_padding[1]) - (kernel_shape[0] - 1) - 1) / stride[0]) + 1)
+    #     unpadded_output_columns = math.floor(
+    #         ((base_input_shape[2] + (2 * conv_padding[2]) - (kernel_shape[1] - 1) - 1) / stride[1]) + 1)
+    #     base_output_shape = (output_filters, unpadded_output_rows, unpadded_output_columns)
+
+    #     # Pad the buffers so we don't need to deal with conditionals at the edges in this test
+    #     padded_input_shape = [base_input_shape[i] + 2*buffer_padding[i]
+    #                           for i in range(len(base_input_shape))]
+    #     padded_output_shape = [base_output_shape[i] + 2*buffer_padding[i]
+    #                            for i in range(len(base_output_shape))]
+
+    #     weights_shape = (output_filters, input_channels, kernel_shape[0], kernel_shape[1])
+
+    #     Input = Array(role=Array.Role.INPUT,
+    #                   element_type=ScalarType.float32, shape=padded_input_shape)
+    #     Kernel = Array(role=Array.Role.INPUT,
+    #                    element_type=ScalarType.float32, shape=weights_shape)
+    #     Output = Array(role=Array.Role.INPUT_OUTPUT,
+    #                    element_type=ScalarType.float32, shape=padded_output_shape)
+
+    #     nest = Nest(shape=(output_filters,
+    #                        input_channels,
+    #                        unpadded_output_rows,
+    #                        unpadded_output_columns,
+    #                        kernel_shape[0],
+    #                        kernel_shape[1]))
+
+    #     out_f, in_ch, out_r, out_c, k_r, k_c = nest.get_indices()
+
+    #     row_stride, column_stride = stride
+    #     channel_padding, row_padding, column_padding = conv_padding
+    #     channel_buffer_padding, row_buffer_padding, column_buffer_padding = buffer_padding
+    #     # Define the iteration logic
+
+    #     @nest.iteration_logic
+    #     def _():
+    #         in_r = out_r * row_stride - row_padding + k_r
+    #         in_c = out_c * column_stride - column_padding + k_c
+    #         Output[out_f, out_r + row_buffer_padding, out_c + column_buffer_padding] += \
+    #             Input[in_ch, in_r + row_buffer_padding, in_c + column_buffer_padding] * \
+    #             Kernel[out_f, in_ch, k_r, k_c]
+
+    #     schedule = nest.create_schedule()
+
+    #     # We don't need to reorder as the kernel row and kernel column loops are already inside
+    #     # of the tensor shape loops in the nest
+
+    #     plan = schedule.create_plan()
+
+    #     # Cache input array at a level that will produce duplication
+    #     # This thrifty cache should not be elided as it is duplicating input elements
+    #     # so it has an inconsistent stride between the base input and the cache
+    #     # The active cache here should be a 3x3 subarray in the rows/columns dimension of the input
+    #     # and the trigger level should cause duplication between the column sections as there is overlap
+    #     plan.cache(Input, trigger_index=out_c, index=k_r, thrifty=True, layout=Array.Layout.FIRST_MAJOR)
+
+    #     package = Package()
+    #     function = package.add(plan, args=(
+    #         Input, Kernel, Output), base_name="test_thrifty_caching_convolution_duplication_no_elide")
+
+    #     package_name = "test_thrifty_caching_convolution_duplication_no_elide"
+
+    #     self._verify_convolution_function(function, package, package_name, buffer_padding, conv_padding, stride)
+
+    # def test_thrifty_caching_convolution_no_duplication_elide(self) -> None:
+    #     # This test creates a convolution loopnest but with the kernel row and kernel column loops
+    #     # outside of the input cache region, so there is no element duplication and the cache is a subbuffer
+    #     # of the input, so the thrifty cache can be elided
+
+    #     input_channels = 32
+    #     base_input_shape = (input_channels, 14, 14)
+    #     buffer_padding = (0, 1, 1)
+    #     conv_padding = (0, 1, 1)
+    #     stride = (2, 2)
+    #     kernel_shape = (3, 3)
+    #     output_filters = 32
+
+    #     import math
+    #     unpadded_output_rows = math.floor(
+    #         ((base_input_shape[1] + (2 * conv_padding[1]) - (kernel_shape[0] - 1) - 1) / stride[0]) + 1)
+    #     unpadded_output_columns = math.floor(
+    #         ((base_input_shape[2] + (2 * conv_padding[2]) - (kernel_shape[1] - 1) - 1) / stride[1]) + 1)
+    #     base_output_shape = (
+    #         output_filters, unpadded_output_rows, unpadded_output_columns)
+
+    #     # Pad the buffers so we don't need to deal with conditionals at the edges in this test
+    #     padded_input_shape = [base_input_shape[i] + 2*buffer_padding[i]
+    #                           for i in range(len(base_input_shape))]
+    #     padded_output_shape = [base_output_shape[i] + 2*buffer_padding[i]
+    #                            for i in range(len(base_output_shape))]
+
+    #     weights_shape = (output_filters, input_channels, kernel_shape[0], kernel_shape[1])
+
+    #     Input = Array(role=Array.Role.INPUT,
+    #                   element_type=ScalarType.float32, shape=padded_input_shape)
+    #     Kernel = Array(role=Array.Role.INPUT,
+    #                    element_type=ScalarType.float32, shape=weights_shape)
+    #     Output = Array(role=Array.Role.INPUT_OUTPUT,
+    #                    element_type=ScalarType.float32, shape=padded_output_shape)
+
+    #     nest = Nest(shape=(output_filters,
+    #                        input_channels,
+    #                        unpadded_output_rows,
+    #                        unpadded_output_columns,
+    #                        kernel_shape[0],
+    #                        kernel_shape[1]))
+
+    #     out_f, in_ch, out_r, out_c, k_r, k_c = nest.get_indices()
+
+    #     row_stride, column_stride = stride
+    #     channel_padding, row_padding, column_padding = conv_padding
+    #     channel_buffer_padding, row_buffer_padding, column_buffer_padding = buffer_padding
+    #     # Define the iteration logic
+
+    #     @nest.iteration_logic
+    #     def _():
+    #         in_r = out_r * row_stride - row_padding + k_r
+    #         in_c = out_c * column_stride - column_padding + k_c
+    #         Output[out_f, out_r + row_buffer_padding, out_c + column_buffer_padding] += \
+    #             Input[in_ch, in_r + row_buffer_padding, in_c + column_buffer_padding] * \
+    #             Kernel[out_f, in_ch, k_r, k_c]
+
+    #     schedule = nest.create_schedule()
+
+    #     # Reorder the schedule to put the kernel row and kernel column loops outside the
+    #     # row, and column loops
+    #     schedule.reorder(out_f, in_ch, k_r, k_c, out_r, out_c)
+
+    #     plan = schedule.create_plan()
+
+    #     # This thrifty cache should be elided as it is a strict subbuffer of the original input array
+    #     plan.cache(Input, index=out_c, thrifty=True, layout=Array.Layout.FIRST_MAJOR)
+
+    #     package = Package()
+    #     function = package.add(plan, args=(
+    #         Input, Kernel, Output), base_name="test_thrifty_caching_convolution_no_duplication_elide")
+
+    #     package_name = "test_thrifty_caching_convolution_no_duplication_elide"
+
+    #     self._verify_convolution_function(function, package, package_name, buffer_padding, conv_padding, stride)
+
+    # def test_thrifty_caching_convolution_no_duplication_no_elide_padding(self) -> None:
+    #     # This test creates a convolution loopnest but with the kernel row and kernel column loops
+    #     # outside of the input cache region, so there is no element duplication and the cache is a subbuffer
+    #     # of the input, but the cached region doesn't include the padding in the input buffer, so the cache
+    #     # should not be elided
+
+    #     input_channels = 32
+    #     base_input_shape = (input_channels, 14, 14)
+    #     buffer_padding = (0, 1, 1)
+    #     conv_padding = (0, 1, 1)
+    #     stride = (2, 2)
+    #     kernel_shape = (3, 3)
+    #     output_filters = 32
+
+    #     import math
+    #     unpadded_output_rows = math.floor(
+    #         ((base_input_shape[1] + (2 * conv_padding[1]) - (kernel_shape[0] - 1) - 1) / stride[0]) + 1)
+    #     unpadded_output_columns = math.floor(
+    #         ((base_input_shape[2] + (2 * conv_padding[2]) - (kernel_shape[1] - 1) - 1) / stride[1]) + 1)
+    #     base_output_shape = (
+    #         output_filters, unpadded_output_rows, unpadded_output_columns)
+
+    #     # Pad the buffers so we don't need to deal with conditionals at the edges in this test
+    #     padded_input_shape = [base_input_shape[i] + 2*buffer_padding[i]
+    #                           for i in range(len(base_input_shape))]
+    #     padded_output_shape = [base_output_shape[i] + 2*buffer_padding[i]
+    #                            for i in range(len(base_output_shape))]
+
+    #     weights_shape = (output_filters, input_channels, kernel_shape[0], kernel_shape[1])
+
+    #     Input = Array(role=Array.Role.INPUT,
+    #                   element_type=ScalarType.float32, shape=padded_input_shape)
+    #     Kernel = Array(role=Array.Role.INPUT,
+    #                    element_type=ScalarType.float32, shape=weights_shape)
+    #     Output = Array(role=Array.Role.INPUT_OUTPUT,
+    #                    element_type=ScalarType.float32, shape=padded_output_shape)
+
+    #     nest = Nest(shape=(output_filters,
+    #                        input_channels,
+    #                        unpadded_output_rows,
+    #                        unpadded_output_columns,
+    #                        kernel_shape[0],
+    #                        kernel_shape[1]))
+
+    #     out_f, in_ch, out_r, out_c, k_r, k_c = nest.get_indices()
+
+    #     row_stride, column_stride = stride
+    #     channel_padding, row_padding, column_padding = conv_padding
+    #     channel_buffer_padding, row_buffer_padding, column_buffer_padding = buffer_padding
+    #     # Define the iteration logic
+
+    #     @nest.iteration_logic
+    #     def _():
+    #         in_r = out_r * row_stride - row_padding + k_r
+    #         in_c = out_c * column_stride - column_padding + k_c
+    #         Output[out_f, out_r + row_buffer_padding, out_c + column_buffer_padding] += \
+    #             Input[in_ch, in_r + row_buffer_padding, in_c + column_buffer_padding] * \
+    #             Kernel[out_f, in_ch, k_r, k_c]
+
+    #     schedule = nest.create_schedule()
+
+    #     # Reorder the schedule to put the kernel row and kernel column loops outside the
+    #     # row, and column loops
+    #     schedule.reorder(out_f, in_ch, k_r, k_c, out_r, out_c)
+
+    #     plan = schedule.create_plan()
+
+    #     # This thrifty cache should be elided as it is a strict subbuffer of the original input array
+    #     plan.cache(Input, index=out_r, thrifty=True, layout=Array.Layout.FIRST_MAJOR)
+
+    #     package = Package()
+    #     function = package.add(plan, args=(
+    #         Input, Kernel, Output), base_name="test_thrifty_caching_convolution_no_duplication_no_elide_padding")
+
+    #     package_name = "test_thrifty_caching_convolution_no_duplication_no_elide_padding"
+
+    #     self._verify_convolution_function(function, package, package_name, buffer_padding, conv_padding, stride)
+
+    # def test_thrifty_caching_max_elements_elide(self) -> None:
+    #     import accera as acc
+
+    #     package = Package()
+
+    #     M = 32
+    #     N = 32
+    #     K = 32
+
+    #     A = acc.Array(role=acc.Array.Role.INPUT, shape=(M, K), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     B = acc.Array(role=acc.Array.Role.INPUT, shape=(K, N), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     C = acc.Array(role=acc.Array.Role.INPUT_OUTPUT, shape=(M, N), layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     nest = acc.Nest(shape=(M, N, K))
+    #     i, j, k = nest.get_indices()
+
+    #     @nest.iteration_logic
+    #     def _():
+    #         C[i, j] += A[i, k] * B[k, j]
+
+    #     schedule = nest.create_schedule()
+
+    #     ii = schedule.split(i, 4)
+    #     kk = schedule.split(k, 32)
+    #     kkk = schedule.split(kk, 8)
+
+    #     order = [i, k, j, kk, ii, kkk]
+    #     schedule.reorder(order)
+
+    #     plan = schedule.create_plan()
+
+    #     # This cache should get elided because with a budget of 4*32 the cache level will be
+    #     # at index kk, where the active block is of shape 4x32, which is a contiguous subarray of the 32x32 base array A
+    #     AA = plan.cache(A, max_elements=4*32, thrifty=True, layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     function = package.add(plan,
+    #                            args=(A,B,C),
+    #                            base_name=f"test_thrifty_caching_max_elements_elide")
+
+    #     self._verify_matrix_multiplication_function(function, package, f"test_thrifty_caching_max_elements_elide")
+
+    # def test_thrifty_caching_max_elements_no_elide(self) -> None:
+    #     import accera as acc
+
+    #     package = Package()
+
+    #     M = 32
+    #     N = 32
+    #     K = 32
+
+    #     A = acc.Array(role=acc.Array.Role.INPUT, shape=(M, K), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     B = acc.Array(role=acc.Array.Role.INPUT, shape=(K, N), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     C = acc.Array(role=acc.Array.Role.INPUT_OUTPUT, shape=(M, N), layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     nest = acc.Nest(shape=(M, N, K))
+    #     i, j, k = nest.get_indices()
+
+    #     @nest.iteration_logic
+    #     def _():
+    #         C[i, j] += A[i, k] * B[k, j]
+
+    #     schedule = nest.create_schedule()
+
+    #     ii = schedule.split(i, 4)
+    #     kk = schedule.split(k, 32)
+    #     kkk = schedule.split(kk, 8)
+
+    #     order = [i, k, j, kk, ii, kkk]
+    #     schedule.reorder(order)
+
+    #     plan = schedule.create_plan()
+
+    #     # This cache should not get elided because with a budget of (4*32 - 1) the cache level will be
+    #     # at index ii, where the active block is of shape 4x32, which is not a contiguous subarray of the 32x32 base array A
+    #     AA = plan.cache(A, max_elements=(4*32 - 1), thrifty=True, layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     function = package.add(plan,
+    #                            args=(A,B,C),
+    #                            base_name=f"test_thrifty_caching_max_elements_no_elide")
+
+    #     self._verify_matrix_multiplication_function(function, package, f"test_thrifty_caching_max_elements_no_elide")
+
+    # def test_thrifty_caching_coefficient_layout_elide(self) -> None:
+    #     import accera as acc
+
+    #     package = Package()
+
+    #     M = 32
+    #     N = 32
+    #     K = 32
+
+    #     A = acc.Array(role=acc.Array.Role.INPUT, shape=(M, K), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     B = acc.Array(role=acc.Array.Role.INPUT, shape=(K, N), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     C = acc.Array(role=acc.Array.Role.INPUT_OUTPUT, shape=(M, N), layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     nest = acc.Nest(shape=(M, N, K))
+    #     i, j, k = nest.get_indices()
+
+    #     @nest.iteration_logic
+    #     def _():
+    #         C[i, j] += A[i, k] * B[k, j]
+
+    #     schedule = nest.create_schedule()
+
+    #     ii = schedule.split(i, 4)
+    #     kk = schedule.split(k, 32)
+    #     kkk = schedule.split(kk, 8)
+
+    #     order = [i, k, j, kk, ii, kkk]
+    #     schedule.reorder(order)
+
+    #     plan = schedule.create_plan()
+
+    #     # This cache should get elided because with a cache level of kk, the active block is of shape 4x32, which is a contiguous subarray of the 32x32 base array A
+    #     # and the cache layout is a coefficient-specified layout which is equivalent to FIRST_MAJOR for this case
+    #     AA = plan.cache(A, index=kk, thrifty=True, layout=(32, 1))
+
+    #     function = package.add(plan,
+    #                            args=(A,B,C),
+    #                            base_name=f"test_thrifty_caching_coefficient_layout_elide")
+
+    #     self._verify_matrix_multiplication_function(function, package, f"test_thrifty_caching_coefficient_layout_elide")
+
+    # @expectedFailure(FailedReason.BUG, "Coefficient caches with gaps don't create sufficiently large buffers")
+    # def test_thrifty_caching_coefficient_layout_no_elide_gaps(self) -> None:
+    #     import accera as acc
+
+    #     package = Package()
+
+    #     M = 32
+    #     N = 32
+    #     K = 32
+
+    #     A = acc.Array(role=acc.Array.Role.INPUT, shape=(M, K), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     B = acc.Array(role=acc.Array.Role.INPUT, shape=(K, N), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     C = acc.Array(role=acc.Array.Role.INPUT_OUTPUT, shape=(M, N), layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     nest = acc.Nest(shape=(M, N, K))
+    #     i, j, k = nest.get_indices()
+
+    #     @nest.iteration_logic
+    #     def _():
+    #         C[i, j] += A[i, k] * B[k, j]
+
+    #     schedule = nest.create_schedule()
+
+    #     ii = schedule.split(i, 4)
+    #     kk = schedule.split(k, 32)
+    #     kkk = schedule.split(kk, 8)
+
+    #     order = [i, k, j, kk, ii, kkk]
+    #     schedule.reorder(order)
+
+    #     plan = schedule.create_plan()
+
+    #     # This cache should get not elided because with a cache level of kk, the active block is of shape 4x32, which is a contiguous subarray of the 32x32 base array A
+    #     # but the cache layout is a coefficient-specified layout which is almost equivalent to FIRST_MAJOR for this case but has
+    #     # 2 additional empty elements between rows
+    #     AA = plan.cache(A, index=kk, thrifty=True, layout=(32 + 2, 1))
+
+    #     function = package.add(plan,
+    #                            args=(A,B,C),
+    #                            base_name=f"test_thrifty_caching_coefficient_layout_no_elide_gaps")
+
+    #     self._verify_matrix_multiplication_function(function, package, f"test_thrifty_caching_coefficient_layout_no_elide_gaps")
+
+    # def test_thrifty_caching_different_memory_space_no_elide(self) -> None:
+    #     import accera as acc
+
+    #     # TODO : update once MemorySpace is better surfaced
+    #     from accera._lang_python._lang import _MemorySpace
+
+    #     package = Package()
+
+    #     M = 32
+    #     N = 32
+    #     K = 32
+
+    #     A = acc.Array(role=acc.Array.Role.INPUT, shape=(M, K), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     B = acc.Array(role=acc.Array.Role.INPUT, shape=(K, N), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     C = acc.Array(role=acc.Array.Role.INPUT_OUTPUT, shape=(M, N), layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     nest = acc.Nest(shape=(M, N, K))
+    #     i, j, k = nest.get_indices()
+
+    #     @nest.iteration_logic
+    #     def _():
+    #         C[i, j] += A[i, k] * B[k, j]
+
+    #     schedule = nest.create_schedule()
+
+    #     ii = schedule.split(i, 4)
+    #     kk = schedule.split(k, 32)
+
+    #     order = [i, j, k, ii, kk]
+    #     schedule.reorder(order)
+
+    #     target = acc.Target(category=acc.Target.Category.GPU, runtime=acc.Target.Runtime.ROCM)
+    #     plan = schedule.create_plan(target)
+
+    #     # With a cache level of ii, the active block is 4x32 and is a contiguous subarray of the base array A, however
+    #     # the cache should not get elided because it resides in a different memory space from the base array
+    #     AA = plan.cache(A, index=ii, thrifty=True, layout=acc.Array.Layout.FIRST_MAJOR, location=_MemorySpace.SHARED)
+
+    #     # Shared -> PRIVATE move so this should not get elided even though with a cache index of kk it is a sinlge contiguous row
+    #     # copy from the outer cache
+    #     AAA = plan.cache(AA, index=kk, thrifty=True, layout=acc.Array.Layout.FIRST_MAJOR, location=_MemorySpace.PRIVATE)
+
+    #     function = package.add(plan,
+    #                            args=(A,B,C),
+    #                            base_name=f"test_thrifty_caching_different_memory_space_no_elide")
+
+    #     package_name = f"test_thrifty_caching_different_memory_space_no_elide"
+    #     output_dir = pathlib.Path(TEST_PACKAGE_DIR) / package_name
+    #     shutil.rmtree(output_dir, ignore_errors=True)
+
+    #     gpu_package_format = Package.Format.MLIR | Package.Format.CUDA | Package.Format.HAT_PACKAGE
+    #     with verifiers.VerifyPackage(self, package_name, output_dir) as v:
+    #         package.build(name=package_name, format=gpu_package_format, mode=self.PACKAGE_MODE, output_dir=output_dir)
+
+    # def test_thrifty_caching_multiple_memory_spaces_elide(self) -> None:
+    #     import accera as acc
+
+    #     # TODO : update once MemorySpace is better surfaced
+    #     from accera._lang_python._lang import _MemorySpace
+
+    #     package = Package()
+
+    #     M = 32
+    #     N = 32
+    #     K = 32
+
+    #     A = acc.Array(role=acc.Array.Role.INPUT, shape=(M, K), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     B = acc.Array(role=acc.Array.Role.INPUT, shape=(K, N), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     C = acc.Array(role=acc.Array.Role.INPUT_OUTPUT, shape=(M, N), layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     nest = acc.Nest(shape=(M, N, K))
+    #     i, j, k = nest.get_indices()
+
+    #     @nest.iteration_logic
+    #     def _():
+    #         C[i, j] += A[i, k] * B[k, j]
+
+    #     schedule = nest.create_schedule()
+
+    #     ii = schedule.split(i, 4)
+    #     kk = schedule.split(k, 32)
+
+    #     order = [i, j, k, ii, kk]
+    #     schedule.reorder(order)
+
+    #     target = acc.Target(category=acc.Target.Category.GPU, runtime=acc.Target.Runtime.ROCM)
+    #     plan = schedule.create_plan(target)
+
+    #     # With a cache level of ii, the active block is 4x32 and is a contiguous subarray of the base array A, however
+    #     # the cache should not get elided because it resides in a different memory space from the base array
+    #     AA = plan.cache(A, index=ii, thrifty=True, layout=acc.Array.Layout.FIRST_MAJOR, location=_MemorySpace.SHARED)
+
+    #     # This cache is a contigous subarray of the outer cache and it is in the same memory space, so it should get elided
+    #     AAA = plan.cache(AA, index=kk, thrifty=True, layout=acc.Array.Layout.FIRST_MAJOR, location=_MemorySpace.SHARED)
+
+    #     function = package.add(plan,
+    #                            args=(A,B,C),
+    #                            base_name=f"test_thrifty_caching_multiple_memory_spaces_elide")
+
+    #     package_name = f"test_thrifty_caching_multiple_memory_spaces_elide"
+    #     output_dir = pathlib.Path(TEST_PACKAGE_DIR) / package_name
+    #     shutil.rmtree(output_dir, ignore_errors=True)
+
+    #     gpu_package_format = Package.Format.MLIR | Package.Format.CUDA | Package.Format.HAT_PACKAGE
+    #     with verifiers.VerifyPackage(self, package_name, output_dir) as v:
+    #         package.build(name=package_name, format=gpu_package_format, mode=self.PACKAGE_MODE, output_dir=output_dir, _quiet=False)
+
+    # def test_thrifty_caching_hierarchical_elide_outer(self) -> None:
+    #     import accera as acc
+
+    #     package = Package()
+
+    #     M = 32
+    #     N = 32
+    #     K = 32
+
+    #     A = acc.Array(role=acc.Array.Role.INPUT, shape=(M, K), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     B = acc.Array(role=acc.Array.Role.INPUT, shape=(K, N), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     C = acc.Array(role=acc.Array.Role.INPUT_OUTPUT, shape=(M, N), layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     nest = acc.Nest(shape=(M, N, K))
+    #     i, j, k = nest.get_indices()
+
+    #     @nest.iteration_logic
+    #     def _():
+    #         C[i, j] += A[i, k] * B[k, j]
+
+    #     schedule = nest.create_schedule()
+
+    #     ii = schedule.split(i, 4)
+    #     kk = schedule.split(k, 8)
+
+    #     order = [i, j, k, ii, kk]
+    #     schedule.reorder(order)
+
+    #     plan = schedule.create_plan()
+
+    #     # With a cache level of k, the active block is 4x32 and is a contiguous subarray of the base array A,
+    #     # so it should get elided
+    #     AA = plan.cache(A, index=k, thrifty=True, layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     # With a cache level of ii, the active block is 4x8 inside a cache of size 4x32, so the cache should not get elided
+    #     AAA = plan.cache(AA, index=ii, thrifty=True, layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     function = package.add(plan,
+    #                            args=(A, B, C),
+    #                            base_name=f"test_thrifty_caching_hierarchical_elide_outer")
+
+    #     self._verify_matrix_multiplication_function(function, package, f"test_thrifty_caching_hierarchical_elide_outer")
+
+    # def test_thrifty_caching_hierarchical_elide_inner(self) -> None:
+    #     import accera as acc
+
+    #     package = Package()
+
+    #     M = 32
+    #     N = 32
+    #     K = 32
+
+    #     A = acc.Array(role=acc.Array.Role.INPUT, shape=(M, K), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     B = acc.Array(role=acc.Array.Role.INPUT, shape=(K, N), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     C = acc.Array(role=acc.Array.Role.INPUT_OUTPUT, shape=(M, N), layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     nest = acc.Nest(shape=(M, N, K))
+    #     i, j, k = nest.get_indices()
+
+    #     @nest.iteration_logic
+    #     def _():
+    #         C[i, j] += A[i, k] * B[k, j]
+
+    #     schedule = nest.create_schedule()
+
+    #     ii = schedule.split(i, 4)
+    #     kk = schedule.split(k, 8)
+
+    #     order = [i, j, k, ii, kk]
+    #     schedule.reorder(order)
+
+    #     plan = schedule.create_plan()
+
+    #     # With a cache level of ii, the active block is 4x8 and is not a contiguous subarray of the base array A,
+    #     # so it should not get elided
+    #     AA = plan.cache(A, index=ii, thrifty=True, layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     # With a cache level of kk, the active block is 1x8 inside a cache of size 4x8, so it is a contiguous subarray
+    #     # of the cache AA so this inner cache should get elided
+    #     AAA = plan.cache(AA, index=kk, thrifty=True, layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     function = package.add(plan,
+    #                            args=(A, B, C),
+    #                            base_name=f"test_thrifty_caching_hierarchical_elide_inner")
+
+    #     self._verify_matrix_multiplication_function(function, package, f"test_thrifty_caching_hierarchical_elide_inner")
+
+    # def test_thrifty_caching_hierarchical_elide_middle(self) -> None:
+    #     import accera as acc
+
+    #     package = Package()
+
+    #     M = 32
+    #     N = 32
+    #     K = 32
+
+    #     A = acc.Array(role=acc.Array.Role.INPUT, shape=(M, K), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     B = acc.Array(role=acc.Array.Role.INPUT, shape=(K, N), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     C = acc.Array(role=acc.Array.Role.INPUT_OUTPUT, shape=(M, N), layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     nest = acc.Nest(shape=(M, N, K))
+    #     i, j, k = nest.get_indices()
+
+    #     @nest.iteration_logic
+    #     def _():
+    #         C[i, j] += A[i, k] * B[k, j]
+
+    #     schedule = nest.create_schedule()
+
+    #     ii = schedule.split(i, 8)
+    #     iii = schedule.split(ii, 2)
+    #     kk = schedule.split(k, 16)
+    #     kkk = schedule.split(kk, 4)
+
+    #     order = [i, j, k, ii, kk, iii, kkk]
+    #     schedule.reorder(order)
+
+    #     plan = schedule.create_plan()
+
+    #     # With a cache level of ii, the active block is 8x16 and is not a contiguous subarray of the base array A,
+    #     # so it should not get elided
+    #     AA = plan.cache(A, index=ii, thrifty=True, layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     # With a cache level of kk, the active block is 2x16 inside a cache of size 8x16, so it is a contiguous subarray
+    #     # of the cache array AA so this cache should get elided
+    #     AAA = plan.cache(AA, index=kk, thrifty=True, layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     # With a cache level of iii, the active block is 2x4 inside a cache of size 2x16, or really 8x16 after the middle cache is elided.
+    #     # In either case this is a discontiguous subarray so this cache should not get elided
+    #     AAAA = plan.cache(AAA, index=iii, thrifty=True, layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     function = package.add(plan,
+    #                            args=(A, B, C),
+    #                            base_name=f"test_thrifty_caching_hierarchical_elide_middle")
+
+    #     self._verify_matrix_multiplication_function(function, package, f"test_thrifty_caching_hierarchical_elide_middle")
+
+    # def test_thrifty_caching_elide_boundary_no_elide_main(self) -> None:
+    #     # This case creates a loopnest where in a boundary condition the cached segment is a strict subarray of the base array
+    #     # but the main section of the loop is not, so the boundary section of the cache should is elided, but the main
+    #     # section is not
+
+    #     import accera as acc
+
+    #     package = Package()
+
+    #     M = 33 # 33 so that M % i_split_size = 1 and a boundary condition is created
+    #     N = 32
+    #     K = 32
+
+    #     A = acc.Array(role=acc.Array.Role.INPUT, shape=(M, K), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     B = acc.Array(role=acc.Array.Role.INPUT, shape=(K, N), layout=acc.Array.Layout.FIRST_MAJOR)
+    #     C = acc.Array(role=acc.Array.Role.INPUT_OUTPUT, shape=(M, N), layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     nest = acc.Nest(shape=(M, N, K))
+    #     i, j, k = nest.get_indices()
+
+    #     @nest.iteration_logic
+    #     def _():
+    #         C[i, j] += A[i, k] * B[k, j]
+
+    #     schedule = nest.create_schedule()
+
+    #     ii = schedule.split(i, 4)
+    #     kk = schedule.split(k, 8)
+
+    #     order = [i, j, k, ii, kk]
+    #     schedule.reorder(order)
+
+    #     plan = schedule.create_plan()
+
+    #     # With a cache level of k, the active block is 4x8 in the main part of the loopnest and is a discontiguous subarray of the base array A,
+    #     # so it should not get elided,
+    #     # However in the boundary condition on ii created because (M % i_split_size) = (33 % 4) = 1, the active block is 1x8, which is a contiguous
+    #     # subarray of the base array A, so the boundary cache should get elided
+    #     AA = plan.cache(A, index=ii, thrifty=True, layout=acc.Array.Layout.FIRST_MAJOR)
+
+    #     function = package.add(plan,
+    #                            args=(A, B, C),
+    #                            base_name=f"test_thrifty_caching_elide_boundary_no_elide_main")
+
+    #     self._verify_matrix_multiplication_function(function, package, f"test_thrifty_caching_elide_boundary_no_elide_main")
+
+    def test_rocm_cache_tensorize(self) -> None:
+        from accera import Array, Nest, Package, ScalarType, Target
+
+        M = 1024
+        N = 1024
+        K = 1024
+        outer_tile_x = 64
+        outer_tile_y = outer_tile_x
+        outer_tile_k = 64
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, K), layout=Array.Layout.FIRST_MAJOR)
+        B = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(K, N), layout=Array.Layout.FIRST_MAJOR)
+        C = Array(
+            role=Array.Role.INPUT_OUTPUT,
+            element_type=ScalarType.float32,
+            shape=(M, N),
+            layout=Array.Layout.FIRST_MAJOR
+        )
+
+        nest = Nest(shape=(M, N, K))
+        i, j, k = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            C[i, j] += A[i, k] * B[k, j]
+
+        schedule = nest.create_schedule()
+
+        ii, jj, kk = schedule.tile((i, j, k), (outer_tile_x, outer_tile_y, outer_tile_k))
+        iii, jjj, kkk = schedule.tile((ii, jj, kk), (2, 2, 16))
+
+        schedule.reorder(i, j, k, ii, jj, kk, iii, jjj, kkk)
+
+        target = Target(Target.Model.AMD_MI100)
+        plan = schedule.create_plan(target=target)
+        plan.bind(
+            (i, j, ii, jj),
+            grid=(target.GridUnit.BLOCK_Y, target.GridUnit.BLOCK_X, target.GridUnit.THREAD_Y, target.GridUnit.THREAD_X)
+        )
+        plan.tensorize(indices=(iii, jjj, kkk))
+        plan.cache(
+            A, index=ii, double_buffer=False, location=target.MemorySpace.SHARED, layout=Array.Layout.FIRST_MAJOR
+        )
+        plan.cache(
+            B, index=ii, double_buffer=False, location=target.MemorySpace.SHARED, layout=Array.Layout.FIRST_MAJOR
+        )
+
+        test_name = "test_rocm_cache_tensorize"
+        package = Package()
+        function = package.add(plan, args=(A, B, C), base_name=test_name)
+
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / test_name
+        shutil.rmtree(output_dir, ignore_errors=True)
+
+        with verifiers.VerifyPackage(self, test_name, output_dir, file_list=[f"{test_name}.cu",
+                                                                             f"{test_name}.hat"]) as v:
+            package.build(
+                name=test_name,
+                format=Package.Format.MLIR_VERBOSE | Package.Format.CUDA | Package.Format.HAT_PACKAGE,
+                mode=Package.Mode.RELEASE,    # Package.Mode.DEBUG,
+                output_dir=output_dir,
+                _quiet=False
+            )
+
+    def test_rocm_cache_double_buffering_tensorize(self) -> None:
+        from accera import Array, Nest, Package, ScalarType, Target
+
+        M = 1024
+        N = 1024
+        K = 1024
+        outer_tile_x = 64
+        outer_tile_y = outer_tile_x
+        outer_tile_k = 64
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(M, K), layout=Array.Layout.FIRST_MAJOR)
+        B = Array(role=Array.Role.INPUT, element_type=ScalarType.float32, shape=(K, N), layout=Array.Layout.FIRST_MAJOR)
+        C = Array(
+            role=Array.Role.INPUT_OUTPUT,
+            element_type=ScalarType.float32,
+            shape=(M, N),
+            layout=Array.Layout.FIRST_MAJOR
+        )
+
+        nest = Nest(shape=(M, N, K))
+        i, j, k = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            C[i, j] += A[i, k] * B[k, j]
+
+        schedule = nest.create_schedule()
+
+        ii, jj, kk = schedule.tile((i, j, k), (outer_tile_x, outer_tile_y, outer_tile_k))
+        iii, jjj, kkk = schedule.tile((ii, jj, kk), (2, 2, 16))
+
+        schedule.reorder(i, j, k, ii, jj, kk, iii, jjj, kkk)
+
+        target = Target(Target.Model.AMD_MI100)
+        plan = schedule.create_plan(target=target)
+        plan.bind(
+            (i, j, ii, jj),
+            grid=(target.GridUnit.BLOCK_Y, target.GridUnit.BLOCK_X, target.GridUnit.THREAD_Y, target.GridUnit.THREAD_X)
+        )
+        plan.tensorize(indices=(iii, jjj, kkk))
+        plan.cache(A, index=ii, double_buffer=True, location=target.MemorySpace.SHARED, layout=Array.Layout.FIRST_MAJOR)
+        plan.cache(B, index=ii, double_buffer=True, location=target.MemorySpace.SHARED, layout=Array.Layout.FIRST_MAJOR)
+
+        test_name = "test_rocm_cache_double_buffering_tensorize"
+        package = Package()
+        function = package.add(plan, args=(A, B, C), base_name=test_name)
+
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / test_name
+        shutil.rmtree(output_dir, ignore_errors=True)
+
+        with verifiers.VerifyPackage(self, test_name, output_dir, file_list=[f"{test_name}.cu",
+                                                                             f"{test_name}.hat"]) as v:
+            package.build(
+                name=test_name,
+                format=Package.Format.MLIR_VERBOSE | Package.Format.CUDA | Package.Format.HAT_PACKAGE,
+                mode=Package.Mode.RELEASE,    # Package.Mode.DEBUG,
+                output_dir=output_dir,
+                _quiet=False
+            )
+
+    def test_fill_fp16(self):
+        from accera import Array, Nest, Package, ScalarType
+        from accera import _cast
+
+        # Define our vector sizes
+        N = 2**16
+
+        Out = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float16, shape=(N, ))
+
+        nest = Nest(shape=(N, ))
+        i = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            Out[i] = _cast(2, ScalarType.float16)
+
+        schedule = nest.create_schedule()
+        plan = schedule.create_plan()
+
+        package = Package()
+        package_name = "test_fill_fp16"
+        function = package.add(plan, args=(Out, ), base_name=package_name)
+
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / package_name
+        shutil.rmtree(output_dir, ignore_errors=True)
+
+        with verifiers.VerifyPackage(self, package_name, output_dir) as v:
+            package.build(name=package_name, format=self.PACKAGE_FORMAT, mode=self.PACKAGE_MODE, output_dir=output_dir)
+
+            def fill_fp16():
+                return 2 * np.ones(N).astype(np.float16)
+
+            Output_test = np.random.random(N).astype(np.float16)
+            Output_ref = fill_fp16()
+
+            v.check_correctness(function.name, before=(Output_test, ), after=(Output_ref, ))
+
+    def test_abs_fp16(self):
+        from accera import Array, Nest, Package, ScalarType, Target
+        from accera import abs
+
+        # Define our vector sizes
+        N = 16
+
+        In = Array(role=Array.Role.INPUT, element_type=ScalarType.float16, shape=(N, ))
+        Out = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float16, shape=(N, ))
+
+        nest = Nest(shape=(N, ))
+        i = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            Out[i] = abs(In[i])
+
+        schedule = nest.create_schedule()
+        plan = schedule.create_plan()
+
+        package = Package()
+        package_name = "test_add_scalar_fp16"
+        function = package.add(plan, args=(In, Out), base_name=package_name)
+
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / package_name
+        shutil.rmtree(output_dir, ignore_errors=True)
+
+        with verifiers.VerifyPackage(self, package_name, output_dir) as v:
+            package.build(name=package_name, format=self.PACKAGE_FORMAT, mode=self.PACKAGE_MODE, output_dir=output_dir)
+
+            def abs_fp16(a):
+                return np.abs(a)
+
+            Input_test, Output_test = (np.random.uniform(-1, 1, p.shape).astype(np.float16) for p in function.args)
+            Output_ref = abs_fp16(Input_test)
+
+            v.check_correctness(function.name, before=(Input_test, Output_test), after=(Input_test, Output_ref))
+
+    def test_vec_add_fp16(self):
+        from accera import Array, Nest, Package, ScalarType
+
+        # Define our vector sizes
+        N = 2**16
+
+        A = Array(role=Array.Role.INPUT, element_type=ScalarType.float16, shape=(N, ))
+        B = Array(role=Array.Role.INPUT, element_type=ScalarType.float16, shape=(N, ))
+        C = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float16, shape=(N, ))
+
+        nest = Nest(shape=(N, ))
+        i = nest.get_indices()
+
+        @nest.iteration_logic
+        def _():
+            C[i] = A[i] + B[i]
+
+        schedule = nest.create_schedule()
+        plan = schedule.create_plan()
+
+        package = Package()
+        package_name = "test_vec_add_fp16"
+        function = package.add(plan, args=(A, B, C), base_name=package_name)
+
+        output_dir = pathlib.Path(TEST_PACKAGE_DIR) / package_name
+        shutil.rmtree(output_dir, ignore_errors=True)
+
+        with verifiers.VerifyPackage(self, package_name, output_dir) as v:
+            package.build(name=package_name, format=self.PACKAGE_FORMAT, mode=self.PACKAGE_MODE, output_dir=output_dir)
+
+            def vecadd_ref(a, b):
+                return a + b
+
+            Input0_test, Input1_test, Output_test = (
+                np.random.random(p.shape).astype(np.float16) for p in function.args
+            )
+            Output_ref = vecadd_ref(Input0_test, Input1_test)
+
+            v.check_correctness(
+                function.name,
+                before=(Input0_test, Input1_test, Output_test),
+                after=(Input0_test, Input1_test, Output_ref)
+            )
+
 
 if __name__ == '__main__':
     unittest.main(verbosity=10)
diff --git a/accera/python/accera/test/unit_tests.py b/accera/python/accera/test/unit_tests.py
index 80c77d6e..23af7a45 100644
--- a/accera/python/accera/test/unit_tests.py
+++ b/accera/python/accera/test/unit_tests.py
@@ -24,6 +24,7 @@ class ModuleScope:
     """Ensures that the global Package module is restored when using
     private APIs to set and clear the active module
     """
+
     def __init__(self, module):
         self.module = module
 
@@ -39,6 +40,7 @@ def __exit__(self, exc_type, exc_val, exc_tb):
 
 
 class ContainerTypesTests(unittest.TestCase):
+
     def test_valor(self) -> None:
         from accera import ScalarType
         from accera._lang_python import _MemoryLayout
@@ -83,7 +85,8 @@ def test_scalar(self) -> None:
             self.assertEqual(s, val)
 
         # test scalar creation from value with no layout
-        for t in [ScalarType.int8, ScalarType.int16, ScalarType.int32, ScalarType.float32, ScalarType.float64]:
+        for t in [ScalarType.int8, ScalarType.int16, ScalarType.int32, ScalarType.float16, ScalarType.float32,
+                  ScalarType.float64]:
             x = _Valor(t, _MemoryLayout())
             s = Scalar(x)
             self.assertIsNotNone(s)
@@ -101,6 +104,7 @@ def test_scalar_conditionals(self) -> None:
 
 
 class PackagingTypesTests(unittest.TestCase):
+
     def test_compiler_options(self) -> None:
         from accera import CompilerOptions, _GetTargetDeviceFromName
 
@@ -380,6 +384,7 @@ def main_test(arr):
 
 
 class ExecutionPlanTypesTests(unittest.TestCase):
+
     def test_gpu_config(self) -> None:
         from accera._lang_python._lang import _GPU, _Dim3
         gpu_config = _GPU(grid=_Dim3(x=8, y=16, z=1), block=_Dim3(16, 32, 2))
@@ -394,6 +399,7 @@ def test_gpu_config(self) -> None:
 
 
 class LogicTypesTests(unittest.TestCase):
+
     def test_if_context(self) -> None:
         from accera._lang_python._lang import _If, Scalar
 
@@ -434,6 +440,7 @@ def test_conditional_logic(self) -> None:
         i, j = nest.get_indices()
 
         def test_fn():
+
             def if_block():
                 A[i, j] = 42
 
@@ -447,6 +454,7 @@ def if_block():
 
 
 class TargetsTest(unittest.TestCase):
+
     def test_equivalence_check(self) -> None:
         from accera import Target
         t1 = Target()
diff --git a/accera/python/gpu/src/__init__.py b/accera/python/gpu/src/__init__.py
index 9a4a0ec6..9f5c98df 100644
--- a/accera/python/gpu/src/__init__.py
+++ b/accera/python/gpu/src/__init__.py
@@ -3,4 +3,8 @@
 # Licensed under the MIT License. See LICENSE in the project root for license information.
 ####################################################################################################
 
-from ._version import __version__
+try:
+    from ._version import __version__
+except:
+    # CMake-driven builds do not generate _version.py yet
+    __version__ = None
diff --git a/accera/python/lib/src/ContainerTypes.cpp b/accera/python/lib/src/ContainerTypes.cpp
index 7cbcac7f..5b2ffe6f 100644
--- a/accera/python/lib/src/ContainerTypes.cpp
+++ b/accera/python/lib/src/ContainerTypes.cpp
@@ -5,8 +5,8 @@
 
 #include "AcceraTypes.h"
 
-#include <value/include/ScalarOperations.h>
 #include <value/include/FastMath.h>
+#include <value/include/ScalarOperations.h>
 
 namespace py = pybind11;
 namespace value = accera::value;
@@ -30,6 +30,7 @@ namespace
             .value("int32", value::ValueType::Int32, "4 byte signed integer")
             .value("int64", value::ValueType::Int64, "8 byte signed integer")
             .value("index", value::ValueType::Index, "index type")
+            .value("float16", value::ValueType::Float16, "2 byte floating point")
             .value("float32", value::ValueType::Float, "4 byte floating point")
             .value("float64", value::ValueType::Double, "8 byte floating point");
 
@@ -103,6 +104,9 @@ General constructor.
             .def("__repr__", [](const util::MemoryLayout& layout) {
                 return layout.ToString();
             })
+            .def("set_memory_space", [](util::MemoryLayout& layout, util::MemorySpace space) {
+                return layout.SetMemorySpace(space);
+            })
             .def(py::self == py::self)
             .def_static("get_subarray_layout", [](const util::MemoryLayout& originalLayout, std::vector<int64_t> size) {
                 return util::MemoryLayout(
@@ -178,19 +182,19 @@ Constructs an instance from a 1D list reshaped into the given array shape
                  "value"_a,
                  "name"_a = "")
             .def(py::init([](const py::buffer buffer, const std::optional<util::MemoryLayout> layout, const std::string& name) {
-                // constructor for np.float32 python buffers, a special case because python floats are 64-bit
-                py::buffer_info info = buffer.request();
-                if (info.format != py::format_descriptor<float>::format())
-                {
-                    throw std::runtime_error("Unsupported buffer format");
-                }
-                assert(info.itemsize == sizeof(float));
-                std::vector<float> v(static_cast<float*>(info.ptr), static_cast<float*>(info.ptr) + info.size);
-                return value::Array(v, layout, name);
-            }),
-            "buffer"_a,
-            "memory_layout"_a,
-            "name"_a = "")
+                     // constructor for np.float32 python buffers, a special case because python floats are 64-bit
+                     py::buffer_info info = buffer.request();
+                     if (info.format != py::format_descriptor<float>::format())
+                     {
+                         throw std::runtime_error("Unsupported buffer format");
+                     }
+                     assert(info.itemsize == sizeof(float));
+                     std::vector<float> v(static_cast<float*>(info.ptr), static_cast<float*>(info.ptr) + info.size);
+                     return value::Array(v, layout, name);
+                 }),
+                 "buffer"_a,
+                 "memory_layout"_a,
+                 "name"_a = "")
             // .ADD_CTOR(bool) // BUG: bool requires std::vector nonsense
             .ADD_CTOR(int8_t)
             .ADD_CTOR(int16_t)
@@ -343,13 +347,13 @@ specific to the EmitterContext, specified by the Emittable type.
             })
             .def("__floordiv__", [](value::Scalar& a, value::Scalar& b) {
                 return (a.GetType() == value::ValueType::Float || a.GetType() == value::ValueType::Double) ?
-                    // Floor is limited to floating point types
-                    value::Floor(value::Divide(a, b)) : value::Divide(a, b);
+                           // Floor is limited to floating point types
+                           value::Floor(value::Divide(a, b)) : value::Divide(a, b);
             })
             .def("__and__", &value::BitwiseAnd)
             .def("__or__", &value::BitwiseOr)
             .def("__invert__", &value::BitwiseNot)
-            .def("__xor__",&value::BitwiseXOr)
+            .def("__xor__", &value::BitwiseXOr)
             .def("__pow__", &value::Pow)
             .def("copy", &value::Scalar::Copy)
             .def_property("name", &value::Scalar::GetName, &value::Scalar::SetName)
diff --git a/accera/python/lib/src/ExecutionPlanTypes.cpp b/accera/python/lib/src/ExecutionPlanTypes.cpp
index e1cae49a..84b62651 100644
--- a/accera/python/lib/src/ExecutionPlanTypes.cpp
+++ b/accera/python/lib/src/ExecutionPlanTypes.cpp
@@ -5,6 +5,7 @@
 
 #include "AcceraTypes.h"
 
+#include <utilities/include/Exception.h>
 #include <value/include/Plan.h>
 
 #include <ir/include/value/ValueEnums.h>
@@ -37,7 +38,7 @@ namespace
             .value("NONE", value::MemorySpace::None)
             .value("GLOBAL", value::MemorySpace::Global)
             .value("SHARED", value::MemorySpace::Shared)
-            .value("LOCAL", value::MemorySpace::Local);
+            .value("PRIVATE", value::MemorySpace::Private);
 
         py::enum_<ir::value::Processor>(module, "Processor", "An enumeration of processors for loop index mapping")
             .value("BLOCK_X", ir::value::Processor::BlockX)
@@ -54,10 +55,17 @@ namespace
             .value("DYNAMIC", value::ParallelizationPolicy::Dynamic);
 
         py::enum_<value::ExecutionRuntime>(module, "_ExecutionRuntime", "Used for specifying the execution runtime of the module")
-            .value("DEFAULT", value::ExecutionRuntime::Default)
-            .value("VULKAN", value::ExecutionRuntime::Vulkan)
-            .value("ROCM", value::ExecutionRuntime::Rocm)
-            .value("CUDA", value::ExecutionRuntime::CUDA);
+            .value("DEFAULT", value::ExecutionRuntime::DEFAULT)
+            .value("VULKAN", value::ExecutionRuntime::VULKAN)
+            .value("ROCM", value::ExecutionRuntime::ROCM)
+            .value("CUDA", value::ExecutionRuntime::CUDA)
+            .value("OPENMP", value::ExecutionRuntime::OPENMP)
+            .value("NONE", value::ExecutionRuntime::NONE);
+
+        py::enum_<value::GPU::BarrierScope>(module, "BarrierScope", "An enumeration of barrier scopes")
+            .value("BLOCK", value::GPU::BarrierScope::Block)
+            .value("WARP", value::GPU::BarrierScope::Warp)
+            .value("THREADFENCE", value::GPU::BarrierScope::Threadfence);
     }
 
     void DefineExecutionPlanStructs(py::module& module)
@@ -106,36 +114,39 @@ namespace
                    value::CacheAllocation allocation,
                    value::MemorySpace memorySpace,
                    const std::optional<util::MemoryAffineCoefficients>& memoryMap,
-                   const std::optional<util::DimensionOrder>& dimOrder) {
+                   const std::optional<util::DimensionOrder>& dimOrder,
+                   bool thrifty,
+                   bool doubleBuffer,
+                   value::MemorySpace doubleBufferMemorySpace) {
                     if (outermostIncludedSplitIndex.has_value())
                     {
                         value::ScalarIndex resolvedTriggerIndex = triggerIndex.has_value() ? *triggerIndex : *outermostIncludedSplitIndex;
                         if (memoryMap.has_value())
                         {
-                            return plan.AddCache(target, *outermostIncludedSplitIndex, resolvedTriggerIndex, *memoryMap, indexing, allocation, memorySpace);
+                            return plan.AddCache(target, *outermostIncludedSplitIndex, resolvedTriggerIndex, *memoryMap, thrifty, doubleBuffer, indexing, allocation, memorySpace, doubleBufferMemorySpace);
                         }
                         else if (dimOrder.has_value())
                         {
-                            return plan.AddCache(target, *outermostIncludedSplitIndex, resolvedTriggerIndex, *dimOrder, indexing, allocation, memorySpace);
+                            return plan.AddCache(target, *outermostIncludedSplitIndex, resolvedTriggerIndex, *dimOrder, thrifty, doubleBuffer, indexing, allocation, memorySpace, doubleBufferMemorySpace);
                         }
                         else
                         {
-                            return plan.AddCache(target, *outermostIncludedSplitIndex, indexing, allocation, memorySpace);
+                            return plan.AddCache(target, *outermostIncludedSplitIndex, thrifty, doubleBuffer, indexing, allocation, memorySpace, doubleBufferMemorySpace);
                         }
                     }
                     else
                     {
                         if (memoryMap.has_value())
                         {
-                            return plan.AddCache(target, *maxElements, *memoryMap, indexing, allocation, memorySpace);
+                            return plan.AddCache(target, *maxElements, *memoryMap, thrifty, doubleBuffer, indexing, allocation, memorySpace, doubleBufferMemorySpace);
                         }
                         else if (dimOrder.has_value())
                         {
-                            return plan.AddCache(target, *maxElements, *dimOrder, indexing, allocation, memorySpace);
+                            return plan.AddCache(target, *maxElements, *dimOrder, thrifty, doubleBuffer, indexing, allocation, memorySpace, doubleBufferMemorySpace);
                         }
                         else
                         {
-                            return plan.AddCache(target, *maxElements, indexing, allocation, memorySpace);
+                            return plan.AddCache(target, *maxElements, thrifty, doubleBuffer, indexing, allocation, memorySpace, doubleBufferMemorySpace);
                         }
                     }
                 },
@@ -147,7 +158,10 @@ namespace
                 "allocation"_a,
                 "location"_a,
                 "memory_map"_a,
-                "dim_order"_a)
+                "dim_order"_a,
+                "thrifty"_a,
+                "double_buffer"_a,
+                "double_buffer_location"_a)
             .def("emit_runtime_init_packing", py::overload_cast<value::ViewAdapter, const std::string&, const std::string&, value::CacheIndexing>(&value::Plan::EmitRuntimeInitPacking), "target"_a, "packing_func_name"_a, "packed_buf_size_func_name"_a, "indexing"_a = value::CacheIndexing::GlobalToPhysical)
             .def("pack_and_embed_buffer", py::overload_cast<value::ViewAdapter, value::ViewAdapter, const std::string&, const std::string&, value::CacheIndexing>(&value::Plan::PackAndEmbedBuffer), "target"_a, "constant_data_buffer"_a, "wrapper_fn_name"_a, "packed_buffer_name"_a, "indexing"_a = value::CacheIndexing::GlobalToPhysical)
             .def("vectorize", &value::Plan::Vectorize, "i"_a, "vectorization_info"_a)
@@ -159,7 +173,8 @@ namespace
                  }),
                  py::return_value_policy::move)
             .def(
-                "add_cache", [](value::GPUPlan& plan,
+                "add_cache",
+                [](value::GPUPlan& plan,
                    const std::variant<value::ViewAdapter, value::Cache*>& target,
                    const std::optional<value::ScalarIndex>& outermostIncludedSplitIndex,
                    const std::optional<value::ScalarIndex>& triggerIndex,
@@ -168,10 +183,25 @@ namespace
                    value::CacheAllocation allocation,
                    value::MemorySpace memorySpace,
                    const std::optional<util::MemoryAffineCoefficients>& memoryMap,
-                   const std::optional<util::DimensionOrder>& dimOrder) {
-                        value::ScalarIndex resolvedTriggerIndex = triggerIndex.has_value() ? *triggerIndex : *outermostIncludedSplitIndex;
-                        return plan.AddCache(target, *outermostIncludedSplitIndex, resolvedTriggerIndex, *dimOrder, indexing, allocation, memorySpace);
-                        //return outermostIncludedSplitIndex.has_value() ? plan.AddCache(target, *outermostIncludedSplitIndex, memorySpace) : plan.AddCache(target, *maxElements, memorySpace);
+                   const std::optional<util::DimensionOrder>& dimOrder,
+                   bool thrifty,
+                   bool doubleBuffer,
+                   value::MemorySpace doubleBufferMemorySpace) {
+                    value::ScalarIndex resolvedTriggerIndex = triggerIndex.has_value() ? *triggerIndex : *outermostIncludedSplitIndex;
+                    if (outermostIncludedSplitIndex.has_value())
+                    {
+                        return plan.AddCache(target, *outermostIncludedSplitIndex, resolvedTriggerIndex, *dimOrder, thrifty, doubleBuffer, indexing, allocation, memorySpace, doubleBufferMemorySpace);
+                    }
+                    else if (maxElements.has_value())
+                    {
+                        // TODO : convert all GPUPlan::AddCache() impls to use manual caching rather than automatic, then plumb remaining arguments
+                        return plan.AddCache(std::get<value::ViewAdapter>(target), *maxElements, memorySpace);
+                    }
+                    else
+                    {
+                        // TODO : reach parity with GPUPlan::AddCache() and Plan::AddCache() functions
+                        throw utilities::LogicException(utilities::LogicExceptionErrors::notImplemented);
+                    }
                 },
                 "target"_a,
                 "index"_a,
@@ -181,7 +211,10 @@ namespace
                 "allocation"_a,
                 "location"_a,
                 "memory_map"_a,
-                "dim_order"_a)
+                "dim_order"_a,
+                "thrifty"_a,
+                "double_buffer"_a,
+                "double_buffer_location"_a)
             .def("tensorize", &value::GPUPlan::Tensorize, "indices"_a, "dims"_a)
             .def("map_index_to_processor", &value::GPUPlan::MapIndexToProcessor, "index"_a, "proc"_a);
     }
diff --git a/accera/python/lib/src/Operations.cpp b/accera/python/lib/src/Operations.cpp
index be9fd943..9370a972 100644
--- a/accera/python/lib/src/Operations.cpp
+++ b/accera/python/lib/src/Operations.cpp
@@ -119,7 +119,10 @@ void DefineOperations(py::module& module)
              [=](std::string pos) {
                  return getFromGPUIndex(value::GPU::ThreadId(), pos);
              })
-        .def("Barrier", &value::GPU::Barrier)
-        .def("MFMA", &value::MFMA);
+        .def(
+            "Barrier", [=](value::GPU::BarrierScope scope) {
+                return value::GPU::Barrier(scope);
+            },
+            "scope"_a = value::GPU::BarrierScope::Block);
 }
 } // namespace accera::python::lang
diff --git a/accera/python/lib/src/SchedulingTypes.cpp b/accera/python/lib/src/SchedulingTypes.cpp
index 076087cc..49e06428 100644
--- a/accera/python/lib/src/SchedulingTypes.cpp
+++ b/accera/python/lib/src/SchedulingTypes.cpp
@@ -76,7 +76,7 @@ Fuse other schedules into this one, destroying the other ones.
         .def("pad", py::overload_cast<value::ScalarIndex, int, bool>(&value::Schedule::Pad), "i"_a, "size"_a, "pad_front"_a)
         .def("skew", py::overload_cast<value::ScalarIndex, value::ScalarIndex>(&value::Schedule::Skew), "i"_a, "reference_index"_a)
         .def("create_plan", &value::Schedule::CreatePlan, "Creates a plan for the host")
-        .def("create_gpu_plan", &value::Schedule::CreateGPUPlan, "Creates a plan for the GPU", "gpu_options"_a, "runtime"_a = value::ExecutionRuntime::Default)
+        .def("create_gpu_plan", &value::Schedule::CreateGPUPlan, "Creates a plan for the GPU", "gpu_options"_a, "runtime"_a = value::ExecutionRuntime::DEFAULT)
         .def(
             "get_indices", [](value::Schedule& sched) { return sched.GetIndices(); }, "Returns the indices for this schedule, starting from the outermost index");
 }
diff --git a/accera/python/llvm/src/__init__.py b/accera/python/llvm/src/__init__.py
index 9a4a0ec6..9f5c98df 100644
--- a/accera/python/llvm/src/__init__.py
+++ b/accera/python/llvm/src/__init__.py
@@ -3,4 +3,8 @@
 # Licensed under the MIT License. See LICENSE in the project root for license information.
 ####################################################################################################
 
-from ._version import __version__
+try:
+    from ._version import __version__
+except:
+    # CMake-driven builds do not generate _version.py yet
+    __version__ = None
diff --git a/accera/transforms/CMakeLists.txt b/accera/transforms/CMakeLists.txt
index 0fbe6851..56ae9f36 100644
--- a/accera/transforms/CMakeLists.txt
+++ b/accera/transforms/CMakeLists.txt
@@ -18,6 +18,7 @@ add_subdirectory(src)
 set(src src/AcceraPasses.cpp)
 
 set(rcvalue_src
+    src/value/BarrierOptPass.cpp
     src/value/FunctionPointerResolutionPass.cpp
     src/value/RangeValueOptimizePass.cpp
     src/value/ValueFuncToTargetPass.cpp
@@ -27,6 +28,7 @@ set(rcvalue_src
 )
 
 set(rcvalue_include
+    include/value/BarrierOptPass.h
     include/value/FunctionPointerResolutionPass.h
     include/value/RangeValueOptimizePass.h
     include/value/ValueFuncToTargetPass.h
@@ -54,17 +56,11 @@ set(rcgpu_src
   src/gpu/ConvertLaunchFuncToVulkanCalls.cpp
   src/gpu/EmitVulkanWrappers.cpp
   src/gpu/SerializeToHSACO.cpp
-
-  # Disabled
-  # src/gpu/AcceraToSPIRVPass.cpp
 )
 
 set(rcgpu_include
   include/gpu/AcceraToGPUPass.h
   include/gpu/AcceraVulkanPasses.h
-
-  # Disabled
-  # include/gpu/AcceraToSPIRVPass.h
 )
 
 set(util_src
diff --git a/accera/transforms/include/AcceraPasses.h b/accera/transforms/include/AcceraPasses.h
index defc6879..e52bacfe 100644
--- a/accera/transforms/include/AcceraPasses.h
+++ b/accera/transforms/include/AcceraPasses.h
@@ -8,11 +8,11 @@
 
 #include "exec/ExecutionPlanToAffineLoweringPass.h"
 #include "gpu/AcceraToGPUPass.h"
-#include "gpu/AcceraToSPIRVPass.h"
 #include "gpu/AcceraVulkanPasses.h"
 #include "ir/include/value/ValueEnums.h"
 #include "nest/LoopNestPasses.h"
 #include "nest/LoopNestToValueFunc.h"
+#include "value/BarrierOptPass.h"
 #include "value/FunctionPointerResolutionPass.h"
 #include "value/RangeValueOptimizePass.h"
 #include "value/ValueFuncToTargetPass.h"
@@ -74,11 +74,13 @@ struct AcceraPassPipelineOptions : mlir::PassPipelineOptions<AcceraPassPipelineO
         "runtime",
         llvm::cl::desc("Execution runtime"),
         llvm::cl::values(
-            clEnumValN(accera::value::ExecutionRuntime::Default, "default", "default runtime"),
-            clEnumValN(accera::value::ExecutionRuntime::Vulkan, "vulkan", "Vulkan runtime"),
-            clEnumValN(accera::value::ExecutionRuntime::Rocm, "rocm", "Rocm runtime"),
-            clEnumValN(accera::value::ExecutionRuntime::CUDA, "cuda", "CUDA runtime")),
-        llvm::cl::init(accera::value::ExecutionRuntime::Default)
+            clEnumValN(accera::value::ExecutionRuntime::NONE, "none", "No runtimes"),
+            clEnumValN(accera::value::ExecutionRuntime::CUDA, "cuda", "CUDA runtime"),
+            clEnumValN(accera::value::ExecutionRuntime::ROCM, "rocm", "ROCm runtime"),
+            clEnumValN(accera::value::ExecutionRuntime::VULKAN, "vulkan", "Vulkan runtime"),
+            clEnumValN(accera::value::ExecutionRuntime::OPENMP, "openmp", "OpenMP runtime"),
+            clEnumValN(accera::value::ExecutionRuntime::DEFAULT, "default", "default runtime")),
+        llvm::cl::init(accera::value::ExecutionRuntime::DEFAULT)
     };
     Option<bool> enableAsync{ *this, "enable-async", llvm::cl::init(false) };
     Option<bool> enableProfile{ *this, "enable-profiling", llvm::cl::init(false) };
diff --git a/accera/transforms/include/AcceraPasses.td b/accera/transforms/include/AcceraPasses.td
index de1d1e69..b4b14d03 100644
--- a/accera/transforms/include/AcceraPasses.td
+++ b/accera/transforms/include/AcceraPasses.td
@@ -208,6 +208,20 @@ def ConvertRangeValueOptimize : Pass<"optimize-range-value"> {
   ];
 }
 
+//===----------------------------------------------------------------------===//
+// BarrierOpt
+//===----------------------------------------------------------------------===//
+
+def BarrierOpt : Pass<"optimize-barriers"> {
+  let summary = "Optimize Barrier ops";
+  let constructor = "accera::transforms::value::createBarrierOptPass()";
+  let dependentDialects = [
+    "mlir::StandardOpsDialect",
+    "mlir::AffineDialect",
+    "mlir::memref::MemRefDialect"
+  ];
+}
+
 //===----------------------------------------------------------------------===//
 // ValueToLLVM
 //===----------------------------------------------------------------------===//
@@ -346,11 +360,14 @@ def FunctionPointerResolution : accModulePass<"resolve-function-pointers"> {
 def SerializeToHSACO : Pass<"serialize-to-hsaco", "::mlir::gpu::GPUModuleOp"> {
   let summary = "Serializes the GPU kernel to HSACO object (WIP)";
   let constructor = "accera::transforms::createSerializeToHSACOPass()";
-  let dependentDialects = ["mlir::ROCDL::ROCDLDialect", "mlir::gpu::GPUDialect"];
+  let dependentDialects = [
+    "mlir::gpu::GPUDialect",
+    "mlir::ROCDL::ROCDLDialect"
+  ];
   let options = [
     Option<"chip", "chip", "std::string",
           //  TODO: Should this default to something else?
-           "\"gfx906\"",
+           "\"gfx908\"",
            "The GPU target architecture.">
   ];
 }
@@ -372,7 +389,15 @@ def ConvertAcceraToSPIRV : Pass<"convert-accera-to-spirv", "::mlir::ModuleOp"> {
 def ConvertAcceraToROCDL : Pass<"convert-accera-to-rocdl", "::mlir::ModuleOp"> {
   let summary = "Convert Accera dialects to ROCDL dialect";
   let constructor = "accera::transforms::createAcceraToROCDLPass()";
-  let dependentDialects = ["mlir::ROCDL::ROCDLDialect", "mlir::gpu::GPUDialect"];
+  let dependentDialects = [
+    "accera::ir::value::ValueDialect",
+    "mlir::StandardOpsDialect",
+    "mlir::AffineDialect",
+    "mlir::vector::VectorDialect",
+    "mlir::memref::MemRefDialect",
+    "mlir::gpu::GPUDialect",
+    "mlir::ROCDL::ROCDLDialect"
+  ];
 }
 
 
@@ -383,7 +408,15 @@ def ConvertAcceraToROCDL : Pass<"convert-accera-to-rocdl", "::mlir::ModuleOp"> {
 def ConvertAcceraToNVVM : Pass<"convert-accera-to-nvvm", "::mlir::ModuleOp"> {
   let summary = "Convert Accera dialects to NVVM dialect";
   let constructor = "accera::transforms::createAcceraToNVVMPass()";
-  let dependentDialects = ["mlir::NVVM::NVVMDialect", "mlir::gpu::GPUDialect"];
+  let dependentDialects = [
+    "accera::ir::value::ValueDialect",
+    "mlir::StandardOpsDialect",
+    "mlir::AffineDialect",
+    "mlir::vector::VectorDialect",
+    "mlir::memref::MemRefDialect",
+    "mlir::gpu::GPUDialect",
+    "mlir::NVVM::NVVMDialect"
+  ];
 }
 
 #endif // ACCERA_CONVERSION_PASSES
diff --git a/accera/transforms/include/exec/ExecutionPlanToAffineLoweringPass.h b/accera/transforms/include/exec/ExecutionPlanToAffineLoweringPass.h
index 5a922495..1de77020 100644
--- a/accera/transforms/include/exec/ExecutionPlanToAffineLoweringPass.h
+++ b/accera/transforms/include/exec/ExecutionPlanToAffineLoweringPass.h
@@ -33,6 +33,9 @@ void populateExecutionPlanParallelizePatterns(mlir::OwningRewritePatternList& pa
 void populateExecutionPlanScaleHoistingPatterns(mlir::OwningRewritePatternList& patterns);
 void populateOutOfBoundsAccessHandlingPatterns(mlir::OwningRewritePatternList& patterns);
 void populateConvergeLoadStoresPatterns(mlir::OwningRewritePatternList& patterns);
+void populateExecutionPlanThriftyCachePatterns(mlir::OwningRewritePatternList& patterns);
+void populateExecutionPlanDelayedMappingPatterns(mlir::OwningRewritePatternList& patterns);
+void populateExecutionPlanLoopUnswitchingPatterns(mlir::OwningRewritePatternList& patterns);
 
 std::unique_ptr<mlir::Pass> createExecutionPlanMakeCachePass();
 std::unique_ptr<mlir::Pass> createExecutionPlanCopyReducePass();
diff --git a/accera/transforms/include/gpu/AcceraToGPUPass.h b/accera/transforms/include/gpu/AcceraToGPUPass.h
index f7a5af9c..76a23728 100644
--- a/accera/transforms/include/gpu/AcceraToGPUPass.h
+++ b/accera/transforms/include/gpu/AcceraToGPUPass.h
@@ -41,6 +41,8 @@ void populateAcceraToSPIRVPatterns(
     mlir::OwningRewritePatternList& patterns);
 std::unique_ptr<mlir::OperationPass<mlir::ModuleOp>> createAcceraToSPIRVPass();
 
+void populateGPUSimplificationPatterns(mlir::OwningRewritePatternList& patterns);
+
 void populateAcceraToNVVMPatterns(mlir::OwningRewritePatternList& patterns);
 std::unique_ptr<mlir::OperationPass<mlir::ModuleOp>> createAcceraToNVVMPass();
 
diff --git a/accera/transforms/include/gpu/AcceraToSPIRVPass.h b/accera/transforms/include/gpu/AcceraToSPIRVPass.h
deleted file mode 100644
index a234a204..00000000
--- a/accera/transforms/include/gpu/AcceraToSPIRVPass.h
+++ /dev/null
@@ -1,31 +0,0 @@
-////////////////////////////////////////////////////////////////////////////////////////////////////
-//  Copyright (c) Microsoft Corporation. All rights reserved.
-//  Licensed under the MIT License. See LICENSE in the project root for license information.
-//  Authors:  Abdul Dakkak, Kern Handa
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#pragma once
-
-#include <memory>
-
-namespace mlir
-{
-class MLIRContext;
-class ModuleOp;
-class RewritePatternSet;
-class Pass;
-class PassManager;
-class SPIRVTypeConverter;
-using OwningRewritePatternList = RewritePatternSet;
-
-template <typename OpT>
-class OperationPass;
-
-} // namespace mlir
-
-namespace accera::transforms
-{
-void populateAcceraToSPIRVPatterns(mlir::SPIRVTypeConverter& typeConverter, mlir::MLIRContext* context, mlir::OwningRewritePatternList& patterns);
-std::unique_ptr<mlir::OperationPass<mlir::ModuleOp>> createAcceraToSPIRVPass();
-
-} // namespace accera::transforms
diff --git a/accera/transforms/include/value/BarrierOptPass.h b/accera/transforms/include/value/BarrierOptPass.h
new file mode 100644
index 00000000..7746860f
--- /dev/null
+++ b/accera/transforms/include/value/BarrierOptPass.h
@@ -0,0 +1,19 @@
+////////////////////////////////////////////////////////////////////////////////////////////////////
+//  Copyright (c) Microsoft Corporation. All rights reserved.
+//  Licensed under the MIT License. See LICENSE in the project root for license information.
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+#pragma once
+
+#include <memory>
+
+// fwd decls
+namespace mlir
+{
+class Pass;
+} // namespace mlir
+
+namespace accera::transforms::value
+{
+std::unique_ptr<mlir::Pass> createBarrierOptPass();
+} // namespace accera::transforms::value
diff --git a/accera/transforms/src/AcceraPasses.cpp b/accera/transforms/src/AcceraPasses.cpp
index 016c516d..65ae4220 100644
--- a/accera/transforms/src/AcceraPasses.cpp
+++ b/accera/transforms/src/AcceraPasses.cpp
@@ -152,6 +152,7 @@ void addAcceraToLLVMPassPipeline(OpPassManager& pm, const AcceraPassPipelineOpti
 
     valueFuncOpPM.addPass(createCanonicalizerPass());
     valueFuncOpPM.addPass(loopnest::createLoopNestToValueFuncPass({ { options.dumpIntraPassIR.getValue(), options.basename + "LoopNestToValueFuncPass_Subpasses" }, options.printLoops.getValue(), options.printVecOpDetails.getValue() }));
+    valueFuncOpPM.addPass(value::createBarrierOptPass());
 
     pmAdaptor.addPass(value::createValueFuncToTargetPass());
     pmAdaptor.addPass(createSymbolDCEPass());
@@ -163,14 +164,21 @@ void addAcceraToLLVMPassPipeline(OpPassManager& pm, const AcceraPassPipelineOpti
     funcOpPM.addPass(createLowerAffinePass());
     funcOpPM.addPass(createConvertSCFToOpenMPPass());
 
+    // Or perhaps we should put the barrier optimization here
+
     pmAdaptor.addPass(value::createValueToStdPass(options.enableProfile));
+    pmAdaptor.addPass(value::createRangeValueOptimizePass());
     pmAdaptor.addPass(createCanonicalizerPass());
     pmAdaptor.addPass(createCSEPass());
 
     pmAdaptor.addPass(createGpuKernelOutliningPass());
-    pmAdaptor.addPass(createAcceraToGPUPass(execRuntime));
+    auto gpuPass = createAcceraToGPUPass(execRuntime);
+    if (gpuPass)
+    {
+        pmAdaptor.addPass(std::move(gpuPass));
+    }
 
-    if (execRuntime == accera::value::ExecutionRuntime::Vulkan)
+    if (execRuntime == accera::value::ExecutionRuntime::VULKAN)
     {
         OpPassManager& spirvModulePM = pm.nest<spirv::ModuleOp>();
         spirvModulePM.addPass(spirv::createLowerABIAttributesPass());
@@ -185,13 +193,12 @@ void addAcceraToLLVMPassPipeline(OpPassManager& pm, const AcceraPassPipelineOpti
         gpuModulePM.addPass(createStripDebugInfoPass());
         if (options.gpuOnly) return;
     }
-    pmAdaptor.addPass(createCanonicalizerPass());
 
     funcOpPM.addPass(createConvertVectorToSCFPass(
         VectorTransferToSCFOptions{} /*.setLowerPermutationMaps(true) .setLowerTensors(true).setUnroll(true) */));
     pmAdaptor.addPass(createLowerToCFGPass());
 
-    if (execRuntime != accera::value::ExecutionRuntime::Vulkan)
+    if (execRuntime != accera::value::ExecutionRuntime::VULKAN)
     {
         PassManagerAdaptor gpuModulePM(pm.nest<gpu::GPUModuleOp>(), options.dumpPasses.getValue(), options.basename + "_rocm_module");
         gpuModulePM.addPass(createLowerGpuOpsToROCDLOpsPass(kDeriveIndexBitwidthFromDataLayout));
@@ -212,7 +219,7 @@ void addAcceraToLLVMPassPipeline(OpPassManager& pm, const AcceraPassPipelineOpti
     pmAdaptor.addPass(LLVM::createLegalizeForExportPass());
     pmAdaptor.addPass(value::createFunctionPointerResolutionPass());
 
-    if (execRuntime == accera::value::ExecutionRuntime::Vulkan)
+    if (execRuntime == accera::value::ExecutionRuntime::VULKAN)
     {
         pmAdaptor.addPass(vulkan::createConvertVulkanLaunchFuncToVulkanCallsWithTimingPass({ false }));
         pmAdaptor.addPass(createGpuToLLVMConversionPass());
diff --git a/accera/transforms/src/exec/ExecutionPlanToAffineLoweringPass.cpp b/accera/transforms/src/exec/ExecutionPlanToAffineLoweringPass.cpp
index dc9f4b98..ac3161d3 100644
--- a/accera/transforms/src/exec/ExecutionPlanToAffineLoweringPass.cpp
+++ b/accera/transforms/src/exec/ExecutionPlanToAffineLoweringPass.cpp
@@ -8,6 +8,7 @@
 #include "util/VectorizationUtil.h"
 
 #include <ir/include/IRUtil.h>
+#include <ir/include/exec/ExecutionOptions.h>
 #include <ir/include/exec/ExecutionPlanAttributes.h>
 #include <ir/include/exec/ExecutionPlanOps.h>
 #include <ir/include/exec/VectorizationInfo.h>
@@ -64,9 +65,38 @@ using namespace accera::transforms;
 using namespace mlir;
 using namespace accera::utilities;
 
+#define DEBUG_TYPE "execution-plat-to-affine-lowering"
+
 namespace
 {
-const mlir::StringRef BoundsCheckedAttrName = "rcxp_bounds_checked";
+// Here we prefer using std::string for these attr names so they can be used more flexibly
+// in internal utilities as well as MLIR APIs as a mlir::StringRef. Note that mlir::StringRef
+// has a constructor that takes a const std::string& for convenience
+
+const std::string BoundsCheckedAttrName = "accxp_bounds_checked";
+const std::string BaseArrayAccessMapAttrName = "accxp_base_array_access_map";
+const std::string BaseArrayAccessIndicesAttrName = "accxp_base_array_access_indices";
+
+// These strings are used to create predictable index names for internally-generated GPU-related loops
+// for the purposes of cache accesses. MakeCacheOps identify the loop indices to look for and combine those
+// with a map to access the appropriate position in the cache, however that mechanism does not currently
+// distinguish between a general active block position and a specific GPU thread's responsibility region
+// within that active block.
+// E.g. suppose you're mapping from a shared memory cache to a private memory cache but instead of having
+// different loop levels with different active blocks, you want the private memory cache to hold only the
+// region that each thread is responsible for in the shared memory cache, so instead of being a new active
+// block, it is a subset of an existing active block identified by thread indices
+// TODO : come up with a better way of standardizing how a GPU thread maps from a shared active block
+//        to the subset of the active block that it is responsible for
+const std::string ActionsPerThreadIndexName = "accxp_actions_per_thread_loop_index";
+const std::string ThreadVectorizationIndexName = "accxp_thread_vectorization_loop_index";
+const std::string ThreadXIndexName = "accxp_thread_x_loop_index";
+const std::string ThreadYIndexName = "accxp_thread_y_loop_index";
+const std::string ThreadZIndexName = "accxp_thread_z_loop_index";
+
+// Attribute names used for partially unrolling loops
+const std::string UnswitchPrefixItersName = "accxp_unswitch_prefix_iters";
+const std::string UnswitchSuffixItersName = "accxp_unswitch_suffix_iters";
 
 struct MakeCacheOpLowering : public OpRewritePattern<MakeCacheOp>
 {
@@ -89,6 +119,27 @@ struct ActiveElementCacheCopyOpRewrite : public OpRewritePattern<ActiveElementCa
     LogicalResult matchAndRewrite(ActiveElementCacheCopyOp cacheCopyOp, PatternRewriter& rewriter) const final;
 };
 
+struct ThriftyCacheMultiCopyOpRewrite : public OpRewritePattern<MultiCacheCopyOp>
+{
+    using OpRewritePattern<MultiCacheCopyOp>::OpRewritePattern;
+
+    LogicalResult matchAndRewrite(MultiCacheCopyOp cacheCopyOp, PatternRewriter& rewriter) const final;
+};
+
+struct ThriftyCacheCopyOpRewrite : public OpRewritePattern<ActiveBlockCacheCopyOp>
+{
+    using OpRewritePattern<ActiveBlockCacheCopyOp>::OpRewritePattern;
+
+    LogicalResult matchAndRewrite(ActiveBlockCacheCopyOp cacheCopyOp, PatternRewriter& rewriter) const final;
+};
+
+struct ThriftyCacheReduceOpRewrite : public OpRewritePattern<ActiveBlockCacheReduceOp>
+{
+    using OpRewritePattern<ActiveBlockCacheReduceOp>::OpRewritePattern;
+
+    LogicalResult matchAndRewrite(ActiveBlockCacheReduceOp cacheReduceOp, PatternRewriter& rewriter) const final;
+};
+
 struct MultiCacheCopyOpRewrite : public OpRewritePattern<MultiCacheCopyOp>
 {
     using OpRewritePattern<MultiCacheCopyOp>::OpRewritePattern;
@@ -289,6 +340,20 @@ struct ConvertValueStoresToAffineRewrite : public OpRewritePattern<v::StoreOp>
     LogicalResult matchAndRewrite(v::StoreOp storeOp, PatternRewriter& rewriter) const final;
 };
 
+struct DelayedMappingRegionOpRewrite : public OpRewritePattern<DelayedMappingRegionOp>
+{
+    using OpRewritePattern<DelayedMappingRegionOp>::OpRewritePattern;
+
+    LogicalResult matchAndRewrite(DelayedMappingRegionOp mappingRegionOp, PatternRewriter& rewriter) const final;
+};
+
+struct LoopUnswitchingOpRewrite : public OpRewritePattern<mlir::AffineForOp>
+{
+    using OpRewritePattern<mlir::AffineForOp>::OpRewritePattern;
+
+    LogicalResult matchAndRewrite(mlir::AffineForOp forOp, PatternRewriter& rewriter) const final;
+};
+
 struct ExecutionPlanMakeCacheLoweringPass : public ConvertExecutionPlanMakeCacheBase<ExecutionPlanMakeCacheLoweringPass>
 {
     void runOnFunction() final;
@@ -447,6 +512,13 @@ TensorizationInfo GetTensorizationInfo(Operation* op)
     return tensorizeInfoAttr.getValue();
 }
 
+void RemoveTensorizationInfo(Operation* op)
+{
+    OpBuilder builder(op);
+    auto tensorizationInfoIdentifier = builder.getIdentifier(TensorizationInfoAttr::getKeyName());
+    op->removeAttr(tensorizationInfoIdentifier);
+}
+
 // Parallelization-related functions
 
 bool HasParallelizationInfo(Operation* op)
@@ -623,10 +695,24 @@ LoopnestInfo ConstructCacheLoopnestInfo(Operation* baseOp, const std::vector<Ind
     return result;
 }
 
+IterationDomain CreateLoopNestIterationDomain(const std::vector<std::string>& domainDimNames,
+                                              const std::vector<int64_t>& domainDimSizes)
+{
+    assert(domainDimNames.size() == domainDimSizes.size());
+    std::vector<IndexRange> indexRanges;
+    for (auto [domainDimName, domainDimSize] : llvm::zip(domainDimNames, domainDimSizes))
+    {
+        accera::ir::loopnest::Range dimRange(0, domainDimSize);
+        indexRanges.emplace_back(domainDimName, dimRange);
+    }
+    return { indexRanges };
+}
+
 std::tuple<NestOp, ScheduleOp, ExecPlanOp> CreateCacheLoopnestHelper(
     OpBuilder& builder,
     Location loc,
     const LoopnestInfo& loopnestInfo,
+    const std::vector<std::string>& activeBlockDimNames,
     const std::optional<VectorizationInfo>& vectorizationInfoOpt,
     int elementByteWidth,
     const v::ExecutionTarget& execTarget,
@@ -636,7 +722,17 @@ std::tuple<NestOp, ScheduleOp, ExecPlanOp> CreateCacheLoopnestHelper(
     // TODO : make this more like a loopnest that the DSL could create
     //        this currently requires all the split indices as separate values,
     //        which requires the schedule to be set before the kernel is created
-    auto cacheNest = MakeNest(builder, loopnestInfo.baseIterationShape);
+    NestOp cacheNest;
+    if (!activeBlockDimNames.empty())
+    {
+        // If we have dim names, then make the nest via a custom IterationDomain
+        auto iterationDomain = CreateLoopNestIterationDomain(activeBlockDimNames, loopnestInfo.baseIterationShape);
+        cacheNest = MakeNest(builder, iterationDomain);
+    }
+    else
+    {
+        cacheNest = MakeNest(builder, loopnestInfo.baseIterationShape);
+    }
     auto cacheNestBodyBuilder = cacheNest.getBodyBuilder();
 
     auto cacheNestSchedule = cacheNest.getOrCreateSchedule();
@@ -735,6 +831,7 @@ std::tuple<NestOp, ScheduleOp, ExecPlanOp> CreateActiveBlockCacheLoopnest(
     mlir::OpBuilder& builder,
     Location loc,
     const std::vector<int64_t>& activeBlockShape,
+    const std::vector<std::string>& activeBlockDimNames,
     const std::optional<VectorizationInfo>& vectorizationInfoOpt,
     int elementByteWidth,
     const v::ExecutionTarget& execTarget,
@@ -756,7 +853,7 @@ std::tuple<NestOp, ScheduleOp, ExecPlanOp> CreateActiveBlockCacheLoopnest(
     });
 
     std::string fullKernelSuffix = "active_block_" + kernelSuffix;
-    return CreateCacheLoopnestHelper(builder, loc, loopnestInfo, vectorizationInfoOpt, elementByteWidth, execTarget, fullKernelSuffix, kernelFn);
+    return CreateCacheLoopnestHelper(builder, loc, loopnestInfo, activeBlockDimNames, vectorizationInfoOpt, elementByteWidth, execTarget, fullKernelSuffix, kernelFn);
 }
 
 template <typename CacheOp>
@@ -793,7 +890,7 @@ std::tuple<NestOp, ScheduleOp, ExecPlanOp> CreateActiveElementCacheLoopnest(OpBu
     LoopnestInfo loopnestInfo = ConstructCacheLoopnestInfo(cacheOp, cacheRegionIndexRanges, cacheRegionBaseIndices);
 
     std::string fullKernelSuffix = "active_element_" + kernelSuffix;
-    return CreateCacheLoopnestHelper(builder, loc, loopnestInfo, vectorizationInfoOpt, elementByteWidth, execTarget, fullKernelSuffix, kernelFn);
+    return CreateCacheLoopnestHelper(builder, loc, loopnestInfo, {}, vectorizationInfoOpt, elementByteWidth, execTarget, fullKernelSuffix, kernelFn);
 }
 
 // Contains shape and access information for an active block of an array
@@ -1316,6 +1413,39 @@ bool ShouldMergeMultiCacheInfos(const MultiCacheInfo& lhs, const MultiCacheInfo&
     return !keepSeparateCacheBuffers;
 }
 
+mlir::Value GetOriginalIV(mlir::Value possiblyOffsetIV)
+{
+    // Requires that possiblyOffsetIV is constructed from a single IV and constants
+    if (possiblyOffsetIV.isa<mlir::BlockArgument>())
+    {
+        return possiblyOffsetIV;
+    }
+    else
+    {
+        auto definingOp = possiblyOffsetIV.getDefiningOp();
+        assert(definingOp != nullptr);
+        if (auto affineApplyOp = mlir::dyn_cast<mlir::AffineApplyOp>(definingOp))
+        {
+            for (auto operand : affineApplyOp.getOperands())
+            {
+                if (auto originalIV = GetOriginalIV(operand))
+                {
+                    return originalIV;
+                }
+            }
+            return nullptr;
+        }
+        else if (auto constantOp = mlir::dyn_cast<mlir::ConstantOp>(definingOp))
+        {
+            return nullptr;
+        }
+        else
+        {
+            assert(false && "Offset IVs must be offset with AffineApplyOps and constants");
+        }
+    }
+}
+
 mlir::AffineMap ComputeLoopIVToDefinitionOrderMap(const std::vector<mlir::Value>& ivs, mlir::MLIRContext* context)
 {
     // This is currently limited to nested AffineForOp induction variables for simplicity
@@ -1336,8 +1466,10 @@ mlir::AffineMap ComputeLoopIVToDefinitionOrderMap(const std::vector<mlir::Value>
         // current is defined in a higher loop level than other
         const auto& currentIV = ivs[currentIdx];
         const auto& otherIV = ivs[otherIdx];
-        auto currentDefiningOp = mlir::getForInductionVarOwner(currentIV);
-        auto otherDefiningOp = mlir::getForInductionVarOwner(otherIV);
+        auto currentOriginalIV = GetOriginalIV(currentIV);
+        auto otherOriginalIV = GetOriginalIV(otherIV);
+        auto currentDefiningOp = mlir::getForInductionVarOwner(currentOriginalIV);
+        auto otherDefiningOp = mlir::getForInductionVarOwner(otherOriginalIV);
         assert(currentDefiningOp != nullptr);
         assert(otherDefiningOp != nullptr);
         bool currentIsAncestor = currentDefiningOp->isAncestor(otherDefiningOp);
@@ -1388,11 +1520,15 @@ mlir::AffineMap ComputeLoopIVDefinitionOrderToCurrentOrderMap(const std::vector<
 }
 
 // Create an AffineLoadOp that understands how to access caches
-mlir::AffineLoadOp CreateLoad(mlir::OpBuilder& builder, mlir::Location loc, mlir::Value src, const std::vector<mlir::Value>& baseArrayPosition)
+mlir::AffineLoadOp CreateLoad(mlir::OpBuilder& builder,
+                              mlir::Location loc,
+                              mlir::Value src,
+                              const std::vector<mlir::Value>& baseArrayPosition,
+                              const std::vector<std::pair<Index, mlir::Value>>& unrealizedLoopNestIndices = {})
 {
     if (auto srcCacheOp = mlir::dyn_cast_or_null<MakeCacheOp>(src.getDefiningOp()))
     {
-        mlir::AffineValueMap loadAccessInfo = srcCacheOp.insertCachePosition(builder.getInsertionBlock(), baseArrayPosition);
+        mlir::AffineValueMap loadAccessInfo = srcCacheOp.insertCachePosition(builder.getInsertionBlock(), baseArrayPosition, unrealizedLoopNestIndices);
         return builder.create<mlir::AffineLoadOp>(loc, src, loadAccessInfo.getAffineMap(), loadAccessInfo.getOperands());
     }
     else
@@ -1402,11 +1538,16 @@ mlir::AffineLoadOp CreateLoad(mlir::OpBuilder& builder, mlir::Location loc, mlir
 }
 
 // Create an AffineStoreOp that understands how to access caches
-mlir::AffineStoreOp CreateStore(mlir::OpBuilder& builder, mlir::Location loc, mlir::Value value, mlir::Value dst, const std::vector<mlir::Value>& baseArrayPosition)
+mlir::AffineStoreOp CreateStore(mlir::OpBuilder& builder,
+                                mlir::Location loc,
+                                mlir::Value value,
+                                mlir::Value dst,
+                                const std::vector<mlir::Value>& baseArrayPosition,
+                                const std::vector<std::pair<Index, mlir::Value>>& unrealizedLoopNestIndices = {})
 {
     if (auto dstCacheOp = mlir::dyn_cast_or_null<MakeCacheOp>(dst.getDefiningOp()))
     {
-        mlir::AffineValueMap storeAccessInfo = dstCacheOp.insertCachePosition(builder.getInsertionBlock(), baseArrayPosition);
+        mlir::AffineValueMap storeAccessInfo = dstCacheOp.insertCachePosition(builder.getInsertionBlock(), baseArrayPosition, unrealizedLoopNestIndices);
         return builder.create<mlir::AffineStoreOp>(loc, value, dst, storeAccessInfo.getAffineMap(), storeAccessInfo.getOperands());
     }
     else
@@ -1415,22 +1556,63 @@ mlir::AffineStoreOp CreateStore(mlir::OpBuilder& builder, mlir::Location loc, ml
     }
 }
 
+bool HasBaseArrayAccessAttrs(mlir::Operation* op)
+{
+    return op->hasAttr(BaseArrayAccessMapAttrName) && op->hasAttr(BaseArrayAccessIndicesAttrName);
+}
+
+void SetBaseArrayAccessAttrs(mlir::Operation* op, mlir::AffineMap accessMap, const std::vector<IndexAttr>& indices)
+{
+    auto indexArrayAttr = util::VectorToArrayAttr<IndexAttr>(indices, op->getContext());
+    op->setAttr(BaseArrayAccessMapAttrName, mlir::AffineMapAttr::get(accessMap));
+    op->setAttr(BaseArrayAccessIndicesAttrName, indexArrayAttr);
+}
+
+void CopyBaseArrayAccessAttrs(mlir::Operation* from, mlir::Operation* to)
+{
+    assert(HasBaseArrayAccessAttrs(from));
+    to->setAttr(BaseArrayAccessMapAttrName, from->getAttr(BaseArrayAccessMapAttrName));
+    to->setAttr(BaseArrayAccessIndicesAttrName, from->getAttr(BaseArrayAccessIndicesAttrName));
+}
+
+mlir::AffineValueMap GetBaseArrayAccessAffineValueMap(mlir::Operation* op)
+{
+    assert(HasBaseArrayAccessAttrs(op));
+    auto affineMapAttr = op->getAttrOfType<mlir::AffineMapAttr>(BaseArrayAccessMapAttrName);
+    auto indexArrayAttr = op->getAttrOfType<mlir::ArrayAttr>(BaseArrayAccessIndicesAttrName);
+    auto affineMap = affineMapAttr.getValue();
+    auto indices = util::ConvertArrayAttrToIndexVector(indexArrayAttr);
+    auto indexValues = util::GetCurrentIndexIVs(indices, op);
+
+    return mlir::AffineValueMap(affineMap, indexValues);
+}
+
 // Get the base array position for an AffineLoadOp that understands how to access caches
 template <typename LoadStoreOp>
 std::vector<mlir::Value> GetBaseArrayPosition(mlir::OpBuilder& builder, mlir::Location loc, LoadStoreOp loadStoreOp)
 {
-    auto memref = loadStoreOp.memref();
-    typename LoadStoreOp::Adaptor adaptor{ loadStoreOp };
-    if (auto cache = mlir::dyn_cast_or_null<MakeCacheOp>(memref.getDefiningOp()))
+    if (HasBaseArrayAccessAttrs(loadStoreOp))
     {
-        return cache.getBaseArrayPosition(loadStoreOp);
+        mlir::AffineValueMap accessAffineValueMap = GetBaseArrayAccessAffineValueMap(loadStoreOp);
+        auto map = accessAffineValueMap.getAffineMap();
+        auto operands = accessAffineValueMap.getOperands().vec();
+        return util::MultiDimAffineApply(builder, loc, map, operands);
     }
     else
     {
-        auto accessMap = loadStoreOp.getAffineMapAttr().getValue();
-        std::vector<mlir::Value> affineIndices(adaptor.indices().begin(), adaptor.indices().end());
-        auto resolvedAccessIndices = util::MultiDimAffineApply(builder, loc, accessMap, affineIndices);
-        return resolvedAccessIndices;
+        auto memref = loadStoreOp.memref();
+        typename LoadStoreOp::Adaptor adaptor{ loadStoreOp };
+        if (auto cache = mlir::dyn_cast_or_null<MakeCacheOp>(memref.getDefiningOp()))
+        {
+            // Note : this doesn't always work after canonicalization has run and omitted some operands
+            return cache.getBaseArrayPosition(loadStoreOp);
+        }
+        else
+        {
+            auto accessMap = loadStoreOp.getAffineMapAttr().getValue();
+            std::vector<mlir::Value> affineIndices(adaptor.indices().begin(), adaptor.indices().end());
+            return util::MultiDimAffineApply(builder, loc, accessMap, affineIndices);
+        }
     }
 }
 
@@ -1449,12 +1631,176 @@ bool IsCacheRegionEmpty(BeginCacheRegion beginRegion)
     return emptyRegion;
 }
 
+template <typename LoadStoreOp>
+mlir::AffineMap GetLoadStoreAccessMap(LoadStoreOp op)
+{
+    return op.getAffineMapAttr().getValue();
+}
+
+template <typename LoadStoreOp>
+std::vector<IndexAttr> GetLoadStoreAccessIndexAttrs(LoadStoreOp op)
+{
+    std::vector<mlir::Value> indexValues(op.indices().begin(), op.indices().end());
+    std::vector<IndexAttr> indexAttrs;
+    for (auto indexValue : indexValues)
+    {
+        mlir::AffineForOp forOp = mlir::getForInductionVarOwner(indexValue);
+        assert(forOp != nullptr);
+        assert(forOp->hasAttrOfType<IndexAttr>("index"));
+        auto indexAttr = forOp->getAttrOfType<IndexAttr>("index");
+        indexAttrs.push_back(indexAttr);
+    }
+    return indexAttrs;
+}
+
+template <typename LoadStoreOp>
+void TransferOrSetAccessAttrs(LoadStoreOp from, LoadStoreOp to)
+{
+    if (HasBaseArrayAccessAttrs(from))
+    {
+        CopyBaseArrayAccessAttrs(from, to);
+    }
+    else
+    {
+        auto accessMap = GetLoadStoreAccessMap(from);
+        auto accessIndexAttrs = GetLoadStoreAccessIndexAttrs(from);
+        SetBaseArrayAccessAttrs(to, accessMap, accessIndexAttrs);
+    }
+}
+
+struct MultiCacheLoopInfo
+{
+    std::vector<mlir::AffineForOp> multiCacheLoops;
+    std::vector<mlir::Value> multiCacheIVs;
+    std::vector<mlir::Value> multiCacheIterCounters;
+    std::vector<mlir::Value> activeBlockExternalSymbols;
+    std::vector<int64_t> multiCacheShape;
+    std::vector<int64_t> multiCacheStepSizes;
+};
+
+MultiCacheLoopInfo CreateMultiCacheLoops(mlir::OpBuilder& builder, MultiCacheCopyOp copyOp, const std::function<void(mlir::OpBuilder&, const MultiCacheLoopInfo&)>& fn)
+{
+    mlir::OpBuilder::InsertionGuard guard(builder);
+    MultiCacheCopyOp::Adaptor adaptor{ copyOp };
+    MultiCacheLoopInfo result;
+
+    auto loc = copyOp.getLoc();
+
+    std::optional<v::ExecutionTarget> execTargetOpt = util::ResolveExecutionTarget(copyOp);
+    auto execTarget = *execTargetOpt;
+    if (execTarget == v::ExecutionTarget::GPU)
+    {
+        mlir::OpBuilder::InsertionGuard guard(builder);
+        builder.setInsertionPoint(copyOp);
+        (void)util::CreateGPUControlBarrier(builder, "Block", loc);
+        builder.setInsertionPointAfter(copyOp);
+        (void)util::CreateGPUControlBarrier(builder, "Block", loc);
+    }
+
+    auto multiCacheLBMapsArrayAttr = adaptor.multiCacheLoopLowerBoundMaps();
+    auto multiCacheUBMapsArrayAttr = adaptor.multiCacheLoopUpperBoundMaps();
+    auto multiCacheStepsArrayAttr = adaptor.multiCacheLoopStepSizes();
+    auto multiCacheLBMaps = util::ArrayAttrToVector<mlir::AffineMap, mlir::AffineMapAttr>(multiCacheLBMapsArrayAttr, [](const mlir::AffineMapAttr& mapAttr) -> mlir::AffineMap {
+        return mapAttr.getValue();
+    });
+    auto multiCacheUBMaps = util::ArrayAttrToVector<mlir::AffineMap, mlir::AffineMapAttr>(multiCacheUBMapsArrayAttr, [](const mlir::AffineMapAttr& mapAttr) -> mlir::AffineMap {
+        return mapAttr.getValue();
+    });
+    result.multiCacheStepSizes = util::ConvertArrayAttrToIntVector(multiCacheStepsArrayAttr);
+
+    std::vector<Index> multiCacheIndexIds = util::ConvertArrayAttrToIndexVector(adaptor.multiCacheLoopIndexIds());
+
+    assert(multiCacheLBMaps.size() == multiCacheUBMaps.size());
+    assert(multiCacheLBMaps.size() == result.multiCacheStepSizes.size());
+    assert(multiCacheLBMaps.size() == multiCacheIndexIds.size());
+    auto multiCacheLoopCount = multiCacheLBMaps.size();
+
+    // Construct the multiCache loops
+    // Are we able to replace these with loopnests? we don't have a way to construct loopnests with affine map lower/upper bounds currently
+    mlir::OpBuilder currentBuilder = builder;
+    mlir::ValueRange emptyOperands;
+    for (unsigned multiCacheDim = 0; multiCacheDim < multiCacheLoopCount; ++multiCacheDim)
+    {
+        auto forOp = mlir::createCanonicalizedAffineForOp(currentBuilder, loc, emptyOperands, multiCacheLBMaps[multiCacheDim], emptyOperands, multiCacheUBMaps[multiCacheDim], result.multiCacheStepSizes[multiCacheDim]);
+        forOp->setAttr("index", IndexAttr::get(multiCacheIndexIds[multiCacheDim], currentBuilder.getContext()));
+        currentBuilder = mlir::OpBuilder::atBlockTerminator(forOp.getBody());
+        mlir::Value iterCounter = util::CreateConstantRangeForOpIterationCounter(currentBuilder, loc, forOp);
+        result.multiCacheIterCounters.push_back(iterCounter);
+
+        result.multiCacheIVs.push_back(forOp.getInductionVar());
+
+        auto constantTripCountOpt = mlir::getConstantTripCount(forOp);
+        assert(constantTripCountOpt.hasValue() && "AffineForOps in Accera loop nests must have constant trip counts");
+        result.multiCacheShape.push_back(constantTripCountOpt.getValue());
+
+        result.multiCacheLoops.push_back(forOp);
+    }
+
+    // Now that we have the multiCache IVs we can permute the multiCache external symbols and these IVs to make the full external symbols for the ActiveBlockCacheCopyOp
+    auto externalSymbolsPermutationMap = copyOp.externalSymbolsPermutationMap();
+    auto multiCacheExternalSymbolsValueRange = adaptor.multiCacheExternalSymbols();
+    std::vector<mlir::Value> unpermutedExternalSymbols(multiCacheExternalSymbolsValueRange.begin(), multiCacheExternalSymbolsValueRange.end());
+    unpermutedExternalSymbols.insert(unpermutedExternalSymbols.end(), result.multiCacheIVs.begin(), result.multiCacheIVs.end());
+
+    // Permute the external symbols into their creation order
+    // as the externalSymbolsPermutationMap will map from their creation
+    // order to their expected order for the maps
+
+    if (!unpermutedExternalSymbols.empty())
+    {
+        auto externalSymbolsToDefOrderMap = ComputeLoopIVToDefinitionOrderMap(unpermutedExternalSymbols, currentBuilder.getContext());
+        std::vector<mlir::Value> activeBlockExternalSymbolDefinitionOrdered = util::MultiDimAffineApply(currentBuilder, loc, externalSymbolsToDefOrderMap, unpermutedExternalSymbols);
+        result.activeBlockExternalSymbols = util::MultiDimAffineApply(currentBuilder, loc, externalSymbolsPermutationMap, activeBlockExternalSymbolDefinitionOrdered);
+    }
+
+    fn(currentBuilder, result);
+
+    return result;
+}
+
+bool SameMemorySpace(mlir::Value left, mlir::Value right)
+{
+    auto leftType = left.getType();
+    assert(leftType.isa<mlir::MemRefType>());
+    auto leftMemRefType = leftType.cast<mlir::MemRefType>();
+    auto rightType = right.getType();
+    assert(rightType.isa<mlir::MemRefType>());
+    auto rightMemRefType = rightType.cast<mlir::MemRefType>();
+
+    return leftMemRefType.getMemorySpace() == rightMemRefType.getMemorySpace();
+}
+
+bool IsCacheOp(mlir::Operation* op)
+{
+    return mlir::isa<MakeCacheOp,
+                     BeginCacheMappingOp,
+                     EndCacheMappingOp,
+                     BeginCacheRegionOp,
+                     EndCacheRegionOp,
+                     BeginMaxElementCacheRegionOp,
+                     MultiCacheCopyOp,
+                     ActiveBlockCacheCopyOp,
+                     ActiveBlockCacheReduceOp,
+                     CacheZeroOp,
+                     ActiveElementCacheCopyOp,
+                     ActiveElementCacheReduceOp>(op);
+}
+
 } // namespace
 
 LogicalResult MakeCacheOpLowering::matchAndRewrite(MakeCacheOp makeCacheOp, PatternRewriter& rewriter) const
 {
     auto loc = makeCacheOp.getLoc();
 
+    auto cacheArray = makeCacheOp.cache();
+
+    if (cacheArray.use_empty())
+    {
+        // No uses of the cache array anymore, so just erase this cache and move on
+        rewriter.eraseOp(makeCacheOp);
+        return success();
+    }
+
     auto cacheBaseType = makeCacheOp.cache().getType();
     assert(cacheBaseType.isa<MemRefType>() && "Cache must be a memref");
     auto cacheType = cacheBaseType.cast<MemRefType>();
@@ -1491,7 +1837,7 @@ LogicalResult MakeCacheOpLowering::matchAndRewrite(MakeCacheOp makeCacheOp, Patt
     }
     else
     {
-        // Shared or Local
+        // Shared or Private
         cacheGlobalBuffer = rewriter.create<v::AllocOp>(loc, cacheType, llvm::None);
     }
 
@@ -1576,7 +1922,7 @@ LogicalResult ActiveElementCacheCopyOpRewrite::matchAndRewrite(ActiveElementCach
         }
     }
 
-    if (execTarget == v::ExecutionTarget::GPU && dstMemRefSpace != static_cast<unsigned int>(v::MemorySpace::Local))
+    if (execTarget == v::ExecutionTarget::GPU && dstMemRefSpace != static_cast<unsigned int>(v::MemorySpace::Private))
     {
         // TODO : should this be in a better place? This barrier is trying to prevent loops from getting
         //        too far ahead of their counterparts and trying to fill a cache before every thread is
@@ -1606,7 +1952,7 @@ LogicalResult ActiveElementCacheCopyOpRewrite::matchAndRewrite(ActiveElementCach
         copyScheduleOp.addLoopAttribute(loopIndex, rewriter.getIdentifier(AccessBoundsCheckAttrName), rewriter.getUnitAttr());
     }
 
-    if (execTarget == v::ExecutionTarget::GPU && dstMemRefSpace != static_cast<unsigned int>(v::MemorySpace::Local))
+    if (execTarget == v::ExecutionTarget::GPU && dstMemRefSpace != static_cast<unsigned int>(v::MemorySpace::Private))
     {
         // Create thread mappings for the different levels of the copy loopnest
         // TODO : restructure the loopnest to ensure that there is always
@@ -1620,9 +1966,8 @@ LogicalResult ActiveElementCacheCopyOpRewrite::matchAndRewrite(ActiveElementCach
 
         auto launchAttr = vLambdaOp->getAttrOfType<mlir::ArrayAttr>(vLambdaOp.getGPULaunchAttrName());
         assert(launchAttr != nullptr);
-        auto launchParams = util::ConvertArrayAttrToIntVector(launchAttr);
-        [[maybe_unused]] std::vector<int64_t> gridDimSizes = { launchParams[0], launchParams[1], launchParams[2] };
-        std::vector<int64_t> blockDimSizes = { launchParams[3], launchParams[4], launchParams[5] };
+        auto gpuParams = accera::ir::targets::GPU::FromArrayAttr(launchAttr);
+        std::vector<int64_t> blockDimSizes = { gpuParams.block.x, gpuParams.block.y, gpuParams.block.z };
 
         // Assign thread dimensions if it's not a private memory cache.
         auto threadXProcStr = v::stringifyEnum(v::Processor::ThreadX);
@@ -1705,73 +2050,21 @@ LogicalResult MultiCacheCopyOpRewrite::matchAndRewrite(MultiCacheCopyOp multiCac
         return success();
     }
 
-    // Construct the multiCache loops
-    // TODO : do we benefit from having this layer be an Accera loopnest? there are no splits so there are no boundary conditions to account for
-    auto multiCacheLBMapsArrayAttr = adaptor.multiCacheLoopLowerBoundMaps();
-    auto multiCacheUBMapsArrayAttr = adaptor.multiCacheLoopUpperBoundMaps();
-    auto multiCacheStepsArrayAttr = adaptor.multiCacheLoopStepSizes();
-    auto multiCacheLBMaps = util::ArrayAttrToVector<mlir::AffineMap, mlir::AffineMapAttr>(multiCacheLBMapsArrayAttr, [](const mlir::AffineMapAttr& mapAttr) -> mlir::AffineMap {
-        return mapAttr.getValue();
-    });
-    auto multiCacheUBMaps = util::ArrayAttrToVector<mlir::AffineMap, mlir::AffineMapAttr>(multiCacheUBMapsArrayAttr, [](const mlir::AffineMapAttr& mapAttr) -> mlir::AffineMap {
-        return mapAttr.getValue();
+    MultiCacheLoopInfo multiCacheInfo = CreateMultiCacheLoops(rewriter, multiCacheCopyOp, [&](mlir::OpBuilder& currentBuilder, const MultiCacheLoopInfo& info) {
+        currentBuilder.create<ActiveBlockCacheCopyOp>(loc,
+                                                      multiCacheCopyOp.array(),
+                                                      multiCacheCopyOp.cache(),
+                                                      info.activeBlockExternalSymbols,
+                                                      info.activeBlockExternalSymbols,
+                                                      info.multiCacheIterCounters,
+                                                      multiCacheCopyOp.activeBlockLowerBoundMaps(),
+                                                      multiCacheCopyOp.activeBlockUpperBoundMaps(),
+                                                      multiCacheCopyOp.activeBlockToCacheMap(),
+                                                      multiCacheCopyOp.toCache(),
+                                                      multiCacheCopyOp.activeBlockTag(),
+                                                      multiCacheCopyOp.thrifty(),
+                                                      true); // skipBarriers : this copy will already be guarded by barriers at the multicache level, so skip creating them internally
     });
-    auto multiCacheStepSizes = util::ConvertArrayAttrToIntVector(multiCacheStepsArrayAttr);
-
-    std::vector<Index> multiCacheIndexIds = util::ConvertArrayAttrToIndexVector(adaptor.multiCacheLoopIndexIds());
-
-    assert(multiCacheLBMaps.size() == multiCacheUBMaps.size());
-    assert(multiCacheLBMaps.size() == multiCacheStepSizes.size());
-    assert(multiCacheLBMaps.size() == multiCacheIndexIds.size());
-    auto multiCacheLoopCount = multiCacheLBMaps.size();
-
-    std::vector<mlir::Value> multiCacheIVs;
-    std::vector<mlir::Value> multiCacheIterCounters;
-
-    // Are we able to replace these with loopnests? we don't have a way to construct loopnests with affine map lower/upper bounds currently
-    mlir::OpBuilder currentBuilder = rewriter;
-    mlir::ValueRange emptyOperands;
-    for (unsigned multiCacheDim = 0; multiCacheDim < multiCacheLoopCount; ++multiCacheDim)
-    {
-        auto forOp = mlir::createCanonicalizedAffineForOp(currentBuilder, loc, emptyOperands, multiCacheLBMaps[multiCacheDim], emptyOperands, multiCacheUBMaps[multiCacheDim], multiCacheStepSizes[multiCacheDim]);
-        forOp->setAttr("index", IndexAttr::get(multiCacheIndexIds[multiCacheDim], currentBuilder.getContext()));
-        currentBuilder = mlir::OpBuilder::atBlockTerminator(forOp.getBody());
-        mlir::Value iterCounter = util::CreateConstantRangeForOpIterationCounter(currentBuilder, loc, forOp);
-        multiCacheIterCounters.push_back(iterCounter);
-
-        multiCacheIVs.push_back(forOp.getInductionVar());
-    }
-
-    // Now that we have the multiCache IVs we can permute the multiCache external symbols and these IVs to make the full external symbols for the ActiveBlockCacheCopyOp
-    auto externalSymbolsPermutationMap = multiCacheCopyOp.externalSymbolsPermutationMap();
-    auto multiCacheExternalSymbolsValueRange = adaptor.multiCacheExternalSymbols();
-    std::vector<mlir::Value> unpermutedExternalSymbols(multiCacheExternalSymbolsValueRange.begin(), multiCacheExternalSymbolsValueRange.end());
-    unpermutedExternalSymbols.insert(unpermutedExternalSymbols.end(), multiCacheIVs.begin(), multiCacheIVs.end());
-
-    // Permute the external symbols into their creation order
-    // as the externalSymbolsPermutationMap will map from their creation
-    // order to their expected order for the maps
-
-    std::vector<mlir::Value> activeBlockExternalSymbols;
-    if (!unpermutedExternalSymbols.empty())
-    {
-        auto externalSymbolsToDefOrderMap = ComputeLoopIVToDefinitionOrderMap(unpermutedExternalSymbols, rewriter.getContext());
-        std::vector<mlir::Value> activeBlockExternalSymbolDefinitionOrdered = util::MultiDimAffineApply(currentBuilder, loc, externalSymbolsToDefOrderMap, unpermutedExternalSymbols);
-        activeBlockExternalSymbols = util::MultiDimAffineApply(currentBuilder, loc, externalSymbolsPermutationMap, activeBlockExternalSymbolDefinitionOrdered);
-    }
-
-    // TODO : rewrite this using slices/views (rather than plumbing the multicache access dims all the way into the individual loads/stores) once linalg.slice issues are worked out
-
-    currentBuilder.create<ActiveBlockCacheCopyOp>(loc,
-                                                  multiCacheCopyOp.array(),
-                                                  multiCacheCopyOp.cache(),
-                                                  activeBlockExternalSymbols,
-                                                  activeBlockExternalSymbols,
-                                                  multiCacheIterCounters,
-                                                  multiCacheCopyOp.activeBlockLowerBoundMaps(),
-                                                  multiCacheCopyOp.activeBlockUpperBoundMaps(),
-                                                  multiCacheCopyOp.activeBlockToCacheMap(),
-                                                  true); // True because a MultiCacheCopyOp only copies into the cache
 
     rewriter.eraseOp(multiCacheCopyOp);
 
@@ -1801,7 +2094,9 @@ LogicalResult ActiveBlockCacheCopyOpRewrite::matchAndRewrite(ActiveBlockCacheCop
     auto array = cacheCopyOp.array();
     assert(array.getType().isa<MemRefType>());
     auto memRefType = array.getType().cast<MemRefType>();
+    unsigned outerArrayMemRefSpace = memRefType.getMemorySpaceAsInt();
     [[maybe_unused]] auto baseArrayElementType = GetInnerElementType(array); // e.g. f32
+    unsigned outerArrayRank = memRefType.getRank();
 
     auto elementBitWidth = memRefType.getElementTypeBitWidth();
     auto elementByteWidth = elementBitWidth / 8;
@@ -1811,6 +2106,7 @@ LogicalResult ActiveBlockCacheCopyOpRewrite::matchAndRewrite(ActiveBlockCacheCop
     auto cacheMemRefType = cache.getType().cast<MemRefType>();
     unsigned cacheMemRefSpace = cacheMemRefType.getMemorySpaceAsInt();
     auto baseCacheElementType = GetInnerElementType(cache); // e.g. f32
+    unsigned fullCacheRank = cacheMemRefType.getRank();
 
     assert(baseArrayElementType == baseCacheElementType && "Copy source and dest data types don't match");
 
@@ -1838,8 +2134,8 @@ LogicalResult ActiveBlockCacheCopyOpRewrite::matchAndRewrite(ActiveBlockCacheCop
         return ubMap.getNumInputs() == ubOperands.size();
     }));
 
-    unsigned rank = memRefType.getRank();
     assert(lbMaps.size() == ubMaps.size() && "mismatched number of lb and ub maps");
+    unsigned activeBlockRank = lbMaps.size();
 
     OpBuilder currentBuilder = rewriter;
 
@@ -1854,6 +2150,10 @@ LogicalResult ActiveBlockCacheCopyOpRewrite::matchAndRewrite(ActiveBlockCacheCop
 
         if (execTarget == v::ExecutionTarget::GPU)
         {
+            if (!cacheCopyOp.skipBarriers())
+            {
+                (void)util::CreateGPUControlBarrier(rewriter, "Block", loc);
+            }
             auto vLambdaOp = cacheCopyOp->getParentOfType<v::ValueLambdaOp>();
             // If we're inside a lambda then our ultimate exec target may be different
             // from the ValueFuncOp target. E.g. for GPU loopnests, the loopnest lambda
@@ -1862,21 +2162,42 @@ LogicalResult ActiveBlockCacheCopyOpRewrite::matchAndRewrite(ActiveBlockCacheCop
 
             auto launchAttr = vLambdaOp->getAttrOfType<mlir::ArrayAttr>(vLambdaOp.getGPULaunchAttrName());
             assert(launchAttr != nullptr);
-            auto launchParams = util::ConvertArrayAttrToIntVector(launchAttr);
-            [[maybe_unused]] std::vector<int64_t> gridDimSizes = { launchParams[0], launchParams[1], launchParams[2] };
-            std::vector<int64_t> blockDimSizes = { launchParams[3], launchParams[4], launchParams[5] };
+            auto gpuParams = accera::ir::targets::GPU::FromArrayAttr(launchAttr);
+            std::vector<int64_t> blockDimSizes = { gpuParams.block.x, gpuParams.block.y, gpuParams.block.z };
+
+            int64_t totalLoadsPerThread = 0;
+            auto vectorSizePerThread = 1; // TODO: Plumb hardware supported vector size
+            auto activeBlockVolume = std::accumulate(activeBlockShape.begin(), activeBlockShape.end(), 1, std::multiplies<int64_t>());
+
+            // Use thread mappings any time one of the arrays we're indexing into is non-private
+            bool useThreadMappings = outerArrayMemRefSpace != static_cast<unsigned int>(v::MemorySpace::Private) ||
+                                     cacheMemRefSpace != static_cast<unsigned int>(v::MemorySpace::Private);
 
-            if (cacheMemRefSpace == static_cast<unsigned int>(v::MemorySpace::Local))
+            if (useThreadMappings)
+            {
+                totalLoadsPerThread = activeBlockVolume / (blockDimSizes[0] * blockDimSizes[1] * blockDimSizes[2]);
+            }
+            else
             {
+                // If we're copying from private memory to private memory, then don't consider the block sizes as we won't
+                // have any threads to map relative to either of these buffers
                 blockDimSizes = { 1, 1, 1 };
+                totalLoadsPerThread = activeBlockVolume;
             }
 
-            auto totalLoadsPerThread = cacheMemRefType.getNumElements() / (blockDimSizes[0] * blockDimSizes[1] * blockDimSizes[2]);
-            auto vectorSizePerThread = 1; // TODO: Plumb hardware supported vector size
-
-            auto loadsPerThread = totalLoadsPerThread / vectorSizePerThread;
-
-            auto [copyNestOp, copyScheduleOp, copyExecPlanOp] = CreateActiveBlockCacheLoopnest(rewriter, loc, { loadsPerThread, blockDimSizes[2], blockDimSizes[1], blockDimSizes[0], vectorSizePerThread }, std::nullopt, elementByteWidth, execTarget, "copy", [&](OpBuilder& currentBuilder, const std::vector<mlir::Value>& orderedSymbolicIndexOpValues) {
+            auto loadsPerThread = std::max((int64_t)1, (int64_t)(totalLoadsPerThread / vectorSizePerThread));
+
+            std::vector<int64_t> activeBlockIterationShape{ loadsPerThread,
+                                                            blockDimSizes[2],
+                                                            blockDimSizes[1],
+                                                            blockDimSizes[0],
+                                                            vectorSizePerThread };
+            std::vector<std::string> activeBlockDimNames{ ActionsPerThreadIndexName,
+                                                          ThreadZIndexName,
+                                                          ThreadYIndexName,
+                                                          ThreadXIndexName,
+                                                          ThreadVectorizationIndexName };
+            auto [copyNestOp, copyScheduleOp, copyExecPlanOp] = CreateActiveBlockCacheLoopnest(rewriter, loc, activeBlockIterationShape, activeBlockDimNames, std::nullopt, elementByteWidth, execTarget, "copy", [&](OpBuilder& currentBuilder, const std::vector<mlir::Value>& orderedSymbolicIndexOpValues) {
                 // The induction variables have been shifted to represent the constant iteration space
                 // however, the maps expect they are constructed based on the original mappings so we
                 // need to offset each IV by its lower bound map applied to its lower bound operands
@@ -1902,13 +2223,12 @@ LogicalResult ActiveBlockCacheCopyOpRewrite::matchAndRewrite(ActiveBlockCacheCop
                 mlir::AffineMap cacheFillNestMap = mlir::AffineMap::get(5, 0, cacheFillNestToFlatExpr);
 
                 // TODO: Handle arbitrary memory order input
-                auto numDims = memRefType.getRank();
                 auto cumulativeStride = 1;
                 std::vector<mlir::AffineExpr> flatToActiveExprs;
 
-                for (int dim_counter = 0; dim_counter < numDims; ++dim_counter)
+                for (int dim_counter = 0; dim_counter < activeBlockRank; ++dim_counter)
                 {
-                    auto curDimSize = activeBlockShape[(numDims - 1) - dim_counter];
+                    auto curDimSize = activeBlockShape[(activeBlockRank - 1) - dim_counter];
                     mlir::AffineExpr flatToActiveBlockExpr = ((currentBuilder.getAffineDimExpr(0).floorDiv(cumulativeStride)) % curDimSize);
                     flatToActiveExprs.push_back(flatToActiveBlockExpr);
                     cumulativeStride *= curDimSize;
@@ -1936,19 +2256,28 @@ LogicalResult ActiveBlockCacheCopyOpRewrite::matchAndRewrite(ActiveBlockCacheCop
                     lowerBoundOffsetIVs.push_back(lbOffsetIV);
                 }
 
+                // Get the pairs of loopnest Index objects and their corresponding mlir::Values to use to access the
+                // caches if needed
+                std::vector<std::pair<Index, mlir::Value>> unrealizedLoopNestIndices;
+                for (auto& loopnestIV : orderedSymbolicIndexOpValues)
+                {
+                    auto indexOp = mlir::dyn_cast<SymbolicIndexOp>(loopnestIV.getDefiningOp());
+                    auto index = indexOp.index().getValue();
+                    unrealizedLoopNestIndices.emplace_back(index, loopnestIV);
+                }
                 if (arrayToCache)
                 {
-                    mlir::Value loadedValue = CreateLoad(currentBuilder, loc, array, lowerBoundOffsetIVs);
-                    CreateStore(currentBuilder, loc, loadedValue, cache, lowerBoundOffsetIVs);
+                    mlir::Value loadedValue = CreateLoad(currentBuilder, loc, array, lowerBoundOffsetIVs, unrealizedLoopNestIndices);
+                    CreateStore(currentBuilder, loc, loadedValue, cache, lowerBoundOffsetIVs, unrealizedLoopNestIndices);
                 }
                 else
                 {
-                    mlir::Value loadedValue = CreateLoad(currentBuilder, loc, cache, lowerBoundOffsetIVs);
-                    CreateStore(currentBuilder, loc, loadedValue, array, lowerBoundOffsetIVs);
+                    mlir::Value loadedValue = CreateLoad(currentBuilder, loc, cache, lowerBoundOffsetIVs, unrealizedLoopNestIndices);
+                    CreateStore(currentBuilder, loc, loadedValue, array, lowerBoundOffsetIVs, unrealizedLoopNestIndices);
                 }
             });
 
-            if (cacheMemRefSpace != static_cast<unsigned int>(v::MemorySpace::Local))
+            if (useThreadMappings)
             {
                 auto threadZProcStr = v::stringifyEnum(v::Processor::ThreadZ);
                 auto threadYProcStr = v::stringifyEnum(v::Processor::ThreadY);
@@ -1969,7 +2298,10 @@ LogicalResult ActiveBlockCacheCopyOpRewrite::matchAndRewrite(ActiveBlockCacheCop
                 auto procMap = rewriter.getDictionaryAttr({ mappings });
                 copyExecPlanOp->setAttr(procMapAttrName, procMap);
 
-                (void)util::CreateGPUControlBarrier(rewriter, "Block", loc);
+                if (!cacheCopyOp.skipBarriers())
+                {
+                    (void)util::CreateGPUControlBarrier(rewriter, "Block", loc);
+                }
             }
         }
         else
@@ -1981,7 +2313,7 @@ LogicalResult ActiveBlockCacheCopyOpRewrite::matchAndRewrite(ActiveBlockCacheCop
                 vecInfo = GetVectorizationInfo(vLambdaOp);
             }
 
-            auto [copyNestOp, copyScheduleOp, copyExecPlanOp] = CreateActiveBlockCacheLoopnest(rewriter, loc, activeBlockShape, vecInfo, elementByteWidth, execTarget, "copy", [&](OpBuilder& currentBuilder, const std::vector<mlir::Value>& orderedSymbolicIndexOpValues) {
+            auto [copyNestOp, copyScheduleOp, copyExecPlanOp] = CreateActiveBlockCacheLoopnest(rewriter, loc, activeBlockShape, {}, vecInfo, elementByteWidth, execTarget, "copy", [&](OpBuilder& currentBuilder, const std::vector<mlir::Value>& orderedSymbolicIndexOpValues) {
                 // The induction variables have been shifted to represent the constant iteration space
                 // however, the maps expect they are constructed based on the original mappings so we
                 // need to offset each IV by its lower bound map applied to its lower bound operands
@@ -2029,7 +2361,7 @@ LogicalResult ActiveBlockCacheCopyOpRewrite::matchAndRewrite(ActiveBlockCacheCop
         std::vector<mlir::Value> copyIVs;
 
         // Are we able to replace these with loopnests? we don't have a way to construct loopnests with affine map lower/upper bounds currently
-        for (unsigned arrayDim = 0; arrayDim < rank; ++arrayDim)
+        for (unsigned arrayDim = 0; arrayDim < outerArrayRank; ++arrayDim)
         {
             auto forOp = mlir::createCanonicalizedAffineForOp(currentBuilder, loc, lbOperands, lbMaps[arrayDim], ubOperands, ubMaps[arrayDim]);
             currentBuilder = mlir::OpBuilder::atBlockTerminator(forOp.getBody());
@@ -2136,7 +2468,7 @@ LogicalResult ActiveBlockCacheReduceOpRewrite::matchAndRewrite(ActiveBlockCacheR
             vecInfo = GetVectorizationInfo(vLambdaOp);
         }
 
-        auto [reduceNestOp, reduceScheduleOp, reduceExecPlanOp] = CreateActiveBlockCacheLoopnest(rewriter, loc, activeBlockShape, vecInfo, elementByteWidth, execTarget, "reduce", [&](OpBuilder& currentBuilder, const std::vector<mlir::Value>& orderedSymbolicIndexOpValues) {
+        auto [reduceNestOp, reduceScheduleOp, reduceExecPlanOp] = CreateActiveBlockCacheLoopnest(rewriter, loc, activeBlockShape, {}, vecInfo, elementByteWidth, execTarget, "reduce", [&](OpBuilder& currentBuilder, const std::vector<mlir::Value>& orderedSymbolicIndexOpValues) {
             // The induction variables have been shifted to represent the constant iteration space
             // however, the maps expect they are constructed based on the original mappings so we
             // need to offset each IV by its lower bound map applied to its lower bound operands
@@ -2192,82 +2524,843 @@ LogicalResult ActiveBlockCacheReduceOpRewrite::matchAndRewrite(ActiveBlockCacheR
         auto accumulatedValue = currentBuilder.create<v::BinOp>(loc, BinaryOpPredicate::ADD, currentArrayValue, scaledCacheValue);
         CreateStore(currentBuilder, loc, accumulatedValue, array, IVs);
     }
-    rewriter.eraseOp(cacheReduceOp);
+    rewriter.eraseOp(cacheReduceOp);
+
+    return success();
+}
+
+LogicalResult ActiveElementCacheReduceOpRewrite::matchAndRewrite(ActiveElementCacheReduceOp cacheReduceOp, PatternRewriter& rewriter) const
+{
+    // Reduce data from the source cache buffer to the destination buffer by iterating over the cache region shape
+    // and mapping from cache region indices to the source cache buffer and destination buffer
+
+    auto loc = cacheReduceOp.getLoc();
+
+    ActiveElementCacheReduceOp::Adaptor adaptor{ cacheReduceOp };
+
+    [[maybe_unused]] auto dst = cacheReduceOp.dst();
+    assert(dst.getType().isa<MemRefType>());
+    auto baseOutputMemRefType = dst.getType().cast<MemRefType>();
+    auto baseOutputShape = baseOutputMemRefType.getShape();
+    auto baseOutputElementType = GetInnerElementType(dst);
+
+    auto elementBitWidth = baseOutputMemRefType.getElementTypeBitWidth();
+    auto elementByteWidth = elementBitWidth / 8;
+
+    auto cache = cacheReduceOp.srcCache();
+    auto cacheMemRefType = cache.getType().cast<MemRefType>();
+    auto cacheElementType = cacheMemRefType.getElementType(); // either something like vector< n x f32 > or f32
+    auto cacheShape = cacheMemRefType.getShape();
+    auto baseCacheElementType = GetInnerElementType(cache); // e.g. f32
+
+    auto cacheRegionIndexRanges = util::ArrayAttrToVector<IndexRange, IndexRangeAttr>(
+        cacheReduceOp.cacheRegionRelevantIndexRanges(),
+        [](const IndexRangeAttr& indexRangeAttr) {
+            return indexRangeAttr.getValue();
+        });
+
+    auto cacheRegionBaseIndices = util::ArrayAttrToVector<std::vector<Index>, mlir::ArrayAttr>(
+        cacheReduceOp.cacheRegionBaseIndices(),
+        util::ConvertArrayAttrToIndexVector);
+    assert(cacheRegionIndexRanges.size() == cacheRegionBaseIndices.size());
+
+    // If this op has no volume to operate over due to unswitched boundary conditions, just erase the op and return
+    for (const auto& indexRange : cacheRegionIndexRanges)
+    {
+        if (indexRange.Size() == 0)
+        {
+            rewriter.eraseOp(cacheReduceOp);
+            return success();
+        }
+    }
+
+    auto scaleValue = CreateProductOfValues(rewriter, loc, baseOutputElementType, adaptor.scaleValues());
+
+    auto [reduceNestOp, reduceScheduleOp, reduceExecPlanOp] = CreateActiveElementCacheLoopnest(rewriter, cacheReduceOp, elementByteWidth, "reduce", [&](OpBuilder& cacheReduceBuilder, const std::vector<mlir::Value>& orderedSymbolicIndexOpValues) {
+        std::vector<mlir::Value> combinedRelevantIndices;
+        combinedRelevantIndices.insert(
+            combinedRelevantIndices.end(),
+            adaptor.externalRelevantIndices().begin(),
+            adaptor.externalRelevantIndices().end());
+        combinedRelevantIndices.insert(combinedRelevantIndices.end(), orderedSymbolicIndexOpValues.begin(), orderedSymbolicIndexOpValues.end());
+
+        auto loadedCacheValue = cacheReduceBuilder.create<AffineLoadOp>(loc, cache, cacheReduceOp.relevantIndicesToSrcCacheMap(), combinedRelevantIndices);
+        auto scaledCacheValue = cacheReduceBuilder.create<v::BinOp>(loc, BinaryOpPredicate::MUL, scaleValue, loadedCacheValue);
+        auto currentOutputValue = cacheReduceBuilder.create<AffineLoadOp>(loc, dst, cacheReduceOp.relevantIndicesToDstMap(), combinedRelevantIndices);
+        auto accumulatedValue = cacheReduceBuilder.create<v::BinOp>(loc, BinaryOpPredicate::ADD, scaledCacheValue, currentOutputValue);
+        cacheReduceBuilder.create<AffineStoreOp>(loc, accumulatedValue, dst, cacheReduceOp.relevantIndicesToDstMap(), combinedRelevantIndices);
+    });
+
+    // Bounds check cache reduce loads/stores so we don't introduce
+    // a bug by adding a cache reduce
+    auto reduceOrder = reduceScheduleOp.getOrder();
+    for (const auto& loopIndex : reduceOrder)
+    {
+        reduceScheduleOp.addLoopAttribute(loopIndex, rewriter.getIdentifier(AccessBoundsCheckAttrName), rewriter.getUnitAttr());
+    }
+
+    rewriter.eraseOp(cacheReduceOp);
+
+    return success();
+}
+
+template <typename CacheOp>
+mlir::Operation* GetOpIfActiveBlockTagMatches(CacheOp cacheOp, llvm::StringRef cacheOpTag)
+{
+    if (cacheOp.activeBlockTag() == cacheOpTag)
+    {
+        return cacheOp;
+    }
+    else
+    {
+        return nullptr;
+    }
+}
+
+template <typename CacheOp>
+mlir::Operation* GetCacheOpPair(CacheOp cacheOp)
+{
+    // Search for ops in the same block that have the same active block tag
+
+    auto cacheOpTag = cacheOp.activeBlockTag();
+    auto parentBlock = cacheOp->getBlock();
+    for (auto& op : parentBlock->getOperations())
+    {
+        if (&op != cacheOp.getOperation())
+        {
+            mlir::Operation* pairOp = nullptr;
+            TypeSwitch<Operation*>(&op)
+                .Case([&](MultiCacheCopyOp copyOp) {
+                    pairOp = GetOpIfActiveBlockTagMatches(copyOp, cacheOpTag);
+                })
+                .Case([&](ActiveBlockCacheCopyOp copyOp) {
+                    pairOp = GetOpIfActiveBlockTagMatches(copyOp, cacheOpTag);
+                })
+                .Case([&](ActiveBlockCacheReduceOp reduceOp) {
+                    pairOp = GetOpIfActiveBlockTagMatches(reduceOp, cacheOpTag);
+                })
+                .Case([&](CacheZeroOp zeroOp) {
+                    pairOp = GetOpIfActiveBlockTagMatches(zeroOp, cacheOpTag);
+                })
+                .Default([&](mlir::Operation* op) {
+                    // Not a cache op, so nothing to do here
+                });
+            if (pairOp != nullptr)
+            {
+                return pairOp;
+            }
+        }
+    }
+    return nullptr;
+}
+
+template <typename KernelFn>
+bool InMemoryLoopnestRecursiveRunner(const std::vector<int64_t>& loopnestShape, const std::vector<int64_t>& stepSizes, size_t currentDim, std::vector<int64_t>& currentIVs, KernelFn&& fn)
+{
+    // Returns true if the loopnest should continue running, false if it should exit early
+    if (currentDim < loopnestShape.size())
+    {
+        for (int64_t idx = 0; idx < loopnestShape[currentDim]; idx += stepSizes[currentDim])
+        {
+            currentIVs[currentDim] = idx;
+            if (!InMemoryLoopnestRecursiveRunner(loopnestShape, stepSizes, currentDim + 1, currentIVs, fn))
+            {
+                return false;
+            }
+        }
+        return true;
+    }
+    else
+    {
+        // We're inside the innermost loop at this point, so just invoke the given kernel function with the current loop IV values
+        // The kernel function should return true if the loopnest should continue running, false if it should exit early
+        return fn(currentIVs);
+    }
+}
+
+template <typename KernelFn>
+bool InMemoryLoopnestRunner(const std::vector<int64_t>& loopnestShape, const std::vector<int64_t>& stepSizes, KernelFn&& fn)
+{
+    // Returns true if the loopnest ran completely, false if it exited early
+    std::vector<int64_t> currentIVs(loopnestShape.size(), 0);
+    return InMemoryLoopnestRecursiveRunner(loopnestShape, stepSizes, 0, currentIVs, fn);
+}
+
+std::vector<int64_t> GetConstantActiveBlockShapeHelper(mlir::ArrayAttr lbMapsArrayAttr, mlir::ArrayAttr ubMapsArrayAttr, mlir::ValueRange lbOperands, mlir::ValueRange ubOperands)
+{
+    auto lbMaps = util::ArrayAttrToVector<mlir::AffineMap, mlir::AffineMapAttr>(lbMapsArrayAttr, [](const mlir::AffineMapAttr& mapAttr) -> mlir::AffineMap {
+        return mapAttr.getValue();
+    });
+    auto ubMaps = util::ArrayAttrToVector<mlir::AffineMap, mlir::AffineMapAttr>(ubMapsArrayAttr, [](const mlir::AffineMapAttr& mapAttr) -> mlir::AffineMap {
+        return mapAttr.getValue();
+    });
+
+    assert(llvm::all_of(lbMaps, [&](mlir::AffineMap lbMap) {
+        return lbMap.getNumInputs() == lbOperands.size();
+    }));
+    assert(llvm::all_of(ubMaps, [&](mlir::AffineMap ubMap) {
+        return ubMap.getNumInputs() == ubOperands.size();
+    }));
+
+    assert(lbMaps.size() == ubMaps.size() && "mismatched number of lb and ub maps");
+
+    auto constantShapeOpt = GetConstantActiveBlockShape(lbMaps, ubMaps);
+    assert(constantShapeOpt.has_value() && "Only constant active block shapes are supported");
+    return *constantShapeOpt;
+}
+
+template <typename CacheOp>
+std::vector<int64_t> GetConstantActiveBlockShapeHelper(CacheOp cacheOp)
+{
+    typename CacheOp::Adaptor adaptor{ cacheOp };
+    auto lbMapsArrayAttr = adaptor.lbMaps();
+    auto ubMapsArrayAttr = adaptor.ubMaps();
+    auto lbOperands = adaptor.lbOperands();
+    auto ubOperands = adaptor.ubOperands();
+    return GetConstantActiveBlockShapeHelper(lbMapsArrayAttr, ubMapsArrayAttr, lbOperands, ubOperands);
+}
+
+std::vector<int64_t> GetFullCacheShapeHelper(const std::vector<int64_t>& multiCacheShape,
+                                             const std::vector<mlir::Value>& lbOperands,
+                                             const std::vector<mlir::Value>& ubOperands,
+                                             mlir::ArrayAttr lbMapsArrayAttr,
+                                             mlir::ArrayAttr ubMapsArrayAttr)
+{
+    auto activeBlockShape = GetConstantActiveBlockShapeHelper(lbMapsArrayAttr, ubMapsArrayAttr, lbOperands, ubOperands);
+    std::vector<int64_t> combinedCacheShape = multiCacheShape;
+    combinedCacheShape.insert(combinedCacheShape.end(), activeBlockShape.begin(), activeBlockShape.end());
+    return combinedCacheShape;
+}
+
+bool ThriftyCacheAllSingleElementStridesHelper(mlir::PatternRewriter& rewriter,
+                                               mlir::OpBuilder& currentBuilder, // Builder positioned inside of the temp multicache loops (if there are any)
+                                               mlir::Location loc,
+                                               mlir::Value outerArray,
+                                               mlir::Value cacheArray,
+                                               const std::vector<mlir::Value>& multiCacheIVs,
+                                               const std::vector<int64_t>& fullCacheShape,
+                                               const std::vector<int64_t>& fullCacheStepSizes,
+                                               const std::vector<mlir::Value>& activeBlockExternalSymbols,
+                                               mlir::ArrayAttr lbMapsArrayAttr,
+                                               mlir::ArrayAttr ubMapsArrayAttr)
+{
+    mlir::ValueRange lbOperands = activeBlockExternalSymbols;
+    mlir::ValueRange ubOperands = activeBlockExternalSymbols;
+
+    auto lbMaps = util::ArrayAttrToVector<mlir::AffineMap, mlir::AffineMapAttr>(lbMapsArrayAttr, [](const mlir::AffineMapAttr& mapAttr) -> mlir::AffineMap {
+        return mapAttr.getValue();
+    });
+    auto ubMaps = util::ArrayAttrToVector<mlir::AffineMap, mlir::AffineMapAttr>(ubMapsArrayAttr, [](const mlir::AffineMapAttr& mapAttr) -> mlir::AffineMap {
+        return mapAttr.getValue();
+    });
+
+    // Walk the combinedCacheStepSizes dimensions from innermost to outermost and loop over the loads and stores in that order
+
+    // Create temporary op stacks to hold the ops created as part of computing the difference in accesses between iterations.
+    // We create two so that one holds the current iteration accesses and one holds the previuos iteration accesses.
+    // Then we keep track of them with two pointers so the "current" can become the "previous" with a pointer swap and then
+    // one can be cleared out before examining the next iteration
+    // Note: here we prefer stacks over other data structures as the accesses may depend on affine apply op computations, so in
+    //       general we want to erase the ops in the reverse order they were constructed.
+    std::stack<mlir::Operation*> temporaryOpsOne;
+    std::stack<mlir::Operation*> temporaryOpsTwo;
+    std::stack<mlir::Operation*>* prevTemporaryOps = &temporaryOpsOne;
+    std::stack<mlir::Operation*>* currentTemporaryOps = &temporaryOpsTwo;
+    auto computeGlobalIndices = [&](const std::vector<int64_t>& activeBlockCurrentIVs, std::stack<mlir::Operation*>* temporaryOps) {
+        std::vector<mlir::Value> lowerBoundOffsetIVs;
+        lowerBoundOffsetIVs.reserve(activeBlockCurrentIVs.size());
+        assert(lbMaps.size() == activeBlockCurrentIVs.size());
+        mlir::AffineExpr sumExpr = currentBuilder.getAffineDimExpr(0) + currentBuilder.getAffineDimExpr(1);
+        mlir::AffineMap sumMap = mlir::AffineMap::get(2, 0, sumExpr);
+        for (unsigned arrayDim = 0; arrayDim < activeBlockCurrentIVs.size(); ++arrayDim)
+        {
+            mlir::Value lbMapApplied = currentBuilder.create<mlir::AffineApplyOp>(loc, lbMaps[arrayDim], lbOperands);
+            mlir::Value constantIV = currentBuilder.create<mlir::ConstantIndexOp>(loc, activeBlockCurrentIVs[arrayDim]);
+            mlir::Value lbOffsetIV = currentBuilder.create<mlir::AffineApplyOp>(loc, sumMap, mlir::ValueRange{ lbMapApplied, constantIV });
+            lowerBoundOffsetIVs.push_back(lbOffsetIV);
+
+            temporaryOps->push(lbMapApplied.getDefiningOp());
+            temporaryOps->push(constantIV.getDefiningOp());
+            temporaryOps->push(lbOffsetIV.getDefiningOp());
+        }
+        return lowerBoundOffsetIVs;
+    };
+
+    auto activeBlockDims = fullCacheShape.size() - multiCacheIVs.size();
+    std::vector<int64_t> zeroActiveBlockIndices(activeBlockDims, 0);
+
+    auto setMultiCacheIndices = [&](mlir::Operation* op, const std::vector<int64_t>& multiCacheCurrentIVs, std::stack<mlir::Operation*>* temporaryOps) {
+        assert(multiCacheIVs.size() == multiCacheCurrentIVs.size());
+        for (const auto& [multiCacheCurrentIV, multiCacheIV] : llvm::zip(multiCacheCurrentIVs, multiCacheIVs))
+        {
+            mlir::Value constantIV = currentBuilder.create<mlir::ConstantIndexOp>(loc, multiCacheCurrentIV);
+            op->replaceUsesOfWith(multiCacheIV, constantIV);
+
+            temporaryOps->push(constantIV.getDefiningOp());
+        }
+    };
+    std::vector<int64_t> zeroMultiCacheIndices(multiCacheIVs.size(), 0);
+
+    // For the purposes of this check we don't care which array we're supposed to be loading from or storing to, as we only care about the memory address strides
+    // Therefore, just create affine loads for both arrays for simplicity
+    std::vector<mlir::Value> initGlobalIndices = computeGlobalIndices(zeroActiveBlockIndices, prevTemporaryOps);
+    mlir::AffineLoadOp prevOuterArrayAccessOp = CreateLoad(currentBuilder, loc, outerArray, initGlobalIndices);
+    mlir::AffineLoadOp prevCacheArrayAccessOp = CreateLoad(currentBuilder, loc, cacheArray, initGlobalIndices);
+    // Set the multicache index constants
+    setMultiCacheIndices(prevOuterArrayAccessOp, zeroMultiCacheIndices, prevTemporaryOps);
+    setMultiCacheIndices(prevCacheArrayAccessOp, zeroMultiCacheIndices, prevTemporaryOps);
+
+    bool allSingleElementStrides = InMemoryLoopnestRunner(fullCacheShape, fullCacheStepSizes, [&](const std::vector<int64_t>& currentIVs) {
+        // Returns true if the loopnest should continue running, false if it should exit early
+        if (std::all_of(currentIVs.begin(), currentIVs.end(), [](int64_t idx) { return idx == 0; }))
+        {
+            // Don't compute anything for the first iteration of the loops, as we're already holding the initial access ops
+            return true;
+        }
+        util::TempOpCleanupGuard prevOpCleanupGuard(prevTemporaryOps, rewriter);
+        std::vector<int64_t> multiCacheCurrentIVs(currentIVs.begin(), currentIVs.begin() + multiCacheIVs.size());
+        std::vector<int64_t> activeBlockCurrentIVs(currentIVs.begin() + multiCacheIVs.size(), currentIVs.end());
+
+        auto lowerBoundOffsetIVs = computeGlobalIndices(activeBlockCurrentIVs, currentTemporaryOps);
+        mlir::AffineLoadOp currentOuterArrayAccessOp = CreateLoad(currentBuilder, loc, outerArray, lowerBoundOffsetIVs);
+        mlir::AffineLoadOp currentCacheArrayAccessOp = CreateLoad(currentBuilder, loc, cacheArray, lowerBoundOffsetIVs);
+        setMultiCacheIndices(currentOuterArrayAccessOp, multiCacheCurrentIVs, currentTemporaryOps);
+        setMultiCacheIndices(currentCacheArrayAccessOp, multiCacheCurrentIVs, currentTemporaryOps);
+
+        // Resolve the position in the memref for each access
+        std::vector<mlir::Value> prevOuterArrayIndicesVec(prevOuterArrayAccessOp.indices().begin(), prevOuterArrayAccessOp.indices().end());
+        std::vector<mlir::Value> prevCacheArrayIndicesVec(prevCacheArrayAccessOp.indices().begin(), prevCacheArrayAccessOp.indices().end());
+        std::vector<mlir::Value> currentOuterArrayIndicesVec(currentOuterArrayAccessOp.indices().begin(), currentOuterArrayAccessOp.indices().end());
+        std::vector<mlir::Value> currentCacheArrayIndicesVec(currentCacheArrayAccessOp.indices().begin(), currentCacheArrayAccessOp.indices().end());
+
+        auto prevOuterArrayAccessMapComposition = util::GetIndexToMemoryLocationMap(currentBuilder.getContext(), prevOuterArrayAccessOp);
+        auto prevCacheArrayAccessMapComposition = util::GetIndexToMemoryLocationMap(currentBuilder.getContext(), prevCacheArrayAccessOp);
+        auto currentOuterArrayAccessMapComposition = util::GetIndexToMemoryLocationMap(currentBuilder.getContext(), currentOuterArrayAccessOp);
+        auto currentCacheArrayAccessMapComposition = util::GetIndexToMemoryLocationMap(currentBuilder.getContext(), currentCacheArrayAccessOp);
+
+        auto prevOuterArrayAccess = util::MultiDimAffineApply(currentBuilder, loc, prevOuterArrayAccessMapComposition, prevOuterArrayIndicesVec);
+        auto prevCacheArrayAccess = util::MultiDimAffineApply(currentBuilder, loc, prevCacheArrayAccessMapComposition, prevCacheArrayIndicesVec);
+        auto currentOuterArrayAccess = util::MultiDimAffineApply(currentBuilder, loc, currentOuterArrayAccessMapComposition, currentOuterArrayIndicesVec);
+        auto currentCacheArrayAccess = util::MultiDimAffineApply(currentBuilder, loc, currentCacheArrayAccessMapComposition, currentCacheArrayIndicesVec);
+
+        assert(prevOuterArrayAccess.size() == 1);
+        assert(prevCacheArrayAccess.size() == 1);
+        assert(currentOuterArrayAccess.size() == 1);
+        assert(currentCacheArrayAccess.size() == 1);
+
+        prevTemporaryOps->push(prevOuterArrayAccess[0].getDefiningOp());
+        prevTemporaryOps->push(prevCacheArrayAccess[0].getDefiningOp());
+        prevTemporaryOps->push(currentOuterArrayAccess[0].getDefiningOp());
+        prevTemporaryOps->push(currentCacheArrayAccess[0].getDefiningOp());
+
+        mlir::AffineExpr diffExpr = currentBuilder.getAffineDimExpr(1) - currentBuilder.getAffineDimExpr(0);
+        auto outerArrayDiffMap = mlir::AffineMap::get(2, 0, diffExpr);
+        auto cacheArrayDiffMap = mlir::AffineMap::get(2, 0, diffExpr);
+
+        mlir::SmallVector<mlir::Value, 4> compareOuterArrayAccesses{ prevOuterArrayAccess[0], currentOuterArrayAccess[0] };
+        mlir::SmallVector<mlir::Value, 4> compareCacheArrayAccesses{ prevCacheArrayAccess[0], currentCacheArrayAccess[0] };
+        mlir::fullyComposeAffineMapAndOperands(&outerArrayDiffMap, &compareOuterArrayAccesses);
+        mlir::fullyComposeAffineMapAndOperands(&cacheArrayDiffMap, &compareCacheArrayAccesses);
+
+        // At this point we don't need the load ops anymore so hold the current accesses as the next iteration's previous accesses
+        // and erase the previous access ops
+        rewriter.eraseOp(prevOuterArrayAccessOp);
+        rewriter.eraseOp(prevCacheArrayAccessOp);
+        prevOuterArrayAccessOp = currentOuterArrayAccessOp;
+        prevCacheArrayAccessOp = currentCacheArrayAccessOp;
+        // Erase the previous temporary ops as we're going along so we don't allocate too much excess memory and leave too many ops around during this procedure
+
+        std::swap(prevTemporaryOps, currentTemporaryOps); // The currentTemporaryOps are the prevTemporaryOps in the next iteration
+
+        assert(outerArrayDiffMap.getNumResults() == 1);
+        assert(cacheArrayDiffMap.getNumResults() == 1);
+
+        auto outerArrayResultExpr = outerArrayDiffMap.getResult(0);
+        auto cacheArrayResultExpr = cacheArrayDiffMap.getResult(0);
+        if (outerArrayResultExpr.isa<mlir::AffineConstantExpr>() && cacheArrayResultExpr.isa<mlir::AffineConstantExpr>())
+        {
+            auto outerArrayConstExpr = outerArrayResultExpr.dyn_cast<mlir::AffineConstantExpr>();
+            auto cacheArrayConstExpr = cacheArrayResultExpr.dyn_cast<mlir::AffineConstantExpr>();
+            if (outerArrayConstExpr.getValue() != cacheArrayConstExpr.getValue())
+            {
+                // The outer array and cache array have a different stride between these two accesses, therefore the cache
+                // will not be a strict sub-buffer copy of the outer array
+                return false;
+            }
+            else if (outerArrayConstExpr.getValue() != 1)
+            {
+                // As a conservative check, additionally only interpret a stride of 1 between the accesses as indicating the cache is a strict subbuffer of the outer array
+                return false;
+            }
+        }
+        else
+        {
+            // One of the strides was non-constant so we can't assert that it is a strict subbuffer
+            return false;
+        }
+
+        // At this point, both strides were constant 1's, so continue on to the next index
+        return true;
+    });
+
+    // Do a final cleanup of the ops we created for this check
+    rewriter.eraseOp(prevOuterArrayAccessOp);
+    rewriter.eraseOp(prevCacheArrayAccessOp);
+
+    while (!prevTemporaryOps->empty())
+    {
+        auto eraseOp = prevTemporaryOps->top();
+        assert(eraseOp->use_empty());
+        rewriter.eraseOp(eraseOp);
+        prevTemporaryOps->pop();
+    }
+    assert(temporaryOpsOne.empty());
+    assert(temporaryOpsTwo.empty());
+
+    return allSingleElementStrides;
+}
+
+std::pair<mlir::Block::iterator, mlir::Block::iterator> GetCacheRegionIterators(MultiCacheCopyOp copyOp)
+{
+    auto pairOp = GetCacheOpPair(copyOp);
+
+    auto cacheArray = copyOp.cache();
+    auto parentBlock = copyOp->getBlock();
+    mlir::Block::iterator beginReplace;
+    mlir::Block::iterator endReplace;
+    if (pairOp)
+    {
+        auto firstOp = util::GetFirstOp(copyOp, pairOp);
+        auto secondOp = copyOp == firstOp ? pairOp : copyOp;
+        beginReplace = firstOp->getIterator();
+        beginReplace++;
+        endReplace = secondOp->getIterator();
+    }
+    else
+    {
+        // This op doesn't have a pair op, so if we want to replace all uses of the cache that could be impacted by this op
+        // then we need to examine all uses of the cache after this op since a multi-cache copy is only a copy-in cache copy op,
+        // but without stepping past other cache ops for this cache.
+        // Note: multiple cache ops for the same cache can occur on the same level of the loopnest since separate active block
+        // regions or separate trigger regions for the same cache can occur on the same level due to boundary conditions. Since
+        // each of these regions should be considered independently, we only deal with our current op and therefore current region
+        // in each instance of this lowering
+        beginReplace = copyOp->getIterator();
+        beginReplace++;
+        endReplace = beginReplace;
+        for (auto iter = beginReplace; iter != parentBlock->end(); ++iter)
+        {
+            // Only break on other cache ops
+            if (!IsCacheOp(&(*iter)) ||
+                (std::find(iter->operand_begin(), iter->operand_end(), cacheArray) == iter->operand_end()))
+            {
+                // This op doesn't use the cache so continue iterating past this op
+                endReplace = iter;
+            }
+            else
+            {
+                // Don't advance the endReplace iterator in this case as we always advance it once after the loop
+                break;
+            }
+        }
+        // Advance the endIterator as the last op we had it pointing at is the last one to consider replacing the cache
+        // usage in, so advancing it now will have it point to the op after the last one we'll make the replacement in
+        endReplace++;
+    }
+
+    return std::make_pair(beginReplace, endReplace);
+}
+
+std::pair<mlir::Block::iterator, mlir::Block::iterator> GetCacheRegionIterators(ActiveBlockCacheCopyOp copyOp)
+{
+    auto pairOp = GetCacheOpPair(copyOp);
+    auto parentBlock = copyOp->getBlock();
+    auto cacheArray = copyOp.cache();
+    mlir::Block::iterator beginReplace;
+    mlir::Block::iterator endReplace;
+    if (pairOp)
+    {
+        auto firstOp = util::GetFirstOp(copyOp, pairOp);
+        auto secondOp = copyOp == firstOp ? pairOp : copyOp;
+        beginReplace = firstOp->getIterator();
+        beginReplace++;
+        endReplace = secondOp->getIterator();
+    }
+    else
+    {
+        // This op doesn't have a pair op, so if we want to replace all uses of the cache that could be impacted by this op
+        // then we need to examine all uses of the cache either before this op if it is a copy-out cache copy op,
+        // or after this op if it is a copy-in cache copy op, but without stepping past other cache ops for this cache.
+        // Note: multiple cache ops for the same cache can occur on the same level of the loopnest since separate active block
+        // regions or separate trigger regions for the same cache can occur on the same level due to boundary conditions. Since
+        // each of these regions should be considered independently, we only deal with our current op and therefore current region
+        // in each instance of this lowering
+        bool copyIn = copyOp.toCache();
+        if (copyIn)
+        {
+            // A copy-in cache copy op without a pair op occurs in a graph like:
+            // for ... {
+            //     cache_copy(outer array -> cache, copy_in = true)
+            //     for ... {
+            //         ... // read from cache
+            //     }
+            //     ...
+            //     for ... {
+            //         ... // read from cache
+            //     }
+            //     (end of cache region)
+            // }
+            // In this case, we need to search forward in the graph until we either reach the end of the block
+            // or find another cache op that is using the cache to determine the region of ops that are affected
+            // by this cache
+
+            beginReplace = copyOp->getIterator();
+            beginReplace++;
+            endReplace = beginReplace;
+            for (auto iter = beginReplace; iter != parentBlock->end(); ++iter)
+            {
+                if (!IsCacheOp(&(*iter)) ||
+                    (std::find(iter->operand_begin(), iter->operand_end(), cacheArray) == iter->operand_end()))
+                {
+                    // This op doesn't use the cache so continue iterating past this op
+                    endReplace = iter;
+                }
+                else
+                {
+                    // Don't advance the endReplace iterator in this case as we always advance it once after the loop
+                    break;
+                }
+            }
+            // Advance the endIterator as the last op we had it pointing at is the last one to consider replacing the cache
+            // usage in, so advancing it now will have it point to the op after the last one we'll make the replacement in
+            endReplace++;
+        }
+        else
+        {
+            // A copy-out cache copy op without a pair op occurs in a graph like:
+            // for ... {
+            //     (beginning of cache region)
+            //     for ... {
+            //         ... // write to cache
+            //     }
+            //     ...
+            //     for ... {
+            //         ... // write to cache
+            //     }
+            //     cache_copy(cache -> outer array, copy_in = false)
+            // }
+            // In this case, we need to search backward in the graph until we either reach the beginning of the block
+            // or find another cache op that is using the cache to determine the region of ops that are affected
+            // by this cache
+
+            endReplace = copyOp->getIterator();
+            beginReplace = endReplace;
+            if (endReplace != parentBlock->begin())
+            {
+                for (auto iter = --mlir::Block::iterator(endReplace); iter != --(parentBlock->begin()); --iter)
+                {
+                    if (!IsCacheOp(&(*iter)) ||
+                        (std::find(iter->operand_begin(), iter->operand_end(), cacheArray) == iter->operand_end()))
+                    {
+                        // This op doesn't use the cache so continue iterating past this op
+                        beginReplace = iter;
+                    }
+                    else
+                    {
+                        // Don't advance the beginReplace iterator in this case as we always advance it once after the loop
+                        break;
+                    }
+                }
+            }
+        }
+    }
+
+    return std::make_pair(beginReplace, endReplace);
+}
+
+std::pair<mlir::Block::iterator, mlir::Block::iterator> GetCacheRegionIterators(ActiveBlockCacheReduceOp reduceOp)
+{
+    auto pairOp = GetCacheOpPair(reduceOp);
+    assert(pairOp != nullptr);
+    auto parentBlock = reduceOp->getBlock();
+    mlir::Block::iterator beginReplace = pairOp->getIterator();
+    beginReplace++;
+    mlir::Block::iterator endReplace = reduceOp->getIterator();
+
+    return std::make_pair(beginReplace, endReplace);
+}
+
+template <typename CacheOp>
+void EraseThriftyCache(PatternRewriter& rewriter, CacheOp cacheOp, mlir::Value outerArray, mlir::Value cacheArray)
+{
+    // To do so, we need to remove this cache op and the pair cache op if it exists, and update any uses of this cache within this op's block scope
+    // to use the outer array instead of the cache
+
+    auto pairOp = GetCacheOpPair(cacheOp);
+    auto loc = cacheOp.getLoc();
+    auto parentBlock = cacheOp->getBlock();
+    auto [beginReplace, endReplace] = GetCacheRegionIterators(cacheOp);
+
+    parentBlock->walk(beginReplace, endReplace, [&](Operation* op) {
+        mlir::OpBuilder::InsertionGuard guard(rewriter);
+        if (std::find(op->operand_begin(), op->operand_end(), cacheArray) != op->operand_end())
+        {
+            // This op uses the cacheArray, only AffineLoad and AffineStore on cache arrays are supported
+            TypeSwitch<Operation*>(op)
+                .Case([&](mlir::AffineLoadOp affineLoadOp) {
+                    rewriter.setInsertionPoint(affineLoadOp);
+                    auto baseArrayPosition = GetBaseArrayPosition(rewriter, loc, affineLoadOp);
+
+                    mlir::AffineLoadOp newLoadOp = CreateLoad(rewriter, loc, outerArray, baseArrayPosition);
+                    affineLoadOp.replaceAllUsesWith(newLoadOp.getResult());
+                    rewriter.eraseOp(affineLoadOp);
+                })
+                .Case([&](mlir::AffineStoreOp affineStoreOp) {
+                    mlir::AffineStoreOp::Adaptor storeAdaptor{ affineStoreOp };
+                    rewriter.setInsertionPoint(affineStoreOp);
+                    auto baseArrayPosition = GetBaseArrayPosition(rewriter, loc, affineStoreOp);
+
+                    CreateStore(rewriter, loc, storeAdaptor.value(), outerArray, baseArrayPosition);
+                    rewriter.eraseOp(affineStoreOp);
+                })
+                .Case([&](MultiCacheCopyOp copyOp) {
+                    copyOp->replaceUsesOfWith(cacheArray, outerArray);
+                })
+                .Case([&](ActiveBlockCacheCopyOp copyOp) {
+                    copyOp->replaceUsesOfWith(cacheArray, outerArray);
+                })
+                .Case([&](ActiveBlockCacheReduceOp reduceOp) {
+                    reduceOp->replaceUsesOfWith(cacheArray, outerArray);
+                })
+                .Default([&](Operation* defaultOp) {
+                    assert(false && "Usage of mapped op found that doesn't have an op conversion registered!");
+                });
+        }
+    });
+
+    rewriter.eraseOp(cacheOp);
+    if (pairOp)
+    {
+        rewriter.eraseOp(pairOp);
+    }
+}
+
+LogicalResult ThriftyCacheMultiCopyOpRewrite::matchAndRewrite(MultiCacheCopyOp multiCacheCopyOp, PatternRewriter& rewriter) const
+{
+    // If this cache op is for a thrifty cache:
+    // - Check if there is a pair op for this, e.g. a cache-copy-in paired with a cache-copy-out
+    // - Check if the source array and cache array cover the same sequence over the active block. I.e. as we step in the cache region domain both arrays have the same stride for every element
+    // - If the source and cache cover a consistent stride-1 region and therefore should be elided according to the thrifty definition,
+    //      then erase this cache op and its corresponding pair op (if one exists)
+    // If this cache op is not a thrifty cache, then return success and do nothing
+
+    if (!multiCacheCopyOp.thrifty())
+    {
+        // Not a thrifty cache, so we'll always realize the cache regardless of memory ordering
+        return failure();
+    }
+
+    MultiCacheCopyOp::Adaptor adaptor{ multiCacheCopyOp };
+    auto outerArray = multiCacheCopyOp.array();
+    auto cacheArray = multiCacheCopyOp.cache();
+
+    std::optional<v::ExecutionTarget> execTargetOpt = util::ResolveExecutionTarget(multiCacheCopyOp);
+    assert(execTargetOpt.has_value());
+    auto execTarget = *execTargetOpt;
+    if (execTarget == v::ExecutionTarget::GPU && !SameMemorySpace(outerArray, cacheArray))
+    {
+        // The cache array is in a different memory space than the outer array, so don't elide this cache
+        return failure();
+    }
+
+    auto loc = multiCacheCopyOp.getLoc();
+
+    MultiCacheLoopInfo multiCacheInfo = CreateMultiCacheLoops(rewriter, multiCacheCopyOp, [&](mlir::OpBuilder& currentBuilder, const MultiCacheLoopInfo& info) {
+        auto lbMapsArrayAttr = adaptor.activeBlockLowerBoundMaps();
+        auto ubMapsArrayAttr = adaptor.activeBlockUpperBoundMaps();
+
+        auto fullCacheShape = GetFullCacheShapeHelper(info.multiCacheShape, info.activeBlockExternalSymbols, info.activeBlockExternalSymbols, lbMapsArrayAttr, ubMapsArrayAttr);
+        if (fullCacheShape.size() != 0 && std::find(fullCacheShape.begin(), fullCacheShape.end(), 0) == fullCacheShape.end())
+        {
+            assert(fullCacheShape.size() >= info.multiCacheShape.size());
+            std::vector<int64_t> activeBlockStepSizes(fullCacheShape.size() - info.multiCacheShape.size(), 1); // TODO : do we have a scenario where the multicache has step sizes > 1 ?
+            auto fullCacheStepSizes = info.multiCacheStepSizes;
+            fullCacheStepSizes.insert(fullCacheStepSizes.end(), activeBlockStepSizes.begin(), activeBlockStepSizes.end());
+            bool allSingleElementStrides = ThriftyCacheAllSingleElementStridesHelper(rewriter,
+                                                                                     currentBuilder,
+                                                                                     loc,
+                                                                                     outerArray,
+                                                                                     cacheArray,
+                                                                                     info.multiCacheIVs,
+                                                                                     fullCacheShape,
+                                                                                     fullCacheStepSizes,
+                                                                                     info.activeBlockExternalSymbols,
+                                                                                     lbMapsArrayAttr,
+                                                                                     ubMapsArrayAttr);
+
+            if (allSingleElementStrides)
+            {
+                // If the accesses into the arrays all had strides of 1, then the cache is a strict subbuffer of the outer array.
+                // Since it is a thrifty cache we should therefore elide this cache.
+                EraseThriftyCache(rewriter, multiCacheCopyOp, outerArray, cacheArray);
+            }
+        }
+    });
+    // Clean up the MultiCacheLoops now that we're done with them
+    // Erase innermost-to-outermost loop, so reverse the multiCacheLoops list then erase each element
+    std::vector<mlir::AffineForOp> loopsToErase = multiCacheInfo.multiCacheLoops;
+    std::reverse(loopsToErase.begin(), loopsToErase.end());
+    for (auto& loop : loopsToErase)
+    {
+        rewriter.eraseOp(loop);
+    }
+
+    return success();
+}
+
+LogicalResult ThriftyCacheCopyOpRewrite::matchAndRewrite(ActiveBlockCacheCopyOp cacheCopyOp, PatternRewriter& rewriter) const
+{
+    // If this cache op is for a thrifty cache:
+    // - Check if there is a pair op for this, e.g. a cache-copy-in paired with a cache-copy-out
+    // - Check if the source array and cache array cover the same sequence over the active block. I.e. as we step in the cache region domain both arrays have the same stride for every element
+    // - If the source and cache cover a consistent stride-1 region and therefore should be elided according to the thrifty definition,
+    //      then erase this cache op and its corresponding pair op (if one exists)
+    // If this cache op is not a thrifty cache, then return success and do nothing
+
+    if (!cacheCopyOp.thrifty())
+    {
+        // Not a thrifty cache, so we'll always realize the cache regardless of memory ordering
+        return failure();
+    }
+
+    // Get pair op if it exists
+    auto pairOp = GetCacheOpPair(cacheCopyOp);
+
+    ActiveBlockCacheCopyOp::Adaptor adaptor{ cacheCopyOp };
+    auto loc = cacheCopyOp.getLoc();
+
+    auto outerArray = cacheCopyOp.array();
+    auto cacheArray = cacheCopyOp.cache();
+
+    std::optional<v::ExecutionTarget> execTargetOpt = util::ResolveExecutionTarget(cacheCopyOp);
+    assert(execTargetOpt.has_value());
+    auto execTarget = *execTargetOpt;
+    if (execTarget == v::ExecutionTarget::GPU && !SameMemorySpace(outerArray, cacheArray))
+    {
+        // The cache array is in a different memory space than the outer array, so don't elide this cache
+        return failure();
+    }
+
+    // Check if the cache copy op covers the outer array in the same memory order
+
+    auto lbMapsArrayAttr = adaptor.lbMaps();
+    auto ubMapsArrayAttr = adaptor.ubMaps();
+    auto lbMaps = util::ArrayAttrToVector<mlir::AffineMap, mlir::AffineMapAttr>(lbMapsArrayAttr, [](const mlir::AffineMapAttr& mapAttr) -> mlir::AffineMap {
+        return mapAttr.getValue();
+    });
+    auto lbOperands = adaptor.lbOperands();
+    auto activeBlockShape = GetConstantActiveBlockShapeHelper(cacheCopyOp);
+    std::vector<int64_t> activeBlockStepSizes(activeBlockShape.size(), 1); // TODO : do we have a scenario where the active block has step sizes > 1 ?
+
+    if (activeBlockShape.size() == 0 || std::find(activeBlockShape.begin(), activeBlockShape.end(), 0) != activeBlockShape.end())
+    {
+        // There's either no active block shape or at least one of the dimensions is 0, resulting in 0 volume, so just skip over this cache op
+        return failure();
+    }
+
+    std::vector<mlir::Value> lbOperandsVec(lbOperands.begin(), lbOperands.end());
+
+    bool allSingleElementStrides = ThriftyCacheAllSingleElementStridesHelper(rewriter,
+                                                                             rewriter,
+                                                                             loc,
+                                                                             outerArray,
+                                                                             cacheArray,
+                                                                             std::vector<mlir::Value>{},
+                                                                             activeBlockShape,
+                                                                             activeBlockStepSizes,
+                                                                             lbOperandsVec,
+                                                                             lbMapsArrayAttr,
+                                                                             ubMapsArrayAttr);
+
+    if (allSingleElementStrides)
+    {
+        // If the accesses into the arrays all had strides of 1, then the cache is a strict subbuffer of the outer array.
+        // Since it is a thrifty cache we should therefore elide this cache.
+        EraseThriftyCache(rewriter, cacheCopyOp, outerArray, cacheArray);
+    }
 
     return success();
 }
 
-LogicalResult ActiveElementCacheReduceOpRewrite::matchAndRewrite(ActiveElementCacheReduceOp cacheReduceOp, PatternRewriter& rewriter) const
+LogicalResult ThriftyCacheReduceOpRewrite::matchAndRewrite(ActiveBlockCacheReduceOp cacheReduceOp, PatternRewriter& rewriter) const
 {
-    // Reduce data from the source cache buffer to the destination buffer by iterating over the cache region shape
-    // and mapping from cache region indices to the source cache buffer and destination buffer
-
-    auto loc = cacheReduceOp.getLoc();
-
-    ActiveElementCacheReduceOp::Adaptor adaptor{ cacheReduceOp };
-
-    [[maybe_unused]] auto dst = cacheReduceOp.dst();
-    assert(dst.getType().isa<MemRefType>());
-    auto baseOutputMemRefType = dst.getType().cast<MemRefType>();
-    auto baseOutputShape = baseOutputMemRefType.getShape();
-    auto baseOutputElementType = GetInnerElementType(dst);
+    // If this cache op is for a thrifty cache:
+    // - Check if there is a pair op for this, e.g. a cache-copy-in or a cache-zero paired with this cache reduce
+    // - Check if the source array and cache array cover the same sequence over the active block. I.e. as we step in the cache region domain both arrays have the same stride for every element
+    // - If the source and cache cover a consistent stride-1 region and therefore should be elided according to the thrifty definition,
+    //      then erase this cache op and its corresponding pair op (if one exists)
+    // If this cache op is not a thrifty cache, then return success and do nothing
 
-    auto elementBitWidth = baseOutputMemRefType.getElementTypeBitWidth();
-    auto elementByteWidth = elementBitWidth / 8;
+    if (!cacheReduceOp.thrifty())
+    {
+        // Not a thrifty cache, so we'll always realize the cache regardless of memory ordering
+        return failure();
+    }
 
-    auto cache = cacheReduceOp.srcCache();
-    auto cacheMemRefType = cache.getType().cast<MemRefType>();
-    auto cacheElementType = cacheMemRefType.getElementType(); // either something like vector< n x f32 > or f32
-    auto cacheShape = cacheMemRefType.getShape();
-    auto baseCacheElementType = GetInnerElementType(cache); // e.g. f32
+    // Get pair op if it exists
+    auto pairOp = GetCacheOpPair(cacheReduceOp);
+    assert(pairOp != nullptr && "ActiveBlockCacheReduceOp must have a pair op");
 
-    auto cacheRegionIndexRanges = util::ArrayAttrToVector<IndexRange, IndexRangeAttr>(
-        cacheReduceOp.cacheRegionRelevantIndexRanges(),
-        [](const IndexRangeAttr& indexRangeAttr) {
-            return indexRangeAttr.getValue();
-        });
+    ActiveBlockCacheReduceOp::Adaptor adaptor{ cacheReduceOp };
+    auto loc = cacheReduceOp.getLoc();
 
-    auto cacheRegionBaseIndices = util::ArrayAttrToVector<std::vector<Index>, mlir::ArrayAttr>(
-        cacheReduceOp.cacheRegionBaseIndices(),
-        util::ConvertArrayAttrToIndexVector);
-    assert(cacheRegionIndexRanges.size() == cacheRegionBaseIndices.size());
+    auto outerArray = cacheReduceOp.array();
+    auto cacheArray = cacheReduceOp.cache();
 
-    // If this op has no volume to operate over due to unswitched boundary conditions, just erase the op and return
-    for (const auto& indexRange : cacheRegionIndexRanges)
+    std::optional<v::ExecutionTarget> execTargetOpt = util::ResolveExecutionTarget(cacheReduceOp);
+    assert(execTargetOpt.has_value());
+    auto execTarget = *execTargetOpt;
+    if (execTarget == v::ExecutionTarget::GPU && !SameMemorySpace(outerArray, cacheArray))
     {
-        if (indexRange.Size() == 0)
-        {
-            rewriter.eraseOp(cacheReduceOp);
-            return success();
-        }
+        // The cache array is in a different memory space than the outer array, so don't elide this cache
+        return failure();
     }
 
-    auto scaleValue = CreateProductOfValues(rewriter, loc, baseOutputElementType, adaptor.scaleValues());
-
-    auto [reduceNestOp, reduceScheduleOp, reduceExecPlanOp] = CreateActiveElementCacheLoopnest(rewriter, cacheReduceOp, elementByteWidth, "reduce", [&](OpBuilder& cacheReduceBuilder, const std::vector<mlir::Value>& orderedSymbolicIndexOpValues) {
-        std::vector<mlir::Value> combinedRelevantIndices;
-        combinedRelevantIndices.insert(
-            combinedRelevantIndices.end(),
-            adaptor.externalRelevantIndices().begin(),
-            adaptor.externalRelevantIndices().end());
-        combinedRelevantIndices.insert(combinedRelevantIndices.end(), orderedSymbolicIndexOpValues.begin(), orderedSymbolicIndexOpValues.end());
+    // Check if the cache copy op covers the outer array in the same memory order
 
-        auto loadedCacheValue = cacheReduceBuilder.create<AffineLoadOp>(loc, cache, cacheReduceOp.relevantIndicesToSrcCacheMap(), combinedRelevantIndices);
-        auto scaledCacheValue = cacheReduceBuilder.create<v::BinOp>(loc, BinaryOpPredicate::MUL, scaleValue, loadedCacheValue);
-        auto currentOutputValue = cacheReduceBuilder.create<AffineLoadOp>(loc, dst, cacheReduceOp.relevantIndicesToDstMap(), combinedRelevantIndices);
-        auto accumulatedValue = cacheReduceBuilder.create<v::BinOp>(loc, BinaryOpPredicate::ADD, scaledCacheValue, currentOutputValue);
-        cacheReduceBuilder.create<AffineStoreOp>(loc, accumulatedValue, dst, cacheReduceOp.relevantIndicesToDstMap(), combinedRelevantIndices);
+    auto lbMapsArrayAttr = adaptor.lbMaps();
+    auto ubMapsArrayAttr = adaptor.ubMaps();
+    auto lbMaps = util::ArrayAttrToVector<mlir::AffineMap, mlir::AffineMapAttr>(lbMapsArrayAttr, [](const mlir::AffineMapAttr& mapAttr) -> mlir::AffineMap {
+        return mapAttr.getValue();
     });
-
-    // Bounds check cache reduce loads/stores so we don't introduce
-    // a bug by adding a cache reduce
-    auto reduceOrder = reduceScheduleOp.getOrder();
-    for (const auto& loopIndex : reduceOrder)
+    auto lbOperands = adaptor.lbOperands();
+    auto activeBlockShape = GetConstantActiveBlockShapeHelper(cacheReduceOp);
+    std::vector<int64_t> activeBlockStepSizes(activeBlockShape.size(), 1); // TODO : do we have a scenario where the active block has step sizes > 1 ?
+    if (activeBlockShape.size() == 0 || std::find(activeBlockShape.begin(), activeBlockShape.end(), 0) != activeBlockShape.end())
     {
-        reduceScheduleOp.addLoopAttribute(loopIndex, rewriter.getIdentifier(AccessBoundsCheckAttrName), rewriter.getUnitAttr());
+        // There's either no active block shape or at least one of the dimensions is 0, resulting in 0 volume, so just skip over this cache op
+        return failure();
     }
+    std::vector<mlir::Value> lbOperandsVec(lbOperands.begin(), lbOperands.end());
 
-    rewriter.eraseOp(cacheReduceOp);
+    bool allSingleElementStrides = ThriftyCacheAllSingleElementStridesHelper(rewriter,
+                                                                             rewriter,
+                                                                             loc,
+                                                                             outerArray,
+                                                                             cacheArray,
+                                                                             std::vector<mlir::Value>{},
+                                                                             activeBlockShape,
+                                                                             activeBlockStepSizes,
+                                                                             lbOperandsVec,
+                                                                             lbMapsArrayAttr,
+                                                                             ubMapsArrayAttr);
+
+    if (allSingleElementStrides)
+    {
+        // If the accesses into the arrays all had strides of 1, then the cache is a strict subbuffer of the outer array.
+        // Since it is a thrifty cache we should therefore elide this cache.
+        EraseThriftyCache(rewriter, cacheReduceOp, outerArray, cacheArray);
+    }
 
     return success();
 }
@@ -2650,6 +3743,7 @@ LogicalResult BeginCacheMappingOpRewrite::matchAndRewrite(BeginCacheMappingOp be
                     auto baseArrayPosition = GetBaseArrayPosition(rewriter, loc, loadOp);
                     mlir::AffineLoadOp newLoadOp = CreateLoad(rewriter, loc, toValue, baseArrayPosition);
                     loadOp.replaceAllUsesWith(newLoadOp.getResult());
+                    TransferOrSetAccessAttrs(loadOp, newLoadOp);
                     rewriter.eraseOp(loadOp);
                 }
                 else
@@ -2664,7 +3758,8 @@ LogicalResult BeginCacheMappingOpRewrite::matchAndRewrite(BeginCacheMappingOp be
                 if (isActiveBlockCache)
                 {
                     auto baseArrayPosition = GetBaseArrayPosition(rewriter, loc, storeOp);
-                    CreateStore(rewriter, loc, storeAdaptor.value(), toValue, baseArrayPosition);
+                    auto newStoreOp = CreateStore(rewriter, loc, storeAdaptor.value(), toValue, baseArrayPosition);
+                    TransferOrSetAccessAttrs(storeOp, newStoreOp);
                     rewriter.eraseOp(storeOp);
                 }
                 else
@@ -3125,6 +4220,133 @@ LogicalResult MergeCacheRegionOpsRewrite::matchAndRewrite(BeginCacheRegionOp beg
     return success();
 }
 
+MakeCacheOp CreateDoubleBufferTempArray(mlir::OpBuilder& builder,
+                                        MultiCacheInfo& info,
+                                        BeginCacheRegionOp& cacheRegionOp)
+{
+    mlir::Value cache = info.multiCache.cache();
+
+    auto cacheType = cache.getType();
+    assert(cacheType.isa<mlir::MemRefType>());
+    auto cacheMemRefType = cacheType.cast<mlir::MemRefType>();
+
+    auto multiCacheShape = info.multiCacheIterationCounts;
+    auto fullCacheShape = cacheMemRefType.getShape().vec();
+    std::vector<int64_t> activeBlockCacheShape(fullCacheShape.begin() + multiCacheShape.size(), fullCacheShape.end());
+
+    auto sharedMemSpaceAttr = util::MemorySpaceToAttribute(v::MemorySpace::Shared, builder.getContext());
+    auto privateMemSpaceAttr = util::MemorySpaceToAttribute(v::MemorySpace::Private, builder.getContext());
+    auto cacheMemSpaceAttr = cacheMemRefType.getMemorySpace();
+    auto tempArrayMemSpaceAttrOpt = cacheRegionOp.doubleBufferMemorySpace();
+    assert(tempArrayMemSpaceAttrOpt.hasValue() && "Can't create a double buffer cache without a double buffer memory space set");
+    auto tempArrayMemSpaceAttr = util::MemorySpaceToAttribute(tempArrayMemSpaceAttrOpt.getValue(), builder.getContext());
+
+    auto cacheAccessMap = info.multiCache.offsetArrayToCacheAccessMap();
+    auto multiCacheAccessIndices = util::ConvertArrayAttrToIndexVector(info.multiCache.multiCacheAccessIndices());
+    auto cacheOffsetIndices = util::ConvertArrayAttrToIndexVector(info.multiCache.offsetAccessIndices());
+
+    auto tempArrayOffsetIndices = cacheOffsetIndices;
+    auto tempArrayMultiCacheAccessIndices = multiCacheAccessIndices;
+    auto tempArrayAccessMap = cacheAccessMap;
+
+    size_t arrayRank = cacheAccessMap.getNumDims() - cacheOffsetIndices.size() - multiCacheAccessIndices.size();
+
+    std::vector<int64_t> tempArrayActiveBlockShape = activeBlockCacheShape;
+    std::vector<mlir::AffineMap> tempMemrefMaps = cacheMemRefType.getAffineMaps();
+
+    std::optional<v::ExecutionTarget> execTargetOpt = util::ResolveExecutionTarget(cacheRegionOp);
+    auto execTarget = *execTargetOpt;
+    auto parentLambda = cacheRegionOp->getParentOfType<v::ValueLambdaOp>();
+
+    if (execTarget == v::ExecutionTarget::GPU)
+    {
+        // If this is for a GPU target, then our temp array memory space should be set to PRIVATE
+
+        if (cacheMemSpaceAttr != privateMemSpaceAttr && tempArrayMemSpaceAttr == privateMemSpaceAttr)
+        {
+            // If the double buffer temp array is in the PRIVATE memory space and the cache is not
+            // then the different threads will each load a different segment of the cache into their
+            // double-buffering temp buffer. To support this, the private memory temp array needs to
+            // be shrunk to just hold a single thread's contribution to the cache
+
+            auto launchAttr = parentLambda->getAttrOfType<mlir::ArrayAttr>(parentLambda.getGPULaunchAttrName());
+            assert(launchAttr != nullptr);
+            auto gpuParams = accera::ir::targets::GPU::FromArrayAttr(launchAttr);
+            std::vector<int64_t> blockDimSizes = { gpuParams.block.x, gpuParams.block.y, gpuParams.block.z };
+
+            auto vectorSizePerThread = 1; // TODO : support vectorized loads changing the volume
+            int64_t activeBlockVolume = std::accumulate(activeBlockCacheShape.begin(), activeBlockCacheShape.end(), 1, std::multiplies<int64_t>());
+            auto loadsPerThread = activeBlockVolume / (blockDimSizes[0] * blockDimSizes[1] * blockDimSizes[2] * vectorSizePerThread);
+            loadsPerThread = std::max((int64_t)1, (int64_t)loadsPerThread);
+            tempArrayActiveBlockShape = { loadsPerThread, vectorSizePerThread };
+
+            std::vector<Index> tempArrayIndexPlaceholders;
+            tempArrayIndexPlaceholders.emplace_back(ActionsPerThreadIndexName, Index::DefaultID);
+            tempArrayIndexPlaceholders.emplace_back(ThreadVectorizationIndexName, Index::DefaultID);
+            tempArrayOffsetIndices = tempArrayIndexPlaceholders;
+
+            // Need to create the temp array access expressions such that the ActionsPerThreadIndex and ThreadVectorizationIndex indices
+            // are used to index into the array and everything else is ignored.
+            // To do this, set the cacheOffsetIndices to be the placeholders (which will need to get fully resolved later once the loopnest is created)
+            // and construct the map to only pay attention to those inputs
+
+            size_t multiCacheIndexPos = 0;
+            size_t cacheOffsetIndexPos = multiCacheAccessIndices.size();
+            size_t multiCacheDimAndPrivateThreadDimCount = multiCacheAccessIndices.size() + 2; // + 2 because there is one ActionPerThread dim and one ThreadVectorization dim
+            size_t totalDimCount = multiCacheDimAndPrivateThreadDimCount + arrayRank;
+
+            // Map { multiCacheIndices..., ActionPerThread idx, ThreadVectorization idx, arrayRank global indices... } to { multiCacheIndices..., ActionPerThread idx, ThreadVectorization idx }
+            tempArrayAccessMap = util::GetMajorIdentityMap(totalDimCount, multiCacheDimAndPrivateThreadDimCount, builder.getContext());
+        }
+    }
+
+    auto fullTempArrayShape = tempArrayActiveBlockShape;
+    fullTempArrayShape.insert(fullTempArrayShape.begin(), multiCacheShape.begin(), multiCacheShape.end());
+    mlir::MemRefType tempArrayType = mlir::MemRefType::get(fullTempArrayShape, cacheMemRefType.getElementType(), tempMemrefMaps, tempArrayMemSpaceAttr);
+
+    mlir::OpBuilder::InsertionGuard insertGuard(builder);
+    builder.setInsertionPointToStart(&parentLambda.body().front());
+    auto memorySpaceEnum = util::AttributeToMemorySpace(tempArrayMemSpaceAttr);
+    return builder.create<MakeCacheOp>(parentLambda.getLoc(),
+                                       tempArrayType,
+                                       memorySpaceEnum,
+                                       tempArrayAccessMap,
+                                       tempArrayOffsetIndices,
+                                       tempArrayMultiCacheAccessIndices);
+}
+
+void CreateCacheMappingRegionHelper(mlir::PatternRewriter& rewriter,
+                                    BeginCacheRegionOp& beginCacheRegionOp,
+                                    MultiCacheInfo& multiCacheInfo)
+{
+    for (auto activeCacheLoopEntry : multiCacheInfo.activeBlockRegionInfos)
+    {
+        auto cacheLevelLoopOp = activeCacheLoopEntry.first;
+        auto cacheLevelLoop = mlir::cast<mlir::AffineForOp>(cacheLevelLoopOp);
+        auto& currentActiveBlockRegionInfo = activeCacheLoopEntry.second;
+
+        mlir::Block* cacheLevelBlock = cacheLevelLoop.getOperation()->getBlock();
+        mlir::Block::iterator cacheLevelStartOp(cacheLevelLoop);
+        mlir::Block::iterator cacheLevelEndOp(cacheLevelLoop);
+        cacheLevelEndOp++;
+
+        rewriter.setInsertionPoint(cacheLevelBlock, cacheLevelStartOp);
+
+        // TODO : refactor out CacheAccessContext and simplify this
+        currentActiveBlockRegionInfo.cacheAccessContext.externalRelevantScheduleIndices = currentActiveBlockRegionInfo.allCacheExternalSymbols;
+        BeginCacheMappingOp cacheMappingOp = rewriter.create<BeginCacheMappingOp>(beginCacheRegionOp.getLoc(),
+                                                                                  beginCacheRegionOp.input(),
+                                                                                  multiCacheInfo.originalCacheOp,
+                                                                                  beginCacheRegionOp.baseInput(),
+                                                                                  currentActiveBlockRegionInfo.cacheAccessContext,
+                                                                                  beginCacheRegionOp.id(),
+                                                                                  beginCacheRegionOp.activeBlockCache());
+
+        rewriter.setInsertionPoint(cacheLevelBlock, cacheLevelEndOp);
+        EndCacheMappingOp endCacheMappingOp = rewriter.create<EndCacheMappingOp>(beginCacheRegionOp.getLoc(), cacheMappingOp.getResult());
+    }
+}
+
 LogicalResult BeginCacheRegionOpRewrite::matchAndRewrite(BeginCacheRegionOp beginCacheRegionOp, PatternRewriter& rewriter) const
 {
     // CacheRegionOp examines the uses of the input value within its region and determines which cache data movements ops are necessary to support that usage
@@ -3440,113 +4662,221 @@ LogicalResult BeginCacheRegionOpRewrite::matchAndRewrite(BeginCacheRegionOp begi
 
     for (auto& multiCacheInfo : multiCacheInfos)
     {
+        std::string activeBlockTag = "active_block_" + std::to_string(util::GetUniqueId());
         if (multiCacheInfo.arrayAccessInfo.cacheUsedInRegion)
         {
             mlir::Block::iterator cacheRegionStart(beginCacheRegionOp);
             mlir::Block::iterator cacheRegionEnd(endOp);
 
-            rewriter.setInsertionPoint(triggerLevelBlock, cacheRegionStart);
-
-            // Create the prologue cache data moving op
-            if (!multiCacheInfo.arrayAccessInfo.valueRead || multiCacheInfo.arrayAccessInfo.onlyReadsAreAccumulates)
+            // Get the next loop outside of the trigger level loop
+            // We can only double-buffer if there is a loop outside of the trigger level loop
+            auto triggerLoopParentLoop = util::CastOrGetParentOfType<mlir::AffineForOp>(triggerLevelBlock->getParentOp());
+            if (beginCacheRegionOp.doubleBufferCache() && triggerLoopParentLoop != nullptr)
             {
-                rewriter.create<CacheZeroOp>(loc, multiCacheInfo.multiCache);
+                [[maybe_unused]] bool inputOnlyCache = !multiCacheInfo.arrayAccessInfo.valueWritten;
+                assert(inputOnlyCache && "Double buffering is only supported for read-only caches");
+
+                auto doubleBufferTempArray = CreateDoubleBufferTempArray(rewriter, multiCacheInfo, beginCacheRegionOp);
+
+                // Create the 0'th iteration copy just before the triggerLoopParentLoop
+                auto parentLoopBlock = triggerLoopParentLoop->getBlock();
+
+                rewriter.setInsertionPoint(parentLoopBlock, triggerLoopParentLoop->getIterator());
+
+                mlir::Value triggerLoopParentIV = triggerLoopParentLoop.getInductionVar();
+                int64_t triggerLoopParentFirstIterIntValue = triggerLoopParentLoop.getConstantLowerBound();
+                int64_t triggerLoopParentStepSize = triggerLoopParentLoop.getStep();
+
+                // Clone the parent loop and wrap it around this ActiveBlockCacheCopyOp for cache access resolution
+                // However, limit the range to just a single iteration and remove everything inside the loop body
+                auto clonedtriggerLoopParentLoop = dyn_cast<mlir::AffineForOp>(rewriter.clone(*(triggerLoopParentLoop.getOperation())));
+                clonedtriggerLoopParentLoop.setConstantUpperBound(triggerLoopParentStepSize);
+
+                // Erase the ops in the cloned loop body and put only the first iteration's cache copy in it followed by an affine.yield
+                util::EraseAllOpsInBlock(rewriter, clonedtriggerLoopParentLoop.getLoopBody().front());
+
+                auto loopBuilder = util::MakeBodyBuilder(clonedtriggerLoopParentLoop);
+
+                auto firstIterCopy = loopBuilder.create<MultiCacheCopyOp>(loc,
+                                                                          beginCacheRegionOp.input(),
+                                                                          multiCacheInfo.multiCache,
+                                                                          multiCacheInfo.multiCacheExternalSymbols,
+                                                                          rewriter.getAffineMapArrayAttr(multiCacheInfo.multiCacheLBMaps),
+                                                                          rewriter.getAffineMapArrayAttr(multiCacheInfo.multiCacheUBMaps),
+                                                                          rewriter.getI64ArrayAttr(multiCacheInfo.multiCacheStepSizes),
+                                                                          util::ConvertIndexVectorToArrayAttr(multiCacheInfo.multiCacheLoopIndexIds, rewriter.getContext()),
+                                                                          rewriter.getAffineMapArrayAttr(multiCacheInfo.activeBlockInfo.lbMaps),
+                                                                          rewriter.getAffineMapArrayAttr(multiCacheInfo.activeBlockInfo.ubMaps),
+                                                                          multiCacheInfo.multiCacheExternalSymbolsPermutationMap,
+                                                                          multiCacheInfo.activeBlockToCacheMap,
+                                                                          activeBlockTag,
+                                                                          beginCacheRegionOp.thrifty(),
+                                                                          true); // toCache
+                // Re-create the affine yield op at the end of the block that we erased
+                loopBuilder.create<mlir::AffineYieldOp>(loc);
+                firstIterCopy->replaceUsesOfWith(triggerLoopParentIV, clonedtriggerLoopParentLoop.getInductionVar());
+
+                rewriter.setInsertionPoint(triggerLevelBlock, cacheRegionStart);
+                // Create the i+1 iteration copy to the temp buffer
+
+                // Create the prologue cache data moving op
+                auto loopStepIncrementExpr = rewriter.getAffineDimExpr(0) + rewriter.getAffineConstantExpr(triggerLoopParentStepSize);
+                auto loopStepIncrementMap = mlir::AffineMap::get(1, 0, loopStepIncrementExpr);
+                mlir::Value triggerLoopParentNextIterValue = rewriter.create<mlir::AffineApplyOp>(loc, loopStepIncrementMap, mlir::ValueRange{ triggerLoopParentIV });
+
+                // Create an AffineIfOp to guard the cache fills so that it doesn't happen in the final iteration
+                // We want to load if triggerLoopParentLoop < parentLoopLastIterInt
+                int64_t parentLoopLastIterInt = triggerLoopParentLoop.getConstantUpperBound() - triggerLoopParentLoop.getStep();
+
+                // the inequality will be ((lastIterIVValue - 1) - triggerLoopParentLoopIV >= 0)
+                // The -1 is because it's a >= comparison and in the final iteration of the loop we want this check to return false
+                mlir::AffineExpr lastIterIntMinusIVExpr = rewriter.getAffineConstantExpr(parentLoopLastIterInt - 1) - rewriter.getAffineDimExpr(0);
+                std::vector<mlir::AffineExpr> conditionalLoadConstraintExprs{ lastIterIntMinusIVExpr };
+                SmallVector<bool, 4> constraintEqFlags(1, false); // false indicating the checks should be >= 0 inequalities rather than == 0 equalities
+
+                auto nonLastIterCheckSet = mlir::IntegerSet::get(1, 0, conditionalLoadConstraintExprs, constraintEqFlags);
+
+                auto prologueCopyIfOp = rewriter.create<mlir::AffineIfOp>(loc, nonLastIterCheckSet, ValueRange{ triggerLoopParentIV }, false); // false indicating we do not want an "else" region
+                auto prologueThenBuilder = prologueCopyIfOp.getThenBodyBuilder();
+
+                MakeDelayedMappingRegion(prologueThenBuilder, triggerLoopParentIV, triggerLoopParentNextIterValue, [&](mlir::OpBuilder& builder) {
+                    auto prologueTempCopy = builder.create<MultiCacheCopyOp>(loc,
+                                                                             beginCacheRegionOp.input(),
+                                                                             doubleBufferTempArray,
+                                                                             multiCacheInfo.multiCacheExternalSymbols,
+                                                                             rewriter.getAffineMapArrayAttr(multiCacheInfo.multiCacheLBMaps),
+                                                                             rewriter.getAffineMapArrayAttr(multiCacheInfo.multiCacheUBMaps),
+                                                                             rewriter.getI64ArrayAttr(multiCacheInfo.multiCacheStepSizes),
+                                                                             util::ConvertIndexVectorToArrayAttr(multiCacheInfo.multiCacheLoopIndexIds, rewriter.getContext()),
+                                                                             rewriter.getAffineMapArrayAttr(multiCacheInfo.activeBlockInfo.lbMaps),
+                                                                             rewriter.getAffineMapArrayAttr(multiCacheInfo.activeBlockInfo.ubMaps),
+                                                                             multiCacheInfo.multiCacheExternalSymbolsPermutationMap,
+                                                                             multiCacheInfo.activeBlockToCacheMap,
+                                                                             activeBlockTag,
+                                                                             beginCacheRegionOp.thrifty(),
+                                                                             true); // toCache
+                });
+
+                // Create mapping ops for each cache active block region associated with this multiCache
+                CreateCacheMappingRegionHelper(rewriter, beginCacheRegionOp, multiCacheInfo);
+
+                // Create the i+1 iteration copy from the temp buffer to the cache
+                rewriter.setInsertionPoint(triggerLevelBlock, cacheRegionEnd);
+
+                auto epilogueCopyIfOp = rewriter.create<mlir::AffineIfOp>(loc, nonLastIterCheckSet, ValueRange{ triggerLoopParentIV }, false); // false indicating we do not want an "else" region
+                auto epilogueThenBuilder = epilogueCopyIfOp.getThenBodyBuilder();
+
+                MakeDelayedMappingRegion(epilogueThenBuilder, triggerLoopParentIV, triggerLoopParentNextIterValue, [&](mlir::OpBuilder& builder) {
+                    auto epilogueTempCopy = builder.create<MultiCacheCopyOp>(loc,
+                                                                             multiCacheInfo.multiCache,
+                                                                             doubleBufferTempArray,
+                                                                             multiCacheInfo.multiCacheExternalSymbols,
+                                                                             rewriter.getAffineMapArrayAttr(multiCacheInfo.multiCacheLBMaps),
+                                                                             rewriter.getAffineMapArrayAttr(multiCacheInfo.multiCacheUBMaps),
+                                                                             rewriter.getI64ArrayAttr(multiCacheInfo.multiCacheStepSizes),
+                                                                             util::ConvertIndexVectorToArrayAttr(multiCacheInfo.multiCacheLoopIndexIds, rewriter.getContext()),
+                                                                             rewriter.getAffineMapArrayAttr(multiCacheInfo.activeBlockInfo.lbMaps),
+                                                                             rewriter.getAffineMapArrayAttr(multiCacheInfo.activeBlockInfo.ubMaps),
+                                                                             multiCacheInfo.multiCacheExternalSymbolsPermutationMap,
+                                                                             multiCacheInfo.activeBlockToCacheMap,
+                                                                             activeBlockTag,
+                                                                             beginCacheRegionOp.thrifty(),
+                                                                             false); // toCache
+                });
+                // Mark the trigger loop parent loop to unswitch the last iteration so that our affine.if checks
+                // are always true in the main loop and always false in the unswitched final iteration
+                triggerLoopParentLoop->setAttr(UnswitchSuffixItersName, rewriter.getI64IntegerAttr(1));
             }
             else
             {
-                if (beginCacheRegionOp.activeBlockCache())
+                // Non-double-buffering case
+
+                rewriter.setInsertionPoint(triggerLevelBlock, cacheRegionStart);
+
+                // Create the prologue cache data moving op
+                if (!multiCacheInfo.arrayAccessInfo.valueRead || multiCacheInfo.arrayAccessInfo.onlyReadsAreAccumulates)
                 {
-                    rewriter.create<MultiCacheCopyOp>(loc,
-                                                      beginCacheRegionOp.input(),
-                                                      multiCacheInfo.multiCache,
-                                                      multiCacheInfo.multiCacheExternalSymbols,
-                                                      rewriter.getAffineMapArrayAttr(multiCacheInfo.multiCacheLBMaps),
-                                                      rewriter.getAffineMapArrayAttr(multiCacheInfo.multiCacheUBMaps),
-                                                      rewriter.getI64ArrayAttr(multiCacheInfo.multiCacheStepSizes),
-                                                      util::ConvertIndexVectorToArrayAttr(multiCacheInfo.multiCacheLoopIndexIds, rewriter.getContext()),
-                                                      rewriter.getAffineMapArrayAttr(multiCacheInfo.activeBlockInfo.lbMaps),
-                                                      rewriter.getAffineMapArrayAttr(multiCacheInfo.activeBlockInfo.ubMaps),
-                                                      multiCacheInfo.multiCacheExternalSymbolsPermutationMap,
-                                                      multiCacheInfo.activeBlockToCacheMap);
+                    rewriter.create<CacheZeroOp>(loc, multiCacheInfo.multiCache, activeBlockTag, beginCacheRegionOp.thrifty());
                 }
                 else
-                {
-                    rewriter.create<ActiveElementCacheCopyOp>(loc, beginCacheRegionOp.input(), cacheAccessContext);
-                }
-            }
-
-            // Create mapping ops for each cache active block region associated with this multiCache
-            for (auto activeCacheLoopEntry : multiCacheInfo.activeBlockRegionInfos)
-            {
-                auto cacheLevelLoopOp = activeCacheLoopEntry.first;
-                auto cacheLevelLoop = mlir::cast<mlir::AffineForOp>(cacheLevelLoopOp);
-                auto& currentActiveBlockRegionInfo = activeCacheLoopEntry.second;
-
-                mlir::Block* cacheLevelBlock = cacheLevelLoop.getOperation()->getBlock();
-                mlir::Block::iterator cacheLevelStartOp(cacheLevelLoop);
-                mlir::Block::iterator cacheLevelEndOp(cacheLevelLoop);
-                cacheLevelEndOp++;
-
-                rewriter.setInsertionPoint(cacheLevelBlock, cacheLevelStartOp);
-
-                // TODO : refactor out CacheAccessContext and simplify this
-                currentActiveBlockRegionInfo.cacheAccessContext.externalRelevantScheduleIndices = currentActiveBlockRegionInfo.allCacheExternalSymbols;
-                BeginCacheMappingOp cacheMappingOp = rewriter.create<BeginCacheMappingOp>(loc,
-                                                                                          beginCacheRegionOp.input(),
-                                                                                          multiCacheInfo.originalCacheOp,
-                                                                                          beginCacheRegionOp.baseInput(),
-                                                                                          currentActiveBlockRegionInfo.cacheAccessContext,
-                                                                                          beginCacheRegionOp.id(),
-                                                                                          beginCacheRegionOp.activeBlockCache());
-
-                rewriter.setInsertionPoint(cacheLevelBlock, cacheLevelEndOp);
-                EndCacheMappingOp endCacheMappingOp = rewriter.create<EndCacheMappingOp>(loc, cacheMappingOp.getResult());
-            }
-
-            rewriter.setInsertionPoint(triggerLevelBlock, cacheRegionEnd);
-
-            // Create the epilogue cache data moving op
-            // If we never wrote to the value, then don't bother copying data out via any method
-            if (multiCacheInfo.arrayAccessInfo.valueWritten)
-            {
-                // Note: onlyReadsAreAccumulates defaults to true, but if no reads are seen don't want to use a CacheReduceOp
-                //       so check that reads occurred and that they were all used for accumulates
-                if (multiCacheInfo.arrayAccessInfo.valueRead && multiCacheInfo.arrayAccessInfo.onlyReadsAreAccumulates)
                 {
                     if (beginCacheRegionOp.activeBlockCache())
                     {
-                        rewriter.create<ActiveBlockCacheReduceOp>(loc,
-                                                                  beginCacheRegionOp.input(),
-                                                                  multiCacheInfo.multiCache,
-                                                                  multiCacheInfo.activeBlockInfo.externalSymbols,
-                                                                  multiCacheInfo.activeBlockInfo.externalSymbols,
-                                                                  rewriter.getAffineMapArrayAttr(multiCacheInfo.activeBlockInfo.lbMaps),
-                                                                  rewriter.getAffineMapArrayAttr(multiCacheInfo.activeBlockInfo.ubMaps),
-                                                                  multiCacheInfo.activeBlockToCacheMap);
+                        rewriter.create<MultiCacheCopyOp>(loc,
+                                                          beginCacheRegionOp.input(),
+                                                          multiCacheInfo.multiCache,
+                                                          multiCacheInfo.multiCacheExternalSymbols,
+                                                          rewriter.getAffineMapArrayAttr(multiCacheInfo.multiCacheLBMaps),
+                                                          rewriter.getAffineMapArrayAttr(multiCacheInfo.multiCacheUBMaps),
+                                                          rewriter.getI64ArrayAttr(multiCacheInfo.multiCacheStepSizes),
+                                                          util::ConvertIndexVectorToArrayAttr(multiCacheInfo.multiCacheLoopIndexIds, rewriter.getContext()),
+                                                          rewriter.getAffineMapArrayAttr(multiCacheInfo.activeBlockInfo.lbMaps),
+                                                          rewriter.getAffineMapArrayAttr(multiCacheInfo.activeBlockInfo.ubMaps),
+                                                          multiCacheInfo.multiCacheExternalSymbolsPermutationMap,
+                                                          multiCacheInfo.activeBlockToCacheMap,
+                                                          activeBlockTag,
+                                                          beginCacheRegionOp.thrifty(),
+                                                          true); // toCache
                     }
                     else
                     {
-                        rewriter.create<ActiveElementCacheReduceOp>(loc, cacheAccessContext, beginCacheRegionOp.input());
+                        rewriter.create<ActiveElementCacheCopyOp>(loc, beginCacheRegionOp.input(), cacheAccessContext);
                     }
                 }
-                else
+
+                // Create mapping ops for each cache active block region associated with this multiCache
+                CreateCacheMappingRegionHelper(rewriter, beginCacheRegionOp, multiCacheInfo);
+
+                rewriter.setInsertionPoint(triggerLevelBlock, cacheRegionEnd);
+
+                // Create the epilogue cache data moving op
+                // If we never wrote to the value, then don't bother copying data out via any method
+                if (multiCacheInfo.arrayAccessInfo.valueWritten)
                 {
-                    if (beginCacheRegionOp.activeBlockCache())
+                    // Note: onlyReadsAreAccumulates defaults to true, but if no reads are seen don't want to use a CacheReduceOp
+                    //       so check that reads occurred and that they were all used for accumulates
+                    if (multiCacheInfo.arrayAccessInfo.valueRead && multiCacheInfo.arrayAccessInfo.onlyReadsAreAccumulates)
                     {
-                        rewriter.create<ActiveBlockCacheCopyOp>(loc,
-                                                                beginCacheRegionOp.input(),
-                                                                multiCacheInfo.multiCache,
-                                                                multiCacheInfo.activeBlockInfo.externalSymbols,
-                                                                multiCacheInfo.activeBlockInfo.externalSymbols,
-                                                                mlir::ValueRange{},
-                                                                rewriter.getAffineMapArrayAttr(multiCacheInfo.activeBlockInfo.lbMaps),
-                                                                rewriter.getAffineMapArrayAttr(multiCacheInfo.activeBlockInfo.ubMaps),
-                                                                multiCacheInfo.activeBlockToCacheMap,
-                                                                false);
+                        if (beginCacheRegionOp.activeBlockCache())
+                        {
+                            rewriter.create<ActiveBlockCacheReduceOp>(loc,
+                                                                      beginCacheRegionOp.input(),
+                                                                      multiCacheInfo.multiCache,
+                                                                      multiCacheInfo.activeBlockInfo.externalSymbols,
+                                                                      multiCacheInfo.activeBlockInfo.externalSymbols,
+                                                                      rewriter.getAffineMapArrayAttr(multiCacheInfo.activeBlockInfo.lbMaps),
+                                                                      rewriter.getAffineMapArrayAttr(multiCacheInfo.activeBlockInfo.ubMaps),
+                                                                      multiCacheInfo.activeBlockToCacheMap,
+                                                                      activeBlockTag,
+                                                                      beginCacheRegionOp.thrifty());
+                        }
+                        else
+                        {
+                            rewriter.create<ActiveElementCacheReduceOp>(loc, cacheAccessContext, beginCacheRegionOp.input());
+                        }
                     }
                     else
                     {
-                        rewriter.create<ActiveElementCacheCopyOp>(loc, cacheAccessContext, beginCacheRegionOp.input());
+                        if (beginCacheRegionOp.activeBlockCache())
+                        {
+                            rewriter.create<ActiveBlockCacheCopyOp>(loc,
+                                                                    beginCacheRegionOp.input(),
+                                                                    multiCacheInfo.multiCache,
+                                                                    multiCacheInfo.activeBlockInfo.externalSymbols,
+                                                                    multiCacheInfo.activeBlockInfo.externalSymbols,
+                                                                    mlir::ValueRange{},
+                                                                    rewriter.getAffineMapArrayAttr(multiCacheInfo.activeBlockInfo.lbMaps),
+                                                                    rewriter.getAffineMapArrayAttr(multiCacheInfo.activeBlockInfo.ubMaps),
+                                                                    multiCacheInfo.activeBlockToCacheMap,
+                                                                    false, // toCache : this copy will copy from the cache back to the outer array
+                                                                    activeBlockTag,
+                                                                    beginCacheRegionOp.thrifty(),
+                                                                    false); // skipBarriers : this copy isn't already guarded by barriers, so don't skip them
+                        }
+                        else
+                        {
+                            rewriter.create<ActiveElementCacheCopyOp>(loc, cacheAccessContext, beginCacheRegionOp.input());
+                        }
                     }
                 }
             }
@@ -3685,6 +5015,12 @@ LogicalResult MaxElementCacheRegionOpRewrite::matchAndRewrite(BeginMaxElementCac
     rewriter.replaceOpWithNewOp<EndCacheRegionOp>(endOp, endOp.regionId());
 
     rewriter.setInsertionPoint(newBlock, newBeginPoint);
+    auto doubleBufferMemorySpaceOpt = beginMaxElementCacheRegionOp.doubleBufferMemorySpace();
+    auto doubleBufferMemorySpace = accera::ir::value::MemorySpace::None;
+    if (doubleBufferMemorySpaceOpt.hasValue())
+    {
+        doubleBufferMemorySpace = doubleBufferMemorySpaceOpt.getValue();
+    }
     auto newBeginOp = rewriter.create<BeginCacheRegionOp>(loc,
                                                           input,
                                                           cacheAccessContext,
@@ -3694,7 +5030,10 @@ LogicalResult MaxElementCacheRegionOpRewrite::matchAndRewrite(BeginMaxElementCac
                                                           beginMaxElementCacheRegionOp.id(),
                                                           beginMaxElementCacheRegionOp.cacheHierarchyLevel(),
                                                           true, // activeBlockCache
-                                                          beginMaxElementCacheRegionOp.dimReorderCache());
+                                                          beginMaxElementCacheRegionOp.dimReorderCache(),
+                                                          beginMaxElementCacheRegionOp.thrifty(),
+                                                          beginMaxElementCacheRegionOp.doubleBufferCache(),
+                                                          doubleBufferMemorySpace);
 
     // This new cache region op has already been hoisted as high as we want to hoist it
     newBeginOp->setAttr("hoisted", rewriter.getUnitAttr());
@@ -3859,7 +5198,7 @@ LogicalResult VectorizeAffineForOpConversion::matchAndRewrite(AffineForOp affine
 
     if (!erasedBaseLoop)
     {
-        util::PromoteIfSingleIteration(rewriter, affineForOp);
+        (void)util::PromoteIfSingleIteration(rewriter, affineForOp);
     }
 
     rewriter.finalizeRootUpdate(affineForOp);
@@ -4117,6 +5456,12 @@ LogicalResult TensorizeAffineForOpConversion::matchAndRewrite(AffineForOp affine
         return success();
     }
 
+    // currently only available for rocm target
+    if (util::ResolveExecutionRuntime(affineForOp) != ExecutionRuntime::ROCM)
+    {
+        return failure();
+    }
+
     auto tensorizationInfo = GetTensorizationInfo(affineForOp);
 
     SmallVector<AffineForOp, 4> loops;
@@ -4128,6 +5473,10 @@ LogicalResult TensorizeAffineForOpConversion::matchAndRewrite(AffineForOp affine
     for (auto& en : llvm::enumerate(loops))
     {
         auto loop = en.value();
+        if (!HasTensorizationInfo(loop))
+        {
+            return failure();
+        }
         if (!loop.hasConstantBounds())
         {
             return failure();
@@ -4136,6 +5485,10 @@ LogicalResult TensorizeAffineForOpConversion::matchAndRewrite(AffineForOp affine
         {
             return failure();
         }
+        if (loop.getStep() != 1)
+        {
+            return failure();
+        }
         if (loop.getConstantUpperBound() != tensorizationInfo.dim[en.index()])
         {
             return failure();
@@ -4144,122 +5497,244 @@ LogicalResult TensorizeAffineForOpConversion::matchAndRewrite(AffineForOp affine
 
     auto innerLoop = loops[2]; // the inner most loop
     auto innerLoopBodyIter = innerLoop.getBody()->begin();
+    auto innerLoopBodyEnd = innerLoop.getBody()->end();
+
+    std::stack<Operation*> opsToErase;
 
     // 1. load from A matrix
-    if (!isa<mlir::AffineLoadOp>(innerLoopBodyIter))
+    if (innerLoopBodyIter == innerLoopBodyEnd || !isa<mlir::AffineLoadOp>(*innerLoopBodyIter))
     {
         llvm::dbgs() << "While processing " << *innerLoopBodyIter << ". Failed to match the load from A Op\n";
         return rewriter.notifyMatchFailure(&*innerLoopBodyIter, "Failed to match the load from A Op");
     }
     auto loadAOp = cast<mlir::AffineLoadOp>(innerLoopBodyIter);
+    opsToErase.push(loadAOp);
 
-    (void)++innerLoopBodyIter;
+    innerLoopBodyIter++;
     // 1. load from B matrix
-    if (!isa<mlir::AffineLoadOp>(innerLoopBodyIter))
+    if (innerLoopBodyIter == innerLoopBodyEnd || !isa<mlir::AffineLoadOp>(*innerLoopBodyIter))
     {
         llvm::dbgs() << "While processing " << *innerLoopBodyIter << ". Failed to match the load from B Op\n";
         return rewriter.notifyMatchFailure(&*innerLoopBodyIter, "Failed to match the load from B Op");
     }
     auto loadBOp = cast<mlir::AffineLoadOp>(innerLoopBodyIter);
+    opsToErase.push(loadBOp);
 
-    (void)++innerLoopBodyIter;
+    (void)innerLoopBodyIter++;
     // 1. muliply A * B
-    if (!isa<v::BinOp>(innerLoopBodyIter))
+    if (innerLoopBodyIter == innerLoopBodyEnd || !isa<v::BinOp>(*innerLoopBodyIter))
     {
         llvm::dbgs() << "While processing " << *innerLoopBodyIter << ". Failed to match the binary A*C multiplication op\n";
         return rewriter.notifyMatchFailure(&*innerLoopBodyIter, "Failed to match the binary A*C multiplication op");
     }
-    auto mulAB = cast<v::BinOp>(innerLoopBodyIter);
+    auto mulAB = cast<v::BinOp>(*innerLoopBodyIter);
     if (mulAB.predicate() != v::BinaryOpPredicate::MUL)
     {
         llvm::dbgs() << "While processing " << *innerLoopBodyIter << ". Failed to match the multiplication op\n";
         return rewriter.notifyMatchFailure(&*innerLoopBodyIter, "Failed to match the multiplication op");
     }
+    opsToErase.push(mulAB);
 
-    (void)++innerLoopBodyIter;
+    (void)innerLoopBodyIter++;
     // 4. load C
-    if (!isa<mlir::AffineLoadOp>(innerLoopBodyIter))
+    if (innerLoopBodyIter == innerLoopBodyEnd || !isa<mlir::AffineLoadOp>(*innerLoopBodyIter))
     {
         llvm::dbgs() << "While processing " << *innerLoopBodyIter << ". Failed to match the load from C Op\n";
         return rewriter.notifyMatchFailure(&*innerLoopBodyIter, "Failed to match the load from C Op");
     }
     auto loadCOp = cast<mlir::AffineLoadOp>(innerLoopBodyIter);
+    opsToErase.push(loadCOp);
 
-    (void)++innerLoopBodyIter;
+    (void)innerLoopBodyIter++;
     // 1. add A * B + C
-    if (!isa<v::BinOp>(innerLoopBodyIter))
+    if (innerLoopBodyIter == innerLoopBodyEnd || !isa<v::BinOp>(*innerLoopBodyIter))
     {
         llvm::dbgs() << "While processing " << *innerLoopBodyIter << ". Failed to match the binary C accumulation op\n";
         return rewriter.notifyMatchFailure(&*innerLoopBodyIter, "Failed to match the binary C accumulation op");
     }
-    auto accumC = cast<v::BinOp>(innerLoopBodyIter);
+    auto accumC = cast<v::BinOp>(*innerLoopBodyIter);
     if (accumC.predicate() != v::BinaryOpPredicate::ADD)
     {
         llvm::dbgs() << "While processing " << accumC << ". Failed to match the accumulation op\n";
         return rewriter.notifyMatchFailure(accumC, "Failed to match the accumulation op");
     }
+    opsToErase.push(accumC);
 
-    (void)++innerLoopBodyIter;
+    (void)innerLoopBodyIter++;
     // 4. store C
-    if (!isa<mlir::AffineStoreOp>(innerLoopBodyIter))
+    auto storeC = cast<mlir::AffineStoreOp>(*innerLoopBodyIter);
+    if (innerLoopBodyIter == innerLoopBodyEnd || !storeC)
     {
         llvm::dbgs() << "While processing " << *innerLoopBodyIter << ". Failed to match the store into C\n";
         return rewriter.notifyMatchFailure(&*innerLoopBodyIter, "Failed to match the store into C");
     }
-    [[maybe_unused]] auto storeCOp = cast<mlir::AffineStoreOp>(innerLoopBodyIter);
+    opsToErase.push(storeC);
+
+    (void)innerLoopBodyIter++;
 
-    (void)++innerLoopBodyIter;
-    // Ignore the yeild op at the end
-    if (isa<mlir::AffineYieldOp>(innerLoopBodyIter))
+    // for some reason there sometimes is an extra AffineStoreOp / AffineLoadOp pair being redundantly generated, we need to ignore those
+    if (innerLoopBodyIter != innerLoopBodyEnd && isa<mlir::AffineLoadOp>(*innerLoopBodyIter))
     {
-        (void)++innerLoopBodyIter;
+        auto loadOp = cast<mlir::AffineLoadOp>(*innerLoopBodyIter); // TODO: check this is still a load from the C matrix
+        opsToErase.push(loadOp);
+        (void)innerLoopBodyIter++;
+        if (innerLoopBodyIter != innerLoopBodyEnd && isa<mlir::AffineStoreOp>(*innerLoopBodyIter))
+        {
+            auto storeOp = cast<mlir::AffineStoreOp>(*innerLoopBodyIter); // TODO: check this is still a store into the C matrix
+            opsToErase.push(storeOp);
+            (void)innerLoopBodyIter++;
+        }
     }
-    if (innerLoopBodyIter != innerLoop.getBody()->end())
+
+    // Ignore the yield op at the end
+    if (innerLoopBodyIter != innerLoopBodyEnd && isa<mlir::AffineYieldOp>(*innerLoopBodyIter))
     {
-        llvm::dbgs() << "While processing " << *innerLoopBodyIter << ". The store into C was not the last instruction\n";
-        return rewriter.notifyMatchFailure(&*innerLoopBodyIter, "The store into C was not the last instruction");
+        (void)innerLoopBodyIter++;
     }
-
-    for (auto loop : loops)
+    if (innerLoopBodyIter != innerLoopBodyEnd)
     {
-        loop.setConstantUpperBound(1);
+        LLVM_DEBUG(llvm::dbgs() << "While processing " << *innerLoopBodyIter << ". The store into C was not the last instruction\n";
+                   llvm::dbgs() << "affine for : " << *affineForOp << "\n";
+                   llvm::dbgs() << "current inst " << *innerLoopBodyIter << "\n");
+        return rewriter.notifyMatchFailure(&*innerLoopBodyIter, "The store into C was not the last instruction");
     }
 
     mlir::OpBuilder::InsertionGuard guard(rewriter);
+    rewriter.setInsertionPoint(innerLoop.getBody(), innerLoop.getBody()->getTerminator()->getIterator());
 
-    rewriter.setInsertionPoint(innerLoop.getBody()->getTerminator());
-
-    // initialize the accum C vector
-    auto loc = affineForOp.getLoc();
+    rewriter.startRootUpdate(affineForOp);
+    auto loc = innerLoop.getLoc();
     auto ctx = rewriter.getContext();
-    auto cTy = loadCOp.getMemRefType();
-    auto cElementTy = cTy.getElementType();
-    auto zero = rewriter.create<ConstantOp>(loc, rewriter.getZeroAttr(cElementTy));
-
-    // TODO: Update TensorizationInfo to provide values for all the "4" literals below
-    auto cVec = rewriter.create<vector::BroadcastOp>(loc, VectorType::get({ 4 }, cElementTy), zero);
-    v::MFMAComputeOp mfmaComputeOp;
-    for (int ii = 0; ii < 4; ii++)
-    {
-        auto loadAAccessMap = loadAOp.getAffineMap();
-        auto shiftALoad = mlir::AffineMap::get(
-            loadAAccessMap.getNumDims(),
-            loadAAccessMap.getNumSymbols(),
-            { loadAAccessMap.getResult(0), loadAAccessMap.getResult(1) + 4 * ii },
-            ctx);
-        auto loadAElem = rewriter.create<mlir::AffineLoadOp>(loc, loadAOp.getMemRef(), shiftALoad, loadAOp.indices());
-
-        auto loadBAccessMap = loadBOp.getAffineMap();
-        auto shiftBLoad = mlir::AffineMap::get(
-            loadBAccessMap.getNumDims(),
-            loadBAccessMap.getNumSymbols(),
-            { loadBAccessMap.getResult(0) + 4 * ii, loadBAccessMap.getResult(1) },
-            ctx);
-        auto loadBElem = rewriter.create<mlir::AffineLoadOp>(loc, loadBOp.getMemRef(), shiftBLoad, loadBOp.indices());
-        mfmaComputeOp = rewriter.create<v::MFMAComputeOp>(loc, cVec.getType(), loadAElem.result(), loadBElem.result(), ii != 0 ? mfmaComputeOp.res() : cVec.vector());
-    }
-
-    (void)mlir::promoteIfSingleIteration(affineForOp);
+
+    auto warpSize = rewriter.create<ConstantIndexOp>(loc, util::ResolveWarpSize(affineForOp).value());
+    auto four = rewriter.create<ConstantIndexOp>(loc, 4);
+    auto sixteen = rewriter.create<ConstantIndexOp>(loc, tensorizationInfo.dim.back());
+    auto tidX = rewriter.create<gpu::ThreadIdOp>(loc, rewriter.getIndexType(), "x");
+    auto tidY = rewriter.create<gpu::ThreadIdOp>(loc, rewriter.getIndexType(), "y");
+    auto bidX = rewriter.create<gpu::BlockIdOp>(loc, rewriter.getIndexType(), "x");
+    auto bidY = rewriter.create<gpu::BlockIdOp>(loc, rewriter.getIndexType(), "y");
+    auto bdimX = rewriter.create<gpu::BlockDimOp>(loc, rewriter.getIndexType(), "x");
+    auto blockTid = rewriter.create<AddIOp>(loc, tidX, rewriter.create<MulIOp>(loc, tidY, bdimX));
+    auto warpId = rewriter.create<UnsignedDivIOp>(loc, blockTid, warpSize);
+    auto warpX = rewriter.create<UnsignedRemIOp>(loc, warpId, four);
+    auto warpY = rewriter.create<UnsignedDivIOp>(loc, warpId, four);
+    auto rowOffset = rewriter.create<AddIOp>(loc,
+                                             rewriter.create<MulIOp>(loc, warpY, sixteen),
+                                             rewriter.create<MulIOp>(loc, bidX, warpSize));
+    auto colOffset = rewriter.create<AddIOp>(loc,
+                                             rewriter.create<MulIOp>(loc, warpX, sixteen),
+                                             rewriter.create<MulIOp>(loc, bidY, warpSize));
+
+    std::vector<int64_t> mfmaMatrixShape{ tensorizationInfo.dim[0], tensorizationInfo.dim[1], tensorizationInfo.dim[2] };
+    auto getMatrixTypeOfMemref = [=](mlir::MemRefType memrefType, const std::vector<int64_t>& shape, llvm::StringRef operand) -> v::MFMAMatrixType {
+        return v::MFMAMatrixType::get(shape, memrefType.getElementType(), operand);
+    };
+
+    auto loadMatrixOp = [&](AffineLoadOp loadOp, StringRef kind) {
+        auto mfmaMatrixType = getMatrixTypeOfMemref(loadOp.getMemRefType(), mfmaMatrixShape, kind);
+        if (kind == "AOp")
+        {
+            [[maybe_unused]] auto d0 = rewriter.getAffineDimExpr(0);
+            auto d1 = rewriter.getAffineDimExpr(1);
+            auto rowOffsetSym = rewriter.getAffineSymbolExpr(0);
+            auto offsetMap = AffineMap::get(2, 1, { rowOffsetSym, d1 }, ctx);
+            SmallVector<mlir::Value, 4> loadOpOperands(loadOp.getMapOperands());
+            loadOpOperands.push_back(rowOffset);
+            // llvm::dbgs() << "AOp with offset " << offsetMap << " with load "
+            //              << loadOp.getAffineMap() << " and composed = " << offsetMap.compose(loadOp.getAffineMap()) << "\n";
+            return rewriter.create<MFMALoadOp>(loc,
+                                               mfmaMatrixType,
+                                               loadOp.memref(),
+                                               offsetMap.compose(loadOp.getAffineMap()),
+                                               loadOpOperands);
+        }
+        else if (kind == "BOp")
+        {
+            auto d0 = rewriter.getAffineDimExpr(0);
+            [[maybe_unused]] auto d1 = rewriter.getAffineDimExpr(1);
+            auto colOffsetSym = rewriter.getAffineSymbolExpr(0);
+            auto offsetMap = AffineMap::get(2, 1, { d0, colOffsetSym }, ctx);
+            SmallVector<mlir::Value, 4> loadOpOperands(loadOp.getMapOperands());
+            loadOpOperands.push_back(colOffset);
+            // llvm::dbgs() << "BOp with offset " << offsetMap << " with load "
+            //              << loadOp.getAffineMap() << " and composed = " << offsetMap.compose(loadOp.getAffineMap()) << "\n";
+            return rewriter.create<MFMALoadOp>(loc,
+                                               mfmaMatrixType,
+                                               loadOp.memref(),
+                                               offsetMap.compose(loadOp.getAffineMap()),
+                                               loadOpOperands);
+        }
+        else if (kind == "COp")
+        {
+            [[maybe_unused]] auto d0 = rewriter.getAffineDimExpr(0);
+            [[maybe_unused]] auto d1 = rewriter.getAffineDimExpr(1);
+            auto rowOffsetSym = rewriter.getAffineSymbolExpr(0);
+            auto colOffsetSym = rewriter.getAffineSymbolExpr(1);
+            auto offsetMap = AffineMap::get(2, 2, { rowOffsetSym, colOffsetSym }, ctx);
+            SmallVector<mlir::Value, 4> loadOpOperands(loadOp.getMapOperands());
+            loadOpOperands.push_back(rowOffset);
+            loadOpOperands.push_back(colOffset);
+            // llvm::dbgs() << "COp with offset " << offsetMap << " with load "
+            //              << loadOp.getAffineMap() << " and composed = " << offsetMap.compose(loadOp.getAffineMap()) << "\n";
+            return rewriter.create<MFMALoadOp>(loc,
+                                               mfmaMatrixType,
+                                               loadOp.memref(),
+                                               offsetMap.compose(loadOp.getAffineMap()),
+                                               loadOpOperands);
+        }
+        else
+        {
+            llvm::report_fatal_error("Unknown kind of matrix");
+        }
+    };
+
+    auto StoreOp = [&](AffineStoreOp storeOp, Value value) {
+        [[maybe_unused]] auto d0 = rewriter.getAffineDimExpr(0);
+        [[maybe_unused]] auto d1 = rewriter.getAffineDimExpr(1);
+        auto rowOffsetSym = rewriter.getAffineSymbolExpr(0);
+        auto colOffsetSym = rewriter.getAffineSymbolExpr(1);
+        auto offsetMap = AffineMap::get(2, 2, { rowOffsetSym, colOffsetSym }, ctx);
+        SmallVector<mlir::Value, 4> storeOpOperands(storeOp.getMapOperands());
+        storeOpOperands.push_back(rowOffset);
+        storeOpOperands.push_back(colOffset);
+        // llvm::dbgs() << "COpOut with offset " << offsetMap << " with load "
+        //              << storeOp.getAffineMap() << " and composed = " << offsetMap.compose(storeOp.getAffineMap()) << "\n";
+        return rewriter.create<MFMAStoreOp>(loc,
+                                            value,
+                                            storeOp.memref(),
+                                            offsetMap.compose(storeOp.getAffineMap()),
+                                            storeOpOperands);
+    };
+
+    auto aMfmaMatrix = loadMatrixOp(loadAOp, "AOp");
+    auto bMfmaMatrix = loadMatrixOp(loadBOp, "BOp");
+    auto cMfmaMatrix = loadMatrixOp(loadCOp, "COp");
+
+    auto destMfmaMatrix = rewriter.create<MFMAComputeOp>(loc, cMfmaMatrix.getType(), aMfmaMatrix, bMfmaMatrix, cMfmaMatrix);
+
+    [[maybe_unused]] auto mfmaStoreOp = StoreOp(storeC, destMfmaMatrix);
+
+    while (!opsToErase.empty())
+    {
+        auto eraseOp = opsToErase.top();
+        if (eraseOp->use_empty())
+        {
+            rewriter.eraseOp(eraseOp);
+        }
+        opsToErase.pop();
+    }
+
+    for (auto loop : loops)
+    {
+        // change loop step so that the loop runs once
+        loop.setConstantUpperBound(1);
+
+        // remove the tensorization annotation
+        RemoveTensorizationInfo(loop);
+    }
+    RemoveTensorizationInfo(affineForOp);
+    (void)util::PromoteIfSingleIteration(rewriter, affineForOp);
+    rewriter.finalizeRootUpdate(affineForOp);
 
     return success();
 }
@@ -4667,8 +6142,10 @@ LogicalResult HoistScalingToCacheReduceRewrite::matchAndRewrite(mlir::AffineStor
                                                                   cacheReduceOpAdaptor.ubOperands(),
                                                                   cacheReduceOpAdaptor.lbMaps(),
                                                                   cacheReduceOpAdaptor.ubMaps(),
-                                                                  cacheReduceOpAdaptor.activeBlockToCacheMap(),
-                                                                  scaleValues);
+                                                                  activeBlockCacheReduceOp.activeBlockToCacheMap(),
+                                                                  scaleValues,
+                                                                  activeBlockCacheReduceOp.activeBlockTag(),
+                                                                  activeBlockCacheReduceOp.thrifty());
         }
     }
 
@@ -4732,7 +6209,7 @@ LogicalResult OutOfBoundsLoadRewriteCommon(mlir::AffineLoadOp affineLoadOp, Patt
         mlir::Value tmpBuffer;
         if (execTarget == v::ExecutionTarget::GPU)
         {
-            tmpElementType = mlir::MemRefType::get(tmpBufferShape, loadResultType, {}, static_cast<unsigned>(v::MemorySpace::Local));
+            tmpElementType = mlir::MemRefType::get(tmpBufferShape, loadResultType, {}, static_cast<unsigned>(v::MemorySpace::Private));
             tmpBuffer = rewriter.create<v::AllocOp>(loc, tmpElementType, llvm::None);
         }
         else
@@ -4926,6 +6403,82 @@ LogicalResult ConvertValueStoresToAffineRewrite::matchAndRewrite(v::StoreOp stor
     return ConvertStoreToAffine(rewriter, storeOp);
 }
 
+LogicalResult DelayedMappingRegionOpRewrite::matchAndRewrite(DelayedMappingRegionOp mappingRegionOp, PatternRewriter& rewriter) const
+{
+    auto fromValue = mappingRegionOp.from();
+    auto toValue = mappingRegionOp.to();
+    mappingRegionOp.region().walk([&](mlir::Operation* op) {
+        op->replaceUsesOfWith(fromValue, toValue);
+    });
+    util::InlineAllRegionOpsBeforeOp(rewriter, mappingRegionOp.region(), mappingRegionOp);
+    rewriter.eraseOp(mappingRegionOp);
+    return success();
+}
+
+// Returns the second loop, which goes from [n, end), and changes the given loop to go from [begin, n)
+mlir::AffineForOp SegmentLoopAtIteration(mlir::AffineForOp forOp, int64_t n)
+{
+    // To segment the loop into two loops at the n'th iteration
+    // 1) Compute the loop IV value at the n'th iteration
+    // 2) Clone the loop
+    // 3) Update the original loop's end value to be the n'th iteration value
+    // 4) Update the cloned loop's begin value to be the n'th iteration value
+
+    // Position a builder in the block containing this forOp just after the loop
+    auto iter = forOp->getIterator();
+    iter++;
+    auto loopParentBlock = forOp->getBlock();
+    mlir::OpBuilder builder(loopParentBlock, iter);
+
+    auto constantTripCountOpt = mlir::getConstantTripCount(forOp);
+
+    assert(constantTripCountOpt.hasValue() && "AffineForOps in Accera loop nests must have constant trip counts");
+    auto constantTripCount = constantTripCountOpt.getValue();
+    if (constantTripCount < n)
+    {
+        // Can't unswitch more iterations than this loop has, so don't bother unswitching
+        return nullptr;
+    }
+
+    assert(forOp.hasConstantBounds() && "Only constant-bounded AffineForOps are supported for unswitching");
+
+    auto nthIterValue = forOp.getConstantLowerBound() + (forOp.getStep() * n);
+
+    auto segmentedSecondLoop = mlir::dyn_cast<mlir::AffineForOp>(builder.clone(*(forOp.getOperation())));
+    forOp.setConstantUpperBound(nthIterValue);
+    segmentedSecondLoop.setConstantLowerBound(nthIterValue);
+
+    return segmentedSecondLoop;
+}
+
+LogicalResult LoopUnswitchingOpRewrite::matchAndRewrite(mlir::AffineForOp forOp, PatternRewriter& rewriter) const
+{
+    if (forOp->hasAttrOfType<IntegerAttr>(UnswitchSuffixItersName) ||
+        forOp->hasAttrOfType<IntegerAttr>(UnswitchPrefixItersName))
+    {
+        // Unswitch the last n iterations first if a suffix unswitch is desired
+        if (auto unswitchSuffix = forOp->getAttrOfType<IntegerAttr>(UnswitchSuffixItersName))
+        {
+            forOp->removeAttr(UnswitchSuffixItersName);
+
+            auto constantTripCountOpt = mlir::getConstantTripCount(forOp);
+            assert(constantTripCountOpt.hasValue() && "AffineForOps in Accera loop nests must have constant trip counts");
+            auto constantTripCount = constantTripCountOpt.getValue();
+
+            int64_t iter = constantTripCount - unswitchSuffix.getInt();
+            [[maybe_unused]] auto secondLoop = SegmentLoopAtIteration(forOp, iter);
+        }
+
+        // Now unswitch the first n iterations if a prefix unswitch is desired. Note: requesting both is also supported
+        if (auto unswitchPrefix = forOp->getAttrOfType<IntegerAttr>(UnswitchPrefixItersName))
+        {
+            forOp->removeAttr(UnswitchPrefixItersName);
+            [[maybe_unused]] auto secondLoop = SegmentLoopAtIteration(forOp, unswitchPrefix.getInt());
+        }
+    }
+    return success();
+}
+
 void ExecutionPlanCacheRegionLoweringPass::runOnOperation()
 {
     auto operation = getOperation();
@@ -4993,24 +6546,14 @@ void ExecutionPlanParallelizationPass::runOnOperation()
 
 void ExecutionPlanTensorizationPass::runOnOperation()
 {
+    auto* ctx = &getContext();
     auto operation = getOperation();
-    mlir::OpBuilder builder(operation);
-    ConversionTarget target(getContext());
-
-    target.addLegalDialect<ValueDialect,
-                           memref::MemRefDialect,
-                           mlir::AffineDialect,
-                           mlir::StandardOpsDialect,
-                           ExecutionPlanDialect>();
-    target.addDynamicallyLegalOp<AffineForOp>([&](AffineForOp op) {
-        // An AffineForOp is legal if it does not have the ExecutionPlan tensorize attributes
-        return !HasTensorizationInfo(op);
-    });
 
-    OwningRewritePatternList patterns(&getContext());
+    OwningRewritePatternList patterns(ctx);
     accera::transforms::executionPlan::populateExecutionPlanTensorizePatterns(patterns);
 
-    (void)applyPatternsAndFoldGreedily(operation, std::move(patterns));
+    if (failed(applyPatternsAndFoldGreedily(operation, std::move(patterns))))
+        return signalPassFailure();
 }
 
 void ExecutionPlanMakeCacheLoweringPass::runOnFunction()
@@ -5124,6 +6667,13 @@ void populateExecutionPlanMakeCachePatterns(mlir::OwningRewritePatternList& patt
     patterns.insert<MakeCacheOpLowering>(patterns.getContext());
 }
 
+void populateExecutionPlanThriftyCachePatterns(mlir::OwningRewritePatternList& patterns)
+{
+    patterns.insert<ThriftyCacheMultiCopyOpRewrite>(patterns.getContext());
+    patterns.insert<ThriftyCacheCopyOpRewrite>(patterns.getContext());
+    patterns.insert<ThriftyCacheReduceOpRewrite>(patterns.getContext());
+}
+
 void populateExecutionPlanMultiCachePatterns(mlir::OwningRewritePatternList& patterns)
 {
     patterns.insert<MultiCacheCopyOpRewrite>(patterns.getContext());
@@ -5138,6 +6688,16 @@ void populateExecutionPlanCopyReducePatterns(mlir::OwningRewritePatternList& pat
                     CacheZeroOpRewrite>(patterns.getContext());
 }
 
+void populateExecutionPlanDelayedMappingPatterns(mlir::OwningRewritePatternList& patterns)
+{
+    patterns.insert<DelayedMappingRegionOpRewrite>(patterns.getContext());
+}
+
+void populateExecutionPlanLoopUnswitchingPatterns(mlir::OwningRewritePatternList& patterns)
+{
+    patterns.insert<LoopUnswitchingOpRewrite>(patterns.getContext());
+}
+
 void populateExecutionPlanMaxElementCacheRegionPatterns(mlir::OwningRewritePatternList& patterns)
 {
     patterns.insert<MaxElementCacheRegionOpRewrite>(patterns.getContext());
diff --git a/accera/transforms/src/gpu/AcceraToGPUPass.cpp b/accera/transforms/src/gpu/AcceraToGPUPass.cpp
index c226ba33..77408b62 100644
--- a/accera/transforms/src/gpu/AcceraToGPUPass.cpp
+++ b/accera/transforms/src/gpu/AcceraToGPUPass.cpp
@@ -7,6 +7,7 @@
 #include "gpu/AcceraToGPUPass.h"
 
 #include "AcceraPasses.h"
+#include "ir/include/value/ValueDialect.h"
 #include "ir/include/value/ValueEnums.h"
 #include "ir/include/value/ValueMFMAOp.h"
 
@@ -14,34 +15,46 @@
 
 #include <utilities/include/Exception.h>
 
-#include <llvm/Support/ErrorHandling.h>
-
 #include <mlir/Conversion/GPUToSPIRV/GPUToSPIRV.h>
 #include <mlir/Conversion/SCFToSPIRV/SCFToSPIRV.h>
 #include <mlir/Conversion/StandardToLLVM/ConvertStandardToLLVM.h>
 #include <mlir/Conversion/StandardToSPIRV/StandardToSPIRV.h>
+#include <mlir/Dialect/Affine/IR/AffineOps.h>
 #include <mlir/Dialect/GPU/GPUDialect.h>
 #include <mlir/Dialect/LLVMIR/LLVMDialect.h>
+#include <mlir/Dialect/LLVMIR/NVVMDialect.h>
 #include <mlir/Dialect/MemRef/IR/MemRef.h>
+#include <mlir/Dialect/OpenMP/OpenMPDialect.h>
 #include <mlir/Dialect/SPIRV/IR/SPIRVDialect.h>
 #include <mlir/Dialect/SPIRV/IR/SPIRVEnums.h>
 #include <mlir/Dialect/SPIRV/IR/SPIRVOps.h>
 #include <mlir/Dialect/SPIRV/Transforms/SPIRVConversion.h>
+#include <mlir/Dialect/StandardOps/IR/Ops.h>
 #include <mlir/Dialect/Vector/VectorOps.h>
+#include <mlir/IR/AffineExpr.h>
+#include <mlir/IR/BuiltinDialect.h>
 #include <mlir/IR/BuiltinOps.h>
+#include <mlir/IR/BuiltinTypes.h>
 #include <mlir/IR/MLIRContext.h>
 #include <mlir/Support/LLVM.h>
 #include <mlir/Support/LogicalResult.h>
 #include <mlir/Transforms/GreedyPatternRewriteDriver.h>
 
+#include <llvm/ADT/StringSwitch.h>
+#include <llvm/Support/Debug.h>
+#include <llvm/Support/ErrorHandling.h>
+
 #include <optional>
-#include <string>
+
+#define DEBUG_TYPE "accera-to-gpu"
 
 using namespace mlir;
 using accera::transforms::populateAcceraToNVVMPatterns;
 using accera::transforms::populateAcceraToROCDLPatterns;
 using accera::transforms::populateAcceraToSPIRVPatterns;
+using accera::transforms::populateGPUSimplificationPatterns;
 
+namespace ir = accera::ir;
 namespace utilir = accera::ir::util;
 namespace vir = accera::ir::value;
 
@@ -56,7 +69,7 @@ const char kPrivateMemoryVarPrefix[] = "__private_mem__";
 /// Returns true if the allocations of type `t` can be lowered to SPIR-V.
 static bool isSPIRVFunctionAllocationSupported(MemRefType t)
 {
-    // Currently only support workgroup local memory allocations with static
+    // Currently only support workgroup private memory allocations with static
     // shape and int or float or vector of int or float element type.
     if (!(t.hasStaticShape() && SPIRVTypeConverter::getMemorySpaceForStorageClass(spirv::StorageClass::Function) == t.getMemorySpaceAsInt()))
         return false;
@@ -68,57 +81,23 @@ static bool isSPIRVFunctionAllocationSupported(MemRefType t)
 
 static std::optional<vir::ExecutionRuntime> getGPURuntimeTarget(mlir::Operation* op)
 {
-    // TODO: Add tests, verify, enable generic version
-#if 1
-    return vir::ExecutionRuntime::Rocm;
-#else
-    auto target = utilir::ResolveExecutionTarget(op);
-    if (!target || target != vir::ExecutionTarget::GPU)
-    {
-        return std::nullopt;
-    }
     return utilir::ResolveExecutionRuntime(op);
-#endif
-}
-
-static std::optional<vir::ExecutionRuntime> getRuntimeTarget(mlir::ModuleOp* op)
-{
-    auto funOps = op->getOps<FuncOp>();
-    for (auto funOp : funOps)
-    {
-        auto runtime = getGPURuntimeTarget(funOp);
-        if (runtime)
-        {
-            return runtime;
-        }
-    }
-    return std::nullopt;
 }
 
 template <vir::ExecutionRuntime Runtime>
 static bool hasRuntimeTarget(mlir::Operation* op)
 {
-    auto runtime = getGPURuntimeTarget(op);
-    if (!runtime)
-    {
-        return false;
-    }
-    return *runtime == Runtime;
+    auto runtime = getGPURuntimeTarget(op).value_or(vir::ExecutionRuntime::NONE);
+    return runtime == Runtime;
 }
 
-static bool hasVulkanRuntimeTarget(mlir::Operation* op)
+int dimIndexToInteger(llvm::StringRef dim)
 {
-    return hasRuntimeTarget<vir::ExecutionRuntime::Vulkan>(op);
-}
-
-static bool hasNVVMRuntimeTarget(mlir::Operation* op)
-{
-    return hasRuntimeTarget<vir::ExecutionRuntime::CUDA>(op);
-}
-
-static bool hasROCDLRuntimeTarget(mlir::Operation* op)
-{
-    return hasRuntimeTarget<vir::ExecutionRuntime::Rocm>(op);
+    return ::llvm::StringSwitch<int>(dim)
+        .Case("x", 0)
+        .Case("y", 1)
+        .Case("z", 2)
+        .Default(-1);
 }
 
 struct PrivateAllocToSPIRVConversion : public OpConversionPattern<memref::AllocOp>
@@ -130,11 +109,6 @@ struct PrivateAllocToSPIRVConversion : public OpConversionPattern<memref::AllocO
     LogicalResult matchAndRewrite(memref::AllocOp op, ArrayRef<Value> operands, ConversionPatternRewriter& rewriter) const final
     {
 
-        if (!hasVulkanRuntimeTarget(op))
-        {
-            return failure();
-        }
-
         // cf mlir/lib/Conversion/StandardToSPIRV/ConvertStandardToSPIRV.cpp
 
         MemRefType allocType = op.getType();
@@ -158,10 +132,6 @@ struct PrivateDeallocToSPIRVConversion final : public OpConversionPattern<memref
 
     LogicalResult matchAndRewrite(memref::DeallocOp op, ArrayRef<Value> operands, ConversionPatternRewriter& rewriter) const final
     {
-        if (!hasVulkanRuntimeTarget(op))
-        {
-            return failure();
-        }
 
         // cf mlir/lib/Conversion/StandardToSPIRV/ConvertStandardToSPIRV.cpp
 
@@ -181,10 +151,6 @@ struct EarlyReturnToSPIRVReturnPattern : public OpConversionPattern<vir::EarlyRe
 
     LogicalResult matchAndRewrite(vir::EarlyReturnOp op, ArrayRef<Value> operands, ConversionPatternRewriter& rewriter) const final
     {
-        if (!hasVulkanRuntimeTarget(op))
-        {
-            return failure();
-        }
 
         if (operands.empty())
         {
@@ -205,10 +171,6 @@ struct EarlyReturnToGPUReturnPattern : public OpRewritePattern<vir::EarlyReturnO
 
     LogicalResult matchAndRewrite(vir::EarlyReturnOp op, PatternRewriter& rewriter) const final
     {
-        if (!hasNVVMRuntimeTarget(op) && !hasROCDLRuntimeTarget(op))
-        {
-            return failure();
-        }
 
         rewriter.replaceOpWithNewOp<gpu::ReturnOp>(op, op->getOperands());
 
@@ -216,6 +178,69 @@ struct EarlyReturnToGPUReturnPattern : public OpRewritePattern<vir::EarlyReturnO
     }
 };
 
+// Tries to match to a public facing function that calls another function as its
+// sole non-terminator op, which in turn launches a GPU function.
+// Once the match is found, renames the GPU function with the name of the top-level function
+// plus a suffix of '__gpu__', and updates the launch gpu func op. Updates the runtime used by the
+// top-level function.
+struct CreateDeviceFuncLauncherPairPattern : public OpRewritePattern<FuncOp>
+{
+    CreateDeviceFuncLauncherPairPattern(vir::ExecutionRuntime targetRuntime, MLIRContext* context, PatternBenefit benefit = 1) :
+        OpRewritePattern(context, benefit), _target(targetRuntime) {}
+
+    LogicalResult matchAndRewrite(FuncOp op, PatternRewriter& rewriter) const final
+    {
+        if (!op->hasAttr(ir::HeaderDeclAttrName) ||
+            !op->hasAttr(ir::RawPointerAPIAttrName)) return failure();
+
+        auto fnBodyOpIterator = op.front().without_terminator();
+        if (!llvm::hasSingleElement(fnBodyOpIterator)) return failure();
+
+        if (auto callOp = dyn_cast<CallOp>(fnBodyOpIterator.begin()))
+        {
+            auto calleeFnOp = dyn_cast_or_null<FuncOp>(SymbolTable::lookupNearestSymbolFrom(op, callOp.callee()));
+            if (!calleeFnOp) return failure();
+
+            auto calleeFnBodyOpIterator = calleeFnOp.front().back().getReverseIterator();
+            assert(calleeFnBodyOpIterator->hasTrait<OpTrait::IsTerminator>());
+
+            ++calleeFnBodyOpIterator;
+            if (auto launchOp = dyn_cast<gpu::LaunchFuncOp>(*calleeFnBodyOpIterator))
+            {
+                auto launchedGPUFnOp = dyn_cast_or_null<gpu::GPUFuncOp>(SymbolTable::lookupNearestSymbolFrom(calleeFnOp, launchOp.kernel()));
+                if (!launchedGPUFnOp) return failure();
+
+                auto gpuTargetFuncName = op.getName().str() + "__gpu__";
+                if (SymbolTable::lookupNearestSymbolFrom(launchedGPUFnOp, gpuTargetFuncName)) return failure();
+
+                auto context = rewriter.getContext();
+                auto execRuntimeAttr = vir::ExecutionRuntimeAttr::get(context, _target);
+                auto execTargetAttr = vir::ExecutionTargetAttr::get(context, vir::ExecutionTarget::GPU);
+                launchedGPUFnOp->setAttr(vir::ValueModuleOp::getExecRuntimeAttrName(), execRuntimeAttr);
+                launchedGPUFnOp->setAttr(vir::ValueFuncOp::getExecTargetAttrName(), execTargetAttr);
+                launchedGPUFnOp->setAttr(ir::HeaderDeclAttrName, rewriter.getUnitAttr());
+                launchedGPUFnOp->setAttr(ir::RawPointerAPIAttrName, rewriter.getUnitAttr());
+
+                launchedGPUFnOp.setName(gpuTargetFuncName);
+                auto kernelSymAttr = launchOp.kernel();
+                auto root = kernelSymAttr.getRootReference();
+                launchOp.kernelAttr(rewriter.getSymbolRefAttr(root, rewriter.getSymbolRefAttr(gpuTargetFuncName)));
+
+                rewriter.updateRootInPlace(op, [&] {
+                    op->setAttr(vir::ValueModuleOp::getExecRuntimeAttrName(), execRuntimeAttr);
+                });
+
+                return success();
+            }
+        }
+
+        return failure();
+    }
+
+private:
+    vir::ExecutionRuntime _target;
+};
+
 struct ValueBarrierToSPIRVBarrierConversion final : public OpConversionPattern<vir::BarrierOp>
 {
     ValueBarrierToSPIRVBarrierConversion(SPIRVTypeConverter& typeConverter, MLIRContext* context) :
@@ -224,10 +249,6 @@ struct ValueBarrierToSPIRVBarrierConversion final : public OpConversionPattern<v
 
     LogicalResult matchAndRewrite(vir::BarrierOp op, ArrayRef<Value>, ConversionPatternRewriter& rewriter) const final
     {
-        if (!hasVulkanRuntimeTarget(op))
-        {
-            return failure();
-        }
         switch (op.scope())
         {
         case vir::BarrierScope::Block:
@@ -274,174 +295,417 @@ struct ValueBarrierToGPUBarrierConversion final : public OpRewritePattern<vir::B
     }
 };
 
-struct ValueMFMALoadMatrixOpToRocDLConversion final : public OpRewritePattern<vir::MFMALoadMatrixOp>
+struct ValueMFMALoadOpToRocDLConversion final : public OpConversionPattern<vir::MFMALoadOp>
 {
-    using OpRewritePattern<vir::MFMALoadMatrixOp>::OpRewritePattern;
+    using OpConversionPattern<vir::MFMALoadOp>::OpConversionPattern;
 
-    LogicalResult matchAndRewrite(vir::MFMALoadMatrixOp op, PatternRewriter& rewriter) const final
+    LogicalResult matchAndRewrite(vir::MFMALoadOp op,
+                                  ArrayRef<mlir::Value> operands,
+                                  ConversionPatternRewriter& rewriter) const final
     {
-        using namespace accera::utilities;
+        auto ctx = rewriter.getContext();
+        auto loc = op.getLoc();
+        vir::MFMALoadOp::Adaptor MFMALoadOpAdaptor(operands, op->getAttrDictionary());
+        auto memref = MFMALoadOpAdaptor.memref();
+        auto mfmaMatrixType = op.getMFMAMatrixType();
+        auto mfmaMatrixOperand = mfmaMatrixType.getOperand();
+        auto elementType = mfmaMatrixType.getElementType();
+
+        auto leadingDim = mfmaMatrixType.getLeadingDim();
+        if (leadingDim != 16) // 16x16x4
+        {
+            return rewriter.notifyMatchFailure(op, "unhandled matrix shape");
+        }
 
-        throw LogicException(LogicExceptionErrors::notImplemented);
+        mlir::OpBuilder::InsertionGuard guard(rewriter);
+        rewriter.setInsertionPoint(op);
 
-        if (!hasROCDLRuntimeTarget(op))
+        const auto warpSize = 64;
+
+        auto strideOffset = rewriter.getAffineSymbolExpr(0);
+        auto threadIdxX = rewriter.getAffineSymbolExpr(0);
+        auto threadIdxY = rewriter.getAffineSymbolExpr(1);
+        auto blockDimX = rewriter.getAffineSymbolExpr(2);
+        auto blockTid = threadIdxX + threadIdxY * blockDimX;
+        auto warpTid = blockTid % warpSize;
+        auto wmmaM = leadingDim;
+        auto m = warpTid % wmmaM;
+        auto ks = warpTid.floorDiv(wmmaM);
+
+        auto vecSize = 4;
+        auto vecTy = mlir::VectorType::get({ vecSize }, elementType);
+        // For AOp load from the input memref with a column stride of 4
+        //
+        // for AOp this transformation is equivalent to:
+        // float4 result;
+        // memrefView = &memred[loadOperands]
+        // for (int i = 0; i < 4; i++) {
+        //    result[i] = memrefView[m, ks + 4*i];
+        // }
+        ////////////////////////////////////////
+        // For BOp load from the input memref with a row stride of 4
+        //
+        // for BOp this transformation is equivalent to:
+        // float4 result;
+        // memrefView = &memred[loadOperands]
+        // for (int i = 0; i < 4; i++) {
+        //    result[i] = memrefView[ks + 4*i, m];
+        // }
+        //
+        ////////////////////////////////////////
+        // For COp load
+        //
+        // for COp this transformation is equivalent to:
+        // float4 result;
+        // memrefView = &memred[loadOperands]
+        // for (int i = 0; i < 4; i++) {
+        //    result[i] = memrefView[ks * 4 + i, m];
+        // }
+        //
+
+        auto d0 = rewriter.getAffineDimExpr(0);
+        auto d1 = rewriter.getAffineDimExpr(1);
+        auto loadAffineMap = MFMALoadOpAdaptor.map().getValue(); // [d0, d1, d2, sa, sb]
+        auto offsetAOpMap = AffineMap::get(2, 3, { d0 + m, d1 + ks }, ctx); // [d0, d1, sx, sy, sz]
+        auto strideAOpMap = AffineMap::get(2, 1, { d0, d1 + strideOffset * 4 }, ctx); // [d0, d1, s0]
+        auto offsetBOpMap = AffineMap::get(2, 3, { d0 + ks, d1 + m }, ctx); // [d0, d1, sx, sy, sz]
+        auto strideBOpMap = AffineMap::get(2, 1, { d0 + strideOffset * 4, d1 }, ctx); // [d0, d1, s0]
+        auto offsetCOpMap = AffineMap::get(2, 3, { d0 + ks * 4, d1 + m }, ctx); // [d0, d1, sx, sy, sz]
+        auto strideCOpMap = AffineMap::get(2, 1, { d0 + strideOffset, d1 }, ctx); // [d0, d1, s0]
+        auto matrixLayoutMap = ::llvm::StringSwitch<AffineMap>(mfmaMatrixOperand.str())
+                                   .Case("AOp", strideAOpMap.compose(offsetAOpMap))
+                                   .Case("BOp", strideBOpMap.compose(offsetBOpMap))
+                                   .Case("COp", strideCOpMap.compose(offsetCOpMap))
+                                   .Default(/*this is really an error */ AffineMap());
+        auto composedMap = matrixLayoutMap.compose(loadAffineMap); // [d0, d1, d2, s0, sx, sy, sz, sa, sb]
+
+        LLVM_DEBUG(llvm::dbgs() << "op: " << *op << "\n"
+                                << "loadAffineMap: " << loadAffineMap << "\n"
+                                << "offsetAOpMap: " << offsetAOpMap << "\n"
+                                << "strideAOpMap: " << strideAOpMap << "\n"
+                                << "offsetBOpMap: " << offsetBOpMap << "\n"
+                                << "strideBOpMap: " << strideBOpMap << "\n"
+                                << "offsetCOpMap: " << offsetCOpMap << "\n"
+                                << "strideCOpMap: " << strideCOpMap << "\n"
+                                << "matrixLayoutMap: " << matrixLayoutMap << "\n"
+                                << "composedMap: " << composedMap << "\n"
+                                << "simplify(composedMap): " << simplifyAffineMap(composedMap) << "\n");
+        auto indices = MFMALoadOpAdaptor.indices();
+        std::vector<Value> mapOperands;
+        for (size_t i = 0; i < loadAffineMap.getNumDims(); i++)
         {
-            return failure();
+            mapOperands.push_back(indices[i]);
+        }
+        mapOperands.push_back(rewriter.create<ConstantIndexOp>(loc, 0));
+        mapOperands.push_back(rewriter.create<gpu::ThreadIdOp>(loc, rewriter.getIndexType(), "x"));
+        mapOperands.push_back(rewriter.create<gpu::ThreadIdOp>(loc, rewriter.getIndexType(), "y"));
+        mapOperands.push_back(rewriter.create<gpu::BlockDimOp>(loc, rewriter.getIndexType(), "x"));
+        for (size_t i = loadAffineMap.getNumDims(); i < loadAffineMap.getNumInputs(); i++)
+        {
+            mapOperands.push_back(indices[i]);
         }
-        auto loc = op.getLoc();
-        auto memref = op.srcMemref();
-        auto memrefType = memref.getType().cast<MemRefType>();
-        auto shape = memrefType.getShape();
-        auto elementType = memrefType.getElementType();
 
-        // TODO: Literal constants should be provided by a helper struct (TensorizationInfo?)
-        auto vecSize = shape[0] == 16 ? 4 : 16;
+        auto zero = rewriter.create<ConstantOp>(loc, elementType, rewriter.getZeroAttr(elementType));
+        mlir::Value vec = rewriter.create<vector::BroadcastOp>(loc, vecTy, zero);
 
-        auto vecTy = mlir::VectorType::get({ vecSize }, elementType);
+        auto i32Ty = rewriter.getIntegerType(32);
+        auto loop = rewriter.replaceOpWithNewOp<AffineForOp>(op, 0, 4, 1, vec);
+        auto loopBuilder = utilir::MakeBodyBuilder(loop);
+        auto inductionVar = loop.getInductionVar();
+        auto regionIterArg = loop.getRegionIterArgs()[0];
+        auto laneIndex = loopBuilder.create<mlir::IndexCastOp>(loc, inductionVar, i32Ty);
+        mapOperands[loadAffineMap.getNumDims()] = inductionVar; // we override the strideOffset symbol with the current index value
+
+        LLVM_DEBUG(llvm::dbgs() << "mapOperands: ["
+                                << "\n";
+                   for (auto op
+                        : mapOperands) {
+                       llvm::dbgs() << "  " << op << "\n";
+                   } llvm::dbgs()
+                   << "]\n");
+
+        auto mappedOperands = utilir::MultiDimAffineApply(loopBuilder, loc, composedMap, mapOperands);
+        auto load = loopBuilder.create<memref::LoadOp>(loc, memref, mappedOperands);
+        vec = loopBuilder.create<vector::InsertElementOp>(loc, load, regionIterArg, laneIndex);
+        loopBuilder.create<AffineYieldOp>(loc, ValueRange{ vec });
+
+        return success();
+    }
+};
+
+struct ValueMFMAStoreOpToRocDLConversion final : public OpConversionPattern<vir::MFMAStoreOp>
+{
+    using OpConversionPattern<vir::MFMAStoreOp>::OpConversionPattern;
+
+    LogicalResult matchAndRewrite(vir::MFMAStoreOp op,
+                                  ArrayRef<mlir::Value> operands,
+                                  ConversionPatternRewriter& rewriter) const final
+    {
+        auto ctx = rewriter.getContext();
+        auto loc = op.getLoc();
+        vir::MFMAStoreOp::Adaptor mfmaStoreOpAdaptor(operands, op->getAttrDictionary());
+        auto value = mfmaStoreOpAdaptor.value();
+        auto memref = mfmaStoreOpAdaptor.memref();
+        auto indices = mfmaStoreOpAdaptor.indices();
+        auto mfmaMatrixType = op.getMFMAMatrixType();
+
+        auto leadingDim = mfmaMatrixType.getLeadingDim();
+        if (leadingDim != 16) // 16x16x4
+        {
+            return rewriter.notifyMatchFailure(op, "unhandled matrix shape");
+        }
+
+        const auto warpSize = utilir::ResolveWarpSize(op).value();
+
+        auto d0 = rewriter.getAffineDimExpr(0);
+        auto d1 = rewriter.getAffineDimExpr(1);
+        auto strideOffset = rewriter.getAffineSymbolExpr(0);
+        auto threadIdxX = rewriter.getAffineSymbolExpr(1);
+        auto threadIdxY = rewriter.getAffineSymbolExpr(2);
+        auto blockDimX = rewriter.getAffineSymbolExpr(3);
+        auto blockTid = threadIdxY * blockDimX + threadIdxX;
+        auto warpTid = blockTid % warpSize;
+        auto wmmaM = leadingDim;
+        auto m = warpTid % wmmaM;
+        auto ks = warpTid.floorDiv(wmmaM);
+        auto offsetMap = AffineMap::get(2, 4, { d0 + ks * 4 + strideOffset, d1 + m }, ctx);
+
+        auto storeAffineMap = op.getAffineMap();
+        auto composedMap = offsetMap.compose(storeAffineMap);
+
+        std::vector<Value> mapOperands;
+        for (size_t i = 0; i < storeAffineMap.getNumDims(); i++)
+        {
+            mapOperands.push_back(indices[i]);
+        }
+        mapOperands.push_back(rewriter.create<ConstantIndexOp>(loc, 0));
+        mapOperands.push_back(rewriter.create<gpu::ThreadIdOp>(loc, rewriter.getIndexType(), "x"));
+        mapOperands.push_back(rewriter.create<gpu::ThreadIdOp>(loc, rewriter.getIndexType(), "y"));
+        mapOperands.push_back(rewriter.create<gpu::BlockDimOp>(loc, rewriter.getIndexType(), "x"));
+        for (size_t i = storeAffineMap.getNumDims(); i < storeAffineMap.getNumInputs(); i++)
+        {
+            mapOperands.push_back(indices[i]);
+        }
 
         auto i32Ty = rewriter.getIntegerType(32);
-        Value zero = rewriter.create<mlir::ConstantOp>(loc, i32Ty, rewriter.getZeroAttr(i32Ty));
+        auto loop = rewriter.replaceOpWithNewOp<AffineForOp>(op, 0, 4, 1);
+        auto loopBuilder = utilir::MakeBodyBuilder(loop);
+        auto inductionVar = loop.getInductionVar();
+        auto laneIndex = loopBuilder.create<mlir::IndexCastOp>(loc, inductionVar, i32Ty);
+        mapOperands[storeAffineMap.getNumDims()] = inductionVar; // we override the strideOffset symbol with the current index value
+        auto mappedOperands = utilir::MultiDimAffineApply(loopBuilder, loc, composedMap, mapOperands);
+        auto elem = loopBuilder.create<vector::ExtractElementOp>(loc, value, laneIndex);
+        loopBuilder.create<memref::StoreOp>(loc, elem, memref, mappedOperands);
+
+        return success();
+    }
+};
+
+struct ValueMFMAConstantOpToRocDLConversion final : public OpRewritePattern<vir::MFMAConstantOp>
+{
+    using OpRewritePattern<vir::MFMAConstantOp>::OpRewritePattern;
+
+    LogicalResult matchAndRewrite(vir::MFMAConstantOp op, PatternRewriter& rewriter) const final
+    {
+
+        auto mfmaType = op.getMFMAMatrixType();
+
+        auto leadingDim = mfmaType.getLeadingDim();
+        if (leadingDim != 16) // 16x16x4
+        {
+            return rewriter.notifyMatchFailure(op, "unhandled matrix shape");
+        }
+
+        auto vecSize = 4;
+        auto vecTy = VectorType::get({ vecSize }, mfmaType.getElementType());
+
+        rewriter.replaceOpWithNewOp<vector::BroadcastOp>(op, vecTy, op.value());
+
+        return success();
+    }
+};
+
+struct ValueMFMAComputeToRocDLConversion final : public OpConversionPattern<vir::MFMAComputeOp>
+{
+    using OpConversionPattern<vir::MFMAComputeOp>::OpConversionPattern;
+
+    LogicalResult matchAndRewrite(vir::MFMAComputeOp op,
+                                  ArrayRef<mlir::Value> operands,
+                                  ConversionPatternRewriter& rewriter) const final
+    {
+        using namespace accera::utilities;
+        auto loc = op.getLoc();
+        vir::MFMAComputeOp::Adaptor mfmaComputeMatrixOpAdaptor(operands, op->getAttrDictionary());
+        auto opA = mfmaComputeMatrixOpAdaptor.opA();
+        auto opB = mfmaComputeMatrixOpAdaptor.opB();
+        auto opC = mfmaComputeMatrixOpAdaptor.opC();
+        if (!opA.getType().isa<VectorType>())
+        {
+            return rewriter.notifyMatchFailure(op, "expecting a vector type for OpA");
+        }
+        if (!opB.getType().isa<VectorType>())
+        {
+            return rewriter.notifyMatchFailure(op, "expecting a vector type for OpB");
+        }
+        if (!opC.getType().isa<VectorType>())
+        {
+            return rewriter.notifyMatchFailure(op, "expecting a vector type for OpC");
+        }
 
         mlir::OpBuilder::InsertionGuard guard(rewriter);
         rewriter.setInsertionPoint(op);
 
-        // TODO: Literal constants should be provided by a helper struct (TensorizationInfo?)
-        if (vecSize == 4)
+        auto i32Ty = rewriter.getIntegerType(32);
+        Value zero = rewriter.create<mlir::ConstantOp>(loc, i32Ty, rewriter.getZeroAttr(i32Ty));
+
+        auto destVectorType = opC.getType().cast<VectorType>();
+        auto numElems = destVectorType.getShape()[0];
+        if (numElems != 4)
         {
-            llvm::SmallVector<int64_t, 4> offsets{ 0, 0 };
-            llvm::SmallVector<int64_t, 4> sizes{ 16, 16 };
-            llvm::SmallVector<int64_t, 4> strides{ 4, 1 };
-            auto rowMemrefTy = MemRefType::get({ 4, 4 }, elementType);
-            [[maybe_unused]] auto row = rewriter.create<memref::SubViewOp>(loc, rowMemrefTy, memref, offsets, sizes, strides);
-            // auto vec = rewriter.replaceOpWithNewOp<AffineVectorLoadOp>(op, vecTy, row, ValueRange{ rewriter.create<ConstantIndexOp>(loc, 0) });
+            return rewriter.notifyMatchFailure(op, "expecting a 16x16 matrix type for OpC");
+        }
+
+        if (numElems == 4)
+        {
+            auto numIterations = 4;
+            auto result = opC;
+            //
+            // equivalent to:
+            // result = opC;
+            // for (int i = 0; i < 4; i++) {
+            //    result = mfma_f32_16x16x4f32(opA[i], opB[i], result);
+            // }
+            //
+            auto loop = rewriter.replaceOpWithNewOp<AffineForOp>(op, 0, numIterations, 1, result);
+            auto loopBuilder = utilir::MakeBodyBuilder(loop);
+            auto inductionVar = loop.getInductionVar();
+            auto regionIterArg = loop.getRegionIterArgs()[0];
+            auto laneIndex = loopBuilder.create<mlir::IndexCastOp>(loc, inductionVar, i32Ty);
+            auto elemA = loopBuilder.create<vector::ExtractElementOp>(loc, opA, laneIndex);
+            auto elemB = loopBuilder.create<vector::ExtractElementOp>(loc, opB, laneIndex);
+            auto mfmaOp = loopBuilder.create<ROCDL::mfma_f32_16x16x4f32>(loc, result.getType(), ValueRange{ elemA, elemB, regionIterArg, zero, zero, zero });
+            loopBuilder.create<AffineYieldOp>(loc, ValueRange{ mfmaOp });
         }
         else
         {
-            return rewriter.notifyMatchFailure(op, "unhandled vector size");
+            return rewriter.notifyMatchFailure(op, "Unsupported op size.");
         }
-
         return success();
     }
 };
 
-struct ValueMFMAStoreMatrixOpToRocDLConversion final : public OpRewritePattern<vir::MFMAStoreMatrixOp>
+struct ValueMFMAStoreOpToGPUConversion final : public OpConversionPattern<vir::MFMAStoreOp>
 {
-    using OpRewritePattern<vir::MFMAStoreMatrixOp>::OpRewritePattern;
+    using OpConversionPattern<vir::MFMAStoreOp>::OpConversionPattern;
 
-    LogicalResult matchAndRewrite(vir::MFMAStoreMatrixOp op, PatternRewriter& rewriter) const final
+    LogicalResult matchAndRewrite(vir::MFMAStoreOp op,
+                                  ArrayRef<mlir::Value> operands,
+                                  ConversionPatternRewriter& rewriter) const final
     {
-        using namespace accera::utilities;
+        return success();
+    }
+};
 
-        throw LogicException(LogicExceptionErrors::notImplemented);
+struct ValueMFMALoadOpToGPUConversion final : public OpConversionPattern<vir::MFMALoadOp>
+{
+    using OpConversionPattern<vir::MFMALoadOp>::OpConversionPattern;
 
+    LogicalResult matchAndRewrite(vir::MFMALoadOp op,
+                                  ArrayRef<mlir::Value> operands,
+                                  ConversionPatternRewriter& rewriter) const final
+    {
+        rewriter.eraseOp(op);
         return success();
     }
 };
 
-struct ValueMFMAComputeToRocDLConversion final : public OpRewritePattern<vir::MFMAComputeOp>
+struct ValueMFMAConstantOpToGPUConversion final : public OpConversionPattern<vir::MFMAConstantOp>
 {
-    using OpRewritePattern<vir::MFMAComputeOp>::OpRewritePattern;
+    using OpConversionPattern<vir::MFMAConstantOp>::OpConversionPattern;
 
-    LogicalResult matchAndRewrite(vir::MFMAComputeOp op, PatternRewriter& rewriter) const final
+    LogicalResult matchAndRewrite(vir::MFMAConstantOp op,
+                                  ArrayRef<mlir::Value> operands,
+                                  ConversionPatternRewriter& rewriter) const final
     {
-        using namespace accera::utilities;
+        rewriter.eraseOp(op);
+        return success();
+    }
+};
+struct ValueMFMAComputeToGPUConversion final : public OpConversionPattern<vir::MFMAComputeOp>
+{
+    using OpConversionPattern<vir::MFMAComputeOp>::OpConversionPattern;
 
-        throw LogicException(LogicExceptionErrors::notImplemented);
+    LogicalResult matchAndRewrite(vir::MFMAComputeOp op,
+                                  ArrayRef<mlir::Value> operands,
+                                  ConversionPatternRewriter& rewriter) const final
+    {
+        rewriter.replaceOpWithNewOp<gpu::SubgroupMmaComputeOp>(op, operands[2].getType(), operands, op->getAttrs());
+        return success();
+    }
+};
+
+struct ResolveBlockDimPattern final : public OpRewritePattern<gpu::BlockDimOp>
+{
+    using OpRewritePattern<gpu::BlockDimOp>::OpRewritePattern;
 
-        if (!hasROCDLRuntimeTarget(op))
+    LogicalResult matchAndRewrite(gpu::BlockDimOp op, PatternRewriter& rewriter) const final
+    {
+        auto gpuFunc = op->getParentOfType<gpu::GPUFuncOp>();
+        if (!gpuFunc)
         {
             return failure();
         }
-        auto adaptor = vir::MFMAComputeOpAdaptor(op);
-        auto loc = op.getLoc();
-
-        auto opA = adaptor.opA();
-        auto opB = adaptor.opB();
-        auto opC = adaptor.opC();
-        if (!opA.getType().isa<vir::MFMAMatrixType>())
+        auto blockSizeAttr = gpuFunc->getAttrOfType<ArrayAttr>("blockSize");
+        auto blockDimIdx = dimIndexToInteger(op.dimension());
+        if (!blockSizeAttr || blockDimIdx == -1)
         {
-            return rewriter.notifyMatchFailure(op, "expecting a matrix type for OpA");
+            return failure();
         }
-        if (!opB.getType().isa<vir::MFMAMatrixType>())
+        auto val = blockSizeAttr.getValue()[blockDimIdx].cast<IntegerAttr>().getInt();
+        rewriter.replaceOpWithNewOp<mlir::ConstantIndexOp>(op, val);
+        return success();
+    }
+};
+struct ConditionalBarrierHoistingPattern : public OpRewritePattern<vir::BarrierOp>
+{
+    using OpRewritePattern<vir::BarrierOp>::OpRewritePattern;
+
+    mlir::Operation* GetAncestorIfOp(vir::BarrierOp op) const
+    {
+        mlir::Operation* parentAffineIfOp = utilir::GetHighestAncestorOfType<mlir::AffineIfOp>(op);
+        mlir::Operation* parentSCFIfOp = utilir::GetHighestAncestorOfType<mlir::scf::IfOp>(op);
+
+        if (parentAffineIfOp && parentSCFIfOp)
         {
-            return rewriter.notifyMatchFailure(op, "expecting a matrix type for OpB");
+            // There are both affine.if and scf.if parents, so return the highest ancestor between the two
+            return parentAffineIfOp->isAncestor(parentSCFIfOp) ? parentAffineIfOp : parentSCFIfOp;
         }
-        if (!opC.getType().isa<vir::MFMAMatrixType>())
+        else
         {
-            return rewriter.notifyMatchFailure(op, "expecting a matrix type for OpC");
+            // Return whichever is nonnull, or return nullptr if both are null
+            return parentAffineIfOp == nullptr ? parentSCFIfOp : parentAffineIfOp;
         }
+    }
 
-        auto i32Ty = rewriter.getIntegerType(32);
-        [[maybe_unused]] Value zero = rewriter.create<mlir::ConstantOp>(loc, i32Ty, rewriter.getZeroAttr(i32Ty));
-
-        // Value accumVecIn;
-        // if (accumIn.getType().dyn_cast<MemRefType>())
-        // {
-        //     auto memRefType = accumIn.getType().cast<MemRefType>();
-        //     unsigned rank = memRefType.getRank();
-        //     if (rank != 1)
-        //     {
-        //         return rewriter.notifyMatchFailure(op, "accumulation type for the MFMA op must be a vector (memref rank = 1).");
-        //     }
-        //     if (!memRefType.hasStaticShape())
-        //     {
-        //         return rewriter.notifyMatchFailure(op, "accumulation type for the MFMA op must have a static shape.");
-        //     }
-        //     if (memRefType.getElementType() != rewriter.getF32Type())
-        //     {
-        //         return rewriter.notifyMatchFailure(op, "accumulation type for the MFMA op must be a floating point number.");
-        //     }
-        //     auto numElements = memRefType.getNumElements();
-        //     if (numElements != 4 && numElements != 16)
-        //     {
-        //         return rewriter.notifyMatchFailure(op, "accumulation type for the MFMA op must be a floating point vector of size 4 or 16.");
-        //     }
-
-        //     auto vecF32Ty = mlir::VectorType::get({ numElements }, rewriter.getF32Type());
-        //     accumVecIn = rewriter.create<AffineVectorLoadOp>(loc, vecF32Ty, accumIn, ValueRange{ rewriter.create<ConstantIndexOp>(loc, 0) });
-        // }
-        // else
-        // {
-        //     accumVecIn = accumIn;
-        // }
-        // auto accumVecInTy = accumVecIn.getType().cast<MFMAMatrix>();
-        // Operation* newOp;
+    LogicalResult matchAndRewrite(vir::BarrierOp op, PatternRewriter& rewriter) const final
+    {
+        // Hoist barrier ops outside of any affine.if or scf.if conditional blocks they are contained inside of
+
+        // As a simple hoist, remove all barriers inside of the conditional and place a barrier before and after the conditional block
+        // TODO : instead of hoisting this way, split conditional blocks at the barriers to keep the same relative
+
+        // Get the highest level affine.if or scf.if op that contains this barrier, if one exists
+        if (auto ancestorIfOp = GetAncestorIfOp(op))
+        {
+            // This barrier is contained within a conditional, so clone it before and after the conditional then erase it
+            rewriter.setInsertionPoint(ancestorIfOp);
+            rewriter.clone(*(op.getOperation()));
+            rewriter.setInsertionPointAfter(ancestorIfOp);
+            rewriter.clone(*(op.getOperation()));
+
+            rewriter.eraseOp(op);
+        }
 
-        // if (accumVecInTy.getNumElements() == 16)
-        // {
-        //     newOp = rewriter.replaceOpWithNewOp<ROCDL::mfma_f32_32x32x2f32>(op, accumVecIn.getType(), ValueRange{ adaptor.opA(), adaptor.opB(), accumVecIn, zero, zero, zero });
-        // }
-        // else
-        // {
-        //     newOp = rewriter.replaceOpWithNewOp<ROCDL::mfma_f32_16x16x4f32>(op, accumVecIn.getType(), ValueRange{ adaptor.opA(), adaptor.opB(), accumVecIn, zero, zero, zero });
-        // }
-        // rewriter.setInsertionPointAfter(newOp);
-
-        // assert(op->getNumResults() > 0);
-        // auto result = newOp->getResult(0);
-
-        // for (auto user : result.getUsers())
-        // {
-        //     if (memref::LoadOp loadOp = dyn_cast<memref::LoadOp>(user))
-        //     {
-        //         mlir::OpBuilder::InsertionGuard guard(rewriter);
-        //         rewriter.setInsertionPoint(loadOp);
-        //         auto idx =  loadOp.indices().front();
-        //         auto idxAsInt = rewriter.create<IndexCastOp>(loadOp->getLoc(), idx, rewriter.getI64Type());
-        //         rewriter.replaceOpWithNewOp<vector::ExtractElementOp>(loadOp, result, idxAsInt);
-        //     }
-        //     else if (memref::StoreOp storeOp = dyn_cast<memref::StoreOp>(user))
-        //     {
-        //         mlir::OpBuilder::InsertionGuard guard(rewriter);
-        //         rewriter.setInsertionPoint(storeOp);
-        //         auto value = storeOp.getValueToStore();
-        //         auto idx =  loadOp.indices().front();
-        //         auto idxAsInt = rewriter.create<IndexCastOp>(loadOp->getLoc(), idx, rewriter.getI64Type());
-        //         rewriter.replaceOpWithNewOp<vector::InsertElementOp>(storeOp, value, result, idxAsInt);
-        //     } else {
-        //         return rewriter.notifyMatchFailure(op, "Unsupported op. Users for the result of the MFMA op must be either store or load instructions.");
-        //     }
-        // }
         return success();
     }
 };
@@ -451,14 +715,21 @@ struct AcceraToSPIRVPass : public accera::transforms::ConvertAcceraToSPIRVBase<A
     void runOnOperation() final
     {
         ModuleOp module = getOperation();
-        auto runtime = getRuntimeTarget(&module);
-        if (!runtime || *runtime != vir::ExecutionRuntime::Vulkan)
+
+        if (!hasRuntimeTarget<vir::ExecutionRuntime::VULKAN>(module))
         {
             return;
         }
 
-        // cf mlir/lib/Conversion/GPUToSPIRV/ConvertGPUToSPIRVPass.cpp -- GPUToSPIRVPass::runOnOperation
         MLIRContext* context = &getContext();
+
+        {
+            RewritePatternSet patterns(context);
+            populateGPUSimplificationPatterns(patterns);
+            (void)applyPatternsAndFoldGreedily(module, std::move(patterns));
+        }
+
+        // cf mlir/lib/Conversion/GPUToSPIRV/ConvertGPUToSPIRVPass.cpp -- GPUToSPIRVPass::runOnOperation
         SmallVector<Operation*, 1> kernelModules;
         OpBuilder builder(context);
         module.walk([&builder, &kernelModules](gpu::GPUModuleOp moduleOp) {
@@ -483,7 +754,7 @@ struct AcceraToSPIRVPass : public accera::transforms::ConvertAcceraToSPIRVBase<A
         if (failed(applyFullConversion(kernelModules, *target, std::move(patterns))))
             return signalPassFailure();
     }
-};
+}; // namespace
 
 struct AcceraToROCDLPass : public accera::transforms::ConvertAcceraToROCDLBase<AcceraToROCDLPass>
 {
@@ -491,18 +762,51 @@ struct AcceraToROCDLPass : public accera::transforms::ConvertAcceraToROCDLBase<A
     {
         MLIRContext* context = &getContext();
         auto module = getOperation();
+        ConversionTarget target(*context);
 
-        // TODO: Enable querying of module for execution runtime
-        // auto runtime = getRuntimeTarget(&module);
-        // if (!runtime || *runtime != vir::ExecutionRuntime::Rocm)
-        // {
-        //     return;
-        // }
+        if (!hasRuntimeTarget<vir::ExecutionRuntime::ROCM>(module))
+        {
+            return;
+        }
 
-        RewritePatternSet patterns(context);
-        populateAcceraToROCDLPatterns(patterns);
+        target.addLegalOp<ModuleOp>();
+        target.addIllegalOp<
+            vir::EarlyReturnOp,
+            vir::MFMAComputeOp,
+            vir::MFMAConstantOp,
+            vir::MFMALoadOp,
+            vir::MFMAStoreOp,
+            vir::BarrierOp,
+            gpu::BlockDimOp
+            >();
+        target.addLegalDialect<
+            mlir::AffineDialect,
+            mlir::BuiltinDialect,
+            mlir::gpu::GPUDialect,
+            mlir::memref::MemRefDialect,
+            mlir::ROCDL::ROCDLDialect,
+            mlir::scf::SCFDialect,
+            mlir::StandardOpsDialect,
+            mlir::vector::VectorDialect,
+            omp::OpenMPDialect,
+            vir::ValueDialect>();
 
-        (void)applyPatternsAndFoldGreedily(module, std::move(patterns));
+        {
+            RewritePatternSet patterns(context);
+            populateGPUSimplificationPatterns(patterns);
+            (void)applyPatternsAndFoldGreedily(module, std::move(patterns));
+        }
+        {
+            RewritePatternSet patterns(context);
+            patterns.insert<CreateDeviceFuncLauncherPairPattern>(vir::ExecutionRuntime::ROCM, context);
+            (void)applyPatternsAndFoldGreedily(module, std::move(patterns));
+        }
+        {
+            RewritePatternSet patterns(context);
+            populateAcceraToROCDLPatterns(patterns);
+            if (failed(applyFullConversion(module, target, std::move(patterns))))
+                signalPassFailure();
+        }
     }
 };
 
@@ -513,18 +817,52 @@ struct AcceraToNVVMPass : public accera::transforms::ConvertAcceraToNVVMBase<Acc
 
         MLIRContext* context = &getContext();
         auto module = getOperation();
+        ConversionTarget target(*context);
 
-        // TODO: Enable querying of module for execution runtime
-        // auto runtime = getRuntimeTarget(&module);
-        // if (!runtime || *runtime != vir::ExecutionRuntime::CUDA)
-        // {
-        //     return;
-        // }
+        if (!hasRuntimeTarget<vir::ExecutionRuntime::CUDA>(module))
+        {
+            return;
+        }
 
-        RewritePatternSet patterns(context);
-        populateAcceraToNVVMPatterns(patterns);
+        target.addLegalOp<ModuleOp>();
+        target.addIllegalOp<
+            vir::EarlyReturnOp,
+            vir::MFMAComputeOp,
+            vir::MFMAConstantOp,
+            vir::MFMALoadOp,
+            vir::MFMAStoreOp,
+            vir::BarrierOp,
+            gpu::BlockDimOp
+            >();
+        target.addLegalDialect<
+            mlir::AffineDialect,
+            mlir::BuiltinDialect,
+            mlir::gpu::GPUDialect,
+            mlir::memref::MemRefDialect,
+            mlir::NVVM::NVVMDialect,
+            mlir::scf::SCFDialect,
+            mlir::StandardOpsDialect,
+            mlir::vector::VectorDialect,
+            omp::OpenMPDialect,
+            vir::ValueDialect>();
+
+        {
+            RewritePatternSet patterns(context);
+            populateGPUSimplificationPatterns(patterns);
+            (void)applyPatternsAndFoldGreedily(module, std::move(patterns));
+        }
+        {
+            RewritePatternSet patterns(context);
+            patterns.insert<CreateDeviceFuncLauncherPairPattern>(vir::ExecutionRuntime::CUDA, context);
+            (void)applyPatternsAndFoldGreedily(module, std::move(patterns));
+        }
+        {
+            RewritePatternSet patterns(context);
+            populateAcceraToNVVMPatterns(patterns);
 
-        (void)applyPatternsAndFoldGreedily(module, std::move(patterns));
+            if (failed(applyFullConversion(module, target, std::move(patterns))))
+                signalPassFailure();
+        }
     }
 };
 } // namespace
@@ -544,18 +882,30 @@ void populateAcceraToSPIRVPatterns(mlir::SPIRVTypeConverter& typeConverter, mlir
 void populateAcceraToROCDLPatterns(mlir::OwningRewritePatternList& patterns)
 {
     patterns.insert<
+        ResolveBlockDimPattern,
         EarlyReturnToGPUReturnPattern,
         ValueBarrierToGPUBarrierConversion,
-        ValueMFMALoadMatrixOpToRocDLConversion,
+        ValueMFMALoadOpToRocDLConversion,
         ValueMFMAComputeToRocDLConversion,
-        ValueMFMAStoreMatrixOpToRocDLConversion>(patterns.getContext());
+        ValueMFMAStoreOpToRocDLConversion,
+        ValueMFMAConstantOpToRocDLConversion>(patterns.getContext());
 }
 
 void populateAcceraToNVVMPatterns(mlir::OwningRewritePatternList& patterns)
 {
     patterns.insert<
+        ResolveBlockDimPattern,
         EarlyReturnToGPUReturnPattern,
-        ValueBarrierToGPUBarrierConversion>(patterns.getContext());
+        ValueBarrierToGPUBarrierConversion,
+        ValueMFMALoadOpToGPUConversion,
+        ValueMFMAComputeToGPUConversion,
+        ValueMFMAStoreOpToGPUConversion,
+        ValueMFMAConstantOpToGPUConversion>(patterns.getContext());
+}
+
+void populateGPUSimplificationPatterns(mlir::OwningRewritePatternList& patterns)
+{
+    patterns.insert<ConditionalBarrierHoistingPattern>(patterns.getContext());
 }
 
 std::unique_ptr<mlir::OperationPass<mlir::ModuleOp>> createAcceraToSPIRVPass()
@@ -578,17 +928,21 @@ std::unique_ptr<mlir::OperationPass<mlir::ModuleOp>> createAcceraToGPUPass(accer
     using accera::value::ExecutionRuntime;
     switch (runtime)
     {
-    case ExecutionRuntime::CUDA:
-        return createAcceraToNVVMPass();
-    case ExecutionRuntime::Rocm:
+    case ExecutionRuntime::DEFAULT:
         // TODO: default gpu runtime is rocm
         [[fallthrough]];
-    case ExecutionRuntime::Default:
+    case ExecutionRuntime::ROCM:
         return createAcceraToROCDLPass();
-    case ExecutionRuntime::Vulkan:
+    case ExecutionRuntime::CUDA:
+        return createAcceraToNVVMPass();
+    case ExecutionRuntime::VULKAN:
         return createAcceraToSPIRVPass();
+    case ExecutionRuntime::NONE:
+        [[fallthrough]];
+    case ExecutionRuntime::OPENMP:
+        [[fallthrough]];
     default:
-        llvm::llvm_unreachable_internal("The execution runtime must be specified.");
+        return {};
     }
 }
 
diff --git a/accera/transforms/src/gpu/AcceraToSPIRVPass.cpp b/accera/transforms/src/gpu/AcceraToSPIRVPass.cpp
deleted file mode 100644
index 56d367c1..00000000
--- a/accera/transforms/src/gpu/AcceraToSPIRVPass.cpp
+++ /dev/null
@@ -1,196 +0,0 @@
-////////////////////////////////////////////////////////////////////////////////////////////////////
-//  Copyright (c) Microsoft Corporation. All rights reserved.
-//  Licensed under the MIT License. See LICENSE in the project root for license information.
-//  Authors: Kern Handa
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#include "AcceraPasses.h"
-
-#include <ir/include/IRUtil.h>
-
-#include <mlir/Conversion/GPUToSPIRV/GPUToSPIRV.h>
-#include <mlir/Conversion/SCFToSPIRV/SCFToSPIRV.h>
-#include <mlir/Conversion/StandardToSPIRV/StandardToSPIRV.h>
-#include <mlir/Dialect/GPU/GPUDialect.h>
-#include <mlir/Dialect/MemRef/IR/MemRef.h>
-#include <mlir/Dialect/SPIRV/IR/SPIRVDialect.h>
-#include <mlir/Dialect/SPIRV/IR/SPIRVEnums.h>
-#include <mlir/Dialect/SPIRV/IR/SPIRVOps.h>
-#include <mlir/Dialect/SPIRV/Transforms/SPIRVConversion.h>
-#include <mlir/IR/BuiltinOps.h>
-#include <mlir/IR/MLIRContext.h>
-
-#include <string>
-
-using namespace mlir;
-using accera::transforms::populateAcceraToSPIRVPatterns;
-
-namespace utilir = accera::ir::util;
-namespace vir = accera::ir::value;
-
-namespace
-{
-
-// We need to make this greater than 1 to preempt builtin patterns
-constexpr unsigned kAcceraSPIRVPatternBenefit = 10;
-const char kPrivateMemoryVarPrefix[] = "__private_mem__";
-
-// cf mlir/lib/Conversion/StandardToSPIRV/ConvertStandardToSPIRV.cpp
-/// Returns true if the allocations of type `t` can be lowered to SPIR-V.
-static bool isAllocationSupported(MemRefType t)
-{
-    // Currently only support workgroup local memory allocations with static
-    // shape and int or float or vector of int or float element type.
-    if (!(t.hasStaticShape() && SPIRVTypeConverter::getMemorySpaceForStorageClass(spirv::StorageClass::Function) == t.getMemorySpaceAsInt()))
-        return false;
-    Type elementType = t.getElementType();
-    if (auto vecType = elementType.dyn_cast<VectorType>())
-        elementType = vecType.getElementType();
-    return elementType.isIntOrFloat();
-}
-
-struct PrivateAllocToSPIRVConversion : public OpConversionPattern<memref::AllocOp>
-{
-    PrivateAllocToSPIRVConversion(SPIRVTypeConverter& typeConverter, MLIRContext* context) :
-        OpConversionPattern(typeConverter, context, kAcceraSPIRVPatternBenefit)
-    {}
-
-    LogicalResult matchAndRewrite(memref::AllocOp op, ArrayRef<Value> operands, ConversionPatternRewriter& rewriter) const final
-    {
-        // cf mlir/lib/Conversion/StandardToSPIRV/ConvertStandardToSPIRV.cpp
-
-        MemRefType allocType = op.getType();
-        if (!isAllocationSupported(allocType))
-            return failure();
-
-        // Get the SPIR-V type for the allocation.
-        Type spirvType = getTypeConverter()->convertType(allocType);
-
-        rewriter.replaceOpWithNewOp<spirv::VariableOp>(op, spirvType, *SPIRVTypeConverter::getStorageClassForMemorySpace(allocType.getMemorySpaceAsInt()), mlir::Value{});
-        return success();
-    }
-};
-
-/// Removes a deallocation if it is a supported allocation
-struct PrivateDeallocToSPIRVConversion final : public OpConversionPattern<memref::DeallocOp>
-{
-    PrivateDeallocToSPIRVConversion(SPIRVTypeConverter& typeConverter, MLIRContext* context) :
-        OpConversionPattern(typeConverter, context, kAcceraSPIRVPatternBenefit)
-    {}
-
-    LogicalResult matchAndRewrite(memref::DeallocOp op, ArrayRef<Value> operands, ConversionPatternRewriter& rewriter) const final
-    {
-        // cf mlir/lib/Conversion/StandardToSPIRV/ConvertStandardToSPIRV.cpp
-
-        MemRefType deallocType = op.memref().getType().cast<MemRefType>();
-        if (!isAllocationSupported(deallocType))
-        {
-            return op.emitError("unhandled deallocation type");
-        }
-        rewriter.eraseOp(op);
-        return success();
-    }
-};
-
-struct GPUEarlyReturnRewritePattern : public OpConversionPattern<vir::EarlyReturnOp>
-{
-    using OpConversionPattern<vir::EarlyReturnOp>::OpConversionPattern;
-
-    LogicalResult matchAndRewrite(vir::EarlyReturnOp op, ArrayRef<Value> operands, ConversionPatternRewriter& rewriter) const final
-    {
-        if (auto target = utilir::ResolveExecutionTarget(op); !target || *target != vir::ExecutionTarget::GPU)
-        {
-            return failure();
-        }
-
-        if (operands.empty())
-        {
-            rewriter.replaceOpWithNewOp<spirv::ReturnOp>(op);
-        }
-        else
-        {
-            assert(operands.size() == 1);
-            rewriter.replaceOpWithNewOp<spirv::ReturnValueOp>(op, operands[0]);
-        }
-        return success();
-    }
-};
-
-struct ValueBarrierToSPIRVBarrierConversion final : public OpConversionPattern<vir::BarrierOp>
-{
-    ValueBarrierToSPIRVBarrierConversion(SPIRVTypeConverter& typeConverter, MLIRContext* context) :
-        OpConversionPattern(typeConverter, context, kAcceraSPIRVPatternBenefit)
-    {}
-
-    LogicalResult matchAndRewrite(vir::BarrierOp op, ArrayRef<Value>, ConversionPatternRewriter& rewriter) const final
-    {
-        switch (op.scope())
-        {
-        case vir::BarrierScope::Block:
-            rewriter.replaceOpWithNewOp<spirv::ControlBarrierOp>(
-                op,
-                op->getAttrOfType<IntegerAttr>("execution_scope").cast<spirv::ScopeAttr>(),
-                op->getAttrOfType<IntegerAttr>("memory_scope").cast<spirv::ScopeAttr>(),
-                op->getAttrOfType<IntegerAttr>("memory_semantics").cast<spirv::MemorySemanticsAttr>());
-            break;
-        default:
-            assert(true);
-            break;
-        }
-        return success();
-    }
-};
-
-struct AcceraToSPIRVPass : public accera::transforms::ConvertAcceraToSPIRVBase<AcceraToSPIRVPass>
-{
-    void runOnOperation() final
-    {
-        // cf mlir/lib/Conversion/GPUToSPIRV/ConvertGPUToSPIRVPass.cpp -- GPUToSPIRVPass::runOnOperation
-        MLIRContext* context = &getContext();
-        ModuleOp module = getOperation();
-
-        SmallVector<Operation*, 1> kernelModules;
-        OpBuilder builder(context);
-        module.walk([&builder, &kernelModules](gpu::GPUModuleOp moduleOp) {
-            // For each kernel module (should be only 1 for now, but that is not a
-            // requirement here), clone the module for conversion because the
-            // gpu.launch function still needs the kernel module.
-            builder.setInsertionPoint(moduleOp.getOperation());
-            kernelModules.push_back(builder.clone(*moduleOp.getOperation()));
-        });
-
-        auto targetAttr = spirv::lookupTargetEnvOrDefault(module);
-        std::unique_ptr<ConversionTarget> target = SPIRVConversionTarget::get(targetAttr);
-
-        SPIRVTypeConverter typeConverter(targetAttr);
-        ScfToSPIRVContext scfContext;
-        RewritePatternSet patterns(context);
-        populateAcceraToSPIRVPatterns(typeConverter, context, patterns);
-        populateGPUToSPIRVPatterns(typeConverter, patterns);
-        populateSCFToSPIRVPatterns(typeConverter, scfContext, patterns);
-        populateStandardToSPIRVPatterns(typeConverter, patterns);
-
-        if (failed(applyFullConversion(kernelModules, *target, std::move(patterns))))
-            return signalPassFailure();
-    }
-};
-} // namespace
-
-namespace accera::transforms
-{
-
-void populateAcceraToSPIRVPatterns(mlir::SPIRVTypeConverter& typeConverter, mlir::MLIRContext* context, mlir::OwningRewritePatternList& patterns)
-{
-    patterns.insert<
-        GPUEarlyReturnRewritePattern,
-        ValueBarrierToSPIRVBarrierConversion,
-        PrivateAllocToSPIRVConversion,
-        PrivateDeallocToSPIRVConversion>(typeConverter, context);
-}
-
-std::unique_ptr<mlir::OperationPass<mlir::ModuleOp>> createAcceraToSPIRVPass()
-{
-    return std::make_unique<AcceraToSPIRVPass>();
-}
-
-} // namespace accera::transforms
diff --git a/accera/transforms/src/nest/LoopNestPasses.cpp b/accera/transforms/src/nest/LoopNestPasses.cpp
index da8a22bb..96c8a59e 100644
--- a/accera/transforms/src/nest/LoopNestPasses.cpp
+++ b/accera/transforms/src/nest/LoopNestPasses.cpp
@@ -177,7 +177,7 @@ void LoopNestOptPass::runOnOperation()
     auto func = getOperation();
 
     func.walk([&](AffineForOp op) {
-        if (op->getAttrOfType<UnitAttr>("rcv_unrolled"))
+        if (op->getAttrOfType<UnitAttr>("accv_unrolled"))
         {
             auto tripCount = getConstantTripCount(op);
             if (tripCount && *tripCount >= 1)
diff --git a/accera/transforms/src/nest/LoopNestToValue.cpp b/accera/transforms/src/nest/LoopNestToValue.cpp
index 6e6c924d..f3c899ba 100644
--- a/accera/transforms/src/nest/LoopNestToValue.cpp
+++ b/accera/transforms/src/nest/LoopNestToValue.cpp
@@ -452,7 +452,7 @@ LogicalResult ScheduleOpConversion::matchAndRewrite(ScheduleOp op, PatternRewrit
 
 LogicalResult SaturatedAccumulateLoopRewrite::matchAndRewrite(AffineForOp loopOp, PatternRewriter& rewriter) const
 {
-    if (!loopOp->getAttr("rcv_saturated"))
+    if (!loopOp->getAttr("accv_saturated"))
         return success();
 
     if (!loopOp.hasConstantBounds())
@@ -787,7 +787,7 @@ LogicalResult GPUMappedAffineForOpRewrite::matchAndRewrite(mlir::AffineForOp aff
 {
     auto loc = affineForOp.getLoc();
 
-    if (auto gpuMapAttr = affineForOp->getAttrOfType<StringAttr>("rcv_gpu_map"))
+    if (auto gpuMapAttr = affineForOp->getAttrOfType<StringAttr>("accv_gpu_map"))
     {
         auto iv = affineForOp.getInductionVar();
         int64_t begin = affineForOp.getConstantLowerBound();
diff --git a/accera/transforms/src/nest/LoopNestToValue.td b/accera/transforms/src/nest/LoopNestToValue.td
index 7c24a01e..319fa289 100644
--- a/accera/transforms/src/nest/LoopNestToValue.td
+++ b/accera/transforms/src/nest/LoopNestToValue.td
@@ -9,6 +9,6 @@
 include "ir/include/nest/LoopNestOps.td"
 include "ir/include/value/ValueOps.td"
 
-def : Pat<(rcln_PrintOp $input, $to_stderr), (rcv_PrintOp $input, $to_stderr)>;
+def : Pat<(accln_PrintOp $input, $to_stderr), (accv_PrintOp $input, $to_stderr)>;
 
 #endif // LOOPNEST_TO_VALUE
diff --git a/accera/transforms/src/nest/LoopNestToValueFunc.cpp b/accera/transforms/src/nest/LoopNestToValueFunc.cpp
index 6e255aa6..3bcd37cd 100644
--- a/accera/transforms/src/nest/LoopNestToValueFunc.cpp
+++ b/accera/transforms/src/nest/LoopNestToValueFunc.cpp
@@ -73,6 +73,7 @@ struct LoopNestToValueFuncPass : public accera::transforms::LoopNestToValueFuncB
                 (void)applyPatternsAndFoldGreedily(vFuncOp, std::move(patterns));
                 snapshotter.Snapshot("RangeResolution", vFuncOp);
             }
+            
             {
                 OwningRewritePatternList patterns(context);
                 tr::populateScheduledOperationsPatterns(patterns);
@@ -188,6 +189,15 @@ struct LoopNestToValueFuncPass : public accera::transforms::LoopNestToValueFuncB
                 snapshotter.Snapshot("ExecutionPlanCacheMapping", vFuncOp);
             }
 
+            {
+                // Note: A canonicalization cannot happen between ExecutionPlanCacheMapping and ExecutionPlanCheckAndElideThriftyCaches
+                //       otherwise attributes on loads will be removed that this pass depends on
+                OwningRewritePatternList patterns(context);
+                xptr::populateExecutionPlanThriftyCachePatterns(patterns);
+                applyPatternsAndFoldGreedily(vFuncOp, std::move(patterns));
+                snapshotter.Snapshot("ExecutionPlanCheckAndElideThriftyCaches", vFuncOp);
+            }
+
             {
                 OwningRewritePatternList patterns(context);
                 tr::populateScheduledOperationsPatterns(patterns);
@@ -223,6 +233,20 @@ struct LoopNestToValueFuncPass : public accera::transforms::LoopNestToValueFuncB
                 snapshotter.Snapshot("ExecutionPlanCopyReduce", vFuncOp);
             }
 
+            {
+                OwningRewritePatternList patterns(context);
+                xptr::populateExecutionPlanDelayedMappingPatterns(patterns);
+                (void)applyPatternsAndFoldGreedily(vFuncOp, std::move(patterns));
+                snapshotter.Snapshot("ExecutionPlanDelayedMapping", vFuncOp);
+            }
+
+            {
+                OwningRewritePatternList patterns(context);
+                xptr::populateExecutionPlanLoopUnswitchingPatterns(patterns);
+                (void)applyPatternsAndFoldGreedily(vFuncOp, std::move(patterns));
+                snapshotter.Snapshot("ExecutionPlanLoopUnswitching", vFuncOp);
+            }
+
             {
                 OwningRewritePatternList patterns(context);
                 tr::populateLoopSimplificationPatterns(patterns);
@@ -249,6 +273,14 @@ struct LoopNestToValueFuncPass : public accera::transforms::LoopNestToValueFuncB
             snapshotter.Snapshot("GPUIndexMapping", vFuncOp);
         }
 
+        {
+            OwningRewritePatternList patterns(context);
+            xptr::populateExecutionPlanTensorizePatterns(patterns);
+            utilir::FillCanonicalPatternsRecursively(vFuncOp, patterns);
+            (void)applyPatternsAndFoldGreedily(vFuncOp, std::move(patterns));
+            snapshotter.Snapshot("ExecutionPlanTensorize", vFuncOp);
+        }
+
         {
             OwningRewritePatternList patterns(context);
             xptr::populateExecutionPlanVectorizePatterns(printVecOpDetails, patterns);
@@ -278,14 +310,6 @@ struct LoopNestToValueFuncPass : public accera::transforms::LoopNestToValueFuncB
             (void)applyPatternsAndFoldGreedily(vFuncOp, std::move(patterns));
             snapshotter.Snapshot("ExecutionPlanParallelize", vFuncOp);
         }
-
-        {
-            OwningRewritePatternList patterns(context);
-            xptr::populateExecutionPlanTensorizePatterns(patterns);
-            utilir::FillCanonicalPatternsRecursively(vFuncOp, patterns);
-            (void)applyPatternsAndFoldGreedily(vFuncOp, std::move(patterns));
-            snapshotter.Snapshot("ExecutionPlanTensorize", vFuncOp);
-        }
     }
 
     tr::IRSnapshotter _intrapassSnapshotter;
diff --git a/accera/transforms/src/util/VectorizationUtil.cpp b/accera/transforms/src/util/VectorizationUtil.cpp
index 3e0a2b84..9c22d60c 100644
--- a/accera/transforms/src/util/VectorizationUtil.cpp
+++ b/accera/transforms/src/util/VectorizationUtil.cpp
@@ -253,56 +253,6 @@ std::optional<mlir::Operation*> VectorizeConstantOp(mlir::PatternRewriter& rewri
     return constVec;
 }
 
-template <typename MemoryOp>
-mlir::AffineMap GetMemRefIndexToMemoryLocationMap(mlir::MLIRContext* context, MemoryOp op)
-{
-    assert(op.memref().getType().template isa<mlir::MemRefType>());
-    auto memRefType = op.memref().getType().template cast<mlir::MemRefType>();
-    std::vector<mlir::AffineMap> memRefMaps = memRefType.getAffineMaps().vec();
-    if (memRefMaps.empty())
-    {
-        auto stridedLayout = mlir::makeCanonicalStridedLayoutExpr(memRefType.getShape(), context);
-        memRefMaps.push_back(mlir::AffineMap::get(memRefType.getRank(), 0, stridedLayout));
-    }
-    mlir::AffineMap accessMapComposition = memRefMaps.front();
-    for (size_t mapIdx = 1; mapIdx < memRefMaps.size(); ++mapIdx)
-    {
-        accessMapComposition = memRefMaps[mapIdx].compose(accessMapComposition);
-    }
-    assert(accessMapComposition.getNumResults() == 1);
-    return accessMapComposition;
-}
-
-template <typename AffineMemoryOp>
-mlir::AffineMap GetAffineOpIndexToMemoryLocationMap(mlir::MLIRContext* context, AffineMemoryOp op)
-{
-    auto composedMemRefMap = GetMemRefIndexToMemoryLocationMap(context, op);
-    mlir::AffineMap affineOpMap = op.getAffineMapAttr().getValue();
-    mlir::AffineMap accessMapComposition = composedMemRefMap.compose(affineOpMap);
-    assert(accessMapComposition.getNumResults() == 1);
-    return accessMapComposition;
-}
-
-mlir::AffineMap GetIndexToMemoryLocationMap(mlir::MLIRContext* context, mlir::AffineStoreOp op)
-{
-    return GetAffineOpIndexToMemoryLocationMap(context, op);
-}
-
-mlir::AffineMap GetIndexToMemoryLocationMap(mlir::MLIRContext* context, mlir::AffineLoadOp op)
-{
-    return GetAffineOpIndexToMemoryLocationMap(context, op);
-}
-
-mlir::AffineMap GetIndexToMemoryLocationMap(mlir::MLIRContext* context, mlir::memref::StoreOp op)
-{
-    return GetMemRefIndexToMemoryLocationMap(context, op);
-}
-
-mlir::AffineMap GetIndexToMemoryLocationMap(mlir::MLIRContext* context, mlir::memref::LoadOp op)
-{
-    return GetMemRefIndexToMemoryLocationMap(context, op);
-}
-
 template <typename OpType>
 bool IsUnrolledAccessSequential(mlir::PatternRewriter& rewriter,
                                 OpType op,
@@ -319,7 +269,7 @@ bool IsUnrolledAccessSequential(mlir::PatternRewriter& rewriter,
     }
 
     // Check if the temporary clones are all accessing sequential memory
-    auto accessMapComposition = GetIndexToMemoryLocationMap(rewriter.getContext(), op);
+    auto accessMapComposition = ir::util::GetIndexToMemoryLocationMap(rewriter.getContext(), op);
 
     if (accessMapComposition.getNumSymbols() > 0)
     {
diff --git a/accera/transforms/src/value/BarrierOptPass.cpp b/accera/transforms/src/value/BarrierOptPass.cpp
new file mode 100644
index 00000000..a1aadf48
--- /dev/null
+++ b/accera/transforms/src/value/BarrierOptPass.cpp
@@ -0,0 +1,235 @@
+////////////////////////////////////////////////////////////////////////////////////////////////////
+//  Copyright (c) Microsoft Corporation. All rights reserved.
+//  Licensed under the MIT License. See LICENSE in the project root for license information.
+//  Authors: Chuck Jacobs
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+#include "AcceraPasses.h"
+
+#include <ir/include/value/ValueDialect.h>
+
+#include <utilities/include/TypeTraits.h>
+
+#include <mlir/IR/Visitors.h>
+
+#include <mlir/Pass/Pass.h>
+#include <mlir/Pass/PassManager.h>
+
+#include <mlir/Transforms/Passes.h>
+
+#include <llvm/Support/raw_os_ostream.h>
+
+#include <algorithm>
+#include <iostream>
+#include <set>
+#include <variant>
+#include <vector>
+
+using namespace mlir;
+
+namespace
+{
+using namespace scf;
+#include "value/ValueConversion.inc"
+} // namespace
+
+using namespace accera::ir;
+using namespace accera::transforms;
+using namespace accera::ir::value;
+using namespace accera::utilities;
+
+using ValueBarrierOp = accera::ir::value::BarrierOp;
+
+struct BarrierOptPass : public BarrierOptBase<BarrierOptPass>
+{
+    enum class MemoryAccessType
+    {
+        Read,
+        Write,
+    };
+
+    // Maybe make this a map from op->operands?
+    struct MemoryAccessInfo
+    {
+        Operation* op;
+        mlir::Value baseMemRef;
+        mlir::ValueRange indices;
+        mlir::AffineMap accessMap;
+        MemoryAccessType accessType;
+    };
+
+    struct BarrierInfo
+    {
+        ValueBarrierOp barrier;
+        std::vector<MemoryAccessInfo> activeWrites;
+        std::vector<MemoryAccessInfo> activeReads;
+    };
+
+    using MemoryOpInfo = std::variant<MemoryAccessInfo, BarrierInfo>;
+
+    void runOnOperation() final;
+
+private:
+    bool _debug = false;
+
+    std::vector<MemoryOpInfo> GatherMemoryOps(Operation* parentOp)
+    {
+        std::vector<MemoryOpInfo> memoryOps;
+
+        parentOp->walk<WalkOrder::PreOrder>([&](Operation* op) {
+            if (auto barrierOp = dyn_cast<BarrierOp>(op))
+            {
+                memoryOps.push_back(BarrierInfo{ barrierOp, {}, {} });
+            }
+            else if (auto memInfo = GetSharedMemoryAccessInfo(op))
+            {
+                memoryOps.push_back(*memInfo);
+            }
+        });
+
+        return memoryOps;
+    }
+
+    llvm::Optional<MemoryAccessInfo>
+    GetSharedMemoryAccessInfo(Operation* op)
+    {
+        auto getAffineAccessInfo = [](auto affineOp, MemoryAccessType accessType) -> llvm::Optional<MemoryAccessInfo> {
+            auto memRefType = affineOp.getMemRefType();
+            auto memSpace = memRefType.getMemorySpaceAsInt();
+            if (memSpace == gpu::GPUDialect::getWorkgroupAddressSpace())
+            {
+                MemoryAccessInfo info;
+                info.op = affineOp.getOperation();
+                info.baseMemRef = affineOp.getMemRef();
+                info.indices = affineOp.indices();
+                info.accessMap = affineOp.getAffineMap();
+                info.accessType = accessType;
+                return info;
+            }
+            return llvm::None;
+        };
+
+        if (auto affineLoadOp = dyn_cast<mlir::AffineLoadOp>(op))
+        {
+            return getAffineAccessInfo(affineLoadOp, MemoryAccessType::Read);
+        }
+        else if (auto affineStoreOp = dyn_cast<mlir::AffineStoreOp>(op))
+        {
+            return getAffineAccessInfo(affineStoreOp, MemoryAccessType::Write);
+        }
+
+        return llvm::None;
+    }
+};
+
+void BarrierOptPass::runOnOperation()
+{
+    auto memoryOps = GatherMemoryOps(getOperation());
+
+    auto usesSameMemory = [&](const MemoryAccessInfo& access1, const MemoryAccessInfo& access2) {
+        return access1.baseMemRef == access2.baseMemRef;
+    };
+
+    auto contains = [&](const std::vector<MemoryAccessInfo>& activeAccesses, const MemoryAccessInfo& access) {
+        return std::find_if(activeAccesses.begin(), activeAccesses.end(), [&](const MemoryAccessInfo& activeAccess) {
+                   return usesSameMemory(access, activeAccess);
+               }) != activeAccesses.end();
+    };
+
+    std::vector<MemoryAccessInfo> activeReads;
+    std::vector<MemoryAccessInfo> activeWrites;
+    BarrierInfo prevBarrier;
+
+    auto commitPrevBarrier = [&]() {
+        prevBarrier = {};
+        activeReads.clear();
+        activeWrites.clear();
+    };
+
+    for (auto memoryOp : memoryOps)
+    {
+        std::visit(
+            VariantVisitor{
+                [&](BarrierInfo& barrierInfo) {
+                    if (_debug)
+                    {
+                        auto out = barrierInfo.barrier.emitRemark("Barrier found with ") << activeWrites.size() << " active writes and " << activeReads.size() << " active reads\n";
+
+                        if (activeWrites.size() > 0)
+                            out << "Active writes:\n";
+                        for (auto& memOpInfo : activeWrites)
+                        {
+                            out << memOpInfo.op << "\n";
+                        }
+
+                        if (activeReads.size() > 0)
+                            out << "Active reads:\n";
+                        for (auto& memOpInfo : activeReads)
+                        {
+                            out << memOpInfo.op << "\n";
+                        }
+                    }
+
+                    if (prevBarrier.barrier)
+                    {
+                        if (_debug)
+                            prevBarrier.barrier.emitRemark("BarrierOpRewrite: removing redundant barrier");
+                        prevBarrier.barrier.erase();
+                    }
+
+                    prevBarrier = barrierInfo;
+                },
+                [&](MemoryAccessInfo& memOpInfo) {
+                    if (memOpInfo.accessType == MemoryAccessType::Write)
+                    {
+                        // If this is a write to memory in the active reads list, we need a barrier
+                        if (contains(activeReads, memOpInfo))
+                        {
+                            if (prevBarrier.barrier)
+                            {
+                                if (_debug)
+                                    prevBarrier.barrier.emitRemark("Barrier needed because of write to memory in active reads");
+                                commitPrevBarrier();
+                            }
+                        }
+
+                        activeWrites.push_back(memOpInfo);
+                    }
+                    else
+                    {
+                        // If this is a read to memory in the active writes list, we need a barrier
+                        if (contains(activeWrites, memOpInfo))
+                        {
+                            if (prevBarrier.barrier)
+                            {
+                                if (_debug)
+                                    prevBarrier.barrier.emitRemark("Barrier needed because of read to memory in active writes");
+                                commitPrevBarrier();
+                            }
+                        }
+
+                        activeReads.push_back(memOpInfo);
+                    }
+                },
+            },
+            memoryOp);
+    }
+
+    // Delete prevBarrier.barrier if necessary
+    if (prevBarrier.barrier)
+    {
+        if (_debug)
+            prevBarrier.barrier.emitRemark("BarrierOpRewrite: removing redundant barrier");
+        prevBarrier.barrier.erase();
+    }
+}
+
+namespace accera::transforms::value
+{
+
+std::unique_ptr<mlir::Pass> createBarrierOptPass()
+{
+    return std::make_unique<BarrierOptPass>();
+}
+
+} // namespace accera::transforms::value
diff --git a/accera/transforms/src/value/RangeValueOptimizePass.cpp b/accera/transforms/src/value/RangeValueOptimizePass.cpp
index 930f99cf..50906e62 100644
--- a/accera/transforms/src/value/RangeValueOptimizePass.cpp
+++ b/accera/transforms/src/value/RangeValueOptimizePass.cpp
@@ -60,8 +60,20 @@ struct RangeValue
         min = APInt(maxBitWidth, min_, true);
         max = APInt(maxBitWidth, max_, true);
     }
-    RangeValue(APInt min, APInt max) :
-        min(min), max(max) {}
+    RangeValue(APInt min_, APInt max_)
+    {
+        if (min_.isSingleWord() && max_.isSingleWord())
+        {
+            min = APInt(maxBitWidth, min_.getSExtValue(), true);
+            max = APInt(maxBitWidth, max_.getSExtValue(), true);
+        }
+        else
+        {
+            // is not an int64_t, then the range is not valid
+            min = negInf();
+            max = inf();
+        }
+    }
 
     RangeValue(DictionaryAttr dict)
     {
@@ -187,19 +199,48 @@ struct RangeValue
         return RangeValue(lowerbound, upperbound);
     }
 
-    RangeValue join(const RangeValue& other) const
+    RangeValue operator%(const RangeValue& other) const
     {
-        if (isUnBounded())
+        if (isUnBounded() || other.isUnBounded())
         {
-            return other;
+            return RangeValue();
         }
-        if (other.isUnBounded())
+        auto zero = APInt(maxBitWidth, 0, true);
+        if (other.isConstant(zero))
         {
-            return *this;
+            // handle mod 0
+            return RangeValue();
+        }
+
+        SmallVector<APInt, 4> cases;
+        if (isBoundedLower())
+        {
+            if (other.isBoundedLower())
+            {
+                cases.emplace_back(min.srem(other.min));
+            }
+            if (other.isBoundedUpper())
+            {
+                cases.emplace_back(min.srem(other.max));
+            }
+        }
+        if (isBoundedUpper())
+        {
+            if (other.isBoundedLower())
+            {
+                cases.emplace_back(max.srem(other.min));
+            }
+            if (other.isBoundedUpper())
+            {
+                cases.emplace_back(max.srem(other.max));
+            }
         }
-        return RangeValue(min.slt(other.min) ? min : other.min,
-                          max.sgt(other.max) ? max : other.max);
+        auto [minElem, maxElem] = std::minmax_element(cases.begin(), cases.end(), [](const APInt& a, const APInt& b) {
+            return a.slt(b);
+        });
+        return RangeValue(*minElem, *maxElem);
     }
+
     bool operator==(const RangeValue& other) const
     {
         return min.eq(other.min) && max.eq(other.max);
@@ -239,14 +280,34 @@ struct RangeValue
     {
         return !isUnBounded();
     }
+    bool isBoundedLower() const
+    {
+        return !isUnBoundedLower();
+    }
+    bool isBoundedUpper() const
+    {
+        return !isUnBoundedUpper();
+    }
+    bool isUnBoundedLower() const
+    {
+        return min.eq(negInf());
+    }
+    bool isUnBoundedUpper() const
+    {
+        return max.eq(inf());
+    }
     bool isUnBounded() const
     {
-        return min.eq(negInf()) || max.eq(inf());
+        return isUnBoundedLower() || isUnBoundedUpper();
     }
     bool isConstant() const
     {
         return isBounded() && min.eq(max);
     }
+    bool isConstant(APInt val) const
+    {
+        return isConstant() && min.eq(val);
+    }
     APInt getConstant() const
     {
         assert(isConstant());
@@ -295,7 +356,7 @@ struct RangeValueAnalysis
             if (nextOp == worklist.end())
                 break;
             Operation* op = *nextOp;
-            worklist.erase(op); 
+            worklist.erase(op);
 
             auto range = resolveRangeValue(op);
 
@@ -365,11 +426,13 @@ struct RangeValueAnalysis
             .Case([&](ConstantOp op) { return resolveRangeValue(op); })
             .Case([&](ConstantIndexOp op) { return resolveRangeValue(op); })
             .Case([&](ConstantIntOp op) { return resolveRangeValue(op); })
+            .Case([&](IndexCastOp op) { return resolveRangeValue(op); })
             .Case([&](gpu::ThreadIdOp op) { return resolveRangeValue(op); })
             .Case([&](gpu::BlockIdOp op) { return resolveRangeValue(op); })
             .Case([&](AddIOp op) { return resolveRangeValue(op); })
             .Case([&](SubIOp op) { return resolveRangeValue(op); })
             .Case([&](MulIOp op) { return resolveRangeValue(op); })
+            .Case([&](SignedRemIOp op) { return resolveRangeValue(op); })
             .Case([&](scf::ForOp op) { return resolveRangeValue(op); })
             .Case([&](AffineForOp op) { return resolveRangeValue(op); })
             .Default([&](Operation*) { return RangeValue(); });
@@ -394,6 +457,16 @@ struct RangeValueAnalysis
         auto value = op.getValue();
         return RangeValue(value, value);
     }
+    RangeValue resolveRangeValue(IndexCastOp op)
+    {
+        auto val = op.in();
+        if (auto defOp = val.getDefiningOp())
+        {
+            return resolveRangeValue(defOp);
+        }
+        // otherwise this is a BlockArgument which conservatively we assume has no range
+        return RangeValue();
+    }
     RangeValue resolveRangeValue(gpu::ThreadIdOp op)
     {
         auto gpuMod = op->getParentOfType<gpu::GPUFuncOp>();
@@ -441,7 +514,12 @@ struct RangeValueAnalysis
         auto operands = resolveOperands(op);
         return operands[0] * operands[1];
     }
-    RangeValue resolveLoopBounds(AffineForOp op)
+    RangeValue resolveRangeValue(SignedRemIOp op)
+    {
+        auto operands = resolveOperands(op);
+        return operands[0] % operands[1];
+    }
+    RangeValue resolveRangeValue(AffineForOp op)
     {
         return op.hasConstantBounds() ? RangeValue(op.getConstantLowerBound(), op.getConstantUpperBound()) : RangeValue();
     }
@@ -471,7 +549,7 @@ struct RangeValueOptimizePass : public ConvertRangeValueOptimizeBase<RangeValueO
                 Value val = builder.create<ConstantOp>(op->getLoc(), i1Ty, builder.getBoolAttr(classification == CmpIOpClassification::AlwaysTrue));
                 op.replaceAllUsesWith(val);
                 op.erase();
-            } 
+            }
         });
     }
 
@@ -498,7 +576,7 @@ struct RangeValueOptimizePass : public ConvertRangeValueOptimizeBase<RangeValueO
         if (lhsRange.isUnBounded() || rhsRange.isUnBounded())
         {
             return CmpIOpClassification::Unknown;
-        } 
+        }
 
         switch (predicate)
         {
diff --git a/accera/transforms/src/value/ValueConversion.td b/accera/transforms/src/value/ValueConversion.td
index 69bfff67..700b26af 100644
--- a/accera/transforms/src/value/ValueConversion.td
+++ b/accera/transforms/src/value/ValueConversion.td
@@ -13,8 +13,8 @@ include "mlir/Dialect/MemRef/IR/MemRefOps.td"
 // These patterns are here to work around a bug with linalg.view operations that correspond to a single array element.
 // The issue is that a store operation (either accv.StoreOp or (standard) StoreOp) with a (single-element) accv.OffsetOp
 // eventually gets lowered to a store into a linalg.view, which fails when storing a single scalar value.
-def : Pat<(rcv_GetElementOp (rcv_OffsetOp $source, $args)), (LoadOp $source, $args)>;
-def : Pat<(rcv_StoreOp $val, (rcv_OffsetOp $source, $args), $indices), (MemRef_StoreOp $val, $source, $args)>;
-def : Pat<(MemRef_StoreOp $val, (rcv_OffsetOp $source, $args), $indices), (MemRef_StoreOp $val, $source, $args)>;
+def : Pat<(accv_GetElementOp (accv_OffsetOp $source, $args)), (LoadOp $source, $args)>;
+def : Pat<(accv_StoreOp $val, (accv_OffsetOp $source, $args), $indices), (MemRef_StoreOp $val, $source, $args)>;
+def : Pat<(MemRef_StoreOp $val, (accv_OffsetOp $source, $args), $indices), (MemRef_StoreOp $val, $source, $args)>;
 
 #endif // ACCERA_VALUE_CONVERSION
diff --git a/accera/transforms/src/value/ValueFuncToTargetPass.cpp b/accera/transforms/src/value/ValueFuncToTargetPass.cpp
index 293ae994..c8701847 100644
--- a/accera/transforms/src/value/ValueFuncToTargetPass.cpp
+++ b/accera/transforms/src/value/ValueFuncToTargetPass.cpp
@@ -104,13 +104,13 @@ struct ValueFuncToTargetPass : public tr::ValueFuncToTargetBase<ValueFuncToTarge
         }
 
         module.walk([&](AffineForOp op) {
-            if (op->getAttrOfType<UnitAttr>("rcv_unrolled"))
+            if (op->getAttrOfType<UnitAttr>("accv_unrolled"))
             {
                 auto tripCount = mlir::getConstantTripCount(op);
                 if (tripCount && *tripCount >= 1)
                     (void)mlir::loopUnrollFull(op);
             }
-            else if (auto jammed = op->getAttrOfType<IntegerAttr>("rcv_unroll_jam"))
+            else if (auto jammed = op->getAttrOfType<IntegerAttr>("accv_unroll_jam"))
             {
                 (void)mlir::loopUnrollJamByFactor(op, (uint64_t)jammed.getInt());
             }
diff --git a/accera/transforms/src/value/ValueToStandardLoweringPass.cpp b/accera/transforms/src/value/ValueToStandardLoweringPass.cpp
index 071e80bb..32662212 100644
--- a/accera/transforms/src/value/ValueToStandardLoweringPass.cpp
+++ b/accera/transforms/src/value/ValueToStandardLoweringPass.cpp
@@ -524,6 +524,8 @@ struct ValueModuleOpRewritePattern : OpRewritePattern<vir::ValueModuleOp>
         {
             gpuModOp->setAttr(mlir::gpu::getDefaultGpuBinaryAnnotation(),
                               rewriter.getStringAttr("HSACO"));
+            gpuModOp->setAttr(vir::ValueModuleOp::getExecRuntimeAttrName(),
+                              ir::value::ExecutionRuntimeAttr::get(getContext(), vir::ExecutionRuntime::ROCM));
         }
     }
     void AddNVVMAnnotations(vir::ValueModuleOp module, PatternRewriter& rewriter) const
@@ -533,6 +535,8 @@ struct ValueModuleOpRewritePattern : OpRewritePattern<vir::ValueModuleOp>
         {
             gpuModOp->setAttr(mlir::gpu::getDefaultGpuBinaryAnnotation(),
                               rewriter.getStringAttr("CUBIN"));
+            gpuModOp->setAttr(vir::ValueModuleOp::getExecRuntimeAttrName(),
+                              ir::value::ExecutionRuntimeAttr::get(getContext(), vir::ExecutionRuntime::CUDA));
         }
     }
 
@@ -556,6 +560,8 @@ struct ValueModuleOpRewritePattern : OpRewritePattern<vir::ValueModuleOp>
         module->setAttr(
             mlir::spirv::getTargetEnvAttrName(),
             targetEnvAttr);
+        module->setAttr(vir::ValueModuleOp::getExecRuntimeAttrName(),
+                        ir::value::ExecutionRuntimeAttr::get(getContext(), vir::ExecutionRuntime::VULKAN));
     }
 
     LogicalResult matchAndRewrite(vir::ValueModuleOp vModuleOp, PatternRewriter& rewriter) const final
@@ -569,7 +575,7 @@ struct ValueModuleOpRewritePattern : OpRewritePattern<vir::ValueModuleOp>
             AddGPUAnnotations(module, rewriter);
 
             const auto runtime = utilir::ResolveExecutionRuntime(vModuleOp);
-            if (runtime == vir::ExecutionRuntime::Vulkan)
+            if (runtime == vir::ExecutionRuntime::VULKAN)
             {
                 AddVulkanAnnotations(module, rewriter);
             }
@@ -577,7 +583,7 @@ struct ValueModuleOpRewritePattern : OpRewritePattern<vir::ValueModuleOp>
             {
                 AddNVVMAnnotations(vModuleOp, rewriter);
             }
-            else if (runtime == vir::ExecutionRuntime::Rocm)
+            else if (runtime == vir::ExecutionRuntime::ROCM)
             {
                 AddRocmAnnotations(vModuleOp, rewriter);
             }
@@ -608,16 +614,16 @@ auto GetGPUModuleBinaryAnnotationAttrValue(vir::ExecutionRuntime runtime)
     switch (runtime)
     {
     // ref: mlir/test/Conversion/GPUToROCm/lower-rocdl-kernel-to-hsaco.mlir
-    case vir::ExecutionRuntime::Rocm:
+    case vir::ExecutionRuntime::ROCM:
         return "HSACO";
 
     // ref: mlir/test/Conversion/GPUToCUDA/lower-nvvm-kernel-to-cubin.mlir
     case vir::ExecutionRuntime::CUDA:
         return "CUBIN";
 
-    case vir::ExecutionRuntime::Vulkan:
+    case vir::ExecutionRuntime::VULKAN:
         [[fallthrough]];
-    case vir::ExecutionRuntime::Default:
+    case vir::ExecutionRuntime::DEFAULT:
         [[fallthrough]];
     default:
         return "";
@@ -635,8 +641,7 @@ struct GPUTargetedFuncRewritePattern : OpRewritePattern<FuncOp>
 
     void rewrite(FuncOp funcOp, PatternRewriter& rewriter) const final
     {
-        // TODO: Make this an attribute on the gpu.module
-        const auto gpuRuntime = vir::ExecutionRuntime::Rocm;
+        auto gpuRuntime = utilir::ResolveExecutionRuntime(funcOp).value_or(vir::ExecutionRuntime::NONE);
 
         auto loc = rewriter.getFusedLoc({ funcOp.getLoc(), RC_FILE_LOC(rewriter) });
         OpBuilder::InsertionGuard guard(rewriter);
@@ -647,6 +652,8 @@ struct GPUTargetedFuncRewritePattern : OpRewritePattern<FuncOp>
 
         auto gpuModule = rewriter.create<gpu::GPUModuleOp>(loc, newFuncName + "_module");
         gpuModule->setAttr(GetGPUModuleBinaryAnnotationAttrName(), rewriter.getStringAttr(GetGPUModuleBinaryAnnotationAttrValue(gpuRuntime)));
+        gpuModule->setAttr(vir::ValueModuleOp::getExecRuntimeAttrName(),
+                           ir::value::ExecutionRuntimeAttr::get(getContext(), gpuRuntime));
         gpuModule.setVisibility(mlir::SymbolTable::Visibility::Public);
 
         auto insertPt = utilir::GetTerminalInsertPoint<gpu::GPUModuleOp, gpu::ModuleEndOp>(gpuModule);
@@ -672,7 +679,7 @@ struct GPUTargetedFuncRewritePattern : OpRewritePattern<FuncOp>
 
         fnAttrs.emplace_back(rewriter.getIdentifier(mlir::gpu::GPUDialect::getKernelFuncAttrName()),
                              rewriter.getUnitAttr());
-        if (gpuRuntime == vir::ExecutionRuntime::Vulkan)
+        if (gpuRuntime == vir::ExecutionRuntime::VULKAN)
         {
             auto entryPointLocalSize = blockDimsLaunchConfig;
             assert(entryPointLocalSize.size() == kLocalSizeDimSize);
diff --git a/accera/utilities/include/MemoryLayout.h b/accera/utilities/include/MemoryLayout.h
index b5db1720..d8346164 100644
--- a/accera/utilities/include/MemoryLayout.h
+++ b/accera/utilities/include/MemoryLayout.h
@@ -256,7 +256,7 @@ namespace utilities
         None = 0,
         Global = 1,
         Shared = 3,
-        Local = 5,
+        Private = 5,
     };
 
     /// <summary> A class representing layout of a block of data in memory where the block can also
@@ -344,6 +344,8 @@ namespace utilities
         ///     the canonical row-major ordering of 2D arrays, and [1, 0] for column-major. </param>
         MemoryLayout(const MemoryShape& size, const MemoryShape& extent, const MemoryShape& offset, const DimensionOrder& order);
 
+        MemoryLayout(const MemoryShape& size, const MemoryShape& extent, const MemoryShape& offset, const MemoryShape& increment);
+
         /// <summary> Returns the number of dimensions in this memory layout </summary>
         ///
         /// <returns> The number of dimensions </summary>
@@ -559,7 +561,6 @@ namespace utilities
         std::string ToString() const;
 
     private:
-        MemoryLayout(const MemoryShape& size, const MemoryShape& extent, const MemoryShape& offset, const MemoryShape& increment);
         MemoryLayout(const MemoryShape& size, const MemoryShape& extent, const MemoryShape& offset, const MemoryShape& increment, const DimensionOrder& order);
         void BoundsCheckDimensionIndex(size_t index) const;
         size_t GetDataOffset() const; // offset for entry {0,0,0...}
diff --git a/accera/value/include/Cache.h b/accera/value/include/Cache.h
index b0f1c0fa..2536e58b 100644
--- a/accera/value/include/Cache.h
+++ b/accera/value/include/Cache.h
@@ -61,9 +61,12 @@ namespace value
               const std::optional<ScalarIndex>& triggerIndex,
               const std::optional<int64_t>& maxElements,
               const MemoryAffineCoefficients& memoryCoefficients,
+              bool thrifty,
+              bool doubleBufferCache = false,
               CacheIndexing mapping = CacheIndexing::GlobalToPhysical,
               CacheAllocation allocation = CacheAllocation::Automatic,
               MemorySpace memorySpace = MemorySpace::None,
+              MemorySpace doubleBufferMemorySpace = MemorySpace::None,
               ExecutionTarget execTarget = targets::CPU{});
 
         Cache(accera::ir::loopnest::ScheduleOp schedule,
@@ -72,9 +75,12 @@ namespace value
               const std::optional<ScalarIndex>& triggerIndex,
               const std::optional<int64_t>& maxElements,
               const DimensionOrder& dimOrder,
+              bool thrifty,
+              bool doubleBufferCache = false,
               CacheIndexing mapping = CacheIndexing::GlobalToPhysical,
               CacheAllocation allocation = CacheAllocation::Automatic,
               MemorySpace memorySpace = MemorySpace::None,
+              MemorySpace doubleBufferMemorySpace = MemorySpace::None,
               ExecutionTarget execTarget = targets::CPU{});
 
         // Runtime-Init caching version
diff --git a/accera/value/include/CompilerOptions.h b/accera/value/include/CompilerOptions.h
index 85453ef8..af054c68 100644
--- a/accera/value/include/CompilerOptions.h
+++ b/accera/value/include/CompilerOptions.h
@@ -49,7 +49,7 @@ namespace value
         TargetDevice targetDevice = { "host" };
 
         /// <summary> Name of the target runtime. </summary>
-        ExecutionRuntime executionRuntime = ExecutionRuntime::Default;
+        ExecutionRuntime executionRuntime = ExecutionRuntime::DEFAULT;
 
         // Options that can be changed during code generation (e.g., per function)
         /// <summary> Emit code that calls an external BLAS library. </summary>
diff --git a/accera/value/include/EmitterContext.h b/accera/value/include/EmitterContext.h
index 088a7346..6c741193 100644
--- a/accera/value/include/EmitterContext.h
+++ b/accera/value/include/EmitterContext.h
@@ -317,21 +317,27 @@ namespace value
 
         Value LogicalOperation(ValueLogicalOperation op, Value source1, Value source2);
 
-        /// <summary> Performs matrix multiply accumulate operation D = A.B + C.
+        /// <summary> Performs matrix multiply load operation.
         /// There are restrictions on the input types and sizes. </summary>
-        /// <param name="dest"> The result destination matrix </param>
-        /// <param name="A"> The input A matrix </param>
-        /// <param name="B"> The input B matrix </param>
-        /// <param name="C"> The input C matrix </param>
-        void MFMA(Matrix& dest, Matrix A, Matrix B, Matrix C);
+        /// <param name="source"> The input memref </param>
+        /// <param name="shape"> The shape of the load </param>
+        /// <param name="operand"> The kind of the mfma matrix </param>
+        Matrix MFMALoad(Value source, const std::vector<int64_t> & shape, const std::string & operand);
+
+        /// <summary> Performs matrix multiply store operation.
+        /// There are restrictions on the source type. </summary>
+        /// <param name="source"> The input mfma matrix </param>
+        /// <param name="target"> The target memref </param>
+        void MFMAStore(Matrix source, Value target);
 
         /// <summary> Performs matrix multiply accumulate compute operation D = A.B + C.
+        /// This operation assumes that A, B, C, and D have been loaded using the MFMALoad operation.
         /// There are restrictions on the input types and sizes. </summary>
-        /// <param name="A"> The input A matrix </param>
-        /// <param name="B"> The input B matrix </param>
-        /// <param name="C"> The input C matrix </param>
-        /// <returns> An instance of Matrix pointing to the result </returns>
-        // Matrix MFMACompute(Matrix A, Matrix B, Matrix C);
+        /// <param name="A"> The input A mfma matrix </param>
+        /// <param name="B"> The input B mfma matrix </param>
+        /// <param name="C"> The input C mfma matrix </param>
+        /// <returns> The result destination mfma matrix </returns>
+        Matrix MFMACompute(Matrix A, Matrix B, Matrix C); 
 
         Scalar Max(Vector input);
 
@@ -461,7 +467,11 @@ namespace value
 
         virtual Value LogicalOperationImpl(ValueLogicalOperation op, Value source1, Value source2) = 0;
 
-        virtual void MFMAImpl(Matrix& dest, Matrix A, Matrix B, Matrix C) = 0;
+        virtual Matrix MFMALoadImpl(Value source, const std::vector<int64_t> & shape, const std::string & operand) = 0;
+
+        virtual void MFMAStoreImpl(Matrix source, Value target) = 0;
+
+        virtual Matrix MFMAComputeImpl(Matrix A, Matrix B, Matrix C) = 0;
 
         virtual Scalar MaxImpl(Vector input) = 0;
 
@@ -768,7 +778,9 @@ namespace value
 
     void ForRanges(std::vector<Scalar> range_ends, std::function<void(std::vector<Scalar>)> fn);
 
-    void MFMA(Matrix& dest, Matrix A, Matrix B, Matrix C);
+    Matrix MFMALoad(Value source, const std::vector<int64_t> & shape, const std::string & operand);
+    void MFMAStore(Matrix source, Value target);
+    Matrix MFMACompute(Matrix A, Matrix B, Matrix C); 
 
     /// <summary> Runs the provided function, in parallel if possible </summary>
     /// <typeparam name="Tys..."> The types that represent the captured values. Must be `Value` or types that provide a member
diff --git a/accera/value/include/ExecutionOptions.h b/accera/value/include/ExecutionOptions.h
index d22392e4..cbcd7fa9 100644
--- a/accera/value/include/ExecutionOptions.h
+++ b/accera/value/include/ExecutionOptions.h
@@ -6,8 +6,7 @@
 
 #pragma once
 
-#include <cstdint>
-#include <variant>
+#include <ir/include/exec/ExecutionOptions.h>
 
 namespace accera
 {
@@ -15,46 +14,7 @@ namespace value
 {
     namespace targets
     {
-        //  <summary> A struct encapsulating x, y, z indices for a GPU processor </summary>
-        struct Dim3
-        {
-            /// <summary> The x index </summary>
-            int64_t x;
-            /// <summary> The y index </summary>
-            int64_t y;
-            /// <summary> The z index </summary>
-            int64_t z;
-
-            Dim3(int64_t x_ = 1, int64_t y_ = 1, int64_t z_ = 1) :
-                x(x_), y(y_), z(z_) {}
-        };
-
-        /// <summary> The CPU execution options </summary>
-        struct CPU
-        {};
-
-        /// <summary> The GPU execution options </summary>
-        struct GPU
-        {
-            /// <summary> Indicates the grid </summary>
-            Dim3 grid;
-
-            /// <summary> Indicates the block </summary>
-            Dim3 block;
-
-            GPU(Dim3 grid_ = Dim3(1, 1, 1), Dim3 block_ = Dim3(1, 1, 1)) :
-                grid(grid_), block(block_){};
-        };
-
-        using Target = std::variant<CPU, GPU>;
-        enum class Runtime : int
-        {
-            Default,
-            Vulkan,
-            Rocm,
-            CUDA
-        };
-
+        using namespace ir::targets;
     } // namespace targets
 
     using ExecutionTarget = targets::Target;
diff --git a/accera/value/include/FunctionDeclaration.h b/accera/value/include/FunctionDeclaration.h
index 4c62cd63..442b21e6 100644
--- a/accera/value/include/FunctionDeclaration.h
+++ b/accera/value/include/FunctionDeclaration.h
@@ -224,7 +224,7 @@ namespace value
         std::optional<Scalar> _pointer;
 
         ExecutionTarget _execTarget;
-        ExecutionRuntime _execRuntime = ExecutionRuntime::Default;
+        ExecutionRuntime _execRuntime = ExecutionRuntime::DEFAULT;
         FunctionInlining _inlineState = FunctionInlining::defaultInline;
         bool _isDecorated = true;
         bool _isPublic = false;
diff --git a/accera/value/include/MLIREmitterContext.h b/accera/value/include/MLIREmitterContext.h
index efa93cbb..a4eb2a1f 100644
--- a/accera/value/include/MLIREmitterContext.h
+++ b/accera/value/include/MLIREmitterContext.h
@@ -165,8 +165,11 @@ namespace value
 
         Value LogicalOperationImpl(ValueLogicalOperation op, Value source1, Value source2) override;
 
-        void MFMAImpl(Matrix & dest, Matrix A, Matrix B, Matrix C) override;
-        // Value MFMAComputeImpl(Value A, Value B, Value C) override;
+        Matrix MFMALoadImpl(Value source, const std::vector<int64_t>& shape, const std::string& operand) override;
+
+        void MFMAStoreImpl(Matrix source, Value target) override;
+
+        Matrix MFMAComputeImpl(Matrix A, Matrix B, Matrix C) override;
 
         Scalar CastImpl(Scalar value, ValueType type, bool srcSigned);
         Scalar CastImpl(Scalar value, ValueType type) override;
diff --git a/accera/value/include/Plan.h b/accera/value/include/Plan.h
index 96db5726..09af9401 100644
--- a/accera/value/include/Plan.h
+++ b/accera/value/include/Plan.h
@@ -65,67 +65,77 @@ namespace value
         /// <param name="outermostIncludedSplitIndex"> The outermost index in one of the cached dimensions to include in the cache </param>
         /// <param name="triggerIndex"> The index to fill the cache at, must be the same as outermostIncludedSplitIndex or precede it in the schedule order </param>
         /// <param name="memoryMap"> The affine coefficients to use to map from active block position to cache position in the cache buffer </param>
+        /// <param name="thrifty"> Whether to make this a thrifty cache </param>
+        /// <param name="doubleBuffer"> Whether or not to use double-buffering to fill this cache </param>
         /// <param name="indexing"> The cache indexing </param>
         /// <param name="allocation"> The cache allocation </param>
         /// <param name="memorySpace"> The memory space</param>
-        /// <param name="memoryMap"> The affine layout</param>
+        /// <param name="doubleBufferMemorySpace"> The memory space to put the double buffer temporary buffer in </param>
         /// <returns> An instance of Cache </returns>
-        Cache AddCache(std::variant<ViewAdapter, Cache*> target, const ScalarIndex& outermostIncludedSplitIndex, const ScalarIndex& triggerIndex, const MemoryAffineCoefficients& memoryMap, CacheIndexing indexing = CacheIndexing::GlobalToPhysical, CacheAllocation allocation = CacheAllocation::Automatic, MemorySpace memorySpace = MemorySpace::None);
+        Cache AddCache(std::variant<ViewAdapter, Cache*> target, const ScalarIndex& outermostIncludedSplitIndex, const ScalarIndex& triggerIndex, const MemoryAffineCoefficients& memoryMap, bool thrifty, bool doubleBuffer = false, CacheIndexing indexing = CacheIndexing::GlobalToPhysical, CacheAllocation allocation = CacheAllocation::Automatic, MemorySpace memorySpace = MemorySpace::None, MemorySpace doubleBufferMemorySpace = MemorySpace::None);
 
         /// <summary> Adds a manual active block cache for a view target or different cache </summary>
         /// <param name="target"> The target being cached (e.g Array, Matrix, etc) </param>
         /// <param name="outermostIncludedSplitIndex"> The outermost index in one of the cached dimensions to include in the cache </param>
         /// <param name="triggerIndex"> The index to fill the cache at, must be the same as outermostIncludedSplitIndex or precede it in the schedule order </param>
         /// <param name="dimOrder"> The dimension order permutation to use to map from active block position to cache position in the cache buffer </param>
+        /// <param name="thrifty"> Whether to make this a thrifty cache </param>
+        /// <param name="doubleBuffer"> Whether or not to use double-buffering to fill this cache </param>
         /// <param name="indexing"> The cache indexing </param>
         /// <param name="allocation"> The cache allocation </param>
         /// <param name="memorySpace"> The memory space</param>
-        /// <param name="memoryMap"> The affine layout</param>
+        /// <param name="doubleBufferMemorySpace"> The memory space to put the double buffer temporary buffer in </param>
         /// <returns> An instance of Cache </returns>
-        Cache AddCache(std::variant<ViewAdapter, Cache*> target, const ScalarIndex& outermostIncludedSplitIndex, const ScalarIndex& triggerIndex, const DimensionOrder& dimOrder, CacheIndexing indexing = CacheIndexing::GlobalToPhysical, CacheAllocation allocation = CacheAllocation::Automatic, MemorySpace memorySpace = MemorySpace::None);
+        Cache AddCache(std::variant<ViewAdapter, Cache*> target, const ScalarIndex& outermostIncludedSplitIndex, const ScalarIndex& triggerIndex, const DimensionOrder& dimOrder, bool thrifty, bool doubleBuffer = false, CacheIndexing indexing = CacheIndexing::GlobalToPhysical, CacheAllocation allocation = CacheAllocation::Automatic, MemorySpace memorySpace = MemorySpace::None, MemorySpace doubleBufferMemorySpace = MemorySpace::None);
 
         /// <summary> Adds a manual active block cache for a view target or different cache with an identity dimension ordering </summary>
         /// <param name="target"> The target being cached (e.g Array, Matrix, etc) </param>
         /// <param name="outermostIncludedSplitIndex"> The outermost index in one of the cached dimensions to include in the cache </param>
-        /// <param name="memoryMap"> The affine coefficients to use to map from active block position to cache position in the cache buffer </param>
+        /// <param name="thrifty"> Whether to make this a thrifty cache </param>
         /// <param name="indexing"> The cache indexing </param>
         /// <param name="allocation"> The cache allocation </param>
         /// <param name="memorySpace"> The memory space</param>
-        /// <param name="memoryMap"> The affine layout</param>
+        /// <param name="doubleBufferMemorySpace"> The memory space to put the double buffer temporary buffer in </param>
         /// <returns> An instance of Cache </returns>
-        Cache AddCache(std::variant<ViewAdapter, Cache*> target, const ScalarIndex& outermostIncludedSplitIndex, CacheIndexing indexing = CacheIndexing::GlobalToPhysical, CacheAllocation allocation = CacheAllocation::Automatic, MemorySpace memorySpace = MemorySpace::None);
+        Cache AddCache(std::variant<ViewAdapter, Cache*> target, const ScalarIndex& outermostIncludedSplitIndex, bool thrifty = false, bool doubleBuffer = false, CacheIndexing indexing = CacheIndexing::GlobalToPhysical, CacheAllocation allocation = CacheAllocation::Automatic, MemorySpace memorySpace = MemorySpace::None, MemorySpace doubleBufferMemorySpace = MemorySpace::None);
 
         /// <summary> Adds a manual active block cache for a view target or different cache </summary>
         /// <param name="target"> The target being cached (e.g Array, Matrix, etc) </param>
         /// <param name="maxElements"> A cutoff budget that will be used to select the outermost index in one of the cached dimensions to include in the cache (in order not to exceed the budget) </param>
         /// <param name="memoryMap"> The affine coefficients to use to map from active block position to cache position in the cache buffer </param>
+        /// <param name="thrifty"> Whether to make this a thrifty cache </param>
+        /// <param name="doubleBuffer"> Whether or not to use double-buffering to fill this cache </param>
         /// <param name="indexing"> The cache indexing </param>
         /// <param name="allocation"> The cache allocation </param>
         /// <param name="memorySpace"> The memory space</param>
-        /// <param name="memoryMap"> The affine layout</param>
+        /// <param name="doubleBufferMemorySpace"> The memory space to put the double buffer temporary buffer in </param>
         /// <returns> An instance of Cache </returns>
-        Cache AddCache(std::variant<ViewAdapter, Cache*> target, int64_t maxElements, const MemoryAffineCoefficients& memoryMap, CacheIndexing indexing = CacheIndexing::GlobalToPhysical, CacheAllocation allocation = CacheAllocation::Automatic, MemorySpace memorySpace = MemorySpace::None);
+        Cache AddCache(std::variant<ViewAdapter, Cache*> target, int64_t maxElements, const MemoryAffineCoefficients& memoryMap, bool thrifty, bool doubleBuffer = false, CacheIndexing indexing = CacheIndexing::GlobalToPhysical, CacheAllocation allocation = CacheAllocation::Automatic, MemorySpace memorySpace = MemorySpace::None, MemorySpace doubleBufferMemorySpace = MemorySpace::None);
 
         /// <summary> Adds a manual active block cache for a view target or different cache </summary>
         /// <param name="target"> The target being cached (e.g Array, Matrix, etc) </param>
         /// <param name="maxElements"> A cutoff budget that will be used to select the outermost index in one of the cached dimensions to include in the cache (in order not to exceed the budget) </param>
         /// <param name="dimOrder"> The dimension order permutation to use to map from active block position to cache position in the cache buffer </param>
+        /// <param name="thrifty"> Whether to make this a thrifty cache </param>
+        /// <param name="doubleBuffer"> Whether or not to use double-buffering to fill this cache </param>
         /// <param name="indexing"> The cache indexing </param>
         /// <param name="allocation"> The cache allocation </param>
         /// <param name="memorySpace"> The memory space</param>
-        /// <param name="memoryMap"> The affine layout</param>
+        /// <param name="doubleBufferMemorySpace"> The memory space to put the double buffer temporary buffer in </param>
         /// <returns> An instance of Cache </returns>
-        Cache AddCache(std::variant<ViewAdapter, Cache*> target, int64_t maxElements, const DimensionOrder& dimOrder, CacheIndexing indexing = CacheIndexing::GlobalToPhysical, CacheAllocation allocation = CacheAllocation::Automatic, MemorySpace memorySpace = MemorySpace::None);
+        Cache AddCache(std::variant<ViewAdapter, Cache*> target, int64_t maxElements, const DimensionOrder& dimOrder, bool thrifty, bool doubleBuffer = false, CacheIndexing indexing = CacheIndexing::GlobalToPhysical, CacheAllocation allocation = CacheAllocation::Automatic, MemorySpace memorySpace = MemorySpace::None, MemorySpace doubleBufferMemorySpace = MemorySpace::None);
 
         /// <summary> Adds a manual active element cache for a view target or different cache with an identity dimension ordering </summary>
         /// <param name="target"> The target being cached (e.g Array, Matrix, etc) </param>
         /// <param name="maxElements"> A cutoff budget that will be used to select the outermost index in one of the cached dimensions to include in the cache (in order not to exceed the budget) </param>
+        /// <param name="thrifty"> Whether to make this a thrifty cache </param>
+        /// <param name="doubleBuffer"> Whether or not to use double-buffering to fill this cache </param>
         /// <param name="indexing"> The cache indexing </param>
         /// <param name="allocation"> The cache allocation </param>
         /// <param name="memorySpace"> The memory space</param>
-        /// <param name="memoryMap"> The affine layout</param>
+        /// <param name="doubleBufferMemorySpace"> The memory space to put the double buffer temporary buffer in </param>
         /// <returns> An instance of Cache </returns>
-        Cache AddCache(std::variant<ViewAdapter, Cache*> target, int64_t maxElements, CacheIndexing indexing = CacheIndexing::GlobalToPhysical, CacheAllocation allocation = CacheAllocation::Automatic, MemorySpace memorySpace = MemorySpace::None);
+        Cache AddCache(std::variant<ViewAdapter, Cache*> target, int64_t maxElements, bool thrifty = false, bool doubleBuffer = false, CacheIndexing indexing = CacheIndexing::GlobalToPhysical, CacheAllocation allocation = CacheAllocation::Automatic, MemorySpace memorySpace = MemorySpace::None, MemorySpace doubleBufferMemorySpace = MemorySpace::None);
 
         /// <summary> Emits an offline packing function for the given target and changes its usage in the function to assume a packed representation </summary>
         /// <param name="target"> The target being cached (e.g Array, Matrix, etc) </param>
@@ -157,7 +167,7 @@ namespace value
 
     private:
         friend class Schedule;
-        Plan(Schedule& sched, ExecutionRuntime execRuntime = ExecutionRuntime::Default);
+        Plan(Schedule& sched, ExecutionRuntime execRuntime = ExecutionRuntime::DEFAULT);
 
         std::unique_ptr<PlanImpl> _impl;
     };
@@ -169,18 +179,21 @@ namespace value
         GPUPlan(GPUPlan&&) noexcept;
         GPUPlan& operator=(const GPUPlan&) = delete;
         GPUPlan& operator=(GPUPlan&&) noexcept;
-        ~GPUPlan(); 
+        ~GPUPlan();
 
         /// <summary> Adds a cache for a view target </summary>
         /// <param name="target"> The target being cached (e.g Array, Matrix, etc) </param>
         /// <param name="outermostIncludedSplitIndex"> The outermost index in one of the cached dimensions to include in the cache </param>
         /// <param name="triggerIndex"> The index to fill the cache at, must be the same as outermostIncludedSplitIndex or precede it in the schedule order </param>
         /// <param name="dimOrder"> The dimension order permutation to use to map from active block position to cache position in the cache buffer </param>
+        /// <param name="thrifty"> Whether to make this a thrifty cache </param>
+        /// <param name="doubleBuffer"> Whether or not to use double-buffering to fill this cache </param>
         /// <param name="mapping"> The cache mapping </param>
         /// <param name="allocation"> The cache allocation </param>
         /// <param name="memorySpace"> The memory space</param>
+        /// <param name="doubleBufferMemorySpace"> The memory space to put the double buffer temporary buffer in </param>
         /// <returns> An instance of Cache </returns>
-        Cache AddCache(std::variant<ViewAdapter, Cache*> target, const ScalarIndex& outermostIncludedSplitIndex, const value::ScalarIndex& triggerIndex, const DimensionOrder& dimOrder, CacheIndexing mapping, CacheAllocation allocation, MemorySpace memorySpace);
+        Cache AddCache(std::variant<ViewAdapter, Cache*> target, const ScalarIndex& outermostIncludedSplitIndex, const value::ScalarIndex& triggerIndex, const DimensionOrder& dimOrder, bool thrifty, bool doubleBuffer, CacheIndexing mapping, CacheAllocation allocation, MemorySpace memorySpace, MemorySpace doubleBufferMemorySpace = MemorySpace::None);
 
         /// <summary> Adds a cache for a view target </summary>
         /// <param name="target"> The target being cached (e.g Array, Matrix, etc) </param>
@@ -194,14 +207,14 @@ namespace value
         /// <param name="proc"> The GPU processor, indicating a block or thread </param>
         void MapIndexToProcessor(ScalarIndex index, Processor proc);
 
-        /// <summary> Tensorize two iteration space dimensions </summary>
-        /// <param name="indices"> The scalar indices to tensorize. Two indicies must be specified whose dimensions must be contiguous in the iteration space dimension order. </param>
+        /// <summary> Tensorize three iteration space dimensions </summary>
+        /// <param name="indices"> The scalar indices to tensorize. Three indices must be specified whose dimensions must be contiguous in the iteration space dimension order. </param>
         /// <param name="numThreads"> The dimension of the tensor operation. </param>
-        void Tensorize(std::vector<ScalarIndex> indices, std::vector<int> dims);
+        void Tensorize(std::vector<ScalarIndex> indices, std::array<int, 3> dims);
 
     private:
         friend class Schedule;
-        GPUPlan(targets::GPU gpuOptions, Schedule& sched, ExecutionRuntime execRuntime = ExecutionRuntime::Default);
+        GPUPlan(targets::GPU gpuOptions, Schedule& sched, ExecutionRuntime execRuntime = ExecutionRuntime::DEFAULT);
 
         std::unique_ptr<PlanImpl> _impl;
     };
diff --git a/accera/value/include/Scalar.h b/accera/value/include/Scalar.h
index 73959cc8..1dfeb03f 100644
--- a/accera/value/include/Scalar.h
+++ b/accera/value/include/Scalar.h
@@ -139,6 +139,8 @@ namespace value
             return Scalar(static_cast<int64_t>(t));
         case ValueType::Index:
             return Scalar(static_cast<index_t>(t));
+        case ValueType::Float16:
+            return Scalar(float16_t{static_cast<float16_t::underlying_type>(t)});
         case ValueType::Float:
             return Scalar(static_cast<float>(t));
         case ValueType::Double:
diff --git a/accera/value/include/Schedule.h b/accera/value/include/Schedule.h
index fed34dfc..976becef 100644
--- a/accera/value/include/Schedule.h
+++ b/accera/value/include/Schedule.h
@@ -153,7 +153,7 @@ namespace value
         /// <param name="gpuOptions"> The target GPU options </param>
         /// <param name="execRuntime"> The target execution runtime </param>
         /// <returns> The execution plan </returns>
-        GPUPlan CreateGPUPlan(targets::GPU gpuOptions, ExecutionRuntime execRuntime = ExecutionRuntime::Default);
+        GPUPlan CreateGPUPlan(targets::GPU gpuOptions, ExecutionRuntime execRuntime = ExecutionRuntime::DEFAULT);
 
         void dump();
 
diff --git a/accera/value/include/Value.h b/accera/value/include/Value.h
index de6ce936..0adce08b 100644
--- a/accera/value/include/Value.h
+++ b/accera/value/include/Value.h
@@ -42,6 +42,7 @@ namespace value
                 std::vector<int32_t>,
                 std::vector<int64_t>,
                 std::vector<index_t>,
+                std::vector<float16_t>,
                 std::vector<float>,
                 std::vector<double>>;
 
@@ -66,6 +67,7 @@ namespace value
         template <typename T>
         inline static constexpr bool IsAcceptableDataType = std::is_same_v<std::decay_t<T>, T> &&
                                                             (std::is_arithmetic_v<T> ||
+                                                             std::is_same_v<std::decay_t<T>, float16_t> ||
                                                              std::is_same_v<std::decay_t<T>, index_t> ||
                                                              std::is_same_v<std::decay_t<T>, Boolean>);
 
@@ -328,6 +330,9 @@ namespace value
         /// <summary> Returns true if the instance's type is a floating point type </summary>
         bool IsFloatingPoint() const;
 
+        /// <summary> Returns true if the instance's type is a 16-bit float </summary>
+        bool IsFloat16() const;
+
         /// <summary> Returns true if the instance's type is a 32-bit float </summary>
         bool IsFloat32() const;
 
diff --git a/accera/value/include/ValueType.h b/accera/value/include/ValueType.h
index a9f09d31..dd5e069b 100644
--- a/accera/value/include/ValueType.h
+++ b/accera/value/include/ValueType.h
@@ -19,6 +19,10 @@ namespace value
 {
 
     enum class index_t : int64_t {};
+    struct float16_t {
+        using underlying_type = float;
+        float data;
+    };
 
     /// <summary> An enumeration of primitive types supported by the value library </summary>
     enum class ValueType
@@ -41,6 +45,8 @@ namespace value
         Int32,
         /// <summary> 8 byte signed integer </summary>
         Int64,
+        /// <summary> 2 byte floating point </summary>
+        Float16,
         /// <summary> 4 byte floating point </summary>
         Float,
         /// <summary> 8 byte floating point </summary>
@@ -117,6 +123,10 @@ namespace value
         {
             return ValueType::Index;
         }
+        else if constexpr (std::is_same_v<T, float16_t>)
+        {
+            return ValueType::Float16;
+        }
         else if constexpr (std::is_same_v<T, float>)
         {
             return ValueType::Float;
diff --git a/accera/value/src/ArrayOperations.cpp b/accera/value/src/ArrayOperations.cpp
index 38502322..6a879f78 100644
--- a/accera/value/src/ArrayOperations.cpp
+++ b/accera/value/src/ArrayOperations.cpp
@@ -297,9 +297,9 @@ namespace value
         [[maybe_unused]] auto bColumnStride = B.GetLayout().GetIncrement(0);
         if (((N * K) > (128 * 128)) || (B.GetLayout().GetIncrement(0) < B.GetLayout().GetIncrement(1)))
         {
-            plan.AddCache(B, jKernelOuter2, CacheIndexing::GlobalToPhysical, CacheAllocation::Automatic, MemorySpace::Shared);
+            plan.AddCache(B, jKernelOuter2, false /* thrifty */, false /* doubleBuffer */, CacheIndexing::GlobalToPhysical, CacheAllocation::Automatic, MemorySpace::Shared);
         }
-        plan.AddCache(C, iInner, CacheIndexing::GlobalToPhysical, CacheAllocation::Automatic, MemorySpace::Shared);
+        plan.AddCache(C, iInner, false /* thrifty */, false /* doubleBuffer */, CacheIndexing::GlobalToPhysical, CacheAllocation::Automatic, MemorySpace::Shared);
 
         // Set unrolling
         schedule.Unroll(jKernelOuter);
diff --git a/accera/value/src/Cache.cpp b/accera/value/src/Cache.cpp
index c8c25b68..82271c3c 100644
--- a/accera/value/src/Cache.cpp
+++ b/accera/value/src/Cache.cpp
@@ -182,7 +182,19 @@ namespace value
             _cacheAccessContext.cacheRegionRelevantScheduleIndexRanges = _cacheInfo.cacheRegionRelevantScheduleIndexRanges;
             _cacheAccessContext.cacheRegionBaseIndices = _cacheInfo.cacheRegionBaseIndices;
 
-            BeginCacheRegionOp regionOp = builder.create<BeginCacheRegionOp>(loc, _mlirValueInput, _cacheAccessContext, _mlirValueInput, *_cacheInfo.cacheIndex, *_cacheInfo.triggerIndex, _cacheId, _hierarchicalCacheLevel, false, false);
+            BeginCacheRegionOp regionOp = builder.create<BeginCacheRegionOp>(loc,
+                                                                             _mlirValueInput,
+                                                                             _cacheAccessContext,
+                                                                             _mlirValueInput,
+                                                                             *_cacheInfo.cacheIndex,
+                                                                             *_cacheInfo.triggerIndex,
+                                                                             _cacheId,
+                                                                             _hierarchicalCacheLevel,
+                                                                             false, // activeBlockCache
+                                                                             false, // dimReorderCache
+                                                                             false, // thrifty
+                                                                             false, // doubleBufferCache
+                                                                             ir::value::MemorySpace::None); // doubleBufferMemorySpace
             [[maybe_unused]] auto endOp = builder.create<EndCacheRegionOp>(loc, regionOp);
             _scheduleOp.injectMapping(regionOp);
         }
@@ -200,9 +212,12 @@ namespace value
                              const std::optional<Index>& triggerIndex,
                              const std::optional<int64_t>& maxElements,
                              const std::variant<MemoryAffineCoefficients, DimensionOrder>& cacheMapping,
+                             bool thrifty,
+                             bool doubleBufferCache,
                              CacheIndexing mapping,
                              CacheAllocation allocation,
                              MemorySpace dslMemorySpace,
+                             MemorySpace dslDoubleBufferMemorySpace,
                              ExecutionTarget execTarget) :
             CacheImpl(schedule, value, mapping),
             _execTarget(execTarget)
@@ -210,6 +225,7 @@ namespace value
             auto builder = GetBuilder();
             auto loc = builder.getUnknownLoc();
             auto memorySpace = *ir::value::symbolizeMemorySpace((uint64_t)dslMemorySpace);
+            auto doubleBufferMemorySpace = *ir::value::symbolizeMemorySpace((uint64_t)dslDoubleBufferMemorySpace);
 
             _cacheInfo = MakeManualCacheInfo(builder, _baseMlirValueInput, allocation, schedule, keySliceIndex, triggerIndex, maxElements, cacheMapping, memorySpace);
 
@@ -245,12 +261,27 @@ namespace value
                                                                                                      innermostIndex,
                                                                                                      _cacheId,
                                                                                                      _hierarchicalCacheLevel,
-                                                                                                     _cacheInfo.dimReorderCache);
+                                                                                                     _cacheInfo.dimReorderCache,
+                                                                                                     thrifty,
+                                                                                                     doubleBufferCache,
+                                                                                                     doubleBufferMemorySpace);
                 cacheRegionOp = regionOp;
             }
             else
             {
-                BeginCacheRegionOp regionOp = builder.create<BeginCacheRegionOp>(loc, _mlirValueInput, _cacheAccessContext, _baseMlirValueInput, *_cacheInfo.cacheIndex, *_cacheInfo.triggerIndex, _cacheId, _hierarchicalCacheLevel, true, _cacheInfo.dimReorderCache);
+                BeginCacheRegionOp regionOp = builder.create<BeginCacheRegionOp>(loc,
+                                                                                 _mlirValueInput,
+                                                                                 _cacheAccessContext,
+                                                                                 _baseMlirValueInput,
+                                                                                 *_cacheInfo.cacheIndex,
+                                                                                 *_cacheInfo.triggerIndex,
+                                                                                 _cacheId,
+                                                                                 _hierarchicalCacheLevel,
+                                                                                 true, // activeBlockCache
+                                                                                 _cacheInfo.dimReorderCache,
+                                                                                 thrifty,
+                                                                                 doubleBufferCache,
+                                                                                 doubleBufferMemorySpace);
                 cacheRegionOp = regionOp;
             }
             auto regionHandle = cacheRegionOp->getResult(0);
@@ -451,7 +482,10 @@ namespace value
         void AddCacheZero(mlir::OpBuilder& builder, mlir::Value cache)
         {
             auto loc = builder.getUnknownLoc();
-            [[maybe_unused]] auto cacheZero = builder.create<CacheZeroOp>(loc, cache);
+            [[maybe_unused]] auto cacheZero = builder.create<CacheZeroOp>(loc,
+                                                                          cache,
+                                                                          "", // activeBlockTag
+                                                                          false); // thrifty
         }
 
         void AddCacheCopy(mlir::OpBuilder& builder, mlir::Value input, CacheAccessContext cacheAccessContext, CopyDirection direction)
@@ -547,6 +581,9 @@ namespace value
             case ValueType::Int64:
                 _packedBuffer = EmbedPackedBuffer<int64_t>(builder, constData, packedBufferName);
                 break;
+            case ValueType::Float16:
+                _packedBuffer = EmbedPackedBuffer<short>(builder, constData, packedBufferName);
+                break;
             case ValueType::Float:
                 _packedBuffer = EmbedPackedBuffer<float>(builder, constData, packedBufferName);
                 break;
@@ -740,9 +777,12 @@ namespace value
                  const std::optional<ScalarIndex>& triggerIndex,
                  const std::optional<int64_t>& maxElements,
                  const MemoryAffineCoefficients& memoryMap,
+                 bool thrifty,
+                 bool doubleBufferCache,
                  CacheIndexing mapping,
                  CacheAllocation allocation,
                  MemorySpace memorySpace,
+                 MemorySpace doubleBufferMemorySpace,
                  ExecutionTarget execTarget)
     {
         std::optional<Index> keySlice;
@@ -757,11 +797,35 @@ namespace value
         }
         if (std::holds_alternative<ViewAdapter>(value))
         {
-            _impl = std::make_unique<ActiveBlockCacheImpl>(schedule, std::get<ViewAdapter>(value), keySlice, resolvedTriggerIndex, maxElements, memoryMap, mapping, allocation, memorySpace, execTarget);
+            _impl = std::make_unique<ActiveBlockCacheImpl>(schedule,
+                                                           std::get<ViewAdapter>(value),
+                                                           keySlice,
+                                                           resolvedTriggerIndex,
+                                                           maxElements,
+                                                           memoryMap,
+                                                           thrifty,
+                                                           doubleBufferCache,
+                                                           mapping,
+                                                           allocation,
+                                                           memorySpace,
+                                                           doubleBufferMemorySpace,
+                                                           execTarget);
         }
         else
         {
-            _impl = std::make_unique<ActiveBlockCacheImpl>(schedule, std::get<Cache*>(value)->_impl.get(), keySlice, resolvedTriggerIndex, maxElements, memoryMap, mapping, allocation, memorySpace, execTarget);
+            _impl = std::make_unique<ActiveBlockCacheImpl>(schedule,
+                                                           std::get<Cache*>(value)->_impl.get(),
+                                                           keySlice,
+                                                           resolvedTriggerIndex,
+                                                           maxElements,
+                                                           memoryMap,
+                                                           thrifty,
+                                                           doubleBufferCache,
+                                                           mapping,
+                                                           allocation,
+                                                           memorySpace,
+                                                           doubleBufferMemorySpace,
+                                                           execTarget);
         }
     }
 
@@ -771,9 +835,12 @@ namespace value
                  const std::optional<ScalarIndex>& triggerIndex,
                  const std::optional<int64_t>& maxElements,
                  const DimensionOrder& dimOrder,
+                 bool thrifty,
+                 bool doubleBufferCache,
                  CacheIndexing mapping,
                  CacheAllocation allocation,
                  MemorySpace memorySpace,
+                 MemorySpace doubleBufferMemorySpace,
                  ExecutionTarget execTarget)
     {
         std::optional<Index> keySlice;
@@ -789,11 +856,35 @@ namespace value
 
         if (std::holds_alternative<ViewAdapter>(value))
         {
-            _impl = std::make_unique<ActiveBlockCacheImpl>(schedule, std::get<ViewAdapter>(value), keySlice, resolvedTriggerIndex, maxElements, dimOrder, mapping, allocation, memorySpace, execTarget);
+            _impl = std::make_unique<ActiveBlockCacheImpl>(schedule,
+                                                           std::get<ViewAdapter>(value),
+                                                           keySlice,
+                                                           resolvedTriggerIndex,
+                                                           maxElements,
+                                                           dimOrder,
+                                                           thrifty,
+                                                           doubleBufferCache,
+                                                           mapping,
+                                                           allocation,
+                                                           memorySpace,
+                                                           doubleBufferMemorySpace,
+                                                           execTarget);
         }
         else
         {
-            _impl = std::make_unique<ActiveBlockCacheImpl>(schedule, std::get<Cache*>(value)->_impl.get(), keySlice, resolvedTriggerIndex, maxElements, dimOrder, mapping, allocation, memorySpace, execTarget);
+            _impl = std::make_unique<ActiveBlockCacheImpl>(schedule,
+                                                           std::get<Cache*>(value)->_impl.get(),
+                                                           keySlice,
+                                                           resolvedTriggerIndex,
+                                                           maxElements,
+                                                           dimOrder,
+                                                           thrifty,
+                                                           doubleBufferCache,
+                                                           mapping,
+                                                           allocation,
+                                                           memorySpace,
+                                                           doubleBufferMemorySpace,
+                                                           execTarget);
         }
     }
 
diff --git a/accera/value/src/CompilerOptions.cpp b/accera/value/src/CompilerOptions.cpp
index 11f1ea12..b22eda33 100644
--- a/accera/value/src/CompilerOptions.cpp
+++ b/accera/value/src/CompilerOptions.cpp
@@ -21,11 +21,13 @@ namespace value
     static ExecutionRuntime GetExecutionRuntime(std::string runtimeName)
     {
         return ::llvm::StringSwitch<ExecutionRuntime>(runtimeName)
-            .Case("Default", ExecutionRuntime::Default)
-            .Case("Vulkan", ExecutionRuntime::Vulkan)
-            .Case("Rocm", ExecutionRuntime::Rocm)
+            .Case("Default", ExecutionRuntime::DEFAULT)
+            .Case("Vulkan", ExecutionRuntime::VULKAN)
+            .Case("Rocm", ExecutionRuntime::ROCM)
             .Case("CUDA", ExecutionRuntime::CUDA)
-            .Default(ExecutionRuntime::Default);
+            .Case("None", ExecutionRuntime::NONE)
+            .Case("OpenMP", ExecutionRuntime::OPENMP)
+            .Default(ExecutionRuntime::DEFAULT);
     }
 
     /// <summary> Constructor from a property bag </summary>
diff --git a/accera/value/src/EmitterContext.cpp b/accera/value/src/EmitterContext.cpp
index a6126dd2..5319cbc6 100644
--- a/accera/value/src/EmitterContext.cpp
+++ b/accera/value/src/EmitterContext.cpp
@@ -221,15 +221,33 @@ namespace value
         return LogicalOperationImpl(op, source1, source2);
     }
 
-    void EmitterContext::MFMA(Matrix& dest, Matrix A, Matrix B, Matrix C)
+    Matrix EmitterContext::MFMALoad(Value source, const std::vector<int64_t>& shape, const std::string& operand)
+    {
+        if (operand != "AOp" && operand != "BOp" && operand != "COp")
+        {
+            throw InputException(InputExceptionErrors::invalidArgument);
+        }
+        return MFMALoadImpl(source, shape, operand);
+    }
+
+    void EmitterContext::MFMAStore(Matrix source, Value target)
+    {
+        if (source.GetType() != target.GetBaseType())
+        {
+            throw InputException(InputExceptionErrors::invalidArgument);
+        }
+        MFMAStoreImpl(source, target);
+    }
+
+    Matrix EmitterContext::MFMACompute(Matrix A, Matrix B, Matrix C)
     {
 
-        if (A.GetType() != B.GetType() || dest.GetType() != C.GetType())
+        if (A.GetType() != B.GetType())
         {
             throw InputException(InputExceptionErrors::invalidArgument);
         }
 
-        return MFMAImpl(dest, A, B, C);
+        return MFMAComputeImpl(A, B, C);
     }
 
     Scalar EmitterContext::Cast(Scalar value, ValueType type)
@@ -465,9 +483,19 @@ namespace value
         }
     }
 
-    void MFMA(Matrix& dest, Matrix A, Matrix B, Matrix C)
+    Matrix MFMALoad(Value source, const std::vector<int64_t>& shape, const std::string& operand)
+    {
+        return GetContext().MFMALoad(source, shape, operand);
+    }
+
+    void MFMAStore(Matrix source, Value target)
+    {
+        GetContext().MFMAStore(source, target);
+    }
+
+    Matrix MFMACompute(Matrix A, Matrix B, Matrix C)
     {
-        return GetContext().MFMA(dest, A, B, C);
+        return GetContext().MFMACompute(A, B, C);
     }
 
     void DebugBreak()
diff --git a/accera/value/src/MLIREmitterContext.cpp b/accera/value/src/MLIREmitterContext.cpp
index 87f7e766..b2b4430f 100644
--- a/accera/value/src/MLIREmitterContext.cpp
+++ b/accera/value/src/MLIREmitterContext.cpp
@@ -17,7 +17,6 @@
 #include <ir/include/value/ValueAttributes.h>
 #include <ir/include/value/ValueFuncOp.h>
 
-#include <llvm/Support/ErrorHandling.h>
 #include <transforms/include/value/ValueToStandardLoweringPass.h>
 
 #include <utilities/include/Exception.h>
@@ -57,12 +56,15 @@
 #include <llvm/ADT/SmallVector.h>
 #include <llvm/ADT/StringRef.h>
 #include <llvm/ADT/TypeSwitch.h>
+#include <llvm/Support/Debug.h>
+#include <llvm/Support/ErrorHandling.h>
 #include <llvm/Support/SourceMgr.h>
 #include <llvm/Support/raw_os_ostream.h>
 
 using namespace accera;
 using namespace accera::utilities;
 using namespace accera::value;
+
 using ConstantData = accera::value::detail::ConstantData;
 
 namespace
@@ -112,6 +114,8 @@ mlir::Type ToMLIRType(mlir::OpBuilder& builder, ValueType type)
         return builder.getIntegerType(64);
     case ValueType::Index:
         return builder.getIndexType();
+    case ValueType::Float16:
+        return builder.getF16Type();
     case ValueType::Float:
         return builder.getF32Type();
     case ValueType::Double:
@@ -433,6 +437,14 @@ auto ConstantDataToDenseElementAttr(mlir::ShapedType shape, const ConstantData&
             {
                 throw InputException(InputExceptionErrors::invalidArgument, "Can't store an array of index type");
             }
+            else if constexpr (std::is_same_v<ElementType, float16_t>)
+            {
+                using float16_underlying_type = typename float16_t::underlying_type;
+                std::vector<float16_underlying_type> fp16Data(data.size());
+                std::transform(data.begin(), data.end(), fp16Data.begin(), [](float16_t value) { return value.data; });
+
+                return mlir::DenseElementsAttr::get(shape, llvm::makeArrayRef(fp16Data));
+            }
             else
             {
                 return mlir::DenseElementsAttr::get(shape, llvm::makeArrayRef(data));
@@ -1015,7 +1027,7 @@ EmitterContext::DefinedFunction MLIRContext::CreateFunctionImpl(FunctionDeclarat
 
             if constexpr (std::is_same_v<decltype(target), targets::GPU>)
             {
-                if (funcRuntime != ExecutionRuntime::Default)
+                if (funcRuntime != ExecutionRuntime::DEFAULT)
                 {
                     auto execRuntimeAttrName = ir::value::ValueModuleOp::getExecRuntimeAttrName();
                     auto execRuntimeAttrValue = ir::value::ExecutionRuntimeAttr::get(b.getContext(), (ir::value::ExecutionRuntime)funcRuntime);
@@ -1031,14 +1043,7 @@ EmitterContext::DefinedFunction MLIRContext::CreateFunctionImpl(FunctionDeclarat
 
                 fnOp->setAttr(
                     fnOp.getGPULaunchAttrName(),
-                    b.getIndexArrayAttr({
-                        target.grid.x,
-                        target.grid.y,
-                        target.grid.z,
-                        target.block.x,
-                        target.block.y,
-                        target.block.z,
-                    }));
+                    target.ToArrayAttr(b.getContext()));
             }
 
             return std::pair{ fnOp.getOperation(), &fnOp.body().back() };
@@ -1047,6 +1052,8 @@ EmitterContext::DefinedFunction MLIRContext::CreateFunctionImpl(FunctionDeclarat
 
     {
         auto fnContext = _impl->CreateNewScope({ entryBlock, entryBlock->begin() });
+        mlir::OpBuilder::InsertionGuard guard(b);
+        b.restoreInsertionPoint({ entryBlock, entryBlock->begin() });
 
         {
             std::lock_guard lock{ _mutex };
@@ -1260,6 +1267,13 @@ Value MLIRContext::StoreConstantDataImpl(ConstantData data, MemoryLayout layout,
                 {
                     op = b.create<mlir::ConstantIndexOp>(loc, static_cast<int64_t>(data[0]));
                 }
+                else if constexpr (std::is_same_v<ElementType, float16_t>)
+                {
+                    bool losesInfo = false;
+                    auto f = llvm::APFloat(data[0].data);
+                    f.convert(llvm::APFloat::IEEEhalf(), llvm::APFloat::rmNearestTiesToEven, &losesInfo);
+                    op = b.create<mlir::ConstantFloatOp>(loc, f, mlirElemTy.cast<mlir::Float16Type>());
+                }
                 else if constexpr (std::is_integral_v<ElementType> || std::is_same_v<ElementType, Boolean>)
                 {
                     auto elem = static_cast<int64_t>(data[0]);
@@ -1303,6 +1317,14 @@ Value MLIRContext::StoreConstantDataImpl(ConstantData data, MemoryLayout layout,
 
                     dataAttribute = mlir::DenseElementsAttr::get(flattenedTensorShapeTy, llvm::makeArrayRef(indexData));
                 }
+                else if constexpr (std::is_same_v<ElementType, float16_t>)
+                {
+                    using float16_underlying_type = typename float16_t::underlying_type;
+                    std::vector<float16_underlying_type> fp16Data(data.size());
+                    std::transform(data.begin(), data.end(), fp16Data.begin(), [](float16_t value) { return value.data; });
+
+                    dataAttribute = mlir::DenseElementsAttr::get(flattenedTensorShapeTy, llvm::makeArrayRef(fp16Data));
+                }
                 else
                 {
                     dataAttribute = mlir::DenseElementsAttr::get(flattenedTensorShapeTy, llvm::makeArrayRef(data));
@@ -1332,13 +1354,13 @@ Value MLIRContext::ResolveConstantDataReferenceImpl(Value constantDataSource)
     auto valueModuleOp = _impl->module();
     auto searchSymName = mlir::dyn_cast<ir::value::ReferenceGlobalOp>(sourceRefGlobalOp).getGlobal().sym_name();
 
-    // TODO: valueModuleOp.lookupSymbol() should be called here to look for an existing symbol, but so far, 
-    // it doesn't work as expected. So manually walk the top level ops inside the ValueModuleOp to look for the symbol.   
+    // TODO: valueModuleOp.lookupSymbol() should be called here to look for an existing symbol, but so far,
+    // it doesn't work as expected. So manually walk the top level ops inside the ValueModuleOp to look for the symbol.
     // Replace this workaround with a ValueModuleOp SymbolTable lookup once issues with comparing mlir::Identifiers is resolved.
     bool foundMatch = false;
-    for (auto globalOp : valueModuleOp.getOps<ir::value::GlobalOp>()) 
+    for (auto globalOp : valueModuleOp.getOps<ir::value::GlobalOp>())
     {
-        if (globalOp.sym_name() == searchSymName) 
+        if (globalOp.sym_name() == searchSymName)
         {
             foundMatch = true;
             break;
@@ -1372,10 +1394,10 @@ Value MLIRContext::ResolveConstantDataReferenceImpl(Value constantDataSource)
     auto refGlobalOp = mlir::dyn_cast<ir::value::ReferenceGlobalOp>(clonedRefGlobalOp);
 
     EmittableInfo& emittableInfo = StoreLocalEmittable({ const_cast<void*>(
-                                                            refGlobalOp
-                                                                .getResult()
-                                                                .getAsOpaquePointer()),
-                                                        { constantDataSource.GetBaseType(), 1 } });
+                                                             refGlobalOp
+                                                                 .getResult()
+                                                                 .getAsOpaquePointer()),
+                                                         { constantDataSource.GetBaseType(), 1 } });
     Emittable emittable{ &emittableInfo };
 
     return Value(emittable, constantDataSource.GetLayout());
@@ -1826,38 +1848,59 @@ Value MLIRContext::LogicalOperationImpl(ValueLogicalOperation op, Value source1,
     return { emittable, ScalarLayout };
 }
 
-void MLIRContext::MFMAImpl(Matrix& dest, Matrix A, Matrix B, Matrix C)
+static ir::value::MFMAMatrixType getMatrixTypeOfMemref(mlir::Value val, const std::vector<int64_t>& shape, llvm::StringRef operand)
 {
-    using namespace accera::ir::value;
+    auto memrefType = val.getType().cast<mlir::MemRefType>();
+    return ir::value::MFMAMatrixType::get(shape, memrefType.getElementType(), operand);
+};
 
+Matrix MLIRContext::MFMALoadImpl(Value source, const std::vector<int64_t>& shape, const std::string& operand)
+{
     auto& builder = _impl->builder;
     auto loc = builder.getUnknownLoc();
 
-    auto destValue = ToMLIRValue(builder, dest);
-    auto aValue = ToMLIRValue(builder, A);
-    auto bValue = ToMLIRValue(builder, B);
-    auto cValue = ToMLIRValue(builder, C);
+    auto matValue = ToMLIRValue(builder, source);
+    auto mfmaMatTy = getMatrixTypeOfMemref(matValue, shape, operand);
+    auto mfmaMatShape = mfmaMatTy.getShape();
+    auto mfmaMatLayout = MemoryLayout(mfmaMatShape[0], mfmaMatShape[1]);
 
-    auto getMatrixTypeOfMemref = [=](mlir::Value val, llvm::StringRef kind) {
-        auto memrefType = val.getType().cast<mlir::MemRefType>();
-        return MFMAMatrixType::get(
-            memrefType.getShape(), memrefType.getElementType(), kind);
-    };
+    auto zeroIdx = builder.create<mlir::ConstantIndexOp>(loc, 0);
+
+    mlir::Value result = builder.create<ir::value::MFMALoadOp>(loc, mfmaMatTy, matValue, mlir::ValueRange{ zeroIdx, zeroIdx });
 
-    mlir::Value aMatrix = builder.create<ir::value::MFMALoadMatrixOp>(loc, getMatrixTypeOfMemref(aValue, "AOp"), aValue);
-    mlir::Value bMatrix = builder.create<ir::value::MFMALoadMatrixOp>(loc, getMatrixTypeOfMemref(aValue, "BOp"), bValue);
-    mlir::Value cMatrix = builder.create<ir::value::MFMALoadMatrixOp>(loc, getMatrixTypeOfMemref(aValue, "COp"), cValue);
+    EmittableInfo& emittableInfo = StoreLocalEmittable({ result.getAsOpaquePointer(), { source.GetBaseType(), 1 } });
+    Emittable emittable{ &emittableInfo };
 
-    auto result = builder.create<ir::value::MFMAComputeOp>(loc, cValue.getType(), aMatrix, bMatrix, cMatrix);
+    return Matrix(Value(emittable, mfmaMatLayout));
+}
 
-    builder.create<ir::value::MFMAStoreMatrixOp>(loc, result, destValue);
+void MLIRContext::MFMAStoreImpl(Matrix source, Value target)
+{
+    auto& builder = _impl->builder;
+    auto loc = builder.getUnknownLoc();
 
-    throw LogicException(LogicExceptionErrors::notImplemented);
+    auto sourceValue = ToMLIRValue(builder, source);
+    auto targetValue = ToMLIRValue(builder, target);
+    auto zeroIdx = builder.create<mlir::ConstantIndexOp>(loc, 0);
 
-    // EmittableInfo& emittableInfo = StoreLocalEmittable({ result.getAsOpaquePointer(), { C.GetBaseType(), 1 } });
-    // Emittable emittable{ &emittableInfo };
+    builder.create<ir::value::MFMAStoreOp>(loc, sourceValue, targetValue, mlir::ValueRange{ zeroIdx, zeroIdx });
+}
 
-    // return Value( emittable, C.GetLayout() );
+Matrix MLIRContext::MFMAComputeImpl(Matrix A, Matrix B, Matrix C)
+{
+    auto& builder = _impl->builder;
+    auto loc = builder.getUnknownLoc();
+
+    auto aValue = ToMLIRValue(builder, A);
+    auto bValue = ToMLIRValue(builder, B);
+    auto cValue = ToMLIRValue(builder, C);
+
+    mlir::Value result = builder.create<ir::value::MFMAComputeOp>(loc, cValue.getType(), aValue, bValue, cValue);
+
+    EmittableInfo& emittableInfo = StoreLocalEmittable({ result.getAsOpaquePointer(), { C.GetType(), 1 } });
+    Emittable emittable{ &emittableInfo };
+
+    return Matrix(Value(emittable, C.GetValue().GetLayout()));
 }
 
 Scalar MLIRContext::CastImpl(Scalar value, ValueType type, bool srcSigned)
@@ -2485,7 +2528,7 @@ void MLIRContext::EmitNestDebugFunction(FunctionDeclaration targetFunc, const st
                             valueMap.map(refGlobalOp.getResult(), newOp.getResult());
                         });
                 });
-                    
+
                 // Create the reference schedule(s)
                 auto targetNestOp = scheduleOp.getNest();
                 if (auto fusedDomains = scheduleOp.getFusedDomains(); !fusedDomains.empty())
@@ -2721,7 +2764,9 @@ ValueType MLIRTypeToValueType(mlir::Type ty)
             return ValueType::Index;
         })
         .Case<mlir::FloatType>([](mlir::FloatType fTy) {
-            if (fTy.isF32())
+            if (fTy.isF16())
+                return ValueType::Float16;
+            else if (fTy.isF32())
                 return ValueType::Float;
             else if (fTy.isF64())
                 return ValueType::Double;
diff --git a/accera/value/src/Plan.cpp b/accera/value/src/Plan.cpp
index 8179a40f..ad44c8a4 100644
--- a/accera/value/src/Plan.cpp
+++ b/accera/value/src/Plan.cpp
@@ -71,7 +71,7 @@ namespace value
                         nestOp.exec_targetAttr(execTargetAttr);
                         _execPlanOp.exec_targetAttr(execTargetAttr);
 
-                        if (_execRuntime != ExecutionRuntime::Default)
+                        if (_execRuntime != ExecutionRuntime::DEFAULT && _execRuntime != ExecutionRuntime::NONE && _execRuntime != ExecutionRuntime::OPENMP)
                         {
                             auto execRuntimeAttrName = ValueModuleOp::getExecRuntimeAttrName();
                             auto execRuntimeAttrValue = ir::value::ExecutionRuntimeAttr::get(
@@ -86,16 +86,7 @@ namespace value
                             }
                         }
 
-                        _execPlanOp->setAttr(
-                            _execPlanOp.getGPULaunchAttrName(),
-                            b.getIndexArrayAttr({
-                                options.grid.x,
-                                options.grid.y,
-                                options.grid.z,
-                                options.block.x,
-                                options.block.y,
-                                options.block.z,
-                            }));
+                        _execPlanOp->setAttr(_execPlanOp.getGPULaunchAttrName(), options.ToArrayAttr(b.getContext()));
                     }
                     else
                         llvm_unreachable("Unexpected");
@@ -108,14 +99,58 @@ namespace value
             return { _scheduleOp, target, keySliceIndex, maxElements, mapping, allocation, memorySpace, _execTarget };
         }
 
-        Cache AddManualCache(std::variant<ViewAdapter, Cache*> target, const std::optional<ScalarIndex>& keySliceIndex, const std::optional<ScalarIndex>& triggerIndex, const std::optional<int64_t>& maxElements, CacheIndexing mapping, CacheAllocation allocation, MemorySpace memorySpace, const MemoryAffineCoefficients& memoryMap)
+        Cache AddManualCache(std::variant<ViewAdapter, Cache*> target,
+                             const std::optional<ScalarIndex>& keySliceIndex,
+                             const std::optional<ScalarIndex>& triggerIndex,
+                             const std::optional<int64_t>& maxElements,
+                             bool thrifty,
+                             bool doubleBuffer,
+                             CacheIndexing mapping,
+                             CacheAllocation allocation,
+                             MemorySpace memorySpace,
+                             MemorySpace doubleBufferMemorySpace,
+                             const MemoryAffineCoefficients& memoryMap)
         {
-            return { _scheduleOp, target, keySliceIndex, triggerIndex, maxElements, memoryMap, mapping, allocation, memorySpace, _execTarget };
+            return { _scheduleOp,
+                     target,
+                     keySliceIndex,
+                     triggerIndex,
+                     maxElements,
+                     memoryMap,
+                     thrifty,
+                     doubleBuffer,
+                     mapping,
+                     allocation,
+                     memorySpace,
+                     doubleBufferMemorySpace,
+                     _execTarget };
         }
 
-        Cache AddManualCache(std::variant<ViewAdapter, Cache*> target, const std::optional<ScalarIndex>& keySliceIndex, const std::optional<ScalarIndex>& triggerIndex, const std::optional<int64_t>& maxElements, CacheIndexing mapping, CacheAllocation allocation, MemorySpace memorySpace, const DimensionOrder& dimOrder)
+        Cache AddManualCache(std::variant<ViewAdapter, Cache*> target,
+                             const std::optional<ScalarIndex>& keySliceIndex,
+                             const std::optional<ScalarIndex>& triggerIndex,
+                             const std::optional<int64_t>& maxElements,
+                             bool thrifty,
+                             bool doubleBuffer,
+                             CacheIndexing mapping,
+                             CacheAllocation allocation,
+                             MemorySpace memorySpace,
+                             MemorySpace doubleBufferMemorySpace,
+                             const DimensionOrder& dimOrder)
         {
-            return { _scheduleOp, target, keySliceIndex, triggerIndex, maxElements, dimOrder, mapping, allocation, memorySpace, _execTarget };
+            return { _scheduleOp,
+                     target,
+                     keySliceIndex,
+                     triggerIndex,
+                     maxElements,
+                     dimOrder,
+                     thrifty,
+                     doubleBuffer,
+                     mapping,
+                     allocation,
+                     memorySpace,
+                     doubleBufferMemorySpace,
+                     _execTarget };
         }
 
         Cache AddRuntimeInitCache(ViewAdapter target, const std::string& packingFnName, const std::string& packedBufferSizeFnName, CacheIndexing indexing)
@@ -162,7 +197,7 @@ namespace value
             }
         }
 
-        void Tensorize(std::vector<ScalarIndex> indices, std::vector<int> dims)
+        void Tensorize(std::vector<ScalarIndex> indices, std::array<int, 3> dims)
         {
             auto& builder = GetBuilder();
 
@@ -249,36 +284,36 @@ namespace value
 
     Plan::Plan(
         Schedule& schedule,
-        value::ExecutionRuntime runtime /* = value::ExecutionRuntime::Default */) :
+        value::ExecutionRuntime runtime /* = value::ExecutionRuntime::DEFAULT */) :
         _impl(std::make_unique<PlanImpl>(
             value::targets::CPU{},
             schedule.GetOp(),
-            value::ExecutionRuntime::Default))
+            runtime))
     {}
 
     Plan::~Plan() = default;
 
-    Cache Plan::AddCache(std::variant<ViewAdapter, Cache*> target, const ScalarIndex& outermostIncludedSplitIndex, const ScalarIndex& triggerIndex, const MemoryAffineCoefficients& memoryMap, CacheIndexing mapping, CacheAllocation allocation, MemorySpace memorySpace)
+    Cache Plan::AddCache(std::variant<ViewAdapter, Cache*> target, const ScalarIndex& outermostIncludedSplitIndex, const ScalarIndex& triggerIndex, const MemoryAffineCoefficients& memoryMap, bool thrifty, bool doubleBuffer, CacheIndexing mapping, CacheAllocation allocation, MemorySpace memorySpace, MemorySpace doubleBufferMemorySpace)
     {
-        return _impl->AddManualCache(target, outermostIncludedSplitIndex, triggerIndex, std::nullopt, mapping, allocation, memorySpace, memoryMap);
+        return _impl->AddManualCache(target, outermostIncludedSplitIndex, triggerIndex, std::nullopt, thrifty, doubleBuffer, mapping, allocation, memorySpace, doubleBufferMemorySpace, memoryMap);
     }
 
-    Cache Plan::AddCache(std::variant<ViewAdapter, Cache*> target, const ScalarIndex& outermostIncludedSplitIndex, const ScalarIndex& triggerIndex, const DimensionOrder& dimOrder, CacheIndexing mapping, CacheAllocation allocation, MemorySpace memorySpace)
+    Cache Plan::AddCache(std::variant<ViewAdapter, Cache*> target, const ScalarIndex& outermostIncludedSplitIndex, const ScalarIndex& triggerIndex, const DimensionOrder& dimOrder, bool thrifty, bool doubleBuffer, CacheIndexing mapping, CacheAllocation allocation, MemorySpace memorySpace, MemorySpace doubleBufferMemorySpace)
     {
-        return _impl->AddManualCache(target, outermostIncludedSplitIndex, triggerIndex, std::nullopt, mapping, allocation, memorySpace, dimOrder);
+        return _impl->AddManualCache(target, outermostIncludedSplitIndex, triggerIndex, std::nullopt, thrifty, doubleBuffer, mapping, allocation, memorySpace, doubleBufferMemorySpace, dimOrder);
     }
 
-    Cache Plan::AddCache(std::variant<ViewAdapter, Cache*> target, int64_t maxElements, const MemoryAffineCoefficients& memoryMap, CacheIndexing mapping, CacheAllocation allocation, MemorySpace memorySpace)
+    Cache Plan::AddCache(std::variant<ViewAdapter, Cache*> target, int64_t maxElements, const MemoryAffineCoefficients& memoryMap, bool thrifty, bool doubleBuffer, CacheIndexing mapping, CacheAllocation allocation, MemorySpace memorySpace, MemorySpace doubleBufferMemorySpace)
     {
-        return _impl->AddManualCache(target, std::nullopt, std::nullopt, maxElements, mapping, allocation, memorySpace, memoryMap);
+        return _impl->AddManualCache(target, std::nullopt, std::nullopt, maxElements, thrifty, doubleBuffer, mapping, allocation, memorySpace, doubleBufferMemorySpace, memoryMap);
     }
 
-    Cache Plan::AddCache(std::variant<ViewAdapter, Cache*> target, int64_t maxElements, const DimensionOrder& dimOrder, CacheIndexing mapping, CacheAllocation allocation, MemorySpace memorySpace)
+    Cache Plan::AddCache(std::variant<ViewAdapter, Cache*> target, int64_t maxElements, const DimensionOrder& dimOrder, bool thrifty, bool doubleBuffer, CacheIndexing mapping, CacheAllocation allocation, MemorySpace memorySpace, MemorySpace doubleBufferMemorySpace)
     {
-        return _impl->AddManualCache(target, std::nullopt, std::nullopt, maxElements, mapping, allocation, memorySpace, dimOrder);
+        return _impl->AddManualCache(target, std::nullopt, std::nullopt, maxElements, thrifty, doubleBuffer, mapping, allocation, memorySpace, doubleBufferMemorySpace, dimOrder);
     }
 
-    Cache Plan::AddCache(std::variant<ViewAdapter, Cache*> target, const ScalarIndex& outermostIncludedSplitIndex, CacheIndexing mapping, CacheAllocation allocation, MemorySpace memorySpace)
+    Cache Plan::AddCache(std::variant<ViewAdapter, Cache*> target, const ScalarIndex& outermostIncludedSplitIndex, bool thrifty, bool doubleBuffer, CacheIndexing mapping, CacheAllocation allocation, MemorySpace memorySpace, MemorySpace doubleBufferMemorySpace)
     {
         Value baseValue;
         if (std::holds_alternative<Cache*>(target))
@@ -293,10 +328,10 @@ namespace value
         }
         int64_t rank = baseValue.GetLayout().NumDimensions();
         DimensionOrder dimOrder(rank);
-        return _impl->AddManualCache(target, outermostIncludedSplitIndex, outermostIncludedSplitIndex, std::nullopt, mapping, allocation, memorySpace, dimOrder);
+        return _impl->AddManualCache(target, outermostIncludedSplitIndex, outermostIncludedSplitIndex, std::nullopt, thrifty, doubleBuffer, mapping, allocation, memorySpace, doubleBufferMemorySpace, dimOrder);
     }
 
-    Cache Plan::AddCache(std::variant<ViewAdapter, Cache*> target, int64_t maxElements, CacheIndexing mapping, CacheAllocation allocation, MemorySpace memorySpace)
+    Cache Plan::AddCache(std::variant<ViewAdapter, Cache*> target, int64_t maxElements, bool thrifty, bool doubleBuffer, CacheIndexing mapping, CacheAllocation allocation, MemorySpace memorySpace, MemorySpace doubleBufferMemorySpace)
     {
         Value baseValue;
         if (std::holds_alternative<Cache*>(target))
@@ -312,7 +347,7 @@ namespace value
         int64_t rank = baseValue.GetLayout().NumDimensions();
         DimensionOrder dimOrder(rank);
         auto viewAdapter = std::get<ViewAdapter>(target);
-        return _impl->AddManualCache(target, std::nullopt, std::nullopt, maxElements, mapping, allocation, memorySpace, dimOrder);
+        return _impl->AddManualCache(target, std::nullopt, std::nullopt, maxElements, thrifty, doubleBuffer, mapping, allocation, memorySpace, doubleBufferMemorySpace, dimOrder);
     }
 
     Cache Plan::EmitRuntimeInitPacking(ViewAdapter target, const std::string& packingFnName, const std::string& packedBufferSizeFnName, CacheIndexing indexing)
@@ -359,9 +394,9 @@ namespace value
     {
     }
 
-    Cache GPUPlan::AddCache(std::variant<ViewAdapter, Cache*> target, const ScalarIndex& outermostIncludedSplitIndex, const value::ScalarIndex& triggerIndex, const DimensionOrder& dimOrder, CacheIndexing mapping, CacheAllocation allocation, MemorySpace memorySpace)
+    Cache GPUPlan::AddCache(std::variant<ViewAdapter, Cache*> target, const ScalarIndex& outermostIncludedSplitIndex, const value::ScalarIndex& triggerIndex, const DimensionOrder& dimOrder, bool thrifty, bool doubleBuffer, CacheIndexing mapping, CacheAllocation allocation, MemorySpace memorySpace, MemorySpace doubleBufferMemorySpace)
     {
-        return _impl->AddManualCache(target, outermostIncludedSplitIndex, triggerIndex, std::nullopt, mapping, allocation, memorySpace, dimOrder);
+        return _impl->AddManualCache(target, outermostIncludedSplitIndex, triggerIndex, std::nullopt, thrifty, doubleBuffer, mapping, allocation, memorySpace, doubleBufferMemorySpace, dimOrder);
     }
 
     Cache GPUPlan::AddCache(ViewAdapter target, int64_t maxElements, MemorySpace memorySpace)
@@ -369,7 +404,7 @@ namespace value
         return _impl->AddAutomaticCache(target, std::nullopt, maxElements, CacheIndexing::GlobalToPhysical, CacheAllocation::Automatic, memorySpace);
     }
 
-    void GPUPlan::Tensorize(std::vector<ScalarIndex> indices, std::vector<int> dims)
+    void GPUPlan::Tensorize(std::vector<ScalarIndex> indices, std::array<int, 3> dims)
     {
         _impl->Tensorize(indices, dims);
     }
diff --git a/accera/value/src/Scalar.cpp b/accera/value/src/Scalar.cpp
index 012d3f4e..e47faf2e 100644
--- a/accera/value/src/Scalar.cpp
+++ b/accera/value/src/Scalar.cpp
@@ -37,8 +37,9 @@ namespace value
                 MAP_TARGET_TO_POSSIBLE_SOURCES(ValueType::Int16, ValueType::Boolean, ValueType::Int8);
                 MAP_TARGET_TO_POSSIBLE_SOURCES(ValueType::Int32, ValueType::Boolean, ValueType::Int8, ValueType::Int16);
                 MAP_TARGET_TO_POSSIBLE_SOURCES(ValueType::Int64, ValueType::Boolean, ValueType::Int8, ValueType::Int16, ValueType::Int32);
-                MAP_TARGET_TO_POSSIBLE_SOURCES(ValueType::Float, ValueType::Boolean, ValueType::Int8, ValueType::Int16, ValueType::Int32);
-                MAP_TARGET_TO_POSSIBLE_SOURCES(ValueType::Double, ValueType::Boolean, ValueType::Int8, ValueType::Int16, ValueType::Int32, ValueType::Int64, ValueType::Float);
+                MAP_TARGET_TO_POSSIBLE_SOURCES(ValueType::Float16, ValueType::Boolean, ValueType::Int8, ValueType::Int16);
+                MAP_TARGET_TO_POSSIBLE_SOURCES(ValueType::Float, ValueType::Boolean, ValueType::Int8, ValueType::Int16, ValueType::Int32, ValueType::Float16);
+                MAP_TARGET_TO_POSSIBLE_SOURCES(ValueType::Double, ValueType::Boolean, ValueType::Int8, ValueType::Int16, ValueType::Int32, ValueType::Int64, ValueType::Float16, ValueType::Float);
 
             default:
                 return false;
diff --git a/accera/value/src/ScalarOperations.cpp b/accera/value/src/ScalarOperations.cpp
index c86ea48a..0bb9d21c 100644
--- a/accera/value/src/ScalarOperations.cpp
+++ b/accera/value/src/ScalarOperations.cpp
@@ -137,6 +137,8 @@ namespace value
     {
         switch (s.GetType())
         {
+        case ValueType::Float16:
+            [[fallthrough]];
         case ValueType::Float:
             [[fallthrough]];
         case ValueType::Double:
diff --git a/accera/value/src/Value.cpp b/accera/value/src/Value.cpp
index 9e4da025..6b428f90 100644
--- a/accera/value/src/Value.cpp
+++ b/accera/value/src/Value.cpp
@@ -294,9 +294,11 @@ namespace value
 
     bool Value::IsFloatingPoint() const
     {
-        return (_type.first == ValueType::Float || _type.first == ValueType::Double);
+        return (_type.first == ValueType::Float16 || _type.first == ValueType::Float || _type.first == ValueType::Double);
     }
 
+    bool Value::IsFloat16() const { return _type.first == ValueType::Float16; }
+
     bool Value::IsFloat32() const { return _type.first == ValueType::Float; }
 
     bool Value::IsDouble() const { return _type.first == ValueType::Double; }
diff --git a/accera/value/test/src/TestUtil.cpp b/accera/value/test/src/TestUtil.cpp
index 62553973..3bf17612 100644
--- a/accera/value/test/src/TestUtil.cpp
+++ b/accera/value/test/src/TestUtil.cpp
@@ -110,6 +110,9 @@ void PrintMatrix(std::string indent, Matrix e)
             case ValueType::Int64:
                 std::cout << s.Get<int64_t>();
                 break;
+            case ValueType::Float16:
+                std::cout << s.Get<float16_t>();
+                break;
             case ValueType::Float:
                 std::cout << s.Get<float>();
                 break;
@@ -148,7 +151,7 @@ Scalar EqualEpsilon(Scalar x, Scalar y, double epsilon)
 #endif // 0
         result = 1;
     }).Else([&] {
-        if (auto type = x.GetType(); type == ValueType::Float || type == ValueType::Double)
+        if (auto type = x.GetType(); type == ValueType::Float16 || type == ValueType::Float || type == ValueType::Double)
         {
             auto tolerance = Cast<Scalar>(epsilon, type);
             If((x - y) <= tolerance, [&] {
diff --git a/build.sh b/build.sh
index c9bda3e6..f52af881 100644
--- a/build.sh
+++ b/build.sh
@@ -27,7 +27,7 @@ else
     export LLVM_SETUP_VARIANT=Default
 
     # Uncomment these lines below to build a debug version (will include release as well, due to vcpkg quirks)
-    # export VCPKG_BUILD_TYPE=debug
+    # export LLVM_BUILD_TYPE=debug
     # export VCPKG_KEEP_ENV_VARS=LLVM_BUILD_TYPE
 
     # Install LLVM (takes a couple of hours and ~20GB of space)
diff --git a/docs/Manual/00 Introduction.md b/docs/Manual/00 Introduction.md
index 5181a513..e64d35cd 100644
--- a/docs/Manual/00 Introduction.md	
+++ b/docs/Manual/00 Introduction.md	
@@ -2,9 +2,9 @@
 [//]: # (Version: v1.2.1)
 
 # Introduction
-Accera is a Python-based embedded domain-specific programming language (eDSL) that enables cross-compiler optimization for compute-intensive code. Currently, optimization of nested for-loops is the primary focus of Accera for CPU and GPU targets.
+Accera is a framework with a Python-based Domain-specific Language (eDSL) that produces optimized compute-intensive code. Currently, optimization of nested for-loops is the primary focus of Accera for CPU and GPU targets.
 
-Optimization of compute-intensive code in a traditional programming language is not only difficult and time-consuming, but manual optimization of simplest numerical algorithms demands significant engineering effort from an engineer who has an advanced understanding of computer architecture and fluency in Assembly Language. Even with all these efforts, implemented code is prone to critical bugs and requires extensive engineering effort for maintenance. Accera aims at resolving all these issues by providing optimized solutions for compute-intensive algorithms that are highly efficient, readable, and maintainable. 
+Optimization of compute-intensive code in a traditional programming language is not only difficult and time-consuming, but manual optimization of simplest numerical algorithms demands significant engineering effort and requires an advanced understanding of computer architecture and fluency in C++, C, or Assembly Language. Even with all these efforts, implemented code is prone to critical bugs and requires extensive engineering effort for maintenance. Accera aims at resolving all these issues by providing optimized solutions for compute-intensive algorithms that are highly efficient, readable, and maintainable. 
 
 Accera has THREE primary goals:
 
diff --git a/docs/Manual/02 Simple Affine Loop Nests.md b/docs/Manual/02 Simple Affine Loop Nests.md
index e8158925..5ada8e2c 100644
--- a/docs/Manual/02 Simple Affine Loop Nests.md	
+++ b/docs/Manual/02 Simple Affine Loop Nests.md	
@@ -86,7 +86,7 @@ The iteration logic can include the following operations (assuming `accera` was
 
 | Operation | Types (Operands must be of same type)  | Description  |
 |----------|----------|--------------|
-| `a = b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float32/64` | Assigns the value of scalar *b* to scalar *a* |
+| `a = b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float16/32/64` | Assigns the value of scalar *b* to scalar *a* |
 
 __Not yet implemented:__ unsigned types (`acc.ScalarType.uint8/16/32/64`)
 
@@ -94,14 +94,14 @@ __Not yet implemented:__ unsigned types (`acc.ScalarType.uint8/16/32/64`)
 
 | Operation | Types (Operands must be of same type)  | Description  |
 |----------|----------|--------------|
-| `a + b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float32/64` | Returns the sum of scalars *a* and *b* |
-| `a - b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float32/64` | Returns the difference between scalars *a* and *b* |
-| `a * b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float32/64` | Returns the product of scalars *a* and *b* |
-| `a / b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float32/64` | Returns the quotient of scalars *a* and *b*. If the operands are integers, an integer division result is returned |
-| `a ** b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float32/64` | Returns the *b*'th power of scalar *a* |
-| `a // b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float32/64` | Returns the floor of the quotient of scalars *a* and *b* |
-| `a % b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float32/64` | Returns the signed remainder after dividing scalar *a* by scalar *b* |
-| `-a` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float32/64` | Returns the additive inverse of scalar *a* |
+| `a + b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float16/32/64` | Returns the sum of scalars *a* and *b* |
+| `a - b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float16/32/64` | Returns the difference between scalars *a* and *b* |
+| `a * b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float16/32/64` | Returns the product of scalars *a* and *b* |
+| `a / b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float16/32/64` | Returns the quotient of scalars *a* and *b*. If the operands are integers, an integer division result is returned |
+| `a ** b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float16/32/64` | Returns the *b*'th power of scalar *a* |
+| `a // b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float16/32/64` | Returns the floor of the quotient of scalars *a* and *b* |
+| `a % b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float16/32/64` | Returns the signed remainder after dividing scalar *a* by scalar *b* |
+| `-a` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float16/32/64` | Returns the additive inverse of scalar *a* |
 
 __Not yet implemented:__ unsigned types (`acc.ScalarType.uint8/16/32/64`)
 
@@ -111,12 +111,12 @@ Comment: Accera also supports the corresponding compound-assignment operators, s
 
 | Operation | Types (Operands must be of same type) | Description  |
 |----------|----------|--------------|
-| `a == b` | `acc.ScalarType.bool, acc.ScalarType.int8/16/32/64, acc.ScalarType.float32/64` | Returns True if scalar *a* equals scalar *b*, else False |
-| `a != b` | `acc.ScalarType.bool, acc.ScalarType.int8/16/32/64, acc.ScalarType.float32/64` | Returns True if scalar *a* is not equal to scalar *b*, else False |
-| `a < b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float32/64` | Returns True if scalar *a* is strictly smaller than scalar *b*, else False |
-| `a <= b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float32/64` | Returns True if scalar *a* is smaller than or equal to scalar *b*, else False |
-| `a > b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float32/64` | Returns True if scalar *a* is strictly greater than scalar *b*, else False |
-| `a >= b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float32/64` | Returns True if scalar *a* is greater than or equal to scalar *b*, else False |
+| `a == b` | `acc.ScalarType.bool, acc.ScalarType.int8/16/32/64, acc.ScalarType.float16/32/64` | Returns True if scalar *a* equals scalar *b*, else False |
+| `a != b` | `acc.ScalarType.bool, acc.ScalarType.int8/16/32/64, acc.ScalarType.float16/32/64` | Returns True if scalar *a* is not equal to scalar *b*, else False |
+| `a < b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float16/32/64` | Returns True if scalar *a* is strictly smaller than scalar *b*, else False |
+| `a <= b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float16/32/64` | Returns True if scalar *a* is smaller than or equal to scalar *b*, else False |
+| `a > b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float16/32/64` | Returns True if scalar *a* is strictly greater than scalar *b*, else False |
+| `a >= b` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float16/32/64` | Returns True if scalar *a* is greater than or equal to scalar *b*, else False |
 
 __Not yet implemented:__ unsigned types (`acc.ScalarType.uint8/16/32/64`)
 
@@ -124,9 +124,9 @@ __Not yet implemented:__ unsigned types (`acc.ScalarType.uint8/16/32/64`)
 
 | Operation  | Types (Operands must be of same type) | Description  |
 |----------|----------|--------------|
-| `acc.logical_and(a, b)` | `acc.ScalarType.bool, acc.ScalarType.int8/16/32/64, acc.ScalarType.float32/64` | Returns True if scalars *a* and *b* are non-zero, else False |
-| `acc.logical_or(a, b)` | `acc.ScalarType.bool, acc.ScalarType.int8/16/32/64, acc.ScalarType.float32/64` | Returns True if either scalar *a* or scalar *b* are non-zero, else False |
-| `acc.logical_not(a)` | `acc.ScalarType.bool, acc.ScalarType.int8/16/32/64, acc.ScalarType.float32/64` | Returns True if *a* is zero, else False |
+| `acc.logical_and(a, b)` | `acc.ScalarType.bool, acc.ScalarType.int8/16/32/64, acc.ScalarType.float16/32/64` | Returns True if scalars *a* and *b* are non-zero, else False |
+| `acc.logical_or(a, b)` | `acc.ScalarType.bool, acc.ScalarType.int8/16/32/64, acc.ScalarType.float16/32/64` | Returns True if either scalar *a* or scalar *b* are non-zero, else False |
+| `acc.logical_not(a)` | `acc.ScalarType.bool, acc.ScalarType.int8/16/32/64, acc.ScalarType.float16/32/64` | Returns True if *a* is zero, else False |
 
 __Not yet implemented:__ unsigned types (`acc.ScalarType.uint8/16/32/64`)
 
@@ -149,22 +149,22 @@ __Not yet implemented:__ unsigned types (`acc.ScalarType.uint8/16/32/64`)
 
 | Operation  | Types (Operands must be of same type) | Description  |
 |----------|----------|--------------|
-| `acc.abs(a)` | `acc.ScalarType.float32/64` | Returns the absolute value of scalar *a* |
-| `acc.max(a, b)` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float32/64` | Returns the larger of the two scalars *a* and *b* |
-| `acc.min(a, b)` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float32/64` | Returns the smaller of the two scalars *a* and *b* |
-| `acc.ceil(a)` | `acc.ScalarType.float32/64` | Returns the value of scalar *a* rounded up to the nearest integer as an int64 type |
-| `acc.floor(a)` | `acc.ScalarType.float32/64` | Returns the value of scalar *a* rounded down to the nearest integer as an int64 type |
-| `acc.sqrt(a)` | `acc.ScalarType.float32/64` | Returns the square root of scalar *a* |
-| `acc.exp(a)` | `acc.ScalarType.float32/64` | Returns the exponential *e* raised to the scalar *a* |
-| `acc.log(a)` | `acc.ScalarType.float32/64` | Returns the natural logarithm (base *e*) of scalar *a* |
-| `acc.log10(a)` | `acc.ScalarType.float32/64` | Returns the common logarithm (base 10) of scalar *a* |
-| `acc.log2(a)` | `acc.ScalarType.float32/64` | Returns the binary logarithm (base 2) of scalar *a* |
-| `acc.sin(a)` | `acc.ScalarType.float32/64` | Returns the sine of scalar *a*, where *a* is in radians |
-| `acc.cos(a)` | `acc.ScalarType.float32/64` | Returns the cosine of scalar *a*, where *a* is in radians |
-| `acc.tan(a)` | `acc.ScalarType.float32/64` | Returns the tangent of scalar *a*, where *a* is in radians |
-| `acc.sinh(a)` | `acc.ScalarType.float32/64` | Returns the hyperbolic sine of scalar *a*, where *a* is in radians |
-| `acc.cosh(a)` | `acc.ScalarType.float32/64` | Returns the hyperbolic cosine of scalar *a*, where *a* is in radians |
-| `acc.tanh(a)` | `acc.ScalarType.float32/64` | Returns the hyperbolic tangent of scalar *a*, where *a* is in radians |
+| `acc.abs(a)` | `acc.ScalarType.float16/32/64` | Returns the absolute value of scalar *a* |
+| `acc.max(a, b)` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float16/32/64` | Returns the larger of the two scalars *a* and *b* |
+| `acc.min(a, b)` | `acc.ScalarType.int8/16/32/64, acc.ScalarType.float16/32/64` | Returns the smaller of the two scalars *a* and *b* |
+| `acc.ceil(a)` | `acc.ScalarType.float16/32/64` | Returns the value of scalar *a* rounded up to the nearest integer as an int64 type |
+| `acc.floor(a)` | `acc.ScalarType.float16/32/64` | Returns the value of scalar *a* rounded down to the nearest integer as an int64 type |
+| `acc.sqrt(a)` | `acc.ScalarType.float16/32/64` | Returns the square root of scalar *a* |
+| `acc.exp(a)` | `acc.ScalarType.float16/32/64` | Returns the exponential *e* raised to the scalar *a* |
+| `acc.log(a)` | `acc.ScalarType.float16/32/64` | Returns the natural logarithm (base *e*) of scalar *a* |
+| `acc.log10(a)` | `acc.ScalarType.float16/32/64` | Returns the common logarithm (base 10) of scalar *a* |
+| `acc.log2(a)` | `acc.ScalarType.float16/32/64` | Returns the binary logarithm (base 2) of scalar *a* |
+| `acc.sin(a)` | `acc.ScalarType.float16/32/64` | Returns the sine of scalar *a*, where *a* is in radians |
+| `acc.cos(a)` | `acc.ScalarType.float16/32/64` | Returns the cosine of scalar *a*, where *a* is in radians |
+| `acc.tan(a)` | `acc.ScalarType.float16/32/64` | Returns the tangent of scalar *a*, where *a* is in radians |
+| `acc.sinh(a)` | `acc.ScalarType.float16/32/64` | Returns the hyperbolic sine of scalar *a*, where *a* is in radians |
+| `acc.cosh(a)` | `acc.ScalarType.float16/32/64` | Returns the hyperbolic cosine of scalar *a*, where *a* is in radians |
+| `acc.tanh(a)` | `acc.ScalarType.float16/32/64` | Returns the hyperbolic tangent of scalar *a*, where *a* is in radians |
 
 __Not yet implemented:__ unsigned types (`acc.ScalarType.uint8/16/32/64`)
 
diff --git a/docs/Manual/04 Fusing.md b/docs/Manual/04 Fusing.md
index 4427f361..59848903 100644
--- a/docs/Manual/04 Fusing.md	
+++ b/docs/Manual/04 Fusing.md	
@@ -119,7 +119,7 @@ for i in range(0, 16, 4):
 ```
 
 ### Constraint 1: the fusing dimension is executed sequentially
-The fusing dimension has a special constraint, which does not apply to other dimensions. Specifically, the fusing dimension cannot be parallelized or vectorized (parallelization and vectorization are presented in [Section 7](<07%20Plans%20-%20Vectorization%20and%20Parallelization.md>) ) and it must be executed sequentially. This constraint enables the safety guarantee discussed below.
+The fusing dimension has a special constraint, which does not apply to other dimensions. Specifically, the fusing dimension cannot be parallelized, vectorized, or tensorized (see [Section 7](<07%20Plans%20-%20Vectorization%20and%20Parallelization.md>) ) and it must be executed sequentially. This constraint enables the safety guarantee discussed below.
 
 ### Safety
 The fused schedule (before applying any subsequent transformations) is always logically equivalent to executing the original schedules one-by-one. However, is it safe? Recall that a schedule is considered safe if its underlying logic is guaranteed not to change, regardless of how we transform it. The safety of a fully fused schedule depends on the circumstances:
diff --git a/docs/Manual/06 Plans - Caching.md b/docs/Manual/06 Plans - Caching.md
index e5f15842..a4ae2b2f 100644
--- a/docs/Manual/06 Plans - Caching.md	
+++ b/docs/Manual/06 Plans - Caching.md	
@@ -57,7 +57,7 @@ AA = plan.cache(A, max_elements=1024)
 ```
 
 
-## __Not yet implemented:__ Thrifty caching
+## Thrifty caching
 By default, Accera caching strategies are *thrifty* in the sense that the data is physically copied into an allocated cache only if the cached data somehow differs from the original active block. Therefore, if the original active block is already in the correct memory layout and resides contiguous in memory. Accera skips the caching steps and uses the original array instead. Note that a physical copy is created on a GPU if the cache is supposed to be allocated a different type of memory than the original array (e.g., the array is in global memory, but the cache is supposed to be in shared memory).
 
 For example, assume that `A` is a two-dimensional array and its active block at the chosen level is always one of its rows. If `A` is row-major, the rows are already stored contiguously. Additionally, the data in the active block and the data to be copied to cache are identical: both are contiguous and share the same memory layout. In this case, there is no benefit in using cache over the original array. The thrifty caching strategy will skip the caching steps and use the original array instead.
@@ -88,11 +88,58 @@ For example,
 AA = plan.cache(A, level=2, trigger_level=4)
 ```
 
-## __Not yet implemented:__ Mapping caches to specific types of memory
-Some target platforms have different types of memory that can hold Accera caches. In the case of a GPU target, caches can be located in *global or shared memory*. Following Python code can be used to specify the location of a cache:
+## Mapping caches to specific types of memory
+Some target platforms have different types of memory that can hold Accera caches. In the case of a GPU target, caches can be located in *global or shared memory*. To explicitly choose the location of the cache, we write:
 ```python
-AA = plan.cache(A, level=4, location=v100.MemoryType.SHARED)
+AA = plan.cache(A, level=4, location=v100.MemorySpace.SHARED)
 ```
 
+## Double buffering
+Caches can double-buffer data by loading the next active block's cache data into a temporary buffer during the current active block's usage and then moving that data into the cache buffer after the current active block is done being used. If the cache trigger level is the highest level in the loopnest then this does nothing as it is dependent on having another loop outside of the cache trigger loop. In shared memory caches on GPU this temporary buffer will automatically be allocated in private memory. Since the next iteration's data is loaded into a temporary buffer while the current iteration's data is in the cache buffer, any overlap in these active blocks would result in a write coherency issue similar to what occurs with Multicaching. Because of this, `double_buffer` may only be specified on an `INPUT` or `CONST` array as Accera does not perform multicache write coherence.
+```python
+AA = plan.cache(A, level=3, double_buffer=True)
+```
+
+Full schedule with equivalent pseudo-code:
+```python
+...
+M, N, K = 1024, 1024, 1024
+m_tile, n_tile, k_tile = 32, 64, 128
+nest = Nest((M, N, K))
+i, j, k = nest.get_indices()
+@nest.iteration_logic
+def _():
+    C[i,j] += A[i,k] * B[k,j]
+schedule = nest.create_schedule()
+schedule.tile((i, j, k), (m_tile, n_tile, k_tile))
+schedule.reorder(i, j, k, ii, jj, kk)
+
+plan = schedule.create_plan()
+plan.cache(A, index=ii, double_buffer=True)
+...
+```
+equivalent to:
+```python
+for i in range(0, M, m_tile):
+    for j in range(0, N, n_tile):
+        for ii_cache in range(0, m_tile):
+            for kk_cache in range(0, k_tile):
+                cache_A[ii_cache, kk_cache] = A[i+ii_cache, kk_cache]
+        for k in range(0, K-k_tile, k_tile): # Note: this loop doesn't run for the final K tile
+            for ii_cache in range(0, m_tile):
+                for kk_cache in range(0, k_tile):
+                    temp_A[ii_cache, kk_cache] = A[i+ii_cache, (k + k_tile) + kk_cache]
+            for ii in range(0, m_tile):
+                for jj in range(0, n_tile):
+                    for kk in range(0, k_tile):
+                        C[i+ii, j+jj] += cache_A[ii, kk] * B[k+kk, j+jj]
+            for ii_cache in range(0, m_tile):
+                for kk_cache in range(0, k_tile):
+                    cache_A[ii_cache, kk_cache] = temp_A[ii_cache, kk_cache]
+        for ii in range(0, m_tile):
+            for jj in range(0, n_tile):
+                for kk in range(0, k_tile):
+                    C[i+ii, j+jj] += cache_A[ii, kk] * B[k+kk, j+jj]
+```
 
 <div style="page-break-after: always;"></div>
diff --git a/docs/Manual/07 Plans - Vectorization and Parallelization.md b/docs/Manual/07 Plans - Vectorization and Parallelization.md
index 7110ed90..1f1b1b7a 100644
--- a/docs/Manual/07 Plans - Vectorization and Parallelization.md	
+++ b/docs/Manual/07 Plans - Vectorization and Parallelization.md	
@@ -107,6 +107,27 @@ Additionally, Accera can perform vectorized load and store operations to/from ve
 
 To vectorize dimension `i`, the number of active elements that corresponds to dimension `i` must exactly match the vector instruction width of the target processor. For example, if the target processor has vector instructions that operate on either 4 or 8 floating-point elements at once, then the number of active elements can either be 4 or 8. Additionally, those active elements must occupy adjacent memory locations (they cannot be spread out).
 
+## `tensorize`
+
+Some hardware also have specialized instructions for performing matrix multiplications. These instructions operate on certain matrix dimensions with specific data types. The tensorization instructions take tiles of the `A`, `B`, and `C` matrices and compute the `C = A * B + C` operation. 
+
+The `tensorize` operation takes 3 indices:
+
+```python
+plan.tensorize(indices=(i,j,k))
+```
+
+Tensorization is limited and is only valid on loop structures of the form 
+
+```python
+for i in range(M):
+    for k in range(N):
+        for j in range(K):
+            C[i, j] += A[i, k] * B[k, j]
+```
+
+Where there is `MxNxK` tensorization hardware support using the `A`, `B`, and `C` element data types.
+
 ## Convenience syntax: `kernelize`
 The `kernelize` instruction is a convenience syntax that does not provide any unique functionality. Specifically, `kernelize` is equivalent to a sequence of `unroll` instructions, followed by an optional `vectorize` instruction.
 
diff --git a/docs/Reference/classes/Array/deferred_layout.md b/docs/Reference/classes/Array/deferred_layout.md
index 1e8e1b6b..0f152eb5 100644
--- a/docs/Reference/classes/Array/deferred_layout.md
+++ b/docs/Reference/classes/Array/deferred_layout.md
@@ -3,14 +3,14 @@
 
 # Accera v1.2.1 Reference
 
-## `accera.Array.deferred_layout(layout)`
+## `accera.Array.deferred_layout(cache)`
 Specifies the layout for a `Array.Role.CONST` array based on a `Cache`. For more details, see [Deferred layout of constant arrays](<../../../Manual/08%20Deferred%20Layout%20of%20Constant%20Arrays.md>)
 
 ## Arguments
 
 argument | description | type/default
 --- | --- | ---
-`layout` | The layout to set. | `accera.Array.Layout`
+`cache` | The cache that defines the layout to set. | `accera.Cache`
 
 ## Examples
 
diff --git a/docs/Reference/classes/Array/sub_array.md b/docs/Reference/classes/Array/sub_array.md
new file mode 100644
index 00000000..df2ff85c
--- /dev/null
+++ b/docs/Reference/classes/Array/sub_array.md
@@ -0,0 +1,51 @@
+[//]: # (Project: Accera)
+[//]: # (Version: v1.2.1)
+
+# Accera v1.2.1 Reference
+
+## `accera.Array.sub_array(offsets, shape, strides)`
+Creates a sub-array of a specific shape from an array. The sub-array is created from elements at specified offsets and strides into the original array.
+
+## Arguments
+
+argument | description | type/default
+--- | --- | ---
+`offsets` | The offsets into the original array. | `Tuple[int]`
+`shape` | The size of the sub-array. | `Tuple[int]`
+`strides` | (Optional) The strides in the original array used to create the sub-array. | `Tuple[int]`
+
+## Examples
+
+Create a sub-array of size 2x3 from an array of size 5x5 at an offset of {1, 1} and a stride of {2, 1}:
+
+```python
+import numpy as np
+import accera as acc
+
+N = 5
+subArrayNumRows = 2
+subArrayNumCols = 3
+
+matrix = np.random.rand(N, N)
+Arr = Array(role=Array.Role.INPUT, data=matrix)
+
+# Zero out a sub array of size [2, 3] such that the resulting array looks like this:
+# xxxxx
+# x000x
+# xxxxx
+# x000x
+# xxxxx
+
+nest = Nest(shape=(subArrayNumRows, subArrayNumCols))
+i, j = nest.get_indices()
+
+@nest.iteration_logic
+def _():
+    SubArr = Arr.sub_array([1, 1], [subArrayNumRows, subArrayNumCols], [2, 1])
+    SubArr[i, j] = 0.0
+
+schedule = nest.create_schedule()
+```
+
+
+<div style="page-break-after: always;"></div>
diff --git a/docs/Reference/classes/Plan/cache.md b/docs/Reference/classes/Plan/cache.md
index d56b700a..b8b8c17b 100644
--- a/docs/Reference/classes/Plan/cache.md
+++ b/docs/Reference/classes/Plan/cache.md
@@ -3,12 +3,12 @@
 
 # Accera v1.2.1 Reference
 
-## `accera.Plan.cache(source[, index, trigger_index, layout, level, trigger_level, max_elements, thrifty, type])`
+## `accera.Plan.cache(source[, index, trigger_index, layout, level, trigger_level, max_elements, thrifty, location, double_buffer])`
 Adds a caching strategy to a plan.
 
 ## Arguments
 
-argument | description | type/default
+argument | description | type
 --- | --- | ---
 `source` | The array or cache from which this cache is copied. | `Array` or `Cache`
 `index` | The index used to determine the cache level. Specify one and only one of `index`, `level`, `max_elements`. | `Index`
@@ -17,8 +17,17 @@ argument | description | type/default
 `level` | The key-slice level to cache (the number of wildcard dimensions in a key-slice). Specify one and only one of `index`, `level`, `max_elements`. | positive integer
 `trigger_level` | The key-slice level to fill the cache at. `trigger_level` can't be smaller than `level`, and will default to `level` if not specified. Specify at most one of `trigger_index` or `trigger_level`. | positive integer
 `max_elements` | The maximum elements to include in the cached region. Specify one and only one of `index`, `level`, `max_elements`. | positive integer
-`thrifty` | Use thrifty caching (copy data into a cache only if the cached data differs from the original active block).  | True or False
-`location` | The type of memory used to store the cache. | `MemoryType`
+`thrifty` | Use thrifty caching (copy data into a cache only if the cached data differs from the original active block).  | `bool`
+`location` | The type of memory used to store the cache. | `MemorySpace`
+`double_buffer` | Whether to make this cache a double-buffering cache. Only valid on INPUT and CONST arrays. | `bool`
+`double_buffer_location` | Which memory space to put the double buffer temp array in. Requires that double_buffer is set to True. Defaults to `AUTO`. | `MemorySpace` or `AUTO`
+
+
+`AUTO` will configure the double buffering location based on the following:
+`location` | `double_buffer` | `double_buffer_location` = `AUTO`
+--- | --- | ---
+`MemorySpace.SHARED` | `True` | `MemorySpace.PRIVATE`
+`!MemorySpace.SHARED` | `True` | Same value as `location`
 
 ## Returns
 A `Cache` handle that represents the created cache.
@@ -54,7 +63,7 @@ AAA = plan.cache(AA, level=2)
 __Not yet implemented:__ Create a cache of array `A` at index `i` in GPU shared memory:
 ```python
 v100 = Target(Target.Model.NVIDIA_V100)
-AA = plan.cache(A, i, location=v100.MemoryType.SHARED)
+AA = plan.cache(A, i, location=v100.MemorySpace.SHARED)
 ```
 
 <div style="page-break-after: always;"></div>
diff --git a/docs/Reference/classes/Plan/tensorize.md b/docs/Reference/classes/Plan/tensorize.md
new file mode 100644
index 00000000..b72deb69
--- /dev/null
+++ b/docs/Reference/classes/Plan/tensorize.md
@@ -0,0 +1,31 @@
+[//]: # (Project: Accera)
+[//]: # (Version: v1.2.1)
+
+# Accera v1.2.1 Reference
+
+## `accera.Plan.tensorize(indices)`
+Only available for targets that have native matrix multiplication instruction (tensor core) support. Marks the dimensions of the iteration-space for tensorization. Only perfectly nested loops of the following form can be tensorized:
+
+
+```python
+for i in range(M):
+    for k in range(N):
+        for j in range(K):
+            C[i, j] += A[i, k] * B[k, j]
+```
+
+## Arguments
+
+argument | description | type/default
+--- | --- | ---
+`indices` | The iteration space dimensions to tensorize. | tuple of `accera.Index`
+
+## Examples
+
+Mark the dimensions `ii`, `jj`, and `kk` for tensorization execution:
+
+```python
+plan.tensorize(indices=(ii,jj,kk))
+```
+
+<div style="page-break-after: always;"></div>
diff --git a/docs/Reference/enumerations/ScalarType.md b/docs/Reference/enumerations/ScalarType.md
index d34d8651..f6fb6fb2 100644
--- a/docs/Reference/enumerations/ScalarType.md
+++ b/docs/Reference/enumerations/ScalarType.md
@@ -7,6 +7,7 @@
 type | description
 --- | ---
 `accera.ScalarType.bool` | boolean
+`accera.ScalarType.float16` | 16-bit floating point number
 `accera.ScalarType.float32` | 32-bit floating point number
 `accera.ScalarType.float64` | 64-bit floating point number
 `accera.ScalarType.int8` | 8-bit signed integer
diff --git a/docs/Tutorials/hello_matmul/mlir/2_LoopNestToValueFunc.mlir b/docs/Tutorials/hello_matmul/mlir/2_LoopNestToValueFunc.mlir
index c556f50f..872f0280 100644
--- a/docs/Tutorials/hello_matmul/mlir/2_LoopNestToValueFunc.mlir
+++ b/docs/Tutorials/hello_matmul/mlir/2_LoopNestToValueFunc.mlir
@@ -15,7 +15,7 @@ module @hello_matmul attributes {llvm.data_layout = "e-m:w-p270:32:32-p271:32:32
                 store %5, %arg2[%arg3, %arg4] : memref<128x256xf32, affine_map<(d0, d1) -> (d0 * 256 + d1)>>
                 %6 = load %arg2[%arg3, %arg4] : memref<128x256xf32, affine_map<(d0, d1) -> (d0 * 256 + d1)>>
                 store %6, %arg2[%arg3, %arg4] : memref<128x256xf32, affine_map<(d0, d1) -> (d0 * 256 + d1)>>
-              } {begin = 0 : i64, end = 4 : i64, index = #accln<"index{k_i,4}">, rcv_unrolled, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 1, 1]}
+              } {begin = 0 : i64, end = 4 : i64, index = #accln<"index{k_i,4}">, accv_unrolled, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 1, 1]}
             } {begin = 0 : i64, end = 256 : i64, index = #accln<"index{k_o,3}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 1, 4]}
           } {begin = 0 : i64, end = 256 : i64, index = #accln<"index{j,1}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 1, 256]}
         } {begin = 0 : i64, end = 128 : i64, index = #accln<"index{i,0}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 256, 256]}
diff --git a/docs/Tutorials/optimized_matmul/mlir/0_Initial.mlir b/docs/Tutorials/optimized_matmul/mlir/0_Initial.mlir
index 32405677..9dff55f3 100644
--- a/docs/Tutorials/optimized_matmul/mlir/0_Initial.mlir
+++ b/docs/Tutorials/optimized_matmul/mlir/0_Initial.mlir
@@ -92,7 +92,7 @@ module @optimized_matmul attributes {llvm.data_layout = "e-m:w-p270:32:32-p271:3
         "accln.schedule"(%55, %57) ( {
           "accln.exec_plan"() {exec_target = 0 : i64} : () -> () loc(unknown)
           accln.terminator loc(unknown)
-        }) {domain = #xdomain0, kernels = [@scheduled__], loopattrs = [{rcxp_vectorizationInfo = #accxp<"vectorizationinfo{8,16,1}">, scheduledIndex = #accln<"index{j_i_i_i,16}">}], order = [#accln<"index{j_o,3}">, #accln<"index{k_o,5}">, #accln<"index{i_o,7}">, #accln<"index{j_i_o,13}">, #accln<"index{k_i_o,9}">, #accln<"index{i_i_o,11}">, #accln<"index{k_i_i,10}">, #accln<"index{i_i_i,12}">, #accln<"index{j_i_i_o,15}">, #accln<"index{j_i_i_i,16}">], parallel = [], unroll_and_jammed = {}, unrolled = [15 : index, 11 : index]} : (index, index) -> () loc(unknown)
+        }) {domain = #xdomain0, kernels = [@scheduled__], loopattrs = [{accxp_vectorizationInfo = #accxp<"vectorizationinfo{8,16,1}">, scheduledIndex = #accln<"index{j_i_i_i,16}">}], order = [#accln<"index{j_o,3}">, #accln<"index{k_o,5}">, #accln<"index{i_o,7}">, #accln<"index{j_i_o,13}">, #accln<"index{k_i_o,9}">, #accln<"index{i_i_o,11}">, #accln<"index{k_i_i,10}">, #accln<"index{i_i_i,12}">, #accln<"index{j_i_i_o,15}">, #accln<"index{j_i_i_i,16}">], parallel = [], unroll_and_jammed = {}, unrolled = [15 : index, 11 : index]} : (index, index) -> () loc(unknown)
         accln.terminator loc(unknown)
       }) {domain = #domain0, exec_target = 0 : i64, kernels = []} : () -> () loc(unknown)
       accv.return loc(unknown)
diff --git a/docs/Tutorials/optimized_matmul/mlir/1_Canonicalizer.mlir b/docs/Tutorials/optimized_matmul/mlir/1_Canonicalizer.mlir
index 08bf9b53..44024330 100644
--- a/docs/Tutorials/optimized_matmul/mlir/1_Canonicalizer.mlir
+++ b/docs/Tutorials/optimized_matmul/mlir/1_Canonicalizer.mlir
@@ -32,7 +32,7 @@ module @optimized_matmul attributes {llvm.data_layout = "e-m:w-p270:32:32-p271:3
         "accln.schedule"(%8, %10) ( {
           "accln.exec_plan"() {exec_target = 0 : i64} : () -> ()
           accln.terminator
-        }) {domain = #accln<"xfdomain{dims: {{i,0}, {j,1}, {k,2}}, indices: {{{i,0} : {0:784:1} = {(d0, d1) -> (d0 + d1), {{i_o,7}, {i_i,8}}}}, {{j,1} : {0:512:1} = {(d0, d1) -> (d0 + d1), {{j_o,3}, {j_i,4}}}}, {{k,2} : {0:128:1} = {(d0, d1) -> (d0 + d1), {{k_o,5}, {k_i,6}}}}, {{j_o,3} : {0:512:256}}, {{j_i,4} : {0:256:1} = {(d0, d1) -> (d0 + d1), {{j_i_o,13}, {j_i_i,14}}}}, {{k_o,5} : {0:128:128}}, {{k_i,6} : {0:128:1} = {(d0, d1) -> (d0 + d1), {{k_i_o,9}, {k_i_i,10}}}}, {{i_o,7} : {0:784:1}}, {{i_i,8} : {0:1:1} = {(d0, d1) -> (d0 + d1), {{i_i_o,11}, {i_i_i,12}}}}, {{k_i_o,9} : {0:128:4}}, {{k_i_i,10} : {0:4:1}}, {{i_i_o,11} : {0:1:6}}, {{i_i_i,12} : {0:6:1}}, {{j_i_o,13} : {0:256:16}}, {{j_i_i,14} : {0:16:1} = {(d0, d1) -> (d0 + d1), {{j_i_i_o,15}, {j_i_i_i,16}}}}, {{j_i_i_o,15} : {0:16:8}}, {{j_i_i_i,16} : {0:8:1}}}}">, kernels = [@scheduled__], loopattrs = [{rcxp_vectorizationInfo = #accxp<"vectorizationinfo{8,16,1}">, scheduledIndex = #accln<"index{j_i_i_i,16}">}], order = [#accln<"index{j_o,3}">, #accln<"index{k_o,5}">, #accln<"index{i_o,7}">, #accln<"index{j_i_o,13}">, #accln<"index{k_i_o,9}">, #accln<"index{i_i_o,11}">, #accln<"index{k_i_i,10}">, #accln<"index{i_i_i,12}">, #accln<"index{j_i_i_o,15}">, #accln<"index{j_i_i_i,16}">], parallel = [], unroll_and_jammed = {}, unrolled = [15 : index, 11 : index]} : (index, index) -> ()
+        }) {domain = #accln<"xfdomain{dims: {{i,0}, {j,1}, {k,2}}, indices: {{{i,0} : {0:784:1} = {(d0, d1) -> (d0 + d1), {{i_o,7}, {i_i,8}}}}, {{j,1} : {0:512:1} = {(d0, d1) -> (d0 + d1), {{j_o,3}, {j_i,4}}}}, {{k,2} : {0:128:1} = {(d0, d1) -> (d0 + d1), {{k_o,5}, {k_i,6}}}}, {{j_o,3} : {0:512:256}}, {{j_i,4} : {0:256:1} = {(d0, d1) -> (d0 + d1), {{j_i_o,13}, {j_i_i,14}}}}, {{k_o,5} : {0:128:128}}, {{k_i,6} : {0:128:1} = {(d0, d1) -> (d0 + d1), {{k_i_o,9}, {k_i_i,10}}}}, {{i_o,7} : {0:784:1}}, {{i_i,8} : {0:1:1} = {(d0, d1) -> (d0 + d1), {{i_i_o,11}, {i_i_i,12}}}}, {{k_i_o,9} : {0:128:4}}, {{k_i_i,10} : {0:4:1}}, {{i_i_o,11} : {0:1:6}}, {{i_i_i,12} : {0:6:1}}, {{j_i_o,13} : {0:256:16}}, {{j_i_i,14} : {0:16:1} = {(d0, d1) -> (d0 + d1), {{j_i_i_o,15}, {j_i_i_i,16}}}}, {{j_i_i_o,15} : {0:16:8}}, {{j_i_i_i,16} : {0:8:1}}}}">, kernels = [@scheduled__], loopattrs = [{accxp_vectorizationInfo = #accxp<"vectorizationinfo{8,16,1}">, scheduledIndex = #accln<"index{j_i_i_i,16}">}], order = [#accln<"index{j_o,3}">, #accln<"index{k_o,5}">, #accln<"index{i_o,7}">, #accln<"index{j_i_o,13}">, #accln<"index{k_i_o,9}">, #accln<"index{i_i_o,11}">, #accln<"index{k_i_i,10}">, #accln<"index{i_i_i,12}">, #accln<"index{j_i_i_o,15}">, #accln<"index{j_i_i_i,16}">], parallel = [], unroll_and_jammed = {}, unrolled = [15 : index, 11 : index]} : (index, index) -> ()
         accln.terminator
       }) {domain = #accln<"idomain{{i,0}={0:784:1}, {j,1}={0:512:1}, {k,2}={0:128:1}}">, exec_target = 0 : i64, kernels = []} : () -> ()
       accv.return
diff --git a/docs/Tutorials/optimized_matmul/mlir/2_LoopNestToValueFunc.mlir b/docs/Tutorials/optimized_matmul/mlir/2_LoopNestToValueFunc.mlir
index 3566618d..d23d9108 100644
--- a/docs/Tutorials/optimized_matmul/mlir/2_LoopNestToValueFunc.mlir
+++ b/docs/Tutorials/optimized_matmul/mlir/2_LoopNestToValueFunc.mlir
@@ -31,7 +31,7 @@ module @optimized_matmul attributes {llvm.data_layout = "e-m:w-p270:32:32-p271:3
                           %5 = affine.apply affine_map<(d0, d1, d2) -> (d0 + d1 + d2 * 8)>(%arg3, %arg6, %arg8)
                           %6 = vector.transfer_read %arg1[%4, %5], %cst_0 {masked = [false]} : memref<128x512xf32, affine_map<(d0, d1) -> (d0 * 512 + d1)>>, vector<8xf32>
                           store %6, %0[%arg7, %arg8] : memref<1x16xvector<8xf32>>
-                        } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{i_1,24}">, rcv_unrolled, subdomainIndexOrder = [#accln<"index{i_0,23}">, #accln<"index{i_1,24}">], subdomainSize = [1, 1]}
+                        } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{i_1,24}">, accv_unrolled, subdomainIndexOrder = [#accln<"index{i_0,23}">, #accln<"index{i_1,24}">], subdomainSize = [1, 1]}
                       } {begin = 0 : i64, end = 1 : i64, index = #accln<"index{i_0,23}">, subdomainIndexOrder = [#accln<"index{i_0,23}">, #accln<"index{i_1,24}">], subdomainSize = [1, 16]}
                       accv.return
                     }) {exec_target = 0 : i64, sym_name = "NestFunction_15", type = () -> ()} : () -> ()
@@ -40,7 +40,7 @@ module @optimized_matmul attributes {llvm.data_layout = "e-m:w-p270:32:32-p271:3
                         affine.for %arg8 = 0 to 16 {
                           %4 = load %0[%arg7, %arg8] : memref<1x16xvector<8xf32>>
                           affine.store %4, %3[((%arg6 + %arg8 * 8) floordiv 16) mod 16, (%arg5 + %arg7) mod 128, (((%arg6 + %arg8 * 8) mod 16) floordiv 8) mod 2] : memref<16x128x2xvector<8xf32>>
-                        } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{i_1,26}">, rcv_unrolled, subdomainIndexOrder = [#accln<"index{i_0,25}">, #accln<"index{i_1,26}">], subdomainSize = [1, 1]}
+                        } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{i_1,26}">, accv_unrolled, subdomainIndexOrder = [#accln<"index{i_0,25}">, #accln<"index{i_1,26}">], subdomainSize = [1, 1]}
                       } {begin = 0 : i64, end = 1 : i64, index = #accln<"index{i_0,25}">, subdomainIndexOrder = [#accln<"index{i_0,25}">, #accln<"index{i_1,26}">], subdomainSize = [1, 16]}
                       accv.return
                     }) {exec_target = 0 : i64, sym_name = "NestFunction_14", type = () -> ()} : () -> ()
@@ -52,7 +52,7 @@ module @optimized_matmul attributes {llvm.data_layout = "e-m:w-p270:32:32-p271:3
                           %5 = affine.apply affine_map<(d0, d1, d2) -> (d0 + d1 + d2 * 8)>(%arg3, %arg6, %arg8)
                           %6 = vector.transfer_read %arg1[%4, %5], %cst_0 : memref<128x512xf32, affine_map<(d0, d1) -> (d0 * 512 + d1)>>, vector<8xf32>
                           store %6, %0[%arg7, %arg8] : memref<1x16xvector<8xf32>>
-                        } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{i_1,28}">, rcv_unrolled, subdomainIndexOrder = [#accln<"index{i_0,27}">, #accln<"index{i_1,28}">], subdomainSize = [1, 1]}
+                        } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{i_1,28}">, accv_unrolled, subdomainIndexOrder = [#accln<"index{i_0,27}">, #accln<"index{i_1,28}">], subdomainSize = [1, 1]}
                       } {begin = 0 : i64, end = 1 : i64, index = #accln<"index{i_0,27}">, subdomainIndexOrder = [#accln<"index{i_0,27}">, #accln<"index{i_1,28}">], subdomainSize = [1, 16]}
                       accv.return
                     }) {exec_target = 0 : i64, sym_name = "NestFunction_13", type = () -> ()} : () -> ()
@@ -61,7 +61,7 @@ module @optimized_matmul attributes {llvm.data_layout = "e-m:w-p270:32:32-p271:3
                         affine.for %arg8 = 0 to 16 {
                           %4 = load %0[%arg7, %arg8] : memref<1x16xvector<8xf32>>
                           affine.store %4, %3[((%arg6 + %arg8 * 8) floordiv 16) mod 16, (%arg5 + %arg7) mod 128, (((%arg6 + %arg8 * 8) mod 16) floordiv 8) mod 2] : memref<16x128x2xvector<8xf32>>
-                        } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{i_1,30}">, rcv_unrolled, subdomainIndexOrder = [#accln<"index{i_0,29}">, #accln<"index{i_1,30}">], subdomainSize = [1, 1]}
+                        } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{i_1,30}">, accv_unrolled, subdomainIndexOrder = [#accln<"index{i_0,29}">, #accln<"index{i_1,30}">], subdomainSize = [1, 1]}
                       } {begin = 0 : i64, end = 1 : i64, index = #accln<"index{i_0,29}">, subdomainIndexOrder = [#accln<"index{i_0,29}">, #accln<"index{i_1,30}">], subdomainSize = [1, 16]}
                       accv.return
                     }) {exec_target = 0 : i64, sym_name = "NestFunction_12", type = () -> ()} : () -> ()
@@ -238,10 +238,10 @@ module @optimized_matmul attributes {llvm.data_layout = "e-m:w-p270:32:32-p271:3
                             %136 = vector.insertelement %90, %135[%c7_i64 : i64] : vector<8xf32>
                             affine.store %136, %2[((%134 - %arg3) floordiv 16) mod 16, (%11 - %arg5) mod 6, (((%134 - %arg3) mod 16) floordiv 8) mod 2] : memref<16x6x2xvector<8xf32>>
                           } {begin = 0 : i64, end = 8 : i64, index = #accln<"index{j_i_i_i,16}">, scheduledIndex = #accln<"index{j_i_i_i,16}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 1, 1]}
-                        } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{j_i_i_o,15}">, rcv_unrolled, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 8, 1]}
+                        } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{j_i_i_o,15}">, accv_unrolled, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 8, 1]}
                       } {begin = 0 : i64, end = 0 : i64, index = #accln<"index{i_i_i,12}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 1]}
                     } {begin = 0 : i64, end = 4 : i64, index = #accln<"index{k_i_i,10}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 1]}
-                  } {begin = 0 : i64, end = 0 : i64, index = #accln<"index{i_i_o,11}">, rcv_unrolled, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 4]}
+                  } {begin = 0 : i64, end = 0 : i64, index = #accln<"index{i_i_o,11}">, accv_unrolled, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 4]}
                   affine.for %arg8 = 0 to 1 step 6 {
                     affine.for %arg9 = 0 to 4 {
                       affine.for %arg10 = 0 to 1 {
@@ -397,10 +397,10 @@ module @optimized_matmul attributes {llvm.data_layout = "e-m:w-p270:32:32-p271:3
                             %136 = vector.insertelement %90, %135[%c7_i64 : i64] : vector<8xf32>
                             affine.store %136, %2[((%134 - %arg3) floordiv 16) mod 16, (%11 - %arg5) mod 6, (((%134 - %arg3) mod 16) floordiv 8) mod 2] : memref<16x6x2xvector<8xf32>>
                           } {begin = 0 : i64, end = 8 : i64, index = #accln<"index{j_i_i_i,16}">, scheduledIndex = #accln<"index{j_i_i_i,16}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 1, 1]}
-                        } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{j_i_i_o,15}">, rcv_unrolled, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 8, 1]}
+                        } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{j_i_i_o,15}">, accv_unrolled, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 8, 1]}
                       } {begin = 0 : i64, end = 1 : i64, index = #accln<"index{i_i_i,12}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 16, 1]}
                     } {begin = 0 : i64, end = 4 : i64, index = #accln<"index{k_i_i,10}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 16, 1]}
-                  } {begin = 0 : i64, end = 1 : i64, index = #accln<"index{i_i_o,11}">, rcv_unrolled, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 16, 4]}
+                  } {begin = 0 : i64, end = 1 : i64, index = #accln<"index{i_i_o,11}">, accv_unrolled, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 16, 4]}
                 } {begin = 0 : i64, end = 128 : i64, index = #accln<"index{k_i_o,9}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 16, 4]}
               } {begin = 0 : i64, end = 256 : i64, index = #accln<"index{j_i_o,13}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [1, 16, 128]}
               "accv.lambda"() ( {
@@ -416,7 +416,7 @@ module @optimized_matmul attributes {llvm.data_layout = "e-m:w-p270:32:32-p271:3
                             %7 = affine.load %2[((%arg7 + %arg9 * 8) floordiv 16) mod 16, (%arg6 + %arg8) mod 6, (((%arg7 + %arg9 * 8) mod 16) floordiv 8) mod 2] : memref<16x6x2xvector<8xf32>>
                             %8 = addf %6, %7 : vector<8xf32>
                             store %8, %1[%arg8, %arg9] : memref<1x16xvector<8xf32>>
-                          } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{i_o,7}">, rcv_unrolled, subdomainIndexOrder = [#accln<"index{k_i,6}">, #accln<"index{i_o,7}">], subdomainSize = [1, 1]}
+                          } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{i_o,7}">, accv_unrolled, subdomainIndexOrder = [#accln<"index{k_i,6}">, #accln<"index{i_o,7}">], subdomainSize = [1, 1]}
                         } {begin = 0 : i64, end = 1 : i64, index = #accln<"index{k_i,6}">, subdomainIndexOrder = [#accln<"index{k_i,6}">, #accln<"index{i_o,7}">], subdomainSize = [1, 16]}
                         accv.return
                       }) {exec_target = 0 : i64, sym_name = "NestFunction_11", type = () -> ()} : () -> ()
@@ -441,7 +441,7 @@ module @optimized_matmul attributes {llvm.data_layout = "e-m:w-p270:32:32-p271:3
                             %7 = affine.load %2[((%arg7 + %arg9 * 8) floordiv 16) mod 16, (%arg6 + %arg8) mod 6, (((%arg7 + %arg9 * 8) mod 16) floordiv 8) mod 2] : memref<16x6x2xvector<8xf32>>
                             %8 = addf %6, %7 : vector<8xf32>
                             store %8, %1[%arg8, %arg9] : memref<1x16xvector<8xf32>>
-                          } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{i_i_o,11}">, rcv_unrolled, subdomainIndexOrder = [#accln<"index{k_i_i,10}">, #accln<"index{i_i_o,11}">], subdomainSize = [1, 1]}
+                          } {begin = 0 : i64, end = 16 : i64, index = #accln<"index{i_i_o,11}">, accv_unrolled, subdomainIndexOrder = [#accln<"index{k_i_i,10}">, #accln<"index{i_i_o,11}">], subdomainSize = [1, 1]}
                         } {begin = 0 : i64, end = 1 : i64, index = #accln<"index{k_i_i,10}">, subdomainIndexOrder = [#accln<"index{k_i_i,10}">, #accln<"index{i_i_o,11}">], subdomainSize = [1, 16]}
                         accv.return
                       }) {exec_target = 0 : i64, sym_name = "NestFunction_9", type = () -> ()} : () -> ()
diff --git a/docs/Tutorials/optimized_matmul/mlir/3_ValueFuncToTarget.mlir b/docs/Tutorials/optimized_matmul/mlir/3_ValueFuncToTarget.mlir
index b26dc9f4..8f0e2338 100644
--- a/docs/Tutorials/optimized_matmul/mlir/3_ValueFuncToTarget.mlir
+++ b/docs/Tutorials/optimized_matmul/mlir/3_ValueFuncToTarget.mlir
@@ -613,7 +613,7 @@ module @optimized_matmul attributes {llvm.data_layout = "e-m:w-p270:32:32-p271:3
                     affine.store %270, %2[((%268 - %arg3) floordiv 16) mod 16, (%145 - %arg4) mod 6, (((%268 - %arg3) mod 16) floordiv 8) mod 2] : memref<16x6x2xvector<8xf32>>
                   } {begin = 0 : i64, end = 0 : i64, index = #accln<"index{i_i_i,12}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 1]}
                 } {begin = 0 : i64, end = 4 : i64, index = #accln<"index{k_i_i,10}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 1]}
-              } {begin = 0 : i64, end = 0 : i64, index = #accln<"index{i_i_o,11}">, rcv_unrolled, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 4]}
+              } {begin = 0 : i64, end = 0 : i64, index = #accln<"index{i_i_o,11}">, accv_unrolled, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 4]}
               affine.for %arg7 = 0 to 4 {
                 %4 = affine.apply affine_map<(d0, d1, d2) -> (d0 + d1 + d2)>(%arg4, %c0_7, %c0_8)
                 %5 = affine.apply affine_map<(d0, d1, d2) -> (d0 + d1 + d2)>(%arg4, %c0_7, %c0_8)
@@ -2833,7 +2833,7 @@ module @optimized_matmul attributes {llvm.data_layout = "e-m:w-p270:32:32-p271:3
                     affine.store %268, %0[((%266 - %arg5) floordiv 16) mod 16, (%143 - %arg6) mod 6, (((%266 - %arg5) mod 16) floordiv 8) mod 2] : memref<16x6x2xvector<8xf32>>
                   } {begin = 0 : i64, end = 0 : i64, index = #accln<"index{i_i_i,12}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 1]}
                 } {begin = 0 : i64, end = 4 : i64, index = #accln<"index{k_i_i,10}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 1]}
-              } {begin = 0 : i64, end = 0 : i64, index = #accln<"index{i_i_o,11}">, rcv_unrolled, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 4]}
+              } {begin = 0 : i64, end = 0 : i64, index = #accln<"index{i_i_o,11}">, accv_unrolled, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 4]}
               affine.for %arg9 = 0 to 4 {
                 %2 = affine.apply affine_map<(d0, d1, d2) -> (d0 + d1 + d2)>(%arg6, %c0_7, %c0_8)
                 %3 = affine.apply affine_map<(d0, d1, d2) -> (d0 + d1 + d2)>(%arg6, %c0_7, %c0_8)
diff --git a/docs/Tutorials/optimized_matmul/mlir/4_SymbolDCE.mlir b/docs/Tutorials/optimized_matmul/mlir/4_SymbolDCE.mlir
index 07b3ccc0..e36f1a6a 100644
--- a/docs/Tutorials/optimized_matmul/mlir/4_SymbolDCE.mlir
+++ b/docs/Tutorials/optimized_matmul/mlir/4_SymbolDCE.mlir
@@ -613,7 +613,7 @@ module @optimized_matmul attributes {llvm.data_layout = "e-m:w-p270:32:32-p271:3
                     affine.store %270, %2[((%268 - %arg3) floordiv 16) mod 16, (%145 - %arg4) mod 6, (((%268 - %arg3) mod 16) floordiv 8) mod 2] : memref<16x6x2xvector<8xf32>>
                   } {begin = 0 : i64, end = 0 : i64, index = #accln<"index{i_i_i,12}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 1]}
                 } {begin = 0 : i64, end = 4 : i64, index = #accln<"index{k_i_i,10}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 1]}
-              } {begin = 0 : i64, end = 0 : i64, index = #accln<"index{i_i_o,11}">, rcv_unrolled, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 4]}
+              } {begin = 0 : i64, end = 0 : i64, index = #accln<"index{i_i_o,11}">, accv_unrolled, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 4]}
               affine.for %arg7 = 0 to 4 {
                 %4 = affine.apply affine_map<(d0, d1, d2) -> (d0 + d1 + d2)>(%arg4, %c0_7, %c0_8)
                 %5 = affine.apply affine_map<(d0, d1, d2) -> (d0 + d1 + d2)>(%arg4, %c0_7, %c0_8)
diff --git a/docs/Tutorials/optimized_matmul/mlir/5_LinalgLowerToAffineLoops.mlir b/docs/Tutorials/optimized_matmul/mlir/5_LinalgLowerToAffineLoops.mlir
index 9012ac2d..04929b29 100644
--- a/docs/Tutorials/optimized_matmul/mlir/5_LinalgLowerToAffineLoops.mlir
+++ b/docs/Tutorials/optimized_matmul/mlir/5_LinalgLowerToAffineLoops.mlir
@@ -514,7 +514,7 @@ module @optimized_matmul attributes {llvm.data_layout = "e-m:w-p270:32:32-p271:3
                     affine.store %269, %2[((%267 - %arg3) floordiv 16) mod 16, (%144 - %arg4) mod 6, (((%267 - %arg3) mod 16) floordiv 8) mod 2] : memref<16x6x2xvector<8xf32>>
                   } {begin = 0 : i64, end = 0 : i64, index = #accln<"index{i_i_i,12}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 1]}
                 } {begin = 0 : i64, end = 4 : i64, index = #accln<"index{k_i_i,10}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 1]}
-              } {begin = 0 : i64, end = 0 : i64, index = #accln<"index{i_i_o,11}">, rcv_unrolled, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 4]}
+              } {begin = 0 : i64, end = 0 : i64, index = #accln<"index{i_i_o,11}">, accv_unrolled, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 4]}
               affine.for %arg7 = 0 to 4 {
                 %4 = affine.apply affine_map<(d0, d1) -> (d0 + d1)>(%arg3, %arg5)
                 %5 = affine.apply affine_map<(d0, d1) -> (d0 + d1)>(%arg6, %arg7)
diff --git a/docs/Tutorials/optimized_matmul/mlir/6_SimplifyAffineStructures.mlir b/docs/Tutorials/optimized_matmul/mlir/6_SimplifyAffineStructures.mlir
index f41d7ec6..56e88c48 100644
--- a/docs/Tutorials/optimized_matmul/mlir/6_SimplifyAffineStructures.mlir
+++ b/docs/Tutorials/optimized_matmul/mlir/6_SimplifyAffineStructures.mlir
@@ -514,7 +514,7 @@ module @optimized_matmul attributes {llvm.data_layout = "e-m:w-p270:32:32-p271:3
                     affine.store %269, %2[((-%arg3 + %267) floordiv 16) mod 16, (-%arg4 + %144) mod 6, (((-%arg3 + %267) mod 16) floordiv 8) mod 2] : memref<16x6x2xvector<8xf32>>
                   } {begin = 0 : i64, end = 0 : i64, index = #accln<"index{i_i_i,12}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 1]}
                 } {begin = 0 : i64, end = 4 : i64, index = #accln<"index{k_i_i,10}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 1]}
-              } {begin = 0 : i64, end = 0 : i64, index = #accln<"index{i_i_o,11}">, rcv_unrolled, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 4]}
+              } {begin = 0 : i64, end = 0 : i64, index = #accln<"index{i_i_o,11}">, accv_unrolled, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 4]}
               affine.for %arg7 = 0 to 4 {
                 %4 = affine.apply affine_map<(d0, d1) -> (d0 + d1)>(%arg3, %arg5)
                 %5 = affine.apply affine_map<(d0, d1) -> (d0 + d1)>(%arg6, %arg7)
diff --git a/docs/Tutorials/optimized_matmul/mlir/7_Canonicalizer.mlir b/docs/Tutorials/optimized_matmul/mlir/7_Canonicalizer.mlir
index 6cf8c062..d77eefe0 100644
--- a/docs/Tutorials/optimized_matmul/mlir/7_Canonicalizer.mlir
+++ b/docs/Tutorials/optimized_matmul/mlir/7_Canonicalizer.mlir
@@ -456,7 +456,7 @@ module @optimized_matmul attributes {llvm.data_layout = "e-m:w-p270:32:32-p271:3
                     affine.store %211, %2[((%arg5 + 8) floordiv 16) mod 16, (%arg7 + %arg9) mod 6, %arg5 floordiv 8 - ((%arg5 + 8) floordiv 16) * 2 - ((%arg5 floordiv 8 - ((%arg5 + 8) floordiv 16) * 2 + 1) floordiv 2) * 2 + 1] : memref<16x6x2xvector<8xf32>>
                   } {begin = 0 : i64, end = 0 : i64, index = #accln<"index{i_i_i,12}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 1]}
                 } {begin = 0 : i64, end = 4 : i64, index = #accln<"index{k_i_i,10}">, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 1]}
-              } {begin = 0 : i64, end = 0 : i64, index = #accln<"index{i_i_o,11}">, rcv_unrolled, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 4]}
+              } {begin = 0 : i64, end = 0 : i64, index = #accln<"index{i_i_o,11}">, accv_unrolled, subdomainIndexOrder = [#accln<"index{i,0}">, #accln<"index{j,1}">, #accln<"index{k,2}">], subdomainSize = [0, 16, 4]}
               affine.for %arg7 = 0 to 4 {
                 %4 = affine.apply affine_map<(d0, d1) -> (d0 + d1)>(%arg6, %arg7)
                 %5 = affine.apply affine_map<(d0, d1) -> (d0 + d1)>(%arg6, %arg7)
diff --git a/requirements.txt b/requirements.txt
index 780f1d74..7d3ef666 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -11,4 +11,5 @@ mkdocs-material
 packaging
 plotly
 kaleido
-hatlib
\ No newline at end of file
+hatlib
+varname
\ No newline at end of file
diff --git a/setup.cfg b/setup.cfg
index e5dd887b..c88611b9 100644
--- a/setup.cfg
+++ b/setup.cfg
@@ -31,6 +31,7 @@ setup_requires =
     setuptools>=31
     setuptools_scm
 install_requires =
+    varname
     hatlib
     numpy
     pyyaml