Squashed commit of the following:

commit 8e4041a19d8577a4c14741b45b6ab733e5d53d74 Author: Ritwik Das <ritdas@microsoft.com> Date: Wed Aug 10 00:22:43 2022 +0000 Merged PR 2814: Parameterize batch_size in GPU benchmarks Parameterize batch_size in GPU benchmarks commit eb6197a8f54555ab47be383f7f48b7efc9442041 Author: Lisa Ong <onglisa@microsoft.com> Date: Mon Aug 8 05:51:35 2022 +0000 Merged PR 2810: [release] [nfc] Bump docs version to 1.2.8, bump github actions to llvm 14.0.6 Preparation for 1.2.8 release commit 63c7a397210a753836a647f399e2798bba521939 Author: Lisa Ong <onglisa@microsoft.com> Date: Mon Aug 8 05:03:36 2022 +0000 Merged PR 2808: [ci] Add vcpkg caching for buddy builds, disable flaky parallelized tests * Enable vcpkg binary caching for CI pipelines that are using non custom agents. This reduces vcpkg install time from 2-3 minutes to ~30 seconds * ctest --parallel on macos can sometimes fail randomly. The tests will need to be updated to support running in parallel References: https://vcpkg.io/en/docs/users/binarycaching.html Note: an organization-wide Nuget feed must be created. Project-wide Nuget feeds will fail with access denied. commit 37e207a0deb2c8c431a6c0e73787c726221a4f37 Author: Lisa Ong <onglisa@microsoft.com> Date: Mon Aug 8 04:13:14 2022 +0000 Merged PR 2804: [ci] Reduce runtimes of PR Buddy Builds * Remove redundant setup.py builds in pipelines with cmake builds * Build debug for Linux only (the fastest config) * Add pipeline caching for ccache, conan, and pip where applicable * Add parallel configs where applicable * Filter out some tests on windows due to slow runtimes. These should have coverage on Linux and macOS. commit c8940050d9064e5c326644761ef2821b59e8e431 Author: Ritwik Das <ritdas@microsoft.com> Date: Fri Aug 5 23:31:24 2022 +0000 Merged PR 2807: Enable verification for CK baselines - Enable verification for CK baselines - increase timeout for cuda resnet - add functionality for extracting kernel code from cosmosdb commit da114623db8518c9b47eae83eee1357f0bcd3565 Author: Chuck Jacobs <cjacobs@microsoft.com> Date: Fri Aug 5 22:35:43 2022 +0000 Merged PR 2802: Fix barrier optimization pass This PR fixes a couple of barrier-related issues: - The barrier optimization pass wasn't keeping barriers that protected vector load/store ops - Multiple barriers were getting generated when hoisting barriers out of conditionals Related work items: #3732 commit 03171fec09146bdeacfdc2da68ff73202e30d534 Author: Ritwik Das <ritdas@microsoft.com> Date: Thu Aug 4 19:08:27 2022 +0000 Merged PR 2800: Add max_threads to parallelize and change default behavior - Add num_threads to parallelize - change default behavior to count the number of iterations of the given indices - Update documentation commit 7ff3a90dd09f74c8699657b17909a575c59267fa Author: Ritwik Das <ritdas@microsoft.com> Date: Thu Aug 4 16:42:29 2022 +0000 Merged PR 2801: Remove verification on cuda-fp32-big benchmark Remove verification on cuda-fp32-big benchmark commit 5e6f6d93f7c62b2f965e12cb69d00b77a6c65a89 Author: Lisa Ong <onglisa@microsoft.com> Date: Mon Aug 1 22:36:03 2022 +0000 Merged PR 2798: LLVM 14.0.6 upgrade An incremental upgrade with minimal or no changes to MLIR commit bf8faeaee154befc7c34e221fe54f9a1fd2799f3 Author: Kern Handa <kerha@microsoft.com> Date: Sat Jul 30 04:55:03 2022 +0000 Merged PR 2796: Makes NestedPassAdaptor's pipeline consistent Makes NestedPassAdaptor's pipeline consistent This change makes it so NestedPassAdaptor creates a new pass manager every time a new pass is added. Prior to this change, if dumpPasses was false, the same nested pass manager would be used. If dumpPasses was true, a new nested pass manager would be created per call to addPass. This difference in behavior was also resulting in the lowering pipeline to be different, depending on the value of dumpPasses. For example, in the following code in AcceraPasses.cpp, all the passes that are added to `funcOpPM` run BEFORE `createConvertSCFToOpenMPPass` if `dumpPasses` was false. ```cpp auto funcOpPM = pmAdaptor.nestPassManager([&]() -> OpPassManager& { return pm.nest<v::ValueModuleOp>().nest<FuncOp>(); }); funcOpPM.addPass(createConvertLinalgToAffineLoopsPass()); funcOpPM.addPass(createSimplifyAffineStructuresPass()); funcOpPM.addPass(createCanonicalizerPass()); funcOpPM.addPass(createLoopInvariantCodeMotionPass()); funcOpPM.addPass(createCSEPass()); pmAdaptor.addPass(createConvertSCFToOpenMPPass()); pmAdaptor.addPass(value::createValueToStdPass(options.enableProfile)); funcOpPM.addPass(value::createBarrierOptPass(options.writeBarrierGraph.getValue(), options.barrierGraphFilename.getValue())); pmAdaptor.addPass(value::createRangeValueOptimizePass()); pmAdaptor.addPass(createCanonicalizerPass()); pmAdaptor.addPass(createCSEPass()); ``` Additionally, this change exposed the fact that the BarrierOpt pass is incorrectly erasing barriers, and so has been made into a no-op until this correctness issue has been fixed. commit d97a5fd55712ce783f65dd948a9e9c152b1ff2d1 Author: Lisa Ong <onglisa@microsoft.com> Date: Thu Jul 28 18:33:54 2022 +0000 Merged PR 2795: [docs] Cleanup viz scripts, clarify reorder illustrations * Clarify in the labels while working on the animated version * Cleanup and rename .js files for (slightly) easier lookup commit 4afe6b763c2097a9a09eebac4ed5f0f5f59f587d Author: Lisa Ong <onglisa@microsoft.com> Date: Thu Jul 28 08:06:22 2022 +0000 Merged PR 2475: LLVM 14.0.0 upgrade Tag: llvmorg-14.0.0 Notable changes: * std dialect ops are now moved to arith, math dialects * StrEnumAttribute is now replaced by simple enums. This affects things like gpu.dimension.x * [Issue] linalg.copy is removed, replaced by memref.copy, which introduces a runtime dependency on a `memrefCopy` C function for non-identity layout copies. This affects Array.sub_array in debug mode. * [Regression] OMP to LLVM lowering will crash in mlir-translate findAlloc due to a empty set of blocks being emitted. This only affects dynamic scheduling with collapsed loops. * Lots of renames * Upgraded macOS to macOS-12 Related work items: #3646 commit de3bd0ffde5ebb4bf69cb0db0c46bc76fef37c4b Author: Denny Sun <dennys@microsoft.com> Date: Thu Jul 28 01:02:23 2022 +0000 Merged PR 2753: accera.Dimension and runtime-sized Arrays in the Python DSL With this change, Accera is able to generate the initial mlir for runtime sized Arrays. The ir lowering is not fully working due to some bug, which can be fixed in the later changes. ``` M = Dim() N = Dim() K = Dim() A = Array(shape=(M, K), element_type=ScalarType.float32, role=Array.Role.INPUT) B = Array(shape=(K, N), element_type=ScalarType.float32, role=Array.Role.INPUT) C = Array(shape=(M, N), element_type=ScalarType.float32, role=Array.Role.INPUT_OUTPUT) nest = Nest((M, N, K)) i, j, k = nest.get_indices() @nest.iteration_logic def _(): C[i, j] += A[i, k] * B[k, j] package.add() package.build() ``` ``` module @test_runtimesizes attributes {llvm.data_layout = "... ..."} { accv.module "test_runtimesizes" { accv.func nested @runtimesizes_..._impl_...(%arg0: index loc(unknown), %arg1: index loc(unknown), %arg2: index loc(unknown), %arg3: memref<?x?xf32, #map> loc(unknown), %arg4: memref<?x?xf32, #map> loc(unknown), %arg5: memref<?x?xf32, #map> loc(unknown)) attributes {accv.output_verifiers = ["", "", "", "", "", "_debug_check_allclose_<accera.lang.Dim.Dim object at ...>_<accera.lang.Dim.Dim object at ...>_..."], exec_target = 0 : i64} { %0 = "accv.get_element"(<<UNKNOWN SSA VALUE>>) : (memref<index>) -> index loc(#loc) %1 = "accv.get_element"(<<UNKNOWN SSA VALUE>>) : (memref<index>) -> index loc(#loc) %2 = "accv.get_element"(<<UNKNOWN SSA VALUE>>) : (memref<index>) -> index loc(#loc) "accln.nest"(%0, %1, %2) ( { %3 = accln.sym_index {name = "i"} #accln<"index{i,3}"> loc(#loc) %4 = accln.sym_index {name = "j"} #accln<"index{j,4}"> loc(#loc) %5 = accln.sym_index {name = "k"} #accln<"index{k,5}"> loc(#loc) "accln.kernel"() ( { %7 = "accv.slice"(%arg5, %3, %4) {sliceDimensions = [0, 1]} : (memref<?x?xf32, #map>, index, index) -> memref<f32> loc(#loc) ... ... accln.terminator loc(#loc) }) {sym_name = "_"} : () -> () loc(#loc) ... ... accln.terminator loc(#loc) }) {domain = #domain0, exec_target = 0 : i64, kernels = []} : (index, index, index) -> () loc(#loc) accv.return loc(#loc) } loc(#loc) accv.func @runtimesizes_...(%arg0: index loc(unknown), %arg1: index loc(unknown), %arg2: index lo... commit 75553672d92f6e60638b4cb6169bda9712401eef Author: JUBI TANEJA <jubitaneja@microsoft.com> Date: Wed Jul 27 23:34:03 2022 +0000 Merged PR 2793: support sign extend op in canVectorize() function to improve generated MLIR While trying to optimize `int16` `MatMul` with vectorize transformation in DSL, we noticed an unrolled loop with load, binop, sexti, store instructions. There was no vector instruction emitted and it hinted us that sign extend instruction is not supported in `canVectorize` function and now with this op supported, we can emit some vector instructions in the MLIR. commit 4fa740166b1c17359d40b540b7b7eb623caa167a Author: Ritwik Das <ritdas@microsoft.com> Date: Wed Jul 27 07:02:01 2022 +0000 Merged PR 2790: Filter invalid kernels from GPU benchmarks - Filter invalid kernels from GPU benchmarks - Disable verification on cuda f16 benchmarks - Remove frequent cleanups commit 6cff78412b04f15c3257bc2286a3805974a56012 Author: Ritwik Das <ritdas@microsoft.com> Date: Tue Jul 26 03:27:14 2022 +0000 Merged PR 2787: Remove MLIR flag from package format in benchmarks Remove MLIR flag from package format in benchmarks commit 0e7b7ef930f177e6a70b88aad7982e9a36dd116f Author: Lisa Ong <onglisa@microsoft.com> Date: Mon Jul 25 23:10:43 2022 +0000 Merged PR 2784: Merge Github changes to ADO Author: Lisa Ong <11318241+lisaong@users.noreply.github.com> Date: Mon Jul 25 19:13:00 2022 +0800 Update Building_on_Ubuntu.md commit 474d7e4c6fd7dcd2f723193e69446da4c63f97ee Author: Lisa Ong <11318241+lisaong@users.noreply.github.com> Date: Mon Jul 25 19:03:30 2022 +0800 Github codespaces configuration (#48) commit 0e8ffcd806bfc1671c89e599c2562592c4d06f21 Author: Anthony Shaw <anthony.p.shaw@gmail.com> Date: Mon Jul 25 15:34:18 2022 +1000 Set license field in metadata of package (#46) * Set license field in meta * Update all setup.cfg files commit 9a8ea90b22b02379072a98eacc8d5f49c1a28e69 Author: Lisa Ong <11318241+lisaong@users.noreply.github.com> Date: Mon Jul 25 10:24:26 2022 +0800 Enable CIs from pull requests from forks commit 8275363815c0e128ff477a8c7692ad44353db5aa Author: Chuck Jacobs <cjacobs@microsoft.com> Date: Mon Jul 25 20:41:39 2022 +0000 Merged PR 2776: Make fusing more efficient This PR refactors the code generation for schedules and makes it more efficient. This makes a big difference for complex schedules with constraints on the kernels (like the ones generated when fusing schedules). Here are some timings on a few tests (modified versions of Mason's example script) I ran: | test | main branch | PR branch | |----|----|----| | 3 fused schedules, tile first only | 18.8s | 5.8s | | 3 fused schedules, tile 1 & 2 | 190s | 6.2s | | 3 fused schedules, tile all 3 | ???? | 7.2s | Related work items: #3731 commit 2306afbb9425dcecced6a38590da2ae31937d23e Author: Ritwik Das <ritdas@microsoft.com> Date: Mon Jul 25 06:51:24 2022 +0000 Merged PR 2781: Fix benchmark with MLIR format and add repro test commit 6e72de99c9f1952111081dc320cc055eb09aabf6 Author: Ritwik Das <ritdas@microsoft.com> Date: Sat Jul 23 04:26:14 2022 +0000 Merged PR 2780: Type support for tensor ops in CUDA - Add support for FP32 input (TF32 compute) - Add support for bfloat16 input/FP32 output - Add support for integer types Related work items: #3709, #3710 commit 3bedc2c51c7b93865462df68b154db9c4bdda5ec Author: Ritwik Das <ritdas@microsoft.com> Date: Fri Jul 22 04:11:41 2022 +0000 Merged PR 2779: Some assorted benchmark fixes - Build Accera in release mode - Shuffle gemm sizes to run small sizes first - Increase tolerance to account for floating point drift for large k-split commit cb010de71df47c75c48ae3a8a749e03fc606e24f Author: Ritwik Das <ritdas@microsoft.com> Date: Thu Jul 21 19:43:11 2022 +0000 Merged PR 2774: Add input caching tests for CUDA, enable tests in PR pipelines Add input caching tests in CUDA Related work items: #3725 commit 85c09542106b4e685ba4141d9cb34b823d0d02b7 Author: Ritwik Das <ritdas@microsoft.com> Date: Wed Jul 20 23:55:04 2022 +0000 Merged PR 2677: Unify rocm/cuda tensor ops lowering under accv dialect - remove gpu dialect lowering (CUDA) - add accv dialect lowering for CUDA - rocm and cuda lowering use the same semantics Related work items: #3728 commit 282d66743b4d5d828998198d94003ea93228165c Author: Lisa Ong <onglisa@microsoft.com> Date: Tue Jul 19 02:14:58 2022 +0000 Merged PR 2764: [doc] Rename acc.Dim to acc.Dimension and add create_dimensions() * Rename `acc.Dim` to `acc.Dimension`, `acc.Dim.Role` to `acc.Dimension.Role` * Add the simplified `acc.create_dimensions()` construction pattern * Kept the `acc.Dimension` constructor for advanced use cases involving generator patterns Related work items: #3720 commit e8a0a7475acc2b117b41b97238b2ecb11060cbd6 Author: Ritwik Das <ritdas@microsoft.com> Date: Thu Jul 14 03:27:17 2022 +0000 Merged PR 2752: Add nargs to input args in benchmark tool add nargs to input args in benchmark tool commit 2c5083a721b4cbef11d1782b43d34d67b76caa0e Author: Lisa Ong <onglisa@microsoft.com> Date: Wed Jul 13 05:07:40 2022 +0000 Merged PR 2680: [doc] Manual and Reference doc updates for Runtime Array DSL Proposed DSL changes for supporting runtime array sizes: * Adds a new dimension type that serves as a placeholder for runtime dimension sizes for `Array` and `Nest`. Supports both input and output dimensions * Adds output-only Arrays * Add the Scalar type * Example kernels demonstrating different aspects: * Gather: basic features * Range: scalar function arguments * ReduceMean: fusion Related work items: #3720 commit dbdbbb94c98f787782e8e4a6171a3af067f91e58 Author: Denny Sun <dennys@microsoft.com> Date: Wed Jul 13 01:45:07 2022 +0000 Merged PR 2683: Support conditionals in Logic Function Before this change, there is no way to emit conditionals in logic function. With this change, the user is able to write the following logic function: ``` def if_func(): T[i, j] = A[i, j] + B[i, j] C[i, j] += T[i, j]**2. def elseif_func(): T[i, j] = A[i, j] - B[i, j] C[i, j] += T[i, j]**2. def else_func(): C[i, j] = A[i, j] + B[i, j] @nest.iteration_logic def _(): _If(j<100, if_func).ElseIf(i>100, elseif_func).Else(else_func) ``` Related work items: #3706
microsoft · Aug 10, 2022 · a9ab6bd · a9ab6bd
1 parent 8be492c
commit a9ab6bd
Show file tree

Hide file tree

Showing 282 changed files with 6,926 additions and 5,446 deletions.
diff --git a/.azure/cuda/cuda-benchmark-baseline.yml b/.azure/cuda/cuda-benchmark-baseline.yml
@@ -54,8 +54,8 @@ jobs:
 
       - bash: |
           export PYTHONPATH=$(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8
-          python gpu_benchmark_tool.py --type h --target 'NVidia RTX A6000' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --janitor True --verbose True --cublas $(Build.SourcesDirectory)/build/temp.linux-x86_64-3.8/tools/benchmarkers/cublas/cublas_gemm --input gemm_rectangle_A6000.csv,gemm_square.csv,gemm_bert_assorted.csv
-          python gpu_benchmark_tool.py --type s --target 'NVidia RTX A6000' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --janitor True --verbose True --cublas $(Build.SourcesDirectory)/build/temp.linux-x86_64-3.8/tools/benchmarkers/cublas/cublas_gemm --input gemm_rectangle_A6000.csv,gemm_square.csv,gemm_bert_assorted.csv,gemm_resnet_inception.csv
+          python gpu_benchmark_tool.py --type h --target 'NVidia RTX A6000' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --verbose --cublas $(Build.SourcesDirectory)/build/temp.linux-x86_64-3.8/tools/benchmarkers/cublas/cublas_gemm --input gemm_rectangle_A6000.csv gemm_square.csv gemm_bert_assorted.csv
+          python gpu_benchmark_tool.py --type s --target 'NVidia RTX A6000' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --verbose --cublas $(Build.SourcesDirectory)/build/temp.linux-x86_64-3.8/tools/benchmarkers/cublas/cublas_gemm --input gemm_rectangle_A6000.csv gemm_square.csv gemm_bert_assorted.csv gemm_resnet_inception.csv
         displayName: Run CUBLAS benchmarks
         workingDirectory: "$(Build.SourcesDirectory)/tools/benchmarkers"
         env:
@@ -71,8 +71,8 @@ jobs:
 
       - bash: |
           export PYTHONPATH=$(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8
-          python gpu_benchmark_tool.py --type h --target 'NVidia RTX A6000' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --janitor True --verbose True --cutlass $(System.DefaultWorkingDirectory)/cutlass/build/tools/profiler/cutlass_profiler --input gemm_rectangle_A6000.csv,gemm_square.csv,gemm_bert_assorted.csv
-          python gpu_benchmark_tool.py --type s --target 'NVidia RTX A6000' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --janitor True --verbose True --cutlass $(System.DefaultWorkingDirectory)/cutlass/build/tools/profiler/cutlass_profiler --input gemm_rectangle_A6000.csv,gemm_square.csv,gemm_bert_assorted.csv,gemm_resnet_inception.csv
+          python gpu_benchmark_tool.py --type h --target 'NVidia RTX A6000' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --verbose --cutlass $(System.DefaultWorkingDirectory)/cutlass/build/tools/profiler/cutlass_profiler --input gemm_rectangle_A6000.csv gemm_square.csv gemm_bert_assorted.csv
+          python gpu_benchmark_tool.py --type s --target 'NVidia RTX A6000' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --verbose --cutlass $(System.DefaultWorkingDirectory)/cutlass/build/tools/profiler/cutlass_profiler --input gemm_rectangle_A6000.csv gemm_square.csv gemm_bert_assorted.csv gemm_resnet_inception.csv
         displayName: Run CUTLASS benchmarks
         workingDirectory: "$(Build.SourcesDirectory)/tools/benchmarkers"
         env:

diff --git a/.azure/cuda/cuda-benchmark-fp16-bert.yml b/.azure/cuda/cuda-benchmark-fp16-bert.yml
@@ -42,13 +42,13 @@ jobs:
         workingDirectory: "$(Build.SourcesDirectory)"
 
       - bash: |
-          python ./setup.py build -g -b build -t build bdist_wheel -d build/dist
+          python ./setup.py build -b build -t build bdist_wheel -d build/dist
         displayName: Python build
         workingDirectory: "$(Build.SourcesDirectory)"
 
       - bash: |
           export PYTHONPATH=$(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8
-          python gpu_benchmark_tool.py --input gemm_bert_assorted.csv --type h --target 'NVidia RTX A6000' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --janitor True --verbose True --check True
+          python gpu_benchmark_tool.py --input gemm_bert_assorted.csv --type h --target 'NVidia RTX A6000' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE
         displayName: Run fp16 benchmarks BERT
         workingDirectory: "$(Build.SourcesDirectory)/tools/benchmarkers"
         env:

diff --git a/.azure/cuda/cuda-benchmark-fp16-big.yml b/.azure/cuda/cuda-benchmark-fp16-big.yml
@@ -42,13 +42,13 @@ jobs:
         workingDirectory: "$(Build.SourcesDirectory)"
 
       - bash: |
-          python ./setup.py build -g -b build -t build bdist_wheel -d build/dist
+          python ./setup.py build -b build -t build bdist_wheel -d build/dist
         displayName: Python build
         workingDirectory: "$(Build.SourcesDirectory)"
 
       - bash: |
           export PYTHONPATH=$(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8
-          python gpu_benchmark_tool.py --input gemm_big_A6000.csv,gemm_big.csv --type h --target 'NVidia RTX A6000' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --janitor True --verbose True --check True
+          python gpu_benchmark_tool.py --input gemm_big_A6000.csv gemm_big.csv --type h --batch_size 1 --target 'NVidia RTX A6000' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE
         displayName: Run fp16 benchmarks BIG A6000
         workingDirectory: "$(Build.SourcesDirectory)/tools/benchmarkers"
         env:

diff --git a/.azure/cuda/cuda-benchmark-fp16.yml b/.azure/cuda/cuda-benchmark-fp16.yml
@@ -42,13 +42,13 @@ jobs:
         workingDirectory: "$(Build.SourcesDirectory)"
 
       - bash: |
-          python ./setup.py build -g -b build -t build bdist_wheel -d build/dist
+          python ./setup.py build -b build -t build bdist_wheel -d build/dist
         displayName: Python build
         workingDirectory: "$(Build.SourcesDirectory)"
 
       - bash: |
           export PYTHONPATH=$(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8
-          python gpu_benchmark_tool.py --input gemm_small_A6000.csv,gemm_small.csv --type h --target 'NVidia RTX A6000' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --janitor True --verbose True --check True
+          python gpu_benchmark_tool.py --input gemm_small_A6000.csv gemm_small.csv --type h --target 'NVidia RTX A6000' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE
         displayName: Run fp16 benchmarks A6000
         workingDirectory: "$(Build.SourcesDirectory)/tools/benchmarkers"
         env:

diff --git a/.azure/cuda/cuda-benchmark-fp32-bert.yml b/.azure/cuda/cuda-benchmark-fp32-bert.yml
@@ -42,13 +42,13 @@ jobs:
         workingDirectory: "$(Build.SourcesDirectory)"
 
       - bash: |
-          python ./setup.py build -g -b build -t build bdist_wheel -d build/dist
+          python ./setup.py build -b build -t build bdist_wheel -d build/dist
         displayName: Python build
         workingDirectory: "$(Build.SourcesDirectory)"
 
       - bash: |
           export PYTHONPATH=$(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8
-          python gpu_benchmark_tool.py --input gemm_bert_assorted.csv --type s --target 'NVidia RTX A6000' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --janitor True --verbose True --check True
+          python gpu_benchmark_tool.py --input gemm_bert_assorted.csv --type s --target 'NVidia RTX A6000' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --verbose --check
         displayName: Run fp32 benchmarks BERT
         workingDirectory: "$(Build.SourcesDirectory)/tools/benchmarkers"
         env:

diff --git a/.azure/cuda/cuda-benchmark-fp32-big.yml b/.azure/cuda/cuda-benchmark-fp32-big.yml
@@ -42,13 +42,13 @@ jobs:
         workingDirectory: "$(Build.SourcesDirectory)"
 
       - bash: |
-          python ./setup.py build -g -b build -t build bdist_wheel -d build/dist
+          python ./setup.py build -b build -t build bdist_wheel -d build/dist
         displayName: Python build
         workingDirectory: "$(Build.SourcesDirectory)"
 
       - bash: |
           export PYTHONPATH=$(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8
-          python gpu_benchmark_tool.py --input gemm_big_A6000.csv,gemm_big.csv --type s --target 'NVidia RTX A6000' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --janitor True --verbose True --check True
+          python gpu_benchmark_tool.py --input gemm_big_A6000.csv gemm_big.csv --type s --batch_size 1 --target 'NVidia RTX A6000' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE
         displayName: Run fp32 benchmarks BIG A6000
         workingDirectory: "$(Build.SourcesDirectory)/tools/benchmarkers"
         env:

diff --git a/.azure/cuda/cuda-benchmark-fp32-resnet.yml b/.azure/cuda/cuda-benchmark-fp32-resnet.yml
@@ -9,7 +9,7 @@ trigger: none
 
 jobs:
   - job: "CUDA_Benchmarking_FP32_RESNET"
-    timeoutInMinutes: 1080
+    timeoutInMinutes: 2160
 
     pool:
       name: LinuxNVGPUPool
@@ -42,13 +42,13 @@ jobs:
         workingDirectory: "$(Build.SourcesDirectory)"
 
       - bash: |
-          python ./setup.py build -g -b build -t build bdist_wheel -d build/dist
+          python ./setup.py build -b build -t build bdist_wheel -d build/dist
         displayName: Python build
         workingDirectory: "$(Build.SourcesDirectory)"
 
       - bash: |
           export PYTHONPATH=$(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8
-          python gpu_benchmark_tool.py --input gemm_resnet_inception.csv --type s --target 'NVidia RTX A6000' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --janitor True --verbose True --check True
+          python gpu_benchmark_tool.py --input gemm_resnet_inception.csv --type s --batch_size 1 --target 'NVidia RTX A6000' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --verbose --check
         displayName: Run fp32 benchmarks RESNET
         workingDirectory: "$(Build.SourcesDirectory)/tools/benchmarkers"
         env:

diff --git a/.azure/cuda/cuda-benchmark-fp32.yml b/.azure/cuda/cuda-benchmark-fp32.yml
@@ -42,13 +42,13 @@ jobs:
         workingDirectory: "$(Build.SourcesDirectory)"
 
       - bash: |
-          python ./setup.py build -g -b build -t build bdist_wheel -d build/dist
+          python ./setup.py build -b build -t build bdist_wheel -d build/dist
         displayName: Python build
         workingDirectory: "$(Build.SourcesDirectory)"
 
       - bash: |
           export PYTHONPATH=$(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8
-          python gpu_benchmark_tool.py --input gemm_small_A6000.csv,gemm_small.csv --type s --target 'NVidia RTX A6000' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --janitor True --verbose True --check True
+          python gpu_benchmark_tool.py --input gemm_small_A6000.csv gemm_small.csv --type s --target 'NVidia RTX A6000' --branch $(Build.SourceBranch) --output $(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8/accera_benchmarks/results --upload official_build_container_DO_NOT_UPLOAD_HERE --verbose --check
         displayName: Run fp32 benchmarks A6000
         workingDirectory: "$(Build.SourcesDirectory)/tools/benchmarkers"
         env:

diff --git a/.azure/cuda/cuda-pr.yml b/.azure/cuda/cuda-pr.yml
@@ -62,14 +62,14 @@ steps:
       echo "CUDA_VISIBLE_DEVICES" ${CUDA_VISIBLE_DEVICES}
       export LLVM_SYMBOLIZER_PATH=/usr/bin/llvm-symbolizer-12
       python -m pip install bfloat16
-      python -m pytest -v --junitxml=test/test-mfma_tests.xml accera/test/mfma_tests.py
+      python -m pytest -s -v --junitxml=test/test-mfma_tests.xml accera/test/mfma_tests.py
     displayName: Run MFMA tests
     workingDirectory: "$(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8"
 
   - bash: |
       export CUDA_VISIBLE_DEVICES=$(CUDA_VISIBLE_DEVICES)
       export LLVM_SYMBOLIZER_PATH=/usr/bin/llvm-symbolizer-12
-      python -m pytest -v --junitxml=test/test-smoke_tests.xml accera/test/smoke_tests.py -k "cuda"
+      python -m pytest -s -v --junitxml=test/test-smoke_tests.xml accera/test/smoke_tests.py
     displayName: Run CUDA smoke tests
     workingDirectory: "$(Build.SourcesDirectory)/build/lib.linux-x86_64-3.8"
 

diff --git a/.azure/linux-accera.yml b/.azure/linux-accera.yml
@@ -15,25 +15,40 @@ strategy:
     Python310:
       Python.Version: "3.10"
 
+variables:
+ - name: PARALLEL
+   value: 4 # 2 cores (https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/hosted?view=azure-devops&tabs=yaml#hardware)
+ - name: PIP_CACHE_DIR
+   value: $(Pipeline.Workspace)/.pip
+ - name: VCPKG_BINARY_SOURCES
+   value: "clear;nuget,$(VCPKG_NUGET_FEED),readwrite"
+
 steps:
+  - task: NuGetAuthenticate@0
+
   - task: UsePythonVersion@0
     inputs:
       versionSpec: $(Python.Version)
       addToPath: true
       architecture: "x64"
 
+  - task: Cache@2
+    inputs:
+      key: 'pip | "$(Agent.OS)" | $(Build.SourcesDirectory)/requirements.txt'
+      restoreKeys: |
+        pip | "$(Agent.OS)"
+      path: $(PIP_CACHE_DIR)
+    displayName: Cache pip
+
   - bash: |
       sudo apt-get install libunwind-dev ninja-build ccache python3-pip libvulkan-dev libomp-11-dev pkg-config -y
       sudo sysctl -w kernel.core_pattern="$(Build.SourcesDirectory)/build/core-%e-%s-%u-%g-%p-%t.dump"
       ulimit -c unlimited
       python -m pip install -U pip
       python -m pip install -r $(Build.SourcesDirectory)/requirements.txt
-      echo "mkdir $HOME/.ccache"
-      mkdir $HOME/.ccache
-      echo "ln -s $HOME/.ccache $(System.DefaultWorkingDirectory)/ccache"
-      ln -s $HOME/.ccache $(System.DefaultWorkingDirectory)/ccache
       conan remote add accera $(CONAN_REMOTE)
       conan user -p $(CONAN_PWD) -r accera $(CONAN_USERNAME)
+      echo "##vso[task.prependpath]/usr/lib/ccache"
     displayName: Install prereqs for Linux
     env:
       CONAN_PWD: $(CONAN_PWD)
@@ -53,35 +68,35 @@ steps:
 
   # Note: Code signing is not available for Linux distributions (outside of packages.microsoft.com)
   - task: PythonScript@0
-    displayName: python ./setup.py build bdist_wheel -d $(Build.SourcesDirectory)/build/dist
+    displayName: python ./setup.py build_ext -j $(PARALLEL) build bdist_wheel -d $(Build.SourcesDirectory)/build/dist
     inputs:
       scriptSource: "filePath"
       scriptPath: "$(Build.SourcesDirectory)/setup.py"
-      arguments: "build bdist_wheel -d $(Build.SourcesDirectory)/build/dist"
+      arguments: "build_ext -j $(PARALLEL) build bdist_wheel -d $(Build.SourcesDirectory)/build/dist"
       workingDirectory: "$(Build.SourcesDirectory)/"
 
   - task: PythonScript@0
-    displayName: compilers python ./setup.py build bdist_wheel -d $(Build.SourcesDirectory)/build/dist
+    displayName: compilers python ./setup.py build_ext -j $(PARALLEL) build bdist_wheel -d $(Build.SourcesDirectory)/build/dist
     inputs:
       scriptSource: "filePath"
       scriptPath: "$(Build.SourcesDirectory)/accera/python/compilers/setup.py"
-      arguments: "build bdist_wheel -d $(Build.SourcesDirectory)/build/dist"
+      arguments: "build_ext -j $(PARALLEL) build bdist_wheel -d $(Build.SourcesDirectory)/build/dist"
       workingDirectory: "$(Build.SourcesDirectory)/accera/python/compilers"
 
   - task: PythonScript@0
     displayName: gpu python ./setup.py build bdist_wheel -d $(Build.SourcesDirectory)/build/dist
     inputs:
       scriptSource: "filePath"
       scriptPath: "$(Build.SourcesDirectory)/accera/python/gpu/setup.py"
-      arguments: "build bdist_wheel -d $(Build.SourcesDirectory)/build/dist"
+      arguments: "build_ext -j $(PARALLEL) build bdist_wheel -d $(Build.SourcesDirectory)/build/dist"
       workingDirectory: "$(Build.SourcesDirectory)/accera/python/gpu"
 
   - task: PythonScript@0
     displayName: llvm python ./setup.py build bdist_wheel -d $(Build.SourcesDirectory)/build/dist
     inputs:
       scriptSource: "filePath"
       scriptPath: "$(Build.SourcesDirectory)/accera/python/llvm/setup.py"
-      arguments: "build bdist_wheel -d $(Build.SourcesDirectory)/build/dist"
+      arguments: "build_ext -j $(PARALLEL) build bdist_wheel -d $(Build.SourcesDirectory)/build/dist"
       workingDirectory: "$(Build.SourcesDirectory)/accera/python/llvm"
 
   - bash: |