v1.2.6
What's Changed
- Bump urllib3 from 1.25.8 to 1.26.5 in /tools/benchmarkers by @dependabot in #42
- [ci] Fix out of disk space errors for CI workflow by @lisaong in #43
- Bump bottle from 0.12.19 to 0.12.20 in /tools/viz by @dependabot in #44
-
Merged PR 2657: Add conversion pass from gpu ops to rocdl ops. [Ritwik Das]
- switch to gpu dialect for gpu index ops
- add conversion pass from gpu dialect to rocdl
-
Merged PR 2652: Add integer tensor ops support for AMD targets.
[Ritwik Das]- int mfma ops
- tests
- static_cast in c++
Related work items: #3727
-
Merged PR 2650: [release] Docs version to 1.2.6, sync Github to ADO.
[Lisa Ong] -
Merged PR 2624: Add more MMA shapes for CUDA. [Ritwik Das]
Add more MMA shapes for CUDA
- 32x8x16
- 8x32x16
-
Merged PR 2644: Enable CUDA benchmarks only for A6000. [Lisa Ong]
- Manually set the Target.Model user capability on agents running A6000
- Update benchmarking pipelines to demand A6000s
-
Merged PR 2634: Remove couple more big gemm sizes. [Ritwik Das]
Remove couple more big gemm sizes
-
Merged PR 2626: [refactor] Moving debug mode to its own lowering pass.
[Lisa Ong]Move the emitting of the debug mode wrapper function out of MLIREmitterContext into a lowering pass.
This makes it easier to expand debug mode in the future.
-
Merged PR 2633: Bump hatlib to 0.0.19 to unblock CUDA T4 devices.
[Lisa Ong] -
Merged PR 2630: Add batched gemm support with tensorization. [Ritwik
Das]Related work items: #3677
-
Merged PR 2631: Add cosmosdb key env var and shuffle gemm sizes.
[Ritwik Das]- Add env var for ACCOUNT_KEY
- shuffle gemm sizes from small to big
- remove correctness check from big inputs and fp16
-
Merged PR 2607: Infrastructure for plan.auto() to support a basic none
cache heuristics approach. [JUBI TANEJA]Infrastructure for plan.auto() to support a basic none cache heuristics approach
This is a basic approach to test parameterization of cache arguments, index and layout.
User only needs to specify the source they want to cache, and AutoPlanner's
NoneCacheHeuristics algorithm will synthesize the remaining parameters for caching
with possible set of values.Overall idea at DSL level:
Given input -
schedule.reorder(i, j, k, ii, jj, kk)
plan.auto(accera.algorithms.NoneCacheHeuristics(source = B, index = j))Internally, auto() invokes cache and adds two functions with
a unique value of layout.plan.cache(source = B, index = j, layout = {FIRST_MAJOR, LAST_MAJOR})
Important change in this PR:
- Add a new algorithms module in Accera
- Do not delay resolution of delayed parameters to get the value, instead it
now allows setting parameters with a possible set of values and this can be
passed between heuristics and plan object. Check: Parameter.py - Parameters constructed by heuristics are termed as "herustic parameters".
They are not available to the external users of Accera, but just named
separately in the implementation to differentiate them from user-defined "parameters".
Limitation/Changes coming in the subsequent PRs:
- Allow user-defined parameters and heuristic parameters both for AutoPlanner test cases.
For now, the code only focuses on testing AutoPlanner without any user-defined parameters
that one can create using API:create_parameters
. - Documentation of AutoPlanner -- design goals, tutorial, API description, etc. is coming in the
next PR.
-
Merged PR 2600: Refactor MFMA indexing calculations. [Mason Remy]
Refactor MFMA indexing calculations
- Use the iteration space position when determing MFMA computation
locations rather than computing the position from the thread id - Construct the full subschedules for AMD MFMA ops so that the bound
loop indices are ordered appropriately for the MFMA op being invoked - Update unit tests accordingly. The schedule changes may need to be
moved to an under-the-hood feature of tensorization
- Use the iteration space position when determing MFMA computation
-
Merged PR 2627: Raise error for invalid block dimensions. [Ritwik Das]
Raise error for invalid block dimensions based on target info
Related work items: #3715
-
Merged PR 2625: [nfc] Block debug mode for unsupported GPU targets.
[Lisa Ong]Debug mode is not yet supported for GPU targets
- Fail early
- Update documentation
-
Merged PR 2622: Fix dependencies for benchmark tools. [Ritwik Das]
Fix dependencies for benchmark tools
-
Merged PR 2604: Add bfloat16 support for tensor ops on rocm. [Ritwik
Das]Add bfloat16 support for tensor ops on cuda and rocm
Related work items: #3713
-
Merged PR 2621: Merge changes from Github repo. [Lisa Ong]
commit 5b5f5ef
-
Merged PR 2620: Upgrade GPU self-hosted agents to g++-10. [Lisa Ong]
The stock g++-9 from Ubuntu 20.04 crashes when compiling pybind11 alongside mlir/Dialect/IR/Affine/AffineOp.h.
This change updates to g++-10 for the self-hosted images only, as this issue only affects images that we build for ROCm and CUDA.
Azure DevOps agents will continue to run on their pre-installed g++-9.
-
Merged PR 2619: Parameterize Plan.bind. [Denny Sun]
P0, P1, P2, P3, P4, P5 = create_parameters() plan.bind(mapping={ P0: P3, P1: P4, P2: P5 }) package.add( plan, args=(A, B, C), parameters={ P0: i, P1: j, P2: k, P3: v100.GridUnit.BLOCK_X, P4: v100.GridUnit.THREAD_X, P5: v100.GridUnit.THREAD_Y, }, base_name=test_name)
Related work items: #3708
-
Merged PR 2599: Support parameterizing caches based on memory space.
[Mason Remy]Support parameterizing caches based on memory space
-
Identifies bound indices that the cache should be parameterized on,
rather than shaped by.
e.g. for a private memory cache inserted at a gpu block level, the
computed memory space will not be the full active block at that level,
but the portion derived from loops that weren't bound to gpu thread
dims. -
Adds some BoundProcessorOp utilities and shares some common binding
code
-
-
Merged PR 2618: Fix memory allocation bug during benchmark
verification. [Ritwik Das]Fix memory allocation bug during benchmark verification
-
Merged PR 2617: [nfc] [doc] Fix typo and re-sync models table. [Lisa
Ong] -
Merged PR 2616: Formatting Python code a bit for the better
readability. [Denny Sun]- Some functions have a long list of parameters, add line wrap
- Separate external imports from internal ones
-
Merged PR 2614: Remove redundant variable and cosmosdb fix. [Ritwik
Das]Cosmos DB error when upserting from multiple processes:
Process runner0:
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/azp/_work/2/s/tools/benchmarkers/accera_gemm.py", line 633, in gemm_runner
cosmosdb.upsert_benchmark_results(resultRows, containerName, verboseLogs)
File "/azp/_work/2/s/tools/benchmarkers/cosmosdb.py", line 27, in upsert_benchmark_results
container = get_container(containerName, verboseLogs)
File "/azp/_work/2/s/tools/benchmarkers/cosmosdb.py", line 18, in get_container
container = db.create_container_if_not_exists(id=containerName, partition_key=PartitionKey(path='/partitionKey'))
File "/usr/local/lib/python3.8/dist-packages/azure/core/tracing/decorator.py", line 62, in wrapper_use_tracer
return func(*args, **kwargs) # type: ignore
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/database.py", line 287, in create_container_if_not_exists
container_proxy.read(
File "/usr/local/lib/python3.8/dist-packages/azure/core/tracing/decorator.py", line 62, in wrapper_use_tracer
return func(*args, **kwargs) # type: ignore
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/container.py", line 145, in read
self._properties = self.client_connection.ReadContainer(
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_cosmos_client_connection.py", line 469, in ReadContainer
return self.Read(path, "colls", collection_id, None, options, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_cosmos_client_connection.py", line 2162, in Read
result, self.last_response_headers = self.__Get(path, request_params, headers, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_cosmos_client_connection.py", line 2209, in __Get
return synchronized_request.SynchronizedRequest(
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_synchronized_request.py", line 210, in SynchronizedRequest
return _retry_utility.Execute(
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_retry_utility.py", line 73, in Execute
result = ExecuteFunction(function, global_endpoint_manager, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_retry_utility.py", line 130, in ExecuteFunction
return function(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_synchronized_request.py", line 158, in _Request
raise exceptions.CosmosHttpResponseError(message=data, response=response)
azure.cosmos.exceptions.CosmosHttpResponseError: Status code: 400 -
Merged PR 2613: Enable daily CUDA benchmarks. [Ritwik Das]
- Enable CUDA benchmarks
- some refactoring
-
Merged PR 2596: Updates to affine simplifications. [Mason Remy]
Updates to affine simplifications
- Run simplifications on AffineApplyOps
- Detect and simplify some single-element-numerator cases for floordiv
and mod - Detect GPU constants such grid dim size and block dim size and
incorporate those constants into affine maps for later simplification - Detect GPU bound dimensions block id and thread id in affine ops and
incorporate those ranges into simplification passes
Related work items: #3667
-
Merged PR 2594: Always resolve unrealized loopnest indices when
computing cache positions. [Mason Remy]Always resolve unrealized loopnest indices when
computing cache positions -
Merged PR 2574: Support binding multiple indices to a processor
handle. [Mason Remy]Support binding multiple indices to a processor handle
- This creates a mapping of the processor handle to the index iterations
based on the ordering of the indices in the tuple
- This creates a mapping of the processor handle to the index iterations
-
Merged PR 2611: Fix issue when splitting indices by factors that don't
divide evenly. [Chuck Jacobs]This PR fixes an issue when splitting by a factor that doesn't evenly divide the parent index's range. E.g., if
i
has a range of[0, 320)
, thenii = split(i, 128)
would end up withii
having a range of192
instead of128
. -
Merged PR 2612: Add missing psutil dependency. [Ritwik Das]
- Add missing psutil dependency
- Remove private branch from benchmarks
-
Merged PR 2608: Caching fixes and benchmarking optimizations. [Ritwik
Das]- Explore k_split independently of outer tile dims, allows for arbitrary k splits
- Fix for workPerThread < 1 (from Mason), which was exposed since the benchmark now explores k-split of size 1, 2, 4, etc. and this causes small active blocks for caching, and when work per thread becomes less than 1 the compiler crashes during package.build.
-
Merged PR 2610: Opportunistically add more targets used in CI machines
and update Model.md. [Lisa Ong]- Renamed some fields to add units
- Added some Intel Xeon models as we encounter them
- Updated some cache sizes
-
Merged PR 2606: Parameterize Array.sub_array. [Denny Sun]
`
P0, P1 = create_parameters() arr = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(256, 256)) arr0 = arr.sub_array(offsets=(0, 0), shape=(P0, P1)) package.add(nest, args=(arr0, ), parameters={P0: 128, P1: 128})
`
Related work items: #3707
-
Merged PR 2609: [build] peg protobuf to 3.20.1 due to
incompatibilities with latest version. [Lisa Ong]Even though we peg to onnx==1.9.0, onnx requires protobuf >= 3.20.1 which pulls an incompatible version of protobuf (4.x).
-
Merged PR 2576: [doc] MFMA thread assignment visualizations for AMD.
[Lisa Ong]Some helper visualizations for MFMA:
- 2x2x16
- 4x4x32
-
Merged PR 2601: [ci] CUDA pipeline and buddy build. [Lisa Ong]
- Container for CUDA self-hosted Azure devops agent
- Initial buddy build pipeline (similar to ROCm)
- Replaces references to Dockerhub with Azure Container Registry for compliance purposes
-
Merged PR 2603: Add CUDA pipeline host to known targets. [Lisa Ong]
Note that the CPU frequency is conflicting, I went with cpuinfo and dmesg.
References:
> python -m cpuinfo Python Version: 3.8.10.final.0 (64 bit) Cpuinfo Version: 8.0.0 Vendor ID Raw: AuthenticAMD Hardware Raw: Brand Raw: AMD EPYC 7V12 64-Core Processor Hz Advertised Friendly: 3.3049 GHz Hz Actual Friendly: 3.3049 GHz Hz Advertised: (3304919000, 0) Hz Actual: (3304919000, 0) Arch: X86_64 Bits: 64 Count: 128 Arch String Raw: x86_64 L1 Data Cache Size: 2 MiB L1 Instruction Cache Size: 2 MiB L2 Cache Size: 32 MiB L2 Cache Line Size: 512 L2 Cache Associativity: 6 L3 Cache Size: 524288 Stepping: Model: 49 Family: 23 Processor Type: Flags: 3dnowext, 3dnowprefetch, abm, adx, aes, aperfmperf, apic, arat, avic, avx, avx2, bmi1, bmi2, bpext, cat_l3, cdp_l3, clflush, clflushopt, clwb, clzero, cmov, cmp_legacy, constant_tsc, cpb, cpuid, cqm, cqm_llc, cqm_mbm_local, cqm_mbm_total, cqm_occup_llc, cr8_legacy, cx16, cx8, dbx, de, decodeassists, extapic, extd_apicid, f16c, flushbyasid, fma, fpu, fsgsbase, fxsr, fxsr_opt, ht, hw_pstate, ibpb, ibrs, ibs, irperf, lahf_lm, lbrv, lm, mba, mca, mce, misalignsse, mmx, mmxext, monitor, movbe, msr, mtrr, mwaitx, nonstop_tsc, nopl, npt, nrip_save, nx, osvw, osxsave, overflow_recov, pae, pat, pausefilter, pci_l2i, pclmulqdq, pdpe1gb, perfctr_core, perfctr_llc, perfctr_nb, pfthreshold, pge, pni, popcnt, pqe, pqm, pse, pse36, rdpid, rdrand, rdrnd, rdseed, rdt_a, rdtscp, rep_good, sep, sev, sha, sha_ni, skinit, smap, smca, sme, smep, ssbd, sse, sse2, sse4_1, sse4_2, sse4a, ssse3, stibp, succor, svm, svm_lock, syscall, tce, topoext, tsc, tsc_scale, umip, v_vmsave_vmload, vgif, vmcb_clean, vme, vmmcall, wbnoinvd, wdt, xgetbv1, xsave, xsavec, xsaveerptr, xsaveopt, xsaves
> dmesg | grep MHz [ 0.000000] tsc: Detected 2450.083 MHz processor [ 7.731766] hpet0: 3 comparators, 32-bit 14.318180 MHz counter [ 8.979712] tsc: Refined TSC clocksource calibration: 2449.961 MHz
> lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 43 bits physical, 48 bits virtual CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 2 Core(s) per socket: 64 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD EPYC 7V12 64-Core Processor Stepping: 0 Frequency boost: enabled CPU MHz: 1497.558 CPU max MHz: 2450.0000 CPU min MHz: 1500.0000 BogoMIPS: 4900.16 Virtualization: AMD-V L1d cache: 2 MiB L1i cache: 2 MiB L2 cache: 32 MiB L3 cache: 2...
-
Merged PR 2602: Add rocwmma plumbing in tensorize. [Ritwik Das]
- Add rocwmma plumbing in tensorize
- Cannot use this flag until the 5.2 ROCm release which natively supports rocWmma.
Related work items: #3672
-
Merged PR 2570: Enhancements to the gpu benchmark tool. [Ritwik Das]
- Add multiprocess package builders and runners
- Support for running on different GPU devices
- Add clock speed determinism
- add composable_kernel benchmarks
- add cutlass benchmarks
- add cublas and rocblas benchmarks
- Add Cosmos DB result upload capability
Related work items: #3683, #3700, #3705, #3685
-
Merged PR 2598: Fix mfma enum name typo. [Mason Remy]
Fix mfma enum name typo
-
Merged PR 2595: [nfc] Renames smoke_test.py -> smoke_tests.py. [Kern
Handa][nfc] Renames smoke_test.py -> smoke_tests.py
New Contributors
- @dependabot made their first contribution in #42
Full Changelog: v1.2.5...v1.2.6