Release v1.2.6 · microsoft/Accera

What's Changed

Bump urllib3 from 1.25.8 to 1.26.5 in /tools/benchmarkers by @dependabot in #42
[ci] Fix out of disk space errors for CI workflow by @lisaong in #43
Bump bottle from 0.12.19 to 0.12.20 in /tools/viz by @dependabot in #44

Merged PR 2657: Add conversion pass from gpu ops to rocdl ops. [Ritwik Das]
- switch to gpu dialect for gpu index ops
- add conversion pass from gpu dialect to rocdl
Merged PR 2652: Add integer tensor ops support for AMD targets.
[Ritwik Das]
- int mfma ops
- tests
- static_cast in c++
Related work items: #3727
Merged PR 2650: [release] Docs version to 1.2.6, sync Github to ADO.
[Lisa Ong]
Merged PR 2624: Add more MMA shapes for CUDA. [Ritwik Das]

Add more MMA shapes for CUDA
- 32x8x16
- 8x32x16
Merged PR 2644: Enable CUDA benchmarks only for A6000. [Lisa Ong]
- Manually set the Target.Model user capability on agents running A6000
- Update benchmarking pipelines to demand A6000s
https://docs.microsoft.com/en-us/azure/devops/pipelines/process/demands?view=azure-devops&tabs=yaml#feedback
Merged PR 2634: Remove couple more big gemm sizes. [Ritwik Das]

Remove couple more big gemm sizes
Merged PR 2626: [refactor] Moving debug mode to its own lowering pass.
[Lisa Ong]

Move the emitting of the debug mode wrapper function out of MLIREmitterContext into a lowering pass.

This makes it easier to expand debug mode in the future.
Merged PR 2633: Bump hatlib to 0.0.19 to unblock CUDA T4 devices.
[Lisa Ong]

https://github.com/microsoft/hat/releases/tag/v0.0.19
Merged PR 2630: Add batched gemm support with tensorization. [Ritwik
Das]

Related work items: #3677
Merged PR 2631: Add cosmosdb key env var and shuffle gemm sizes.
[Ritwik Das]
- Add env var for ACCOUNT_KEY
- shuffle gemm sizes from small to big
- remove correctness check from big inputs and fp16
Merged PR 2607: Infrastructure for plan.auto() to support a basic none
cache heuristics approach. [JUBI TANEJA]

Infrastructure for plan.auto() to support a basic none cache heuristics approach

This is a basic approach to test parameterization of cache arguments, index and layout.
User only needs to specify the source they want to cache, and AutoPlanner's
NoneCacheHeuristics algorithm will synthesize the remaining parameters for caching
with possible set of values.

Overall idea at DSL level:
Given input -
schedule.reorder(i, j, k, ii, jj, kk)
plan.auto(accera.algorithms.NoneCacheHeuristics(source = B, index = j))

Internally, auto() invokes cache and adds two functions with
a unique value of layout.

plan.cache(source = B, index = j, layout = {FIRST_MAJOR, LAST_MAJOR})

Important change in this PR:
- Add a new algorithms module in Accera
- Do not delay resolution of delayed parameters to get the value, instead it
  now allows setting parameters with a possible set of values and this can be
  passed between heuristics and plan object. Check: Parameter.py
- Parameters constructed by heuristics are termed as "herustic parameters".
  They are not available to the external users of Accera, but just named
  separately in the implementation to differentiate them from user-defined "parameters".
Limitation/Changes coming in the subsequent PRs:
- Allow user-defined parameters and heuristic parameters both for AutoPlanner test cases.
  For now, the code only focuses on testing AutoPlanner without any user-defined parameters
  that one can create using API: create_parameters.
- Documentation of AutoPlanner -- design goals, tutorial, API description, etc. is coming in the
  next PR.
Merged PR 2600: Refactor MFMA indexing calculations. [Mason Remy]

Refactor MFMA indexing calculations
- Use the iteration space position when determing MFMA computation
  locations rather than computing the position from the thread id
- Construct the full subschedules for AMD MFMA ops so that the bound
  loop indices are ordered appropriately for the MFMA op being invoked
- Update unit tests accordingly. The schedule changes may need to be
  moved to an under-the-hood feature of tensorization
Merged PR 2627: Raise error for invalid block dimensions. [Ritwik Das]

Raise error for invalid block dimensions based on target info

Related work items: #3715
Merged PR 2625: [nfc] Block debug mode for unsupported GPU targets.
[Lisa Ong]

Debug mode is not yet supported for GPU targets
- Fail early
- Update documentation
Merged PR 2622: Fix dependencies for benchmark tools. [Ritwik Das]

Fix dependencies for benchmark tools
Merged PR 2604: Add bfloat16 support for tensor ops on rocm. [Ritwik
Das]

Add bfloat16 support for tensor ops on cuda and rocm

Related work items: #3713
Merged PR 2621: Merge changes from Github repo. [Lisa Ong]

commit 5b5f5ef
Merged PR 2620: Upgrade GPU self-hosted agents to g++-10. [Lisa Ong]

The stock g++-9 from Ubuntu 20.04 crashes when compiling pybind11 alongside mlir/Dialect/IR/Affine/AffineOp.h.

This change updates to g++-10 for the self-hosted images only, as this issue only affects images that we build for ROCm and CUDA.

Azure DevOps agents will continue to run on their pre-installed g++-9.

Merged PR 2619: Parameterize Plan.bind. [Denny Sun]

        P0, P1, P2, P3, P4, P5 = create_parameters()

        plan.bind(mapping={
            P0: P3,
            P1: P4,
            P2: P5
        })

        package.add(
            plan,
            args=(A, B, C),
            parameters={
                P0: i,
                P1: j,
                P2: k,
                P3: v100.GridUnit.BLOCK_X,
                P4: v100.GridUnit.THREAD_X,
                P5: v100.GridUnit.THREAD_Y,
            },
            base_name=test_name)

Related work items: #3708

Merged PR 2599: Support parameterizing caches based on memory space.
[Mason Remy]

Support parameterizing caches based on memory space
- Identifies bound indices that the cache should be parameterized on,
  rather than shaped by.
  e.g. for a private memory cache inserted at a gpu block level, the
  computed memory space will not be the full active block at that level,
  but the portion derived from loops that weren't bound to gpu thread
  dims.
- Adds some BoundProcessorOp utilities and shares some common binding
  code
Merged PR 2618: Fix memory allocation bug during benchmark
verification. [Ritwik Das]

Fix memory allocation bug during benchmark verification
Merged PR 2617: [nfc] [doc] Fix typo and re-sync models table. [Lisa
Ong]
Merged PR 2616: Formatting Python code a bit for the better
readability. [Denny Sun]
1. Some functions have a long list of parameters, add line wrap
2. Separate external imports from internal ones
Merged PR 2614: Remove redundant variable and cosmosdb fix. [Ritwik
Das]

Cosmos DB error when upserting from multiple processes:

Process runner0:
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/azp/_work/2/s/tools/benchmarkers/accera_gemm.py", line 633, in gemm_runner
cosmosdb.upsert_benchmark_results(resultRows, containerName, verboseLogs)
File "/azp/_work/2/s/tools/benchmarkers/cosmosdb.py", line 27, in upsert_benchmark_results
container = get_container(containerName, verboseLogs)
File "/azp/_work/2/s/tools/benchmarkers/cosmosdb.py", line 18, in get_container
container = db.create_container_if_not_exists(id=containerName, partition_key=PartitionKey(path='/partitionKey'))
File "/usr/local/lib/python3.8/dist-packages/azure/core/tracing/decorator.py", line 62, in wrapper_use_tracer
return func(*args, **kwargs) # type: ignore
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/database.py", line 287, in create_container_if_not_exists
container_proxy.read(
File "/usr/local/lib/python3.8/dist-packages/azure/core/tracing/decorator.py", line 62, in wrapper_use_tracer
return func(*args, **kwargs) # type: ignore
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/container.py", line 145, in read
self._properties = self.client_connection.ReadContainer(
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_cosmos_client_connection.py", line 469, in ReadContainer
return self.Read(path, "colls", collection_id, None, options, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_cosmos_client_connection.py", line 2162, in Read
result, self.last_response_headers = self.__Get(path, request_params, headers, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_cosmos_client_connection.py", line 2209, in __Get
return synchronized_request.SynchronizedRequest(
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_synchronized_request.py", line 210, in SynchronizedRequest
return _retry_utility.Execute(
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_retry_utility.py", line 73, in Execute
result = ExecuteFunction(function, global_endpoint_manager, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_retry_utility.py", line 130, in ExecuteFunction
return function(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_synchronized_request.py", line 158, in _Request
raise exceptions.CosmosHttpResponseError(message=data, response=response)
azure.cosmos.exceptions.CosmosHttpResponseError: Status code: 400
Merged PR 2613: Enable daily CUDA benchmarks. [Ritwik Das]
- Enable CUDA benchmarks
- some refactoring
Merged PR 2596: Updates to affine simplifications. [Mason Remy]

Updates to affine simplifications
- Run simplifications on AffineApplyOps
- Detect and simplify some single-element-numerator cases for floordiv
  and mod
- Detect GPU constants such grid dim size and block dim size and
  incorporate those constants into affine maps for later simplification
- Detect GPU bound dimensions block id and thread id in affine ops and
  incorporate those ranges into simplification passes
Related work items: #3667
Merged PR 2594: Always resolve unrealized loopnest indices when
computing cache positions. [Mason Remy]

Always resolve unrealized loopnest indices when
computing cache positions
Merged PR 2574: Support binding multiple indices to a processor
handle. [Mason Remy]

Support binding multiple indices to a processor handle
- This creates a mapping of the processor handle to the index iterations
  based on the ordering of the indices in the tuple
Merged PR 2611: Fix issue when splitting indices by factors that don't
divide evenly. [Chuck Jacobs]

This PR fixes an issue when splitting by a factor that doesn't evenly divide the parent index's range. E.g., if i has a range of [0, 320), then ii = split(i, 128) would end up with ii having a range of 192 instead of 128.
Merged PR 2612: Add missing psutil dependency. [Ritwik Das]
- Add missing psutil dependency
- Remove private branch from benchmarks
Merged PR 2608: Caching fixes and benchmarking optimizations. [Ritwik
Das]
- Explore k_split independently of outer tile dims, allows for arbitrary k splits
- Fix for workPerThread < 1 (from Mason), which was exposed since the benchmark now explores k-split of size 1, 2, 4, etc. and this causes small active blocks for caching, and when work per thread becomes less than 1 the compiler crashes during package.build.
Merged PR 2610: Opportunistically add more targets used in CI machines
and update Model.md. [Lisa Ong]
- Renamed some fields to add units
- Added some Intel Xeon models as we encounter them
- Updated some cache sizes

Merged PR 2606: Parameterize Array.sub_array. [Denny Sun]

    P0, P1 = create_parameters()
    arr = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(256, 256))
    arr0 = arr.sub_array(offsets=(0, 0), shape=(P0, P1))
    package.add(nest, args=(arr0, ),  parameters={P0: 128, P1: 128})

Related work items: #3707

Merged PR 2609: [build] peg protobuf to 3.20.1 due to
incompatibilities with latest version. [Lisa Ong]

Even though we peg to onnx==1.9.0, onnx requires protobuf >= 3.20.1 which pulls an incompatible version of protobuf (4.x).
Merged PR 2576: [doc] MFMA thread assignment visualizations for AMD.
[Lisa Ong]

Some helper visualizations for MFMA:
- 2x2x16
- 4x4x32
Merged PR 2601: [ci] CUDA pipeline and buddy build. [Lisa Ong]
- Container for CUDA self-hosted Azure devops agent
- Initial buddy build pipeline (similar to ROCm)
- Replaces references to Dockerhub with Azure Container Registry for compliance purposes

Merged PR 2603: Add CUDA pipeline host to known targets. [Lisa Ong]

Note that the CPU frequency is conflicting, I went with cpuinfo and dmesg.

References:

> python -m cpuinfo

Python Version: 3.8.10.final.0 (64 bit)
Cpuinfo Version: 8.0.0
Vendor ID Raw: AuthenticAMD
Hardware Raw:
Brand Raw: AMD EPYC 7V12 64-Core Processor
Hz Advertised Friendly: 3.3049 GHz
Hz Actual Friendly: 3.3049 GHz
Hz Advertised: (3304919000, 0)
Hz Actual: (3304919000, 0)
Arch: X86_64
Bits: 64
Count: 128
Arch String Raw: x86_64
L1 Data Cache Size: 2 MiB
L1 Instruction Cache Size: 2 MiB
L2 Cache Size: 32 MiB
L2 Cache Line Size: 512
L2 Cache Associativity: 6
L3 Cache Size: 524288
Stepping:
Model: 49
Family: 23
Processor Type:
Flags: 3dnowext, 3dnowprefetch, abm, adx, aes, aperfmperf, apic, arat, avic, avx, avx2, bmi1, bmi2, bpext, cat_l3, cdp_l3, clflush, clflushopt, clwb, clzero, cmov, cmp_legacy, constant_tsc, cpb, cpuid, cqm, cqm_llc, cqm_mbm_local, cqm_mbm_total, cqm_occup_llc, cr8_legacy, cx16, cx8, dbx, de, decodeassists, extapic, extd_apicid, f16c, flushbyasid, fma, fpu, fsgsbase, fxsr, fxsr_opt, ht, hw_pstate, ibpb, ibrs, ibs, irperf, lahf_lm, lbrv, lm, mba, mca, mce, misalignsse, mmx, mmxext, monitor, movbe, msr, mtrr, mwaitx, nonstop_tsc, nopl, npt, nrip_save, nx, osvw, osxsave, overflow_recov, pae, pat, pausefilter, pci_l2i, pclmulqdq, pdpe1gb, perfctr_core, perfctr_llc, perfctr_nb, pfthreshold, pge, pni, popcnt, pqe, pqm, pse, pse36, rdpid, rdrand, rdrnd, rdseed, rdt_a, rdtscp, rep_good, sep, sev, sha, sha_ni, skinit, smap, smca, sme, smep, ssbd, sse, sse2, sse4_1, sse4_2, sse4a, ssse3, stibp, succor, svm, svm_lock, syscall, tce, topoext, tsc, tsc_scale, umip, v_vmsave_vmload, vgif, vmcb_clean, vme, vmmcall, wbnoinvd, wdt, xgetbv1, xsave, xsavec, xsaveerptr, xsaveopt, xsaves

> dmesg | grep MHz
[    0.000000] tsc: Detected 2450.083 MHz processor
[    7.731766] hpet0: 3 comparators, 32-bit 14.318180 MHz counter
[    8.979712] tsc: Refined TSC clocksource calibration: 2449.961 MHz

> lscpu

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              2
Core(s) per socket:              64
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7V12 64-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         1497.558
CPU max MHz:                     2450.0000
CPU min MHz:                     1500.0000
BogoMIPS:                        4900.16
Virtualization:                  AMD-V
L1d cache:                       2 MiB
L1i cache:                       2 MiB
L2 cache:                        32 MiB
L3 cache:                        2...

Merged PR 2602: Add rocwmma plumbing in tensorize. [Ritwik Das]
- Add rocwmma plumbing in tensorize
- Cannot use this flag until the 5.2 ROCm release which natively supports rocWmma.
Related work items: #3672
Merged PR 2570: Enhancements to the gpu benchmark tool. [Ritwik Das]
- Add multiprocess package builders and runners
- Support for running on different GPU devices
- Add clock speed determinism
- add composable_kernel benchmarks
- add cutlass benchmarks
- add cublas and rocblas benchmarks
- Add Cosmos DB result upload capability
Related work items: #3683, #3700, #3705, #3685
Merged PR 2598: Fix mfma enum name typo. [Mason Remy]

Fix mfma enum name typo
Merged PR 2595: [nfc] Renames smoke_test.py -> smoke_tests.py. [Kern
Handa]

[nfc] Renames smoke_test.py -> smoke_tests.py

New Contributors

@dependabot made their first contribution in #42

Full Changelog: v1.2.5...v1.2.6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.2.6

What's Changed

New Contributors

Contributors