Skip to content

v1.2.6

Compare
Choose a tag to compare
@lisaong lisaong released this 17 Jun 06:43
· 36 commits to main since this release

What's Changed

  • Bump urllib3 from 1.25.8 to 1.26.5 in /tools/benchmarkers by @dependabot in #42
  • [ci] Fix out of disk space errors for CI workflow by @lisaong in #43
  • Bump bottle from 0.12.19 to 0.12.20 in /tools/viz by @dependabot in #44
  • Merged PR 2657: Add conversion pass from gpu ops to rocdl ops. [Ritwik Das]

    • switch to gpu dialect for gpu index ops
    • add conversion pass from gpu dialect to rocdl
  • Merged PR 2652: Add integer tensor ops support for AMD targets.
    [Ritwik Das]

    • int mfma ops
    • tests
    • static_cast in c++

    Related work items: #3727

  • Merged PR 2650: [release] Docs version to 1.2.6, sync Github to ADO.
    [Lisa Ong]

  • Merged PR 2624: Add more MMA shapes for CUDA. [Ritwik Das]

    Add more MMA shapes for CUDA

    • 32x8x16
    • 8x32x16
  • Merged PR 2644: Enable CUDA benchmarks only for A6000. [Lisa Ong]

    • Manually set the Target.Model user capability on agents running A6000
    • Update benchmarking pipelines to demand A6000s

    https://docs.microsoft.com/en-us/azure/devops/pipelines/process/demands?view=azure-devops&tabs=yaml#feedback

  • Merged PR 2634: Remove couple more big gemm sizes. [Ritwik Das]

    Remove couple more big gemm sizes

  • Merged PR 2626: [refactor] Moving debug mode to its own lowering pass.
    [Lisa Ong]

    Move the emitting of the debug mode wrapper function out of MLIREmitterContext into a lowering pass.

    This makes it easier to expand debug mode in the future.

  • Merged PR 2633: Bump hatlib to 0.0.19 to unblock CUDA T4 devices.
    [Lisa Ong]

    https://github.com/microsoft/hat/releases/tag/v0.0.19

  • Merged PR 2630: Add batched gemm support with tensorization. [Ritwik
    Das]

    Related work items: #3677

  • Merged PR 2631: Add cosmosdb key env var and shuffle gemm sizes.
    [Ritwik Das]

    • Add env var for ACCOUNT_KEY
    • shuffle gemm sizes from small to big
    • remove correctness check from big inputs and fp16
  • Merged PR 2607: Infrastructure for plan.auto() to support a basic none
    cache heuristics approach. [JUBI TANEJA]

    Infrastructure for plan.auto() to support a basic none cache heuristics approach

    This is a basic approach to test parameterization of cache arguments, index and layout.
    User only needs to specify the source they want to cache, and AutoPlanner's
    NoneCacheHeuristics algorithm will synthesize the remaining parameters for caching
    with possible set of values.

    Overall idea at DSL level:
    Given input -
    schedule.reorder(i, j, k, ii, jj, kk)
    plan.auto(accera.algorithms.NoneCacheHeuristics(source = B, index = j))

    Internally, auto() invokes cache and adds two functions with
    a unique value of layout.

    plan.cache(source = B, index = j, layout = {FIRST_MAJOR, LAST_MAJOR})

    Important change in this PR:

    • Add a new algorithms module in Accera
    • Do not delay resolution of delayed parameters to get the value, instead it
      now allows setting parameters with a possible set of values and this can be
      passed between heuristics and plan object. Check: Parameter.py
    • Parameters constructed by heuristics are termed as "herustic parameters".
      They are not available to the external users of Accera, but just named
      separately in the implementation to differentiate them from user-defined "parameters".

    Limitation/Changes coming in the subsequent PRs:

    • Allow user-defined parameters and heuristic parameters both for AutoPlanner test cases.
      For now, the code only focuses on testing AutoPlanner without any user-defined parameters
      that one can create using API: create_parameters.
    • Documentation of AutoPlanner -- design goals, tutorial, API description, etc. is coming in the
      next PR.
  • Merged PR 2600: Refactor MFMA indexing calculations. [Mason Remy]

    Refactor MFMA indexing calculations

    • Use the iteration space position when determing MFMA computation
      locations rather than computing the position from the thread id
    • Construct the full subschedules for AMD MFMA ops so that the bound
      loop indices are ordered appropriately for the MFMA op being invoked
    • Update unit tests accordingly. The schedule changes may need to be
      moved to an under-the-hood feature of tensorization
  • Merged PR 2627: Raise error for invalid block dimensions. [Ritwik Das]

    Raise error for invalid block dimensions based on target info

    Related work items: #3715

  • Merged PR 2625: [nfc] Block debug mode for unsupported GPU targets.
    [Lisa Ong]

    Debug mode is not yet supported for GPU targets

    • Fail early
    • Update documentation
  • Merged PR 2622: Fix dependencies for benchmark tools. [Ritwik Das]

    Fix dependencies for benchmark tools

  • Merged PR 2604: Add bfloat16 support for tensor ops on rocm. [Ritwik
    Das]

    Add bfloat16 support for tensor ops on cuda and rocm

    Related work items: #3713

  • Merged PR 2621: Merge changes from Github repo. [Lisa Ong]

    commit 5b5f5ef

  • Merged PR 2620: Upgrade GPU self-hosted agents to g++-10. [Lisa Ong]

    The stock g++-9 from Ubuntu 20.04 crashes when compiling pybind11 alongside mlir/Dialect/IR/Affine/AffineOp.h.

    This change updates to g++-10 for the self-hosted images only, as this issue only affects images that we build for ROCm and CUDA.

    Azure DevOps agents will continue to run on their pre-installed g++-9.

  • Merged PR 2619: Parameterize Plan.bind. [Denny Sun]

            P0, P1, P2, P3, P4, P5 = create_parameters()
    
            plan.bind(mapping={
                P0: P3,
                P1: P4,
                P2: P5
            })
    
            package.add(
                plan,
                args=(A, B, C),
                parameters={
                    P0: i,
                    P1: j,
                    P2: k,
                    P3: v100.GridUnit.BLOCK_X,
                    P4: v100.GridUnit.THREAD_X,
                    P5: v100.GridUnit.THREAD_Y,
                },
                base_name=test_name)
    

    Related work items: #3708

  • Merged PR 2599: Support parameterizing caches based on memory space.
    [Mason Remy]

    Support parameterizing caches based on memory space

    • Identifies bound indices that the cache should be parameterized on,
      rather than shaped by.
      e.g. for a private memory cache inserted at a gpu block level, the
      computed memory space will not be the full active block at that level,
      but the portion derived from loops that weren't bound to gpu thread
      dims.

    • Adds some BoundProcessorOp utilities and shares some common binding
      code

  • Merged PR 2618: Fix memory allocation bug during benchmark
    verification. [Ritwik Das]

    Fix memory allocation bug during benchmark verification

  • Merged PR 2617: [nfc] [doc] Fix typo and re-sync models table. [Lisa
    Ong]

  • Merged PR 2616: Formatting Python code a bit for the better
    readability. [Denny Sun]

    1. Some functions have a long list of parameters, add line wrap
    2. Separate external imports from internal ones
  • Merged PR 2614: Remove redundant variable and cosmosdb fix. [Ritwik
    Das]

    Cosmos DB error when upserting from multiple processes:

    Process runner0:
    Traceback (most recent call last):
    File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
    File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    File "/azp/_work/2/s/tools/benchmarkers/accera_gemm.py", line 633, in gemm_runner
    cosmosdb.upsert_benchmark_results(resultRows, containerName, verboseLogs)
    File "/azp/_work/2/s/tools/benchmarkers/cosmosdb.py", line 27, in upsert_benchmark_results
    container = get_container(containerName, verboseLogs)
    File "/azp/_work/2/s/tools/benchmarkers/cosmosdb.py", line 18, in get_container
    container = db.create_container_if_not_exists(id=containerName, partition_key=PartitionKey(path='/partitionKey'))
    File "/usr/local/lib/python3.8/dist-packages/azure/core/tracing/decorator.py", line 62, in wrapper_use_tracer
    return func(*args, **kwargs) # type: ignore
    File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/database.py", line 287, in create_container_if_not_exists
    container_proxy.read(
    File "/usr/local/lib/python3.8/dist-packages/azure/core/tracing/decorator.py", line 62, in wrapper_use_tracer
    return func(*args, **kwargs) # type: ignore
    File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/container.py", line 145, in read
    self._properties = self.client_connection.ReadContainer(
    File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_cosmos_client_connection.py", line 469, in ReadContainer
    return self.Read(path, "colls", collection_id, None, options, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_cosmos_client_connection.py", line 2162, in Read
    result, self.last_response_headers = self.__Get(path, request_params, headers, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_cosmos_client_connection.py", line 2209, in __Get
    return synchronized_request.SynchronizedRequest(
    File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_synchronized_request.py", line 210, in SynchronizedRequest
    return _retry_utility.Execute(
    File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_retry_utility.py", line 73, in Execute
    result = ExecuteFunction(function, global_endpoint_manager, *args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_retry_utility.py", line 130, in ExecuteFunction
    return function(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/azure/cosmos/_synchronized_request.py", line 158, in _Request
    raise exceptions.CosmosHttpResponseError(message=data, response=response)
    azure.cosmos.exceptions.CosmosHttpResponseError: Status code: 400

  • Merged PR 2613: Enable daily CUDA benchmarks. [Ritwik Das]

    • Enable CUDA benchmarks
    • some refactoring
  • Merged PR 2596: Updates to affine simplifications. [Mason Remy]

    Updates to affine simplifications

    • Run simplifications on AffineApplyOps
    • Detect and simplify some single-element-numerator cases for floordiv
      and mod
    • Detect GPU constants such grid dim size and block dim size and
      incorporate those constants into affine maps for later simplification
    • Detect GPU bound dimensions block id and thread id in affine ops and
      incorporate those ranges into simplification passes

    Related work items: #3667

  • Merged PR 2594: Always resolve unrealized loopnest indices when
    computing cache positions. [Mason Remy]

    Always resolve unrealized loopnest indices when
    computing cache positions

  • Merged PR 2574: Support binding multiple indices to a processor
    handle. [Mason Remy]

    Support binding multiple indices to a processor handle

    • This creates a mapping of the processor handle to the index iterations
      based on the ordering of the indices in the tuple
  • Merged PR 2611: Fix issue when splitting indices by factors that don't
    divide evenly. [Chuck Jacobs]

    This PR fixes an issue when splitting by a factor that doesn't evenly divide the parent index's range. E.g., if i has a range of [0, 320), then ii = split(i, 128) would end up with ii having a range of 192 instead of 128.

  • Merged PR 2612: Add missing psutil dependency. [Ritwik Das]

    • Add missing psutil dependency
    • Remove private branch from benchmarks
  • Merged PR 2608: Caching fixes and benchmarking optimizations. [Ritwik
    Das]

    • Explore k_split independently of outer tile dims, allows for arbitrary k splits
    • Fix for workPerThread < 1 (from Mason), which was exposed since the benchmark now explores k-split of size 1, 2, 4, etc. and this causes small active blocks for caching, and when work per thread becomes less than 1 the compiler crashes during package.build.
  • Merged PR 2610: Opportunistically add more targets used in CI machines
    and update Model.md. [Lisa Ong]

    • Renamed some fields to add units
    • Added some Intel Xeon models as we encounter them
    • Updated some cache sizes
  • Merged PR 2606: Parameterize Array.sub_array. [Denny Sun]

    `

        P0, P1 = create_parameters()
        arr = Array(role=Array.Role.INPUT_OUTPUT, element_type=ScalarType.float32, shape=(256, 256))
        arr0 = arr.sub_array(offsets=(0, 0), shape=(P0, P1))
        package.add(nest, args=(arr0, ),  parameters={P0: 128, P1: 128})
    

    `

    Related work items: #3707

  • Merged PR 2609: [build] peg protobuf to 3.20.1 due to
    incompatibilities with latest version. [Lisa Ong]

    Even though we peg to onnx==1.9.0, onnx requires protobuf >= 3.20.1 which pulls an incompatible version of protobuf (4.x).

  • Merged PR 2576: [doc] MFMA thread assignment visualizations for AMD.
    [Lisa Ong]

    Some helper visualizations for MFMA:

    • 2x2x16
    • 4x4x32
  • Merged PR 2601: [ci] CUDA pipeline and buddy build. [Lisa Ong]

    • Container for CUDA self-hosted Azure devops agent
    • Initial buddy build pipeline (similar to ROCm)
    • Replaces references to Dockerhub with Azure Container Registry for compliance purposes
  • Merged PR 2603: Add CUDA pipeline host to known targets. [Lisa Ong]

    Note that the CPU frequency is conflicting, I went with cpuinfo and dmesg.

    References:

    > python -m cpuinfo
    
    Python Version: 3.8.10.final.0 (64 bit)
    Cpuinfo Version: 8.0.0
    Vendor ID Raw: AuthenticAMD
    Hardware Raw:
    Brand Raw: AMD EPYC 7V12 64-Core Processor
    Hz Advertised Friendly: 3.3049 GHz
    Hz Actual Friendly: 3.3049 GHz
    Hz Advertised: (3304919000, 0)
    Hz Actual: (3304919000, 0)
    Arch: X86_64
    Bits: 64
    Count: 128
    Arch String Raw: x86_64
    L1 Data Cache Size: 2 MiB
    L1 Instruction Cache Size: 2 MiB
    L2 Cache Size: 32 MiB
    L2 Cache Line Size: 512
    L2 Cache Associativity: 6
    L3 Cache Size: 524288
    Stepping:
    Model: 49
    Family: 23
    Processor Type:
    Flags: 3dnowext, 3dnowprefetch, abm, adx, aes, aperfmperf, apic, arat, avic, avx, avx2, bmi1, bmi2, bpext, cat_l3, cdp_l3, clflush, clflushopt, clwb, clzero, cmov, cmp_legacy, constant_tsc, cpb, cpuid, cqm, cqm_llc, cqm_mbm_local, cqm_mbm_total, cqm_occup_llc, cr8_legacy, cx16, cx8, dbx, de, decodeassists, extapic, extd_apicid, f16c, flushbyasid, fma, fpu, fsgsbase, fxsr, fxsr_opt, ht, hw_pstate, ibpb, ibrs, ibs, irperf, lahf_lm, lbrv, lm, mba, mca, mce, misalignsse, mmx, mmxext, monitor, movbe, msr, mtrr, mwaitx, nonstop_tsc, nopl, npt, nrip_save, nx, osvw, osxsave, overflow_recov, pae, pat, pausefilter, pci_l2i, pclmulqdq, pdpe1gb, perfctr_core, perfctr_llc, perfctr_nb, pfthreshold, pge, pni, popcnt, pqe, pqm, pse, pse36, rdpid, rdrand, rdrnd, rdseed, rdt_a, rdtscp, rep_good, sep, sev, sha, sha_ni, skinit, smap, smca, sme, smep, ssbd, sse, sse2, sse4_1, sse4_2, sse4a, ssse3, stibp, succor, svm, svm_lock, syscall, tce, topoext, tsc, tsc_scale, umip, v_vmsave_vmload, vgif, vmcb_clean, vme, vmmcall, wbnoinvd, wdt, xgetbv1, xsave, xsavec, xsaveerptr, xsaveopt, xsaves
    
    > dmesg | grep MHz
    [    0.000000] tsc: Detected 2450.083 MHz processor
    [    7.731766] hpet0: 3 comparators, 32-bit 14.318180 MHz counter
    [    8.979712] tsc: Refined TSC clocksource calibration: 2449.961 MHz
    
    > lscpu
    
    Architecture:                    x86_64
    CPU op-mode(s):                  32-bit, 64-bit
    Byte Order:                      Little Endian
    Address sizes:                   43 bits physical, 48 bits virtual
    CPU(s):                          128
    On-line CPU(s) list:             0-127
    Thread(s) per core:              2
    Core(s) per socket:              64
    Socket(s):                       1
    NUMA node(s):                    1
    Vendor ID:                       AuthenticAMD
    CPU family:                      23
    Model:                           49
    Model name:                      AMD EPYC 7V12 64-Core Processor
    Stepping:                        0
    Frequency boost:                 enabled
    CPU MHz:                         1497.558
    CPU max MHz:                     2450.0000
    CPU min MHz:                     1500.0000
    BogoMIPS:                        4900.16
    Virtualization:                  AMD-V
    L1d cache:                       2 MiB
    L1i cache:                       2 MiB
    L2 cache:                        32 MiB
    L3 cache:                        2...
    
  • Merged PR 2602: Add rocwmma plumbing in tensorize. [Ritwik Das]

    • Add rocwmma plumbing in tensorize
    • Cannot use this flag until the 5.2 ROCm release which natively supports rocWmma.

    Related work items: #3672

  • Merged PR 2570: Enhancements to the gpu benchmark tool. [Ritwik Das]

    • Add multiprocess package builders and runners
    • Support for running on different GPU devices
    • Add clock speed determinism
    • add composable_kernel benchmarks
    • add cutlass benchmarks
    • add cublas and rocblas benchmarks
    • Add Cosmos DB result upload capability

    Related work items: #3683, #3700, #3705, #3685

  • Merged PR 2598: Fix mfma enum name typo. [Mason Remy]

    Fix mfma enum name typo

  • Merged PR 2595: [nfc] Renames smoke_test.py -> smoke_tests.py. [Kern
    Handa]

    [nfc] Renames smoke_test.py -> smoke_tests.py

New Contributors

Full Changelog: v1.2.5...v1.2.6