Squashed commit of the following:

commit aa38f90099653e4662e053fd3ab08248d3612ef9 Author: Mason Remy <masonr@microsoft.com> Date: Fri Mar 10 06:13:20 2023 +0000 Merged PR 3150: Change high precision fp to not perform contraction Change high precision fp to not perform contraction Also change value library FMA to use the math dialect FmaOp and vectorize to the vector dialect FMAOp commit 859755f7bbf76fb6b0b92ed7a7dc6cf5c1615ba1 Author: Mason Remy <masonr@microsoft.com> Date: Thu Mar 9 19:17:58 2023 +0000 Merged PR 3147: Fix vector cast with same bitwidth. Fix vector cast with same bitwidth. accv.cast vector<16xi8> to vector<16xui8> was erroneously lowering to cast vector<16xi8> to ui8 commit d6b3308d0f4a4dbda7a30d0695d7408dfd9d32b9 Author: Mason Remy <masonr@microsoft.com> Date: Thu Mar 9 18:26:57 2023 +0000 Merged PR 3149: Improve 1-D horizontal sum reductions for 8xf32 and 8xi32 Improve 1-D horizontal sum reductions for 8xf32 and 8xi32 commit cd030b123dac7b25b2da5127e04e4e28919bd9ed Author: Kern Handa <kerha@microsoft.com> Date: Thu Mar 9 01:22:37 2023 +0000 Merged PR 3148: Adds Package level FP precision override commit dc86c7cd92530c5e0c36639b62b5a380c531648d Author: Kern Handa <kerha@microsoft.com> Date: Wed Mar 8 22:02:25 2023 +0000 Merged PR 3144: Removes fp precision as an option for Package.build The fp-contract option being used in `accc.py` was overriding the recent addition of the fp precision specification at the function level. Since there's now an equivalent default for each function, we shouldn't have need of the option to be specified to `llc` and `opt` during build time. commit 91e77ebcd926fb238852c94477d7f0c26c8f9952 Author: Denny Sun <dennys@microsoft.com> Date: Wed Mar 8 11:48:06 2023 +0000 Merged PR 3143: Add dsl test for profiling op 1. add profiling enable flag to Package.build() 2. add a dsl test commit 33ffb2497e71040e0b27775e5b594cad0949cbe5 Author: Denny Sun <dennys@microsoft.com> Date: Wed Mar 8 09:27:24 2023 +0000 Merged PR 3022: Assert the arg order in debug mode Dimension arg should precede array arg in the arg list for debug mode. commit e3b216ac87e6d75a10c39584d7dfe25b3fc67647 Author: Denny Sun <dennys@microsoft.com> Date: Wed Mar 8 08:12:00 2023 +0000 Merged PR 3137: expose profiling function to DSL expose profiling function to DSL commit d2fcb1caf99c002a25b1ee8c28b3cd719fd6133a Author: Lisa Ong <onglisa@microsoft.com> Date: Tue Mar 7 12:09:08 2023 +0000 Merged PR 3142: [Release] Tie accera-llvm versioning to LLVM version This change introduces a new versioning schema for accera-llvm that follows LLVM's versioning, while allowing for Accera versioned forks: `<llvm_major>.<llvm_minor>.<llvm_micro><accera_micro> = (N+).(N+).(N+)(N{2})` This overloads the micro version field due to constraints on Python versioning: https://peps.python.org/pep-0440/ Examples: * Current LLVM fork is 14.0.6-2: `accera_llvm.14.0.602`, which means LLVM 14.0.6 + accera fork v2 * If/when upgrading to LLVM 15.0.7: `accera_llvm.15.0.700` * Then when we rev the Accera fork to LLVM 15.0.7-1: `accera_llvm.15.0.701` Limitations: * We don't expect Accera's fork to span beyond 2-digit versions Alternatives: * Omit the 0 delimiters, if we think it is unlikely that Accera forks will rev micro versions beyond single-digit. Accera forks may rev more often if we don't update LLVM. * Use a dev version, e.g. accera_llvm.14.0.6.dev4. Downside is that this looks unofficial - devN is intended for developmental releases rather than official PyPI releases. That said, the whole Accera project is developmental :) commit 79ef63b685e8a2221b3127801dec324f0613fa66 Author: Kern Handa <kerha@microsoft.com> Date: Tue Mar 7 09:45:06 2023 +0000 Merged PR 3139: Allows setting precision of fp ops per function Allows setting precision of fp ops per function commit 599742a82910cdc39b5e668f28774327f53a28c7 Author: Mason Remy <masonr@microsoft.com> Date: Mon Mar 6 21:31:09 2023 +0000 Merged PR 3140: Fix bug with reinterpret casts of unrealized conversion casts. Fix bug with reinterpret casts of unrealized conversion casts. This happens when we do a heap alloc followed by a reinterpret cast, but it can come up in other scenarios too commit 655044a3400a4b40aaf9423abc791763c079f410 Author: Lisa Ong <onglisa@microsoft.com> Date: Fri Mar 3 06:15:31 2023 +0000 Merged PR 3135: [nfc] Add XeonE5 benchmark machine to targets, bump hatlib dependency Best guesses at cache sizes and cache lines from: https://en.wikichip.org/wiki/intel/xeon_e5/e5-2673_v4
microsoft · Mar 10, 2023 · 5ebe6c7 · 5ebe6c7
1 parent 05f8c0d
commit 5ebe6c7
Show file tree

Hide file tree

Showing 25 changed files with 553 additions and 91 deletions.
diff --git a/accera/acc-opt/test/ValueBinOpCastOp.mlir b/accera/acc-opt/test/ValueBinOpCastOp.mlir
@@ -37,7 +37,7 @@ module @test_bin_op_cast_op_folding_module {
     // CHECK-NEXT:    %0 = affine.load %arg0[0] : memref<1xf32>
     // CHECK-NEXT:    %1 = affine.load %arg1[0] : memref<1xi32>
     // CHECK-NEXT:    %2 = arith.sitofp %1 : i32 to f32
-    // CHECK-NEXT:    %3 = arith.mulf %0, %2 {RelaxedPrecision} : f32
+    // CHECK-NEXT:    %3 = arith.mulf %0, %2 : f32
     // CHECK-NEXT:    %4 = arith.fptosi %3 : f32 to i32
     // CHECK-NEXT:    affine.store %4, %arg2[0] : memref<1xi32>
     builtin.func @bin_op_cast_input_to_f32(%arg0: memref<1xf32>, %arg1: memref<1xi32>, %arg2: memref<1xi32>) {

diff --git a/accera/acc-opt/test/value_mlir_test.cpp b/accera/acc-opt/test/value_mlir_test.cpp
@@ -553,15 +553,15 @@ TEST_CASE("mlir_test13")
 // CHECK-NEXT: accv.module "test_emit_c_interface" {
 TEST_CASE("test_emit_c_interface")
 {
-    // CHECK-NEXT:    accv.func nested @external_func_decl_{{[0-9]+}}() attributes {accv.dyn_arg_size_refs = [], accv.usages = [], exec_target = 0 : i64, llvm.emit_c_interface} {
+    // CHECK-NEXT:    accv.func nested @external_func_decl_{{[0-9]+}}() attributes {accv.dyn_arg_size_refs = [], accv.usages = [], exec_target = 0 : i64, fastmath = #llvm.fastmath<fast>, llvm.emit_c_interface} {
     auto externDecl = DeclareFunction("external_func_decl")
                           .External(true)
                           .CWrapper(true)
                           // CHECK: return
                           // CHECK-NEXT:    }
                           .Define([] {});
 
-    // CHECK-NEXT:    accv.func nested @internal_func_decl_{{[0-9]+}}() attributes {accv.dyn_arg_size_refs = [], accv.usages = [], exec_target = 0 : i64, llvm.emit_c_interface} {
+    // CHECK-NEXT:    accv.func nested @internal_func_decl_{{[0-9]+}}() attributes {accv.dyn_arg_size_refs = [], accv.usages = [], exec_target = 0 : i64, fastmath = #llvm.fastmath<fast>, llvm.emit_c_interface} {
     DeclareFunction("internal_func_decl")
         .External(false)
         .CWrapper(true)
@@ -577,15 +577,15 @@ TEST_CASE("test_emit_c_interface")
 // CHECK-NEXT: accv.module "test_raw_pointer_api" {
 TEST_CASE("test_raw_pointer_api")
 {
-    // CHECK-NEXT:    accv.func nested @external_func_decl_{{[0-9]+}}() attributes {accv.dyn_arg_size_refs = [], accv.emit_raw_pointer_api, accv.usages = [], exec_target = 0 : i64} {
+    // CHECK-NEXT:    accv.func nested @external_func_decl_{{[0-9]+}}() attributes {accv.dyn_arg_size_refs = [], accv.emit_raw_pointer_api, accv.usages = [], exec_target = 0 : i64, fastmath = #llvm.fastmath<fast>} {
     auto externDecl = DeclareFunction("external_func_decl")
                           .External(true)
                           .RawPointerAPI(true)
                           // CHECK: return
                           // CHECK-NEXT:    }
                           .Define([] {});
 
-    // CHECK-NEXT:    accv.func nested @internal_func_decl_{{[0-9]+}}() attributes {accv.dyn_arg_size_refs = [], accv.emit_raw_pointer_api, accv.usages = [], exec_target = 0 : i64} {
+    // CHECK-NEXT:    accv.func nested @internal_func_decl_{{[0-9]+}}() attributes {accv.dyn_arg_size_refs = [], accv.emit_raw_pointer_api, accv.usages = [], exec_target = 0 : i64, fastmath = #llvm.fastmath<fast>} {
     DeclareFunction("internal_func_decl")
         .External(false)
         .RawPointerAPI(true)
@@ -601,15 +601,15 @@ TEST_CASE("test_raw_pointer_api")
 // CHECK-NEXT: accv.module "test_emit_header_decl" {
 TEST_CASE("test_emit_header_decl")
 {
-    // CHECK-NEXT:    accv.func nested @external_func_decl_{{[0-9]+}}() attributes {accv.dyn_arg_size_refs = [], accv.emit_header_decl, accv.usages = [], exec_target = 0 : i64} {
+    // CHECK-NEXT:    accv.func nested @external_func_decl_{{[0-9]+}}() attributes {accv.dyn_arg_size_refs = [], accv.emit_header_decl, accv.usages = [], exec_target = 0 : i64, fastmath = #llvm.fastmath<fast>} {
     auto externDecl = DeclareFunction("external_func_decl")
                           .External(true)
                           .HeaderDecl(true)
                           // CHECK: return
                           // CHECK-NEXT:    }
                           .Define([] {});
 
-    // CHECK-NEXT:    accv.func nested @internal_func_decl_{{[0-9]+}}() attributes {accv.dyn_arg_size_refs = [], accv.emit_header_decl, accv.usages = [], exec_target = 0 : i64} {
+    // CHECK-NEXT:    accv.func nested @internal_func_decl_{{[0-9]+}}() attributes {accv.dyn_arg_size_refs = [], accv.emit_header_decl, accv.usages = [], exec_target = 0 : i64, fastmath = #llvm.fastmath<fast>} {
     DeclareFunction("internal_func_decl")
         .External(false)
         .HeaderDecl(true)
@@ -625,13 +625,13 @@ TEST_CASE("test_emit_header_decl")
 // CHECK-NEXT: accv.module "test_function_tags" {
 TEST_CASE("test_function_tags")
 {
-    // CHECK-NEXT:    accv.func nested @no_func_tags_{{[0-9]+}}() attributes {accv.dyn_arg_size_refs = [], accv.usages = [], exec_target = 0 : i64} {
+    // CHECK-NEXT:    accv.func nested @no_func_tags_{{[0-9]+}}() attributes {accv.dyn_arg_size_refs = [], accv.usages = [], exec_target = 0 : i64, fastmath = #llvm.fastmath<fast>} {
     auto externDecl = DeclareFunction("no_func_tags")
                           // CHECK: return
                           // CHECK-NEXT:    }
                           .Define([] {});
 
-    // CHECK-NEXT:    accv.func nested @has_func_tags_{{[0-9]+}}() attributes {accv.dyn_arg_size_refs = [], accv.function_tags = {tag_a, tag_b}, accv.usages = [], exec_target = 0 : i64} {
+    // CHECK-NEXT:    accv.func nested @has_func_tags_{{[0-9]+}}() attributes {accv.dyn_arg_size_refs = [], accv.function_tags = {tag_a, tag_b}, accv.usages = [], exec_target = 0 : i64, fastmath = #llvm.fastmath<fast>} {
     DeclareFunction("has_func_tags")
         .AddTag("tag_a")
         .AddTag("tag_b")

diff --git a/accera/accc/accc.py b/accera/accc/accc.py
@@ -100,9 +100,6 @@ def bstr(val):
 
 DEFAULT_MLIR_TRANSLATE_ARGS = ["--mlir-print-op-on-diagnostic", "--acc-to-llvmir"]
 
-DEFAULT_LOW_PRECISION_FLOAT_OPTS = ["-fp-contract=fast", "--enable-unsafe-fp-math"]
-DEFAULT_HIGH_PRECISION_FLOAT_OPTS = ["-fp-contract=on"]
-
 OPT_DISABLE_LOOP_UNROLLING_ARGS = ["--disable-loop-unrolling"]
 
 LLVM_KEEP_DEBUG_INFO_ARGS = ["--frame-pointer=all"]
@@ -128,20 +125,15 @@ def bstr(val):
 }
 
 DEFAULT_LLVM_TOOLING_OPTS = [
-    '--enable-no-infs-fp-math',
-    '--enable-no-nans-fp-math',
-    '--enable-no-signed-zeros-fp-math',
-    '--enable-no-trapping-fp-math'
 ]
 
 DEFAULT_OPT_ARGS = DEFAULT_LLVM_TOOLING_OPTS + []
 
 DEFAULT_LLC_ARGS = DEFAULT_LLVM_TOOLING_OPTS + ["-relocation-model=pic"]
 
 class Options(Flag):
-    NONE = auto() # (enable auto unroll | low precision float | no debug info)
+    NONE = auto() # (enable auto unroll | no debug info)
     DISABLE_AUTO_UNROLL = auto()
-    HIGH_PRECISION_FLOATING_POINT_OPS = auto()
     KEEP_DEBUG_INFO = auto()
 
 def _get_common_debug_info_options_args(options: Options):
@@ -150,27 +142,19 @@ def _get_common_debug_info_options_args(options: Options):
     else:
         return []
 
-def _get_common_fp_options_args(options: Options):
-    if options & Options.HIGH_PRECISION_FLOATING_POINT_OPS:
-        return DEFAULT_HIGH_PRECISION_FLOAT_OPTS
-    else:
-        return DEFAULT_LOW_PRECISION_FLOAT_OPTS
-
 def _get_options_opt_args(options: Options):
     args = []
 
     if options & Options.DISABLE_AUTO_UNROLL:
         args += OPT_DISABLE_LOOP_UNROLLING_ARGS
 
-    args += _get_common_fp_options_args(options)
     args += _get_common_debug_info_options_args(options)
 
     return args
 
 def _get_options_llc_args(options: Options):
     args = []
 
-    args += _get_common_fp_options_args(options)
     args += _get_common_debug_info_options_args(options)
 
     return args

diff --git a/accera/python/accera/Debug.py b/accera/python/accera/Debug.py
@@ -12,6 +12,22 @@
 from ._lang_python._lang import Dimension
 
 
+def check_args_order(func: Function):
+    try:
+        for arg in func.requested_args:
+            if isinstance(arg, Array):
+                for dim in arg.shape:
+                    if isinstance(dim, Dimension):
+                        assert func.requested_args.index(dim) < func.requested_args.index(arg)
+    except Exception as e:
+        if isinstance(e, AssertionError):
+            assert False, "Dimension arguments need to precede the array argument in Debug mode"
+        else:
+            # Swallow the exception in this function when the array's dimension is absent from the arg list,
+            # let this function only focus on the arg order check.
+            return         
+
+
 def get_args_to_debug(func: Function) -> List[Array]:
     """Gets the arguments of interest to debugging
     For example, INPUT_OUTPUT Arrays

diff --git a/accera/python/accera/Package.py b/accera/python/accera/Package.py
@@ -479,7 +479,7 @@ def _add_functions_to_module(self, module, fail_on_error=False):
                 del self._fns[name]
 
     def _add_debug_utilities(self, tolerance):
-        from .Debug import get_args_to_debug, add_debugging_functions
+        from .Debug import get_args_to_debug, add_debugging_functions, check_args_order
 
         # add_check_all_close will modify the self._fns dictionary (because
         # it is adding debug functions), to avoid this, we first gather information
@@ -576,10 +576,13 @@ def _make_accc_options(self, options: _Options):
         accc_opts = accc.Options.NONE
         if options & Package._Options.DISABLE_AUTO_UNROLL:
             accc_opts |= accc.Options.DISABLE_AUTO_UNROLL
-        if options & Package._Options.HIGH_PRECISION_FLOATING_POINT_OPS:
-            accc_opts |= accc.Options.HIGH_PRECISION_FLOATING_POINT_OPS
         return accc_opts
 
+    def _apply_options_to_funcs(self, options: _Options):
+        if options & Package._Options.HIGH_PRECISION_FLOATING_POINT_OPS:
+            for f in self._fns.values():
+                if f.high_precision_fp is None:
+                    f.high_precision_fp = True
 
     def build(
         self,
@@ -590,6 +593,7 @@ def build(
         tolerance: float = 1e-5,
         output_dir: str = None,
         fail_on_error: bool = False,
+        profile: bool = False,
         _opts: _Options = _Options.NONE,
         _quiet=True,
     ):
@@ -653,6 +657,7 @@ def build(
 
         # Create the package module
         package_module = _lang_python._Module(name=name, options=compiler_options)
+        self._apply_options_to_funcs(_opts)
         self._add_functions_to_module(package_module, fail_on_error)
 
         # Emit the package module
@@ -705,6 +710,7 @@ def build(
 
         proj.generate_and_emit(
             build_config=mode.value,
+            profile=profile,
             system_target=target_device.device_name,
             runtime=target.runtime.name,
             dump_all_passes=dump_ir,

diff --git a/accera/python/accera/Targets.py b/accera/python/accera/Targets.py
@@ -462,6 +462,10 @@ class Architecture(Enum):
     ["Intel E5-1680 v3",  "Haswell", "Xeon E5", 3.2, 3.8, 8, 16, [48, 256, 20 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
     ["Intel E5-2620 v3",  "Haswell", "Xeon E5", 2.4, 3.2, 6, 12, [48, 256, 15 * 1024], [64, 64, 64], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
 
+    # Intel Broadwell
+    # ref: https://en.wikichip.org/wiki/intel/xeon_e5/e5-2673_v4
+    ["Intel E5-2673 v4",  "Broadwell", "Xeon E5", 2.3, 2.6, 20, 40, [20, 20, 20], [32, 256, 2.5*1024], 32, 16, ["SSE4.1", "SSE4.2", "AVX2"], "X86_64", "OPENMP"],
+
     # AMD Zen
     # ref: https://en.wikipedia.org/wiki/Zen_(first_generation)
     # ref: https://en.wikichip.org/wiki/amd/microarchitectures/zen

diff --git a/accera/python/accera/lang/Function.py b/accera/python/accera/lang/Function.py
@@ -74,6 +74,7 @@ class Function:
     definition: Callable = None
     no_inline: bool = False # no_inline == True means that this function cannot be inlined into other functions
     no_inline_into: bool = False # no_inline_into == True means that this function cannot have other functions inlined into it
+    high_precision_fp: bool = None # high_precision_fp == True means that precision will not be sacrificed for performance
     auxiliary: dict = field(default_factory=dict)
     target: Target = Target.HOST
     output_verifiers: list = field(default_factory=list)
@@ -102,6 +103,7 @@ def _emit(self):
 
         self._native_fn.inlinable(not self.no_inline)
         self._native_fn.inlinable_into(not self.no_inline_into)
+        self._native_fn.high_precision_fp(bool(self.high_precision_fp))
 
         sig = signature(self.definition)