diff --git a/docs/build/eps.md b/docs/build/eps.md index 6a76ce0fcbfd7..40bf99be46bff 100644 --- a/docs/build/eps.md +++ b/docs/build/eps.md @@ -396,75 +396,24 @@ The DirectML execution provider supports building for both x64 and x86 architect --- -## ARM Compute Library +## Arm Compute Library See more information on the ACL Execution Provider [here](../execution-providers/community-maintained/ACL-ExecutionProvider.md). -### Prerequisites -{: .no_toc } - -* Supported backend: i.MX8QM Armv8 CPUs -* Supported BSP: i.MX8QM BSP - * Install i.MX8QM BSP: `source fsl-imx-xwayland-glibc-x86_64-fsl-image-qt5-aarch64-toolchain-4*.sh` -* Set up the build environment -``` -source /opt/fsl-imx-xwayland/4.*/environment-setup-aarch64-poky-linux -alias cmake="/usr/bin/cmake -DCMAKE_TOOLCHAIN_FILE=$OECORE_NATIVE_SYSROOT/usr/share/cmake/OEToolchainConfig.cmake" -``` -* See [Build ARM](inferencing.md#arm) below for information on building for ARM devices - ### Build Instructions {: .no_toc } -1. Configure ONNX Runtime with ACL support: -``` -cmake ../onnxruntime-arm-upstream/cmake -DONNX_CUSTOM_PROTOC_EXECUTABLE=/usr/bin/protoc -Donnxruntime_RUN_ONNX_TESTS=OFF -Donnxruntime_GENERATE_TEST_REPORTS=ON -Donnxruntime_DEV_MODE=ON -DPYTHON_EXECUTABLE=/usr/bin/python3 -Donnxruntime_USE_CUDA=OFF -Donnxruntime_USE_NSYNC=OFF -Donnxruntime_CUDNN_HOME= -Donnxruntime_USE_JEMALLOC=OFF -Donnxruntime_ENABLE_PYTHON=OFF -Donnxruntime_BUILD_CSHARP=OFF -Donnxruntime_BUILD_SHARED_LIB=ON -Donnxruntime_USE_EIGEN_FOR_BLAS=ON -Donnxruntime_USE_OPENBLAS=OFF -Donnxruntime_USE_ACL=ON -Donnxruntime_USE_DNNL=OFF -Donnxruntime_USE_MKLML=OFF -Donnxruntime_USE_OPENMP=ON -Donnxruntime_USE_TVM=OFF -Donnxruntime_USE_LLVM=OFF -Donnxruntime_ENABLE_MICROSOFT_INTERNAL=OFF -Donnxruntime_USE_BRAINSLICE=OFF -Donnxruntime_USE_EIGEN_THREADPOOL=OFF -Donnxruntime_BUILD_UNIT_TESTS=ON -DCMAKE_BUILD_TYPE=RelWithDebInfo -``` -The ```-Donnxruntime_USE_ACL=ON``` option will use, by default, the 19.05 version of the Arm Compute Library. To set the right version you can use: -```-Donnxruntime_USE_ACL_1902=ON```, ```-Donnxruntime_USE_ACL_1905=ON```, ```-Donnxruntime_USE_ACL_1908=ON``` or ```-Donnxruntime_USE_ACL_2002=ON```; - -To use a library outside the normal environment you can set a custom path by using ```-Donnxruntime_ACL_HOME``` and ```-Donnxruntime_ACL_LIBS``` tags that defines the path to the ComputeLibrary directory and the build directory respectively. +You must first build Arm Compute Library 24.07 for your platform as described in the [documentation](https://github.com/ARM-software/ComputeLibrary). +See [here](inferencing.md#arm) for information on building for Arm®-based devices. -```-Donnxruntime_ACL_HOME=/path/to/ComputeLibrary```, ```-Donnxruntime_ACL_LIBS=/path/to/build``` +Add the following options to `build.sh` to enable the ACL Execution Provider: - -2. Build ONNX Runtime library, test and performance application: -``` -make -j 6 -``` - -3. Deploy ONNX runtime on the i.MX 8QM board ``` -libonnxruntime.so.0.5.0 -onnxruntime_perf_test -onnxruntime_test_all +--use_acl --acl_home=/path/to/ComputeLibrary --acl_libs=/path/to/ComputeLibrary/build ``` -### Native Build Instructions -{: .no_toc } - -*Validated on Jetson Nano and Jetson Xavier* - -1. Build ACL Library (skip if already built) - - ```bash - cd ~ - git clone -b v20.02 https://github.com/Arm-software/ComputeLibrary.git - cd ComputeLibrary - sudo apt-get install -y scons g++-arm-linux-gnueabihf - scons -j8 arch=arm64-v8a Werror=1 debug=0 asserts=0 neon=1 opencl=1 examples=1 build=native - ``` - -1. Cmake is needed to build ONNX Runtime. Because the minimum required version is 3.13, - it is necessary to build CMake from source. Download Unix/Linux sources from https://cmake.org/download/ - and follow https://cmake.org/install/ to build from source. Version 3.17.5 and 3.18.4 have been tested on Jetson. - -1. Build onnxruntime with --use_acl flag with one of the supported ACL version flags. (ACL_1902 | ACL_1905 | ACL_1908 | ACL_2002) - ---- - -## ArmNN +## Arm NN -See more information on the ArmNN Execution Provider [here](../execution-providers/community-maintained/ArmNN-ExecutionProvider.md). +See more information on the Arm NN Execution Provider [here](../execution-providers/community-maintained/ArmNN-ExecutionProvider.md). ### Prerequisites {: .no_toc } @@ -480,7 +429,7 @@ source /opt/fsl-imx-xwayland/4.*/environment-setup-aarch64-poky-linux alias cmake="/usr/bin/cmake -DCMAKE_TOOLCHAIN_FILE=$OECORE_NATIVE_SYSROOT/usr/share/cmake/OEToolchainConfig.cmake" ``` -* See [Build ARM](inferencing.md#arm) below for information on building for ARM devices +* See [here](inferencing.md#arm) for information on building for Arm-based devices ### Build Instructions {: .no_toc } @@ -490,20 +439,20 @@ alias cmake="/usr/bin/cmake -DCMAKE_TOOLCHAIN_FILE=$OECORE_NATIVE_SYSROOT/usr/sh ./build.sh --use_armnn ``` -The Relu operator is set by default to use the CPU execution provider for better performance. To use the ArmNN implementation build with --armnn_relu flag +The Relu operator is set by default to use the CPU execution provider for better performance. To use the Arm NN implementation build with --armnn_relu flag ```bash ./build.sh --use_armnn --armnn_relu ``` -The Batch Normalization operator is set by default to use the CPU execution provider. To use the ArmNN implementation build with --armnn_bn flag +The Batch Normalization operator is set by default to use the CPU execution provider. To use the Arm NN implementation build with --armnn_bn flag ```bash ./build.sh --use_armnn --armnn_bn ``` -To use a library outside the normal environment you can set a custom path by providing the --armnn_home and --armnn_libs parameters to define the path to the ArmNN home directory and build directory respectively. -The ARM Compute Library home directory and build directory must also be available, and can be specified if needed using --acl_home and --acl_libs respectively. +To use a library outside the normal environment you can set a custom path by providing the --armnn_home and --armnn_libs parameters to define the path to the Arm NN home directory and build directory respectively. +The Arm Compute Library home directory and build directory must also be available, and can be specified if needed using --acl_home and --acl_libs respectively. ```bash ./build.sh --use_armnn --armnn_home /path/to/armnn --armnn_libs /path/to/armnn/build --acl_home /path/to/ComputeLibrary --acl_libs /path/to/acl/build @@ -519,7 +468,7 @@ See more information on the RKNPU Execution Provider [here](../execution-provide * Supported platform: RK1808 Linux -* See [Build ARM](inferencing.md#arm) below for information on building for ARM devices +* See [here](inferencing.md#arm) for information on building for Arm-based devices * Use gcc-linaro-6.3.1-2017.05-x86_64_aarch64-linux-gnu instead of gcc-linaro-6.3.1-2017.05-x86_64_arm-linux-gnueabihf, and modify CMAKE_CXX_COMPILER & CMAKE_C_COMPILER in tool.cmake: ``` diff --git a/docs/build/inferencing.md b/docs/build/inferencing.md index d76381a11743d..125623ef28399 100644 --- a/docs/build/inferencing.md +++ b/docs/build/inferencing.md @@ -88,7 +88,8 @@ If you would like to use [Xcode](https://developer.apple.com/xcode/) to build th Without this flag, the cmake build generator will be Unix makefile by default. -Today, Mac computers are either Intel-Based or Apple silicon(aka. ARM) based. By default, ONNX Runtime's build script only generate bits for the CPU ARCH that the build machine has. If you want to do cross-compiling: generate ARM binaries on a Intel-Based Mac computer, or generate x86 binaries on a Mac ARM computer, you can set the "CMAKE_OSX_ARCHITECTURES" cmake variable. For example: +Today, Mac computers are either Intel-Based or Apple silicon-based. By default, ONNX Runtime's build script only generate bits for the CPU ARCH that the build machine has. If you want to do cross-compiling: generate arm64 binaries on a Intel-Based Mac computer, or generate x86 binaries on a Mac +system with Apple silicon, you can set the "CMAKE_OSX_ARCHITECTURES" cmake variable. For example: Build for Intel CPUs: ```bash @@ -367,21 +368,21 @@ ORT_DEBUG_NODE_IO_DUMP_DATA_TO_FILES=1 ``` -### ARM +### Arm -There are a few options for building ONNX Runtime for ARM. +There are a few options for building ONNX Runtime for Arm®-based devices. -First, you may do it on a real ARM device, or on a x86_64 device with an emulator(like qemu), or on a x86_64 device with a docker container with an emulator(you can run an ARM container on a x86_64 PC). Then the build instructions are essentially the same as the instructions for Linux x86_64. However, it wouldn't work if your the CPU you are targeting is not 64-bit since the build process needs more than 2GB memory. +First, you may do it on a real Arm-based device, or on a x86_64 device with an emulator(like qemu), or on a x86_64 device with a docker container with an emulator(you can run an Arm-based container on a x86_64 PC). Then the build instructions are essentially the same as the instructions for Linux x86_64. However, it wouldn't work if your the CPU you are targeting is not 64-bit since the build process needs more than 2GB memory. -* [Cross compiling for ARM with simulation (Linux/Windows)](#cross-compiling-for-arm-with-simulation-linuxwindows) - **Recommended**; Easy, slow, ARM64 only(no support for ARM32) +* [Cross compiling for Arm-based devices with simulation (Linux/Windows)](#cross-compiling-for-arm-based-devices-with-simulation-linuxwindows) - **Recommended**; Easy, slow, ARM64 only(no support for ARM32) * [Cross compiling on Linux](#cross-compiling-on-linux) - Difficult, fast * [Cross compiling on Windows](#cross-compiling-on-windows) -#### Cross compiling for ARM with simulation (Linux/Windows) +#### Cross compiling for Arm-based devices with simulation (Linux/Windows) *EASY, SLOW, RECOMMENDED* -This method relies on qemu user mode emulation. It allows you to compile using a desktop or cloud VM through instruction level simulation. You'll run the build on x86 CPU and translate every ARM instruction to x86. This is much faster than compiling natively on a low-end ARM device. The resulting ONNX Runtime Python wheel (.whl) file is then deployed to an ARM device where it can be invoked in Python 3 scripts. The build process can take hours, and may run of memory if the target CPU is 32-bit. +This method relies on qemu user mode emulation. It allows you to compile using a desktop or cloud VM through instruction level simulation. You'll run the build on x86 CPU and translate every Arm architecture instruction to x86. This is potentially much faster than compiling natively on a low-end device. The resulting ONNX Runtime Python wheel (.whl) file is then deployed to an Arm-based device where it can be invoked in Python 3 scripts. The build process can take hours, and may run of memory if the target CPU is 32-bit. #### Cross compiling on Linux @@ -420,12 +421,12 @@ This option is very fast and allows the package to be built in minutes, but is c You must also know what kind of flags your target hardware need, which can differ greatly. For example, if you just get the normal ARMv7 compiler and use it for Raspberry Pi V1 directly, it won't work because Raspberry Pi only has ARMv6. Generally every hardware vendor will provide a toolchain; check how that one was built. - A target env is identifed by: + A target env is identified by: * Arch: x86_32, x86_64, armv6,armv7,arvm7l,aarch64,... * OS: bare-metal or linux. * Libc: gnu libc/ulibc/musl/... - * ABI: ARM has mutilple ABIs like eabi, eabihf... + * ABI: Arm has multiple ABIs like eabi, eabihf... You can get all these information from the previous output, please be sure they are all correct. @@ -584,8 +585,8 @@ This option is very fast and allows the package to be built in minutes, but is c **Using Visual C++ compilers** -1. Download and install Visual C++ compilers and libraries for ARM(64). - If you have Visual Studio installed, please use the Visual Studio Installer (look under the section `Individual components` after choosing to `modify` Visual Studio) to download and install the corresponding ARM(64) compilers and libraries. +1. Download and install Visual C++ compilers and libraries for Arm(64). + If you have Visual Studio installed, please use the Visual Studio Installer (look under the section `Individual components` after choosing to `modify` Visual Studio) to download and install the corresponding Arm(64) compilers and libraries. 2. Use `.\build.bat` and specify `--arm` or `--arm64` as the build option to start building. Preferably use `Developer Command Prompt for VS` or make sure all the installed cross-compilers are findable from the command prompt being used to build using the PATH environmant variable. diff --git a/docs/execution-providers/Vitis-AI-ExecutionProvider.md b/docs/execution-providers/Vitis-AI-ExecutionProvider.md index 655b563bcaff4..6e95434e2b7c5 100644 --- a/docs/execution-providers/Vitis-AI-ExecutionProvider.md +++ b/docs/execution-providers/Vitis-AI-ExecutionProvider.md @@ -27,9 +27,9 @@ The following table lists AMD targets that are supported by the Vitis AI ONNX Ru | **Architecture** | **Family** | **Supported Targets** | **Supported OS** | |---------------------------------------------------|------------------------------------------------------------|------------------------------------------------------------|------------------------------------------------------------| | AMD64 | Ryzen AI | AMD Ryzen 7040U, 7040HS | Windows | -| ARM64 Cortex-A53 | Zynq UltraScale+ MPSoC | ZCU102, ZCU104, KV260 | Linux | -| ARM64 Cortex-A72 | Versal AI Core / Premium | VCK190 | Linux | -| ARM64 Cortex-A72 | Versal AI Edge | VEK280 | Linux | +| Arm® Cortex®-A53 | Zynq UltraScale+ MPSoC | ZCU102, ZCU104, KV260 | Linux | +| Arm® Cortex®-A72 | Versal AI Core / Premium | VCK190 | Linux | +| Arm® Cortex®-A72 | Versal AI Edge | VEK280 | Linux | AMD Adaptable SoC developers can also leverage the Vitis AI ONNX Runtime Execution Provider to support custom (chip-down) designs. diff --git a/docs/execution-providers/Xnnpack-ExecutionProvider.md b/docs/execution-providers/Xnnpack-ExecutionProvider.md index c1900aa841860..f58929a0d6c1a 100644 --- a/docs/execution-providers/Xnnpack-ExecutionProvider.md +++ b/docs/execution-providers/Xnnpack-ExecutionProvider.md @@ -8,7 +8,7 @@ nav_order: 9 # XNNPACK Execution Provider -Accelerate ONNX models on Android/iOS devices and WebAssembly with ONNX Runtime and the XNNPACK execution provider. [XNNPACK](https://github.com/google/XNNPACK) is a highly optimized library of floating-point neural network inference operators for ARM, WebAssembly, and x86 platforms. +Accelerate ONNX models on Android/iOS devices and WebAssembly with ONNX Runtime and the XNNPACK execution provider. [XNNPACK](https://github.com/google/XNNPACK) is a highly optimized library of floating-point neural network inference operators for Arm®-based, WebAssembly, and x86 platforms. ## Contents {: .no_toc } diff --git a/docs/execution-providers/community-maintained/ACL-ExecutionProvider.md b/docs/execution-providers/community-maintained/ACL-ExecutionProvider.md index f894dcc86f1a1..02a0edf4e743d 100644 --- a/docs/execution-providers/community-maintained/ACL-ExecutionProvider.md +++ b/docs/execution-providers/community-maintained/ACL-ExecutionProvider.md @@ -10,14 +10,7 @@ redirect_from: /docs/reference/execution-providers/ACL-ExecutionProvider # ACL Execution Provider {: .no_toc } -The integration of ACL as an execution provider (EP) into ONNX Runtime accelerates performance of ONNX model workloads across Armv8 cores. [Arm Compute Library](https://github.com/ARM-software/ComputeLibrary){:target="_blank"} is an open source inference engine maintained by Arm and Linaro companies. - - -## Contents -{: .no_toc } - -* TOC placeholder -{:toc} +The ACL Execution Provider enables accelerated performance on Arm®-based CPUs through [Arm Compute Library](https://github.com/ARM-software/ComputeLibrary){:target="_blank"}. ## Build @@ -30,10 +23,44 @@ For build instructions, please see the [build page](../../build/eps.md#arm-compu ``` Ort::Env env = Ort::Env{ORT_LOGGING_LEVEL_ERROR, "Default"}; Ort::SessionOptions sf; -bool enable_cpu_mem_arena = true; -Ort::ThrowOnError(OrtSessionOptionsAppendExecutionProvider_ACL(sf, enable_cpu_mem_arena)); +bool enable_fast_math = true; +Ort::ThrowOnError(OrtSessionOptionsAppendExecutionProvider_ACL(sf, enable_fast_math)); ``` The C API details are [here](../../get-started/with-c.html). +### Python +{: .no_toc } + +``` +import onnxruntime + +providers = [("ACLExecutionProvider", {"enable_fast_math": "true"})] +sess = onnxruntime.InferenceSession("model.onnx", providers=providers) +``` + ## Performance Tuning -When/if using [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/perftest){:target="_blank"}, use the flag -e acl +Arm Compute Library has a fast math mode that can increase performance with some potential decrease in accuracy for MatMul and Conv operators. It is disabled by default. + +When using [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/perftest){:target="_blank"}, use the flag `-e acl` to enable the ACL Execution Provider. You can additionally use `-i 'enable_fast_math|true'` to enable fast math. + +Arm Compute Library uses the ONNX Runtime intra-operator thread pool when running via the execution provider. You can control the size of this thread pool using the `-x` option. + +## Supported Operators + +|Operator|Supported types| +|---|---| +|AveragePool|float| +|BatchNormalization|float| +|Concat|float| +|Conv|float, float16| +|FusedConv|float| +|FusedMatMul|float, float16| +|Gemm|float| +|GlobalAveragePool|float| +|GlobalMaxPool|float| +|MatMul|float, float16| +|MatMulIntegerToFloat|uint8, int8, uint8+int8| +|MaxPool|float| +|NhwcConv|float| +|Relu|float| +|QLinearConv|uint8, int8, uint8+int8| diff --git a/docs/execution-providers/community-maintained/ArmNN-ExecutionProvider.md b/docs/execution-providers/community-maintained/ArmNN-ExecutionProvider.md index 57d07af02bc3a..e38a0a75ef92d 100644 --- a/docs/execution-providers/community-maintained/ArmNN-ExecutionProvider.md +++ b/docs/execution-providers/community-maintained/ArmNN-ExecutionProvider.md @@ -7,7 +7,7 @@ nav_order: 2 redirect_from: /docs/reference/execution-providers/ArmNN-ExecutionProvider --- -# ArmNN Execution Provider +# Arm NN Execution Provider {: .no_toc} ## Contents @@ -16,14 +16,14 @@ redirect_from: /docs/reference/execution-providers/ArmNN-ExecutionProvider * TOC placeholder {:toc} -Accelerate performance of ONNX model workloads across Armv8 cores with the ArmNN execution provider. [ArmNN](https://github.com/ARM-software/armnn) is an open source inference engine maintained by Arm and Linaro companies. +Accelerate performance of ONNX model workloads across Arm®-based devices with the Arm NN execution provider. [Arm NN](https://github.com/ARM-software/armnn) is an open source inference engine maintained by Arm and Linaro companies. ## Build -For build instructions, please see the [BUILD page](../../build/eps.md#armnn). +For build instructions, please see the [BUILD page](../../build/eps.md#arm-nn). ## Usage ### C/C++ -To use ArmNN as execution provider for inferencing, please register it as below. +To use Arm NN as execution provider for inferencing, please register it as below. ``` Ort::Env env = Ort::Env{ORT_LOGGING_LEVEL_ERROR, "Default"}; Ort::SessionOptions so; diff --git a/docs/execution-providers/index.md b/docs/execution-providers/index.md index 1e2c13abcf67f..52687f6f48d2c 100644 --- a/docs/execution-providers/index.md +++ b/docs/execution-providers/index.md @@ -24,9 +24,9 @@ ONNX Runtime supports many different execution providers today. Some of the EPs |CPU|GPU|IoT/Edge/Mobile|Other| ---|---|---|--- |Default CPU|[NVIDIA CUDA](../execution-providers/CUDA-ExecutionProvider.md)|[Intel OpenVINO](../execution-providers/OpenVINO-ExecutionProvider.md)|[Rockchip NPU](../execution-providers/community-maintained/RKNPU-ExecutionProvider.md) (*preview*)| -|[Intel DNNL](../execution-providers/oneDNN-ExecutionProvider.md)|[NVIDIA TensorRT](../execution-providers/TensorRT-ExecutionProvider.md)|[ARM Compute Library](../execution-providers/community-maintained/ACL-ExecutionProvider.md) (*preview*)|[Xilinx Vitis-AI](../execution-providers/Vitis-AI-ExecutionProvider.md) (*preview*)| +|[Intel DNNL](../execution-providers/oneDNN-ExecutionProvider.md)|[NVIDIA TensorRT](../execution-providers/TensorRT-ExecutionProvider.md)|[Arm Compute Library](../execution-providers/community-maintained/ACL-ExecutionProvider.md) (*preview*)|[Xilinx Vitis-AI](../execution-providers/Vitis-AI-ExecutionProvider.md) (*preview*)| |[TVM](../execution-providers/community-maintained/TVM-ExecutionProvider.md) (*preview*)|[DirectML](../execution-providers/DirectML-ExecutionProvider.md)|[Android Neural Networks API](../execution-providers/NNAPI-ExecutionProvider.md)|[Huawei CANN](../execution-providers/community-maintained/CANN-ExecutionProvider.md) (*preview*)| -|[Intel OpenVINO](../execution-providers/OpenVINO-ExecutionProvider.md)|[AMD MIGraphX](../execution-providers/MIGraphX-ExecutionProvider.md)|[ARM-NN](../execution-providers/community-maintained/ArmNN-ExecutionProvider.md) (*preview*)|[AZURE](../execution-providers/Azure-ExecutionProvider.md) (*preview*)| +|[Intel OpenVINO](../execution-providers/OpenVINO-ExecutionProvider.md)|[AMD MIGraphX](../execution-providers/MIGraphX-ExecutionProvider.md)|[Arm NN](../execution-providers/community-maintained/ArmNN-ExecutionProvider.md) (*preview*)|[AZURE](../execution-providers/Azure-ExecutionProvider.md) (*preview*)| |[XNNPACK](../execution-providers/Xnnpack-ExecutionProvider.md)|[Intel OpenVINO](../execution-providers/OpenVINO-ExecutionProvider.md)|[CoreML](../execution-providers/CoreML-ExecutionProvider.md) (*preview*)| ||[AMD ROCm](../execution-providers/ROCm-ExecutionProvider.md)|[TVM](../execution-providers/community-maintained/TVM-ExecutionProvider.md) (*preview*)| ||[TVM](../execution-providers/community-maintained/TVM-ExecutionProvider.md) (*preview*)|[Qualcomm QNN](../execution-providers/QNN-ExecutionProvider.md)| diff --git a/docs/get-started/with-python.md b/docs/get-started/with-python.md index ba7ba27baa2d6..7ff3d1048c58d 100644 --- a/docs/get-started/with-python.md +++ b/docs/get-started/with-python.md @@ -22,7 +22,7 @@ There are two Python packages for ONNX Runtime. Only one of these packages shoul ### Install ONNX Runtime CPU -Use the CPU package if you are running on Arm CPUs and/or macOS. +Use the CPU package if you are running on Arm®-based CPUs and/or macOS. ```bash pip install onnxruntime diff --git a/docs/performance/model-optimizations/quantization.md b/docs/performance/model-optimizations/quantization.md index 961cef10c6972..ae49e591d94ca 100644 --- a/docs/performance/model-optimizations/quantization.md +++ b/docs/performance/model-optimizations/quantization.md @@ -202,7 +202,7 @@ ONNX Runtime quantization on GPU only supports S8S8. On x86-64 machines with AVX2 and AVX512 extensions, ONNX Runtime uses the VPMADDUBSW instruction for U8S8 for performance. This instruction might suffer from saturation issues: it can happen that the output does not fit into a 16-bit integer and has to be clamped (saturated) to fit. Generally, this is not a big issue for the final result. However, if you do encounter a large accuracy drop, it may be caused by saturation. In this case, you can either try [reduce_range](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/quantize.py) or the U8U8 format which doesn't have saturation issues. -There is no such issue on other CPU architectures (x64 with VNNI and ARM). +There is no such issue on other CPU architectures (x64 with VNNI and Arm®). ### List of Supported Quantized Ops {: .no_toc} @@ -290,7 +290,7 @@ For AWQ and GTPQ quantization usage, please refer to [Gen-AI model builder](http The performance improvement depends on your model and hardware. The performance gain from quantization has two aspects: compute and memory. Old hardware has none or few of the instructions needed to perform efficient inference in int8. And quantization has overhead (from quantizing and dequantizing), so it is not rare to get worse performance on old devices. -x86-64 with VNNI, GPU with Tensor Core int8 support and ARM with dot-product instructions can get better performance in general. +x86-64 with VNNI, GPU with Tensor Core int8 support and Arm®-based processors with dot-product instructions can get better performance in general. ### Which quantization method should I choose, dynamic or static? {: .no_toc} diff --git a/src/routes/getting-started/table.svelte b/src/routes/getting-started/table.svelte index b47ec1fa21fc6..e3cdd46ccc2e1 100644 --- a/src/routes/getting-started/table.svelte +++ b/src/routes/getting-started/table.svelte @@ -20,7 +20,7 @@ 'QNN', 'Tensor RT', 'ACL (Preview)', - 'ArmNN (Preview)', + 'Arm NN (Preview)', 'Azure (Preview)', 'CANN (Preview)', 'Rockchip NPU (Preview)',