Ort ovep 1.17 npu #333

preetha-intel · 2023-12-14T09:47:23Z

Description

Add support for NPU plugin in OpenVINO

Add option to disable refiner and only run base model.

### Description Support uniforms in Slice op ### Motivation and Context  Improve ferformance

Free memory for cudnn/cublas instances at TRT EP destruction. microsoft#18466

…soft#18442) ### Description Allow empty shapes and do not validate them for inputs/outputs at the InferenceSession::ValidateInputsOutputs(). ### Motivation and Context microsoft#17301 disallowed empty shapes. However, many models depend on them as a way to pass shapes of different ranks.

…soft#18357) ### Description  Update usability checker and related infrastructure to support checking models > 2GB. - Add ability to set flag to keep initializers as external data - we optimize the model as part of the checking so need to write out a new copy. - Handle issue with ONNX shape inferencing silently failing - use API that supports large models but requires writing the model to a new file - automate cleanup of that copy of the model ### Motivation and Context  Allow analysis of LLMs to determine gaps for mobile usage. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

…antized (microsoft#18469) ### Description QNN can't run MatMul if both inputs are dynamic inputs with uint16 quantized on v68. Make it run by inserting Convert op to convert 1 input to int8

### Description Implement preliminary version of local (sliding window) attention. Currently only supported by Flash Attention (sm >= 80, Linux). Currently only supports sliding attention with a large cached kv. ### Motivation and Context This change enables to run Mistral and other models which use sliding window attention.

…osoft#18449) It's possible that subgraph of the "If" control flow op has no nodes. TRT EP should consider this kind of subgraph is fully supported by TRT. The faster rcnn model mentioned in this issue microsoft#17434 is the case.

tensorrt can work with protobuf lite.

…microsoft#18477) ### Description Always run emsdk_env.sh before build.py, even when ccache is disabled This is a follow up to microsoft#18434. That PR didn't handle the case when ccache was disabled.

### Description  change RotaryEmbeddings op implementation, add support for 4D input tensor that is with shape of [batch, num_heads, seq_len, head_size]. ### Motivation and Context  Current RotaryEmbedding op only support 3d input tensor with shape [batch, seq_len, hidden_size] For llamav2 model, when using FusionRotaryEmbeddings to only fuse RotaryEmbeddings op, there will be a transpose operation for query and key, and then the input tensor of RotaryEmbeddings becomes 4D [batch, num_heads, seq_len, head_size]. This scenario can't be supported by current RotaryEmbeddings implementation. So it needs to support 4D input tensor.

…ers (microsoft#18427)

### Description This is a narrow implementation of Attention/MultiHeadAttention as it does not support: a. inputs 5-7 for MHA b. packed QKV/KV c. past/present d. attention mask But it works well for StableDiffusion and can be extended later. It reduces VRAM usage as it combines many ops into few I've updated demo here https://islamov.ai/stable-diffusion-webgpu/ it takes ~13sec for 1 image with 20 steps on RTX3090Ti and about 25s on M1 Pro VRAM usage is about 8gb if you don't use img2img Going to focus on SDXL now --------- Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>

### Description Similar to microsoft#17852 ### Motivation and Context To avoid downloading NDK

@edgchen1

…7726) ### Description This PR addresses microsoft#17652. The deprecated `MLMultiArray.dataPointer` is replaced with `.getBytesWithHandler`, as suggested by the docs. For now, I am only checking that the output `MLMultiArray` is contiguous, returning unsupported operation when that is not the case. I think this is already better than what we have right now, so we can block unsafe calls to `.dataPointer` (if any..). I would be happy to implement the handling of the non-contiguous case (replacing `memcpy` for such cases) as suggested by @edgchen1, but I am not sure how to reproduce that case to add a corresponding unit-test. Would we have to define a custom `MLCustomLayer` to get a non-contiguous output from a model..? ### Motivation and Context Fix microsoft#17652. --------- Co-authored-by: nicolo-lucchesi <nicolo.lucchesi@hexagon.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

… utils (microsoft#18333) ### Description Motivation for this PR is code cleanup. 1. Remove all deprecated python code related to orttrainer, old checkpoint, related tests and utils 2. Cleanup orttraining_pybind_state.cc to remove all deprecated bindings.

### Description [js] update a few packages - update semver - update reference of onnx_proto to local folder in order to upgrade protobufjs@7.2.4 Resolve AB#18513

### Description It causes our "NPM Packaging Pipeline" to fail. ### Motivation and Context

…icrosoft#18478) Move data member in LiteOpFunc to its parent to avoid possible mem leaks. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>

### Description Add conformer-transducer model type to optimizer. This PR adds pattern matches for attention shown below: Unfused attention: ![ct_unfused](https://github.com/microsoft/onnxruntime/assets/111780983/46c71ed8-67e0-4607-85b1-bcadba5a2956) Fused attention: ![ct_fused](https://github.com/microsoft/onnxruntime/assets/111780983/fbb91c96-0d4b-4f0b-8674-1ae3b9b9a92e)

…osoft#18501)

Recent PyTorch breaks DORT CI and [a patch](pytorch/pytorch#113697) has been merged into PyTorch main. In order to update DORT's CI, we made dummy change in this PR.

### Description  ### Motivation and Context

microsoft#18484) ### Description  Add bfloat16 support for `MatMulBnb4` contrib op. This is useful for QLoRA fine-tuning. - On GPUs with SM80+ (A100, etc), it uses the native cuda bfloat16 dtype, `nv_bfloat16`. On other GPUs, it uses the onnxruntime `BFloat16` type which uses float for compute. - I have validated the op in a llama2-7b training scenario. The losses match pytorch training and the training throughput is better. - Cannot add a bfloat16 case in the op unit test since casting BFloat16 to and from float multiple times during the test causes the required tolerances to be unachievable. The custom autograd function exporter in onnxruntime-training is updated to support the latest version of bitsandbytes. They changed how the `quant_state` is stored. ### Motivation and Context  Enable QLoRA fine-tuning with bfloat16.

### Description  ### Motivation and Context

### Description optimize eslint config to: - set parserOptions.project to `true` to allow @typescript-eslint/parser to find the nearest tsconfig.json file to that source file. This helps to avoid parsing extra files, may helps with: - reduce the possibility of seeing OOM or stackoverflow with "npm run lint" - faster processing - enforce rule "no-underscore-dangle" with a list of exceptions.

…args (microsoft#18462) ### Description Truncate traling non-existing arguments. Make sure we do not skip on the non-existing arguments in the middle, because shape inferece relies on their proper position. This also affects the argument position in the Edges that must be properly rebuilt each time If node branch is inlined. Make sure that when we rename Defs in subgraphs, new renamed defs are created in those subgraphs instead of pointing to outer scope defs. Add unit test. ### Motivation and Context This is a follow up for microsoft#18105 Currently, the non-trailing arguments are simply ignored and the edges are created with potentially incorrect positions.

Use iterator to refer to the set. Co-authored-by: Randy Shuai <rashuai@microsoft.com>

…oft#18833) ### Description Build function bodies according to the imported global opset. Same is for querying ONNX functions. ### Motivation and Context This addresses issues: microsoft#18781 microsoft#16438

…on-the-fly (microsoft#18847) ### Description Change Nuget packaging pipeline's build TRT job to download CUDA SDK on-the-fly, so that we do not need to put a CUDA SDK in the build machine's image.

### Description  Update deprecated TRT api: 1. [setMaxWorkspaceSize](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_builder_config.html#a8209999988ab480c60c8a905dfd2654d)(max_workspace_size_)-------->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kWORKSPACE, max_workspace_size_) 2. [kENABLE_TACTIC_HEURISTIC](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#abdc74c40fe7a0c3d05d2caeccfbc29c1a1215692ad24465e4d9e37a8a7fce1a38)-------->supersede by trt builder optimization level 2 Perf & warning log comparison <html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=OneNote.File> <meta name=Generator content="Microsoft OneNote 15"> </head> <body lang=en-US style='font-family:"Microsoft YaHei";font-size:12.0pt'>  <div style='direction:ltr'> TRT EP options | User will see corresponding warning logs: | Average inference time cost (FRCNN on A100) -- | -- | -- trt_build_heuristics_enable\|true | [TensorRT EP] trt_build_heuristics_enable is deprecated on TRT 8.6 onwards. Please set builder optimization level as 2 to enable builder heuristics. | ~300ms trt_build_heuristics_enable\|true trt_builder_optimization_level\|2 | [TensorRT EP] Builder heuristics are enabled automatically by builder optimization level 2. trt_build_heuristics_enable is deprecated on TRT 8.6 onwards. | ~275ms trt_builder_optimization_level\|2 | | ~275ms </div>  </body> </html> ### Motivation and Context  Prepare for upcoming TRT 10

### Description  ### Motivation and Context

### Description The casing of Podfile is incorrect in the plugin. This causes issues when building iOS on case-sensitive systems such as Linux. ### Motivation and Context because cannot build ios on case sensitive systems

### Description  ### Motivation and Context

… fallback code

…y the onnx frontend

### Description Fixes a failure in the ortmodule nightly pipeline. ### Motivation and Context

### Description Improve MLAS to support high-performance x64 INT4 kernels ### Motivation and Context 1. improve LLM inference performance on Intel CPUs. 2. support more 4bit quantization types: nf4, fp4 3. support dynamic block size: block size aligned with kernel's tiling size(e.g. 4 for VNNI kernel), per channel on N dimension 4. support most Intel ISAs: avx2, avx_vnni, avx512f, avx512_vnni, amx_bf16, amx_int8, avx512_fp16 5. support MatMulNBits' data format ### Tasks - [x] support block_size: 32, 128, -1(per channel) - [x] get weight pack size without memory allocation - [x] use ort's thread pool for parallelism - [x] support ISAs: avx2, avx512f, avx_vnni, avx512_vnni, amx_int8 ### Benchmark Ubuntu 20.22 + Intel(R) Xeon(R) Platinum 8480+ 56 cores Benchmark | Time | CPU | Iterations -- | -- | -- | -- Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:56/real_time | 47613 | 47401 | 12970 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time | 6347792 | 6317562 | 109 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time | 11814014 | 11757847 | 59 Q4GEMM_Jblas/Q4G128SymInt8/M:1/N:4096/K:4096/Threads:56/real_time | 50222 | 50031 | 13759 Q4GEMM_Jblas/Q4G128SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time | 2038222 | 2028743 | 341 Q4GEMM_Jblas/Q4G128SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time | 3792832 | 3774485 | 191 Q4GEMM_Jblas/Q4GPerNSymInt8/M:1/N:4096/K:4096/Threads:56/real_time | 58717 | 58501 | 11467 Q4GEMM_Jblas/Q4GPerNSymInt8/M:1024/N:4096/K:4096/Threads:56/real_time | 1360846 | 1354598 | 543 Q4GEMM_Jblas/Q4GPerNSymInt8/M:2048/N:4096/K:4096/Threads:56/real_time | 2564232 | 2551365 | 266 Q4GEMM_Jblas/Q4G32SymFp32/M:1/N:4096/K:4096/Threads:56/real_time | 57929 | 57694 | 12047 Q4GEMM_Jblas/Q4G32SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time | 5495330 | 5465810 | 126 Q4GEMM_Jblas/Q4G32SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time | 10676240 | 10617817 | 66 Q4GEMM_Jblas/Q4G128SymFp32/M:1/N:4096/K:4096/Threads:56/real_time | 68305 | 68047 | 10026 Q4GEMM_Jblas/Q4G128SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time | 5504862 | 5476215 | 126 Q4GEMM_Jblas/Q4G128SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time | 11758623 | 11697337 | 66 Q4GEMM_Jblas/Q4GPerNSymFp32/M:1/N:4096/K:4096/Threads:56/real_time | 67713 | 67451 | 10298 Q4GEMM_Jblas/Q4GPerNSymFp32/M:1024/N:4096/K:4096/Threads:56/real_time | 5508325 | 5480237 | 126 Q4GEMM_Jblas/Q4GPerNSymFp32/M:2048/N:4096/K:4096/Threads:56/real_time | 10738528 | 10681656 | 64 Q4GEMM_Jblas/Q4G32AsymFp32/M:1/N:4096/K:4096/Threads:56/real_time | 60708 | 60486 | 11321 Q4GEMM_Jblas/Q4G32AsymFp32/M:1024/N:4096/K:4096/Threads:56/real_time | 5523784 | 5495736 | 126 Q4GEMM_Jblas/Q4G32AsymFp32/M:2048/N:4096/K:4096/Threads:56/real_time | 10829633 | 10772161 | 67 Reference: Benchmark | Time | CPU | Iterations -- | -- | -- | -- Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:56/real_time | 53088 | 52911 | 13364 Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:56/real_time | 6268981 | 6230335 | 110 Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:56/real_time | 11701237 | 11632339 | 59 Win11+12900K 8 cores: Benchmark | Time | CPU | Iterations -- | -- | -- | -- Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:8/real_time | 215976 | 211295 | 2884 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:8/real_time | 60960590 | 60937500 | 10 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:8/real_time | 1.18E+08 | 1.19E+08 | 5 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:4096/Threads:8/real_time | 470377 | 453059 | 1414 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:4096/Threads:8/real_time | 1.54E+08 | 1.53E+08 | 5 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:4096/Threads:8/real_time | 3.18E+08 | 3.13E+08 | 2 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:11008/Threads:8/real_time | 569072 | 559398 | 1229 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:11008/Threads:8/real_time | 1.54E+08 | 1.52E+08 | 4 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:11008/Threads:8/real_time | 3.22E+08 | 3.28E+08 | 2 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:11008/Threads:8/real_time | 1486055 | 1473325 | 403 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:11008/Threads:8/real_time | 4.14E+08 | 4.14E+08 | 2 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:11008/Threads:8/real_time | 8.88E+08 | 8.59E+08 | 1 --------- Signed-off-by: Mengni Wang <mengni.wang@intel.com> Co-authored-by: Mengni Wang <mengni.wang@intel.com>

### Description dft is updated in opset20. implement it in ort ### Motivation and Context this is for ort 1.17.0 release Fixes microsoft#17723 --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com>

### Description  Check whether the min/max inputs are provided and use default values if not provided. ### Motivation and Context

### Description support manually dispose session in onnxruntime-node feature request: microsoft#16796

### Description 1. Add a CodeSign validation task before the binaries are published, to make sure all DLL files are signed. 2. Auto-trigger the CUDA 12 pipeline's publishing job.

### Description  Add LeakyRelu to the list as support was added a while ago. ### Motivation and Context

sspintel and others added 30 commits November 3, 2023 00:25

Add NPU device; Revert num_of_threads to 1 to be default

c798024

Merge branch 'master' into openvino-ep-5.2-npu

630d57d

SDXL demo: Add Option to disable refiner (microsoft#18455)

119e86e

Add option to disable refiner and only run base model.

[WebNN EP] Support GreaterOrEqual and LessOrEqual ops (microsoft#18411)

999752a

[JS/Web]Added uniforms support to Slice op. (microsoft#18422)

b291b20

### Description Support uniforms in Slice op ### Motivation and Context  Improve ferformance

[TensorRT EP] Fix memory leak for cudnn/cublas (microsoft#18467)

3588fba

Free memory for cudnn/cublas instances at TRT EP destruction. microsoft#18466

[QNN EP] Support Qnn MatMul with 2 dynamic inputs which are uint16 qu…

6a4e448

…antized (microsoft#18469) ### Description QNN can't run MatMul if both inputs are dynamic inputs with uint16 quantized on v68. Make it run by inserting Convert op to convert 1 input to int8

remove full protobuf requirement for tensorrt ep (microsoft#18413)

d73073d

tensorrt can work with protobuf lite.

Always run emsdk_env.sh before build.py, even when ccache is disabled (…

5eb5056

…microsoft#18477) ### Description Always run emsdk_env.sh before build.py, even when ccache is disabled This is a follow up to microsoft#18434. That PR didn't handle the case when ccache was disabled.

[WebNN Ep] Slice's axes and steps inputs should be constant initializ…

a5537f2

…ers (microsoft#18427)

Update NDK version to 26.1.10909125 (microsoft#18493)

41f9379

### Description Similar to microsoft#17852 ### Motivation and Context To avoid downloading NDK

[js] update a few packages (microsoft#18499)

34c5424

### Description [js] update a few packages - update semver - update reference of onnx_proto to local folder in order to upgrade protobufjs@7.2.4 Resolve AB#18513

Update web-ci.yml: remove depth=1 (microsoft#18500)

9364c05

### Description It causes our "NPM Packaging Pipeline" to fail. ### Motivation and Context

Move up members in Lite Custom Op hierarchy for possible memleaks. (m…

53917a3

…icrosoft#18478) Move data member in LiteOpFunc to its parent to avoid possible mem leaks. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>

Update setup.py: replace libcudart.so.12.0 with libcudart.so.12 (micr…

dc9ab4f

…osoft#18501)

Tiny change to trigger the update of DORT's CI image (microsoft#18507)

3bcc137

Recent PyTorch breaks DORT CI and [a patch](pytorch/pytorch#113697) has been merged into PyTorch main. In order to update DORT's CI, we made dummy change in this PR.

Create a new Python Package pipeline for CUDA 12 (microsoft#18348)

d97fc18

### Description  ### Motivation and Context

Remove setup_env_azure.bat (microsoft#18482)

1dd9bf5

### Description  ### Motivation and Context

RandySheriffH and others added 23 commits December 15, 2023 14:57

Access map by iterator to silence sanity check. (microsoft#18835)

2952cf8

Use iterator to refer to the set. Co-authored-by: Randy Shuai <rashuai@microsoft.com>

Build function bodies according to the imported global opset. (micros…

50cbcf9

…oft#18833) ### Description Build function bodies according to the imported global opset. Same is for querying ONNX functions. ### Motivation and Context This addresses issues: microsoft#18781 microsoft#16438

Change Nuget packaging pipeline's build TRT job to download CUDA SDK …

ad476d5

…on-the-fly (microsoft#18847) ### Description Change Nuget packaging pipeline's build TRT job to download CUDA SDK on-the-fly, so that we do not need to put a CUDA SDK in the build machine's image.

Update stale.yml to correct close-issue-message (microsoft#18849)

ea6186e

### Description  ### Motivation and Context

Bump actions/stale from 8.0.0 to 9.0.0 (microsoft#18774)

3ff4a4c

[REACT NATIVE] Bugfix -> casing Podfile (microsoft#18861)

63b47ce

### Description The casing of Podfile is incorrect in the plugin. This causes issues when building iOS on case-sensitive systems such as Linux. ### Motivation and Context because cannot build ios on case sensitive systems

Adding new pipeline for python cuda testing (microsoft#18718)

6d7519e

### Description  ### Motivation and Context

Merge branch 'openvino-ep-5.2-npu' into ort_ovep_1.17_npu

f2a7ca0

Remove static mapping of LayerNorm op for the NPU; Remove unused MLAS…

0c7d93a

… fallback code

Add support for UINT16 DTYPE in initializers, NPU, and CPU devices

26bcf8d

Merge branch 'master' into sp/model_psa

d4224e1

Temporarily disable model domain check as it is yet to be supported b…

4e5bcd3

…y the onnx frontend

Fix nightly pipeline failure (microsoft#18867)

4dff154

### Description Fixes a failure in the ortmodule nightly pipeline. ### Motivation and Context

Implement dft(20) (microsoft#17821)

32fcf73

### Description dft is updated in opset20. implement it in ort ### Motivation and Context this is for ort 1.17.0 release Fixes microsoft#17723 --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com>

[js/node] support manually dispose session (microsoft#18655)

ffa6602

### Description support manually dispose session in onnxruntime-node feature request: microsoft#16796

Update Nuget publishing jobs (microsoft#18851)

535a240

### Description 1. Add a CodeSign validation task before the binaries are published, to make sure all DLL files are signed. 2. Auto-trigger the CUDA 12 pipeline's publishing job.

Merge branch 'master' into sp/model_psa

07d4118

Allow overriding NPU compiler type through an environmental variable

a601718

Fix a mistake in OpenVINO 2023.2 build flag

38193f2

sspintel force-pushed the ort_ovep_1.17_npu branch from 13c7749 to 38193f2 Compare January 3, 2024 08:41

sspintel and others added 4 commits January 11, 2024 20:35

Remove unused parameter op_map

e94fd7b

Merge branch 'sp/model_psa' into ort_ovep_1.17_npu

122f26b

Add pow to no dimension supported list

6319fe7

Remove deprecated model domain check

9da42a1

preetha-intel deleted the branch openvino-ep-5.2 January 30, 2024 08:10

preetha-intel closed this Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ort ovep 1.17 npu #333

Ort ovep 1.17 npu #333

preetha-intel commented Dec 14, 2023

Ort ovep 1.17 npu #333

Ort ovep 1.17 npu #333

Conversation

preetha-intel commented Dec 14, 2023

Description