enable serialize prepacked weights into data file #22256

frank-dong-ms · 2024-09-28T07:16:57Z

Description

part of #21448
This change is intend to save CPU memory during model load for inference.
Added session option save_prepacked_constant_initializers, with save_prepacked_constant_initializers turn on:

optimize model with inference session, prepacked external initializer will be saved into data file.
load optimized model and external data file with prepacked initializer, no prepack is needed
run inference with optimized model and data file

Tested with model Phi-3-mini-instruct-onnx,
with ORT 1.12.0:

with this change:

Peak memory usage dropped from 5.438 GB to 2.726GB.
This change takes advantage of ORT loads external initializer with mmap on CPU. Prepack will use extra memory on heap, omit prepack process can save this part of memory (roughly same size as external initializers).

next step:
Change all the kernels on CPU with PrePack method implemented and test properly. Will do in next PR.

Motivation and Context

onnxruntime/test/testdata/prepack/model_with_external_initializers_and_prepack_kernel.py

+import os
+
+import numpy as np
+import onnx


onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

+void MatMulNBits<T1>::ConvertPrepackWeightIntoTensor(const onnxruntime::Tensor& tensor) {
+  if (!packed_tensor_) {
+    std::vector<int64_t> weights_dims = {static_cast<int64_t>((packed_b_size_ - 1) / tensor.DataType()->Size()) + 1};
+    packed_tensor_ = new Tensor(tensor.DataType(),


onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

+void MatMulNBits<T1>::ConvertPrepackWeightIntoTensor(const onnxruntime::Tensor& tensor) {
+  if (!packed_tensor_) {
+    std::vector<int64_t> weights_dims = {static_cast<int64_t>((packed_b_size_ - 1) / tensor.DataType()->Size()) + 1};
+    packed_tensor_ = new Tensor(tensor.DataType(),


onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

+                                packed_b_.get(),
+                                OrtMemoryInfo(CPU, OrtAllocatorType::OrtDeviceAllocator));
+  } else {
+    packed_tensor_ = new Tensor(packed_tensor_->DataType(),


onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

+                                packed_b_.get(),
+                                OrtMemoryInfo(CPU, OrtAllocatorType::OrtDeviceAllocator));
+  } else {
+    packed_tensor_ = new Tensor(packed_tensor_->DataType(),


frank-dong-ms added 6 commits August 29, 2024 17:42

test

59fca4a

serialize prepack initializers to onnx data file

b34b3d0

sync and merge changes

57c5c58

fix matmul_nbits kernel

acc23f4

code clean up

fe9c81b

bug fix

c7f19ca

frank-dong-ms requested a review from pranavsharma September 28, 2024 07:16

github-advanced-security bot found potential problems Sep 28, 2024

View reviewed changes

fix lint style

327cb1c

github-advanced-security bot found potential problems Sep 28, 2024

View reviewed changes

frank-dong-ms and others added 6 commits September 30, 2024 15:02

fix CI failure in Linux

c6f8b4e

fix CI failure in Android

46b9bac

fix test failures

ee818ce

disbale test for non-CPU and non-PC env, fix several tests

4520d83

more code clean up

f77d479

Merge branch 'main' into frdong/prepack_1

d58a024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable serialize prepacked weights into data file #22256

enable serialize prepacked weights into data file #22256

frank-dong-ms commented Sep 28, 2024

enable serialize prepacked weights into data file #22256

Are you sure you want to change the base?

enable serialize prepacked weights into data file #22256

Conversation

frank-dong-ms commented Sep 28, 2024

Description

Motivation and Context