Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable serialize prepacked weights into data file #22256

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

frank-dong-ms
Copy link
Contributor

Description

part of #21448
This change is intend to save CPU memory during model load for inference.
Added session option save_prepacked_constant_initializers, with save_prepacked_constant_initializers turn on:

  1. optimize model with inference session, prepacked external initializer will be saved into data file.
  2. load optimized model and external data file with prepacked initializer, no prepack is needed
  3. run inference with optimized model and data file

Tested with model Phi-3-mini-instruct-onnx,
with ORT 1.12.0:
image

with this change:
image

Peak memory usage dropped from 5.438 GB to 2.726GB.
This change takes advantage of ORT loads external initializer with mmap on CPU. Prepack will use extra memory on heap, omit prepack process can save this part of memory (roughly same size as external initializers).

next step:
Change all the kernels on CPU with PrePack method implemented and test properly. Will do in next PR.

Motivation and Context

import os

import numpy as np
import onnx

Check notice

Code scanning / CodeQL

Module is imported with 'import' and 'import from' Note test

Module 'onnx' is imported with both 'import' and 'import from'.
Module 'onnxruntime.test.onnx' is imported with both 'import' and 'import from'.
void MatMulNBits<T1>::ConvertPrepackWeightIntoTensor(const onnxruntime::Tensor& tensor) {
if (!packed_tensor_) {
std::vector<int64_t> weights_dims = {static_cast<int64_t>((packed_b_size_ - 1) / tensor.DataType()->Size()) + 1};
packed_tensor_ = new Tensor(tensor.DataType(),

Check warning

Code scanning / PREfast

Avoid calling new and delete explicitly, use std::make_unique instead (r.11). Warning

Avoid calling new and delete explicitly, use std::make_unique instead (r.11).
void MatMulNBits<T1>::ConvertPrepackWeightIntoTensor(const onnxruntime::Tensor& tensor) {
if (!packed_tensor_) {
std::vector<int64_t> weights_dims = {static_cast<int64_t>((packed_b_size_ - 1) / tensor.DataType()->Size()) + 1};
packed_tensor_ = new Tensor(tensor.DataType(),

Check warning

Code scanning / PREfast

Avoid calling new and delete explicitly, use std::make_unique instead (r.11). Warning

Avoid calling new and delete explicitly, use std::make_unique instead (r.11).
packed_b_.get(),
OrtMemoryInfo(CPU, OrtAllocatorType::OrtDeviceAllocator));
} else {
packed_tensor_ = new Tensor(packed_tensor_->DataType(),

Check warning

Code scanning / PREfast

Avoid calling new and delete explicitly, use std::make_unique instead (r.11). Warning

Avoid calling new and delete explicitly, use std::make_unique instead (r.11).
packed_b_.get(),
OrtMemoryInfo(CPU, OrtAllocatorType::OrtDeviceAllocator));
} else {
packed_tensor_ = new Tensor(packed_tensor_->DataType(),

Check warning

Code scanning / PREfast

Avoid calling new and delete explicitly, use std::make_unique instead (r.11). Warning

Avoid calling new and delete explicitly, use std::make_unique instead (r.11).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant