-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable serialize prepacked weights into data file #22256
base: main
Are you sure you want to change the base?
Conversation
void MatMulNBits<T1>::ConvertPrepackWeightIntoTensor(const onnxruntime::Tensor& tensor) { | ||
if (!packed_tensor_) { | ||
std::vector<int64_t> weights_dims = {static_cast<int64_t>((packed_b_size_ - 1) / tensor.DataType()->Size()) + 1}; | ||
packed_tensor_ = new Tensor(tensor.DataType(), |
Check warning
Code scanning / PREfast
Avoid calling new and delete explicitly, use std::make_unique instead (r.11). Warning
void MatMulNBits<T1>::ConvertPrepackWeightIntoTensor(const onnxruntime::Tensor& tensor) { | ||
if (!packed_tensor_) { | ||
std::vector<int64_t> weights_dims = {static_cast<int64_t>((packed_b_size_ - 1) / tensor.DataType()->Size()) + 1}; | ||
packed_tensor_ = new Tensor(tensor.DataType(), |
Check warning
Code scanning / PREfast
Avoid calling new and delete explicitly, use std::make_unique instead (r.11). Warning
packed_b_.get(), | ||
OrtMemoryInfo(CPU, OrtAllocatorType::OrtDeviceAllocator)); | ||
} else { | ||
packed_tensor_ = new Tensor(packed_tensor_->DataType(), |
Check warning
Code scanning / PREfast
Avoid calling new and delete explicitly, use std::make_unique instead (r.11). Warning
packed_b_.get(), | ||
OrtMemoryInfo(CPU, OrtAllocatorType::OrtDeviceAllocator)); | ||
} else { | ||
packed_tensor_ = new Tensor(packed_tensor_->DataType(), |
Check warning
Code scanning / PREfast
Avoid calling new and delete explicitly, use std::make_unique instead (r.11). Warning
Description
part of #21448
This change is intend to save CPU memory during model load for inference.
Added session option save_prepacked_constant_initializers, with save_prepacked_constant_initializers turn on:
Tested with model Phi-3-mini-instruct-onnx,
with ORT 1.12.0:
with this change:
Peak memory usage dropped from 5.438 GB to 2.726GB.
This change takes advantage of ORT loads external initializer with mmap on CPU. Prepack will use extra memory on heap, omit prepack process can save this part of memory (roughly same size as external initializers).
next step:
Change all the kernels on CPU with PrePack method implemented and test properly. Will do in next PR.
Motivation and Context