Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU]whisper readvalue optimize #26130

Open
wants to merge 47 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
2916414
Add profiler for CPU plugin.
xipingyan Aug 10, 2023
451c76d
Mark ReadValue's inputs and corresponding Assign.
xipingyan Aug 20, 2024
137beee
Only mark: ReadValue->Assign pairs.
xipingyan Aug 20, 2024
737fe5c
Optimize pattern match.
xipingyan Aug 21, 2024
6b05005
transformation test pass.
xipingyan Sep 3, 2024
58d9f6f
Test pass.
xipingyan Sep 6, 2024
d54dc25
Fix error: one param link to mulitple ReadValueWithSubgraphNode
xipingyan Sep 6, 2024
a533d73
Add submodel infer to MemoryInput::runDynamic
xipingyan Sep 6, 2024
f7339e3
Debug code
xipingyan Sep 6, 2024
e142a06
Merge remote-tracking branch 'origin/master' into xp/whisper_readvalu…
xipingyan Sep 6, 2024
4a2dba0
fix merge error
xipingyan Sep 6, 2024
bf7e493
Dynamic shape test pass
xipingyan Sep 6, 2024
d90144e
test whisper pass
xipingyan Sep 6, 2024
5a98e7b
Disable debug log to test performance. Got expected result:
xipingyan Sep 9, 2024
577721d
Add test.
xipingyan Sep 9, 2024
c23062e
Remove stateName in ov::Node
xipingyan Sep 10, 2024
592919b
Add env: ENABLE_RV for comprison test.
xipingyan Sep 10, 2024
f134307
Merge branch 'master' into xp/whisper_readvalue_optimize
xipingyan Sep 10, 2024
258c3c8
Merge branch 'xp/whisper_readvalue_optimize' of https://github.com/xi…
xipingyan Sep 10, 2024
5c771b0
rm debug log
xipingyan Sep 10, 2024
17b8ce3
[CPU] Introduce SubModel op and Composite node
EgorDuplensky Jul 4, 2024
3a6b83a
Apply review comments
EgorDuplensky Sep 6, 2024
f57851d
Merge branch 'egor/introduce_subgraph_node' into xp/whisper_readvalue…
xipingyan Sep 10, 2024
ca7dde8
Integrate SubModel, WIP.
xipingyan Sep 10, 2024
df7bb70
Remove not used variable.
xipingyan Sep 11, 2024
aa33812
Merge remote-tracking branch 'origin/master' into xp/whisper_readvalu…
xipingyan Sep 12, 2024
62772a9
Remove USE_SUBMODEL
xipingyan Sep 12, 2024
cebed6f
Fix search gap.
xipingyan Sep 13, 2024
d0f7986
Tmp version, test integrate new interface of Graph, Init and Activate.
xipingyan Sep 13, 2024
47f436c
After calling new interface of graph, fix memory inPlace issue.
xipingyan Sep 16, 2024
dbb9a3e
remove tmp code
xipingyan Sep 16, 2024
2ed69c0
"remove debug log"
xipingyan Sep 16, 2024
44c8fb2
Remove profiling code.
xipingyan Sep 18, 2024
87664d4
Merge branch 'master' into xp/whisper_readvalue_optimize
xipingyan Sep 18, 2024
153b4b8
Revert unchanged code.
xipingyan Sep 18, 2024
e188276
Simplify codes.
xipingyan Sep 18, 2024
2833602
Add judge whether subGraph can be called.
xipingyan Sep 18, 2024
0a6f13f
Fix test fail issue: readvalue have no any input.
xipingyan Sep 18, 2024
7ae318f
Remove get_body, m_subgraph, and update haveSubgraph.
xipingyan Sep 19, 2024
79b8272
Replace set_body with base class set_function
xipingyan Sep 19, 2024
9c0989d
remove debug env
xipingyan Sep 20, 2024
3eaea83
1: Add check memoryNode null
xipingyan Sep 25, 2024
d3bf35c
1: Remove reset_prime_mem setter
xipingyan Sep 25, 2024
266cf38
Remove getSupportedDescriptors in cpp.
xipingyan Sep 25, 2024
e94e67f
1. Removed const MemoryPtr& prime_mem() const
xipingyan Sep 26, 2024
e5402f8
1: Remove redefine supportedPrimitiveDescriptors
xipingyan Sep 26, 2024
74147b7
Merge branch 'master' into xp/whisper_readvalue_optimize
xipingyan Sep 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions src/plugins/intel_cpu/src/cpu_types.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,7 @@ static const TypeToNameMap& get_type_to_name_tbl() {
{"Loop", Type::TensorIterator},
{"ReadValue", Type::MemoryInput}, // for construction from name ctor, arbitrary name is used
{"Assign", Type::MemoryOutput}, // for construction from layer ctor
{"ReadValueWithSubgraph", Type::MemoryInput},
{"Convert", Type::Convert},
{"NV12toRGB", Type::ColorConvert},
{"NV12toBGR", Type::ColorConvert},
Expand Down
2 changes: 2 additions & 0 deletions src/plugins/intel_cpu/src/extension.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
#include "transformations/cpu_opset/common/op/power_static.hpp"
#include "transformations/cpu_opset/common/op/sdpa.hpp"
#include "transformations/cpu_opset/common/op/swish_cpu.hpp"
#include "transformations/cpu_opset/common/op/read_value_with_subgraph.hpp"
#include "transformations/cpu_opset/x64/op/interaction.hpp"
#include "transformations/cpu_opset/x64/op/mha.hpp"
#include "transformations/cpu_opset/x64/op/llm_mlp.hpp"
Expand Down Expand Up @@ -76,6 +77,7 @@ class TypeRelaxedExtension : public ov::OpExtension<ov::op::TypeRelaxed<Op>> {
OP_EXTENSION(ov::intel_cpu::CausalMaskPreprocessNode) \
OP_EXTENSION(ov::intel_cpu::SwishNode) \
OP_EXTENSION(ov::intel_cpu::NgramNode) \
OP_EXTENSION(ov::intel_cpu::ReadValueWithSubgraph) \
OP_EXTENSION(ov::op::internal::GatherCompressed) \
OP_EXTENSION(ov::op::internal::NonMaxSuppressionIEInternal) \
OP_EXTENSION(ov::op::internal::MulticlassNmsIEInternal) \
Expand Down
7 changes: 7 additions & 0 deletions src/plugins/intel_cpu/src/graph.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -940,6 +940,13 @@ void Graph::Allocate(const std::vector<size_t>& syncNodesInds) {
one_of(edge->getChild()->getType(), Type::Output, Type::MemoryOutput) &&
edge->inPlace(Edge::LOOK_DOWN)) {
edge->getChild()->resolveInPlaceEdges(Edge::LOOK_DOWN);
} else if (one_of(edge->getParent()->getType(), Type::MemoryInput)) {
auto memInp = std::dynamic_pointer_cast<node::MemoryInput>(edge->getParent());
if (memInp && memInp->haveSubgraph()) {
// Since the ReadValueWithSubgraph is middle node, just add this branch in order to use
// ProxyMemoryBlock to share memory
edge->getParent()->resolveInPlaceEdges(Edge::LOOK_UP);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @maxnick ,
For example:
MemoryInput(ReadValueWithSubgraph)->computer nodes
MemoryInput(ReadValueWithSubgraph)->MemoryOutputStub
MemoryInput(ReadValueWithSubgraph)->Stateful
I have to add this branch to call MemoryInput::resolveInPlaceEdges, it will let Stateful and MemoryOutputStub, computer nodes, share memory. It is reasonable.

If not call, these nodes's output will be inPlace, but MemoryOutputStub will
allocate memory, so maybe these output will empty ptr. Just clarify.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is in the MemoryInput::selectOptimalPrimitiveDescriptor implementation you developed for the subgraph scenario. There you:

  1. Lose the inPlace port tag, when copy the output port config from the internal subgraph. Therefore the output edge inplace look up property is lost and the resolveInPlaceEdges(Edge::LOOK_UP) isn't being called for such a node.
  2. It doesn't make sense to enforce the output port memory descriptor to mimic the internal subgraph output node memory descriptor, as the data are being read from the internal subgraph pretty rarely (only per reset state). So that it makes sense to stick with the external memory representation (just a default behavior of the the MemoryInput node), as the data will be read from the state via proxy memory manager more often.
    Thus I would recommend the following:
  3. Don't redefine selectPrimitiveDescriptor as it's more important to preserve the external graph memory representation to avoid extra reorders in runtime
  4. Provide the MemoryInput node output memory descriptors to the subgraph output config, so that the subgraph does all the necessary memory layout alignment for us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1: Removed redefine supportedPrimitiveDescriptors
2: Internal subgraph provide input setting InputConfig interface. but there is no OutputConfig setting interface.
@maxnick

}
}
}
}
Expand Down
88 changes: 88 additions & 0 deletions src/plugins/intel_cpu/src/graph_optimizer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -183,6 +183,10 @@ void GraphOptimizer::ApplyCommonGraphOptimizations(Graph &graph) {
MatchSdpaKvCache(graph);
graph.RemoveDroppedNodes();

OV_ITT_SCOPE_NEXT(FIRST_INFERENCE, taskChain, "ReplaceMemoryOutputWithMemoryOutputStub");
ReplaceMemoryOutputWithMemoryOutputStub(graph);
graph.RemoveDroppedNodes();

OV_ITT_SCOPE_NEXT(FIRST_INFERENCE, taskChain, "RemoveDroppedEdges");
graph.RemoveDroppedEdges();
}
Expand Down Expand Up @@ -3064,6 +3068,90 @@ void GraphOptimizer::RemoveConvertMemoryOutput(Graph &graph) {
}
}

void GraphOptimizer::ReplaceMemoryOutputWithMemoryOutputStub(Graph& graph) {
auto& graphNodes = graph.GetNodes();

auto isSuitableMemInput = [](const NodePtr& node) -> bool {
if (Type::MemoryInput != node->getType()) {
return false;
}
auto memInput = std::dynamic_pointer_cast<node::MemoryInput>(node);
if (memInput) {
return memInput->haveSubgraph();
}
Comment on lines +3078 to +3081
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding, in all the cases when Assign is directly attached to the ReadValue node, we should replace it with a stub, since the assign node is practically useless. ReadValue->Assign pair means that the state values aren't really changed by the assign node. So it looks like it can always be safely replaced with a stub.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Same to my first understanding.
But it also depends on another factor, the memory is VariableStateDoubleBuffer or VariableStateSingleBuffer(I introduced).
For VariableStateSingleBuffer, it is OK, I can replace with Stub for ReadValue->Assign pair.
For VariableStateDoubleBuffer, it is not safe.
About using VariableStateSingleBuffer or VariableStateDoubleBuffer, it still depends on if (haveSubgraph()) in my codes.
@maxnick

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding, when we have a direct ReadValue->Assign pair, we most definitely should use a single buffer, as nothing new will be written to the state during the assign stage. May be we can check for the MemoryOutput child in the MemoryInput node to select an appropriate state type (i.e. single buffer or double buffer).


return false;
};

for (size_t i = 0; i < graphNodes.size(); i++) {
auto node = graphNodes[i];
if (!isSuitableMemInput(node)) {
continue;
}

CPU_GRAPH_OPTIMIZER_SCOPE(ReplaceMemoryOutputWithMemoryOutputStub);

auto memoryNode = std::dynamic_pointer_cast<node::MemoryNode>(node);
if (nullptr == memoryNode) {
continue;
}

// Find sibling MemoryOutput
std::shared_ptr<MemoryOutput> memOutput = nullptr;
bool isReplaced = false;
for (auto&& edge : node->getChildEdgesAtPort(0)) {
auto child = edge->getChild();
if (Type::MemoryOutput == child->getType()) {
memOutput = std::dynamic_pointer_cast<MemoryOutput>(child);
if (memOutput && memOutput->getId() == memoryNode->getId()) {
break;
}

auto memOutputStub = std::dynamic_pointer_cast<MemoryOutputStub>(child);
if (memOutputStub && memOutputStub->getId() == memoryNode->getId()) {
isReplaced = true;
break;
}
}
}

if (isReplaced) {
continue;
}
if (memOutput == nullptr) {
OPENVINO_THROW("Can't find ", node->getName(), " corresponding sibling node.");
}

auto memInputNode = std::dynamic_pointer_cast<node::MemoryInputBase>(node);
OPENVINO_ASSERT(memInputNode, "MemoryInput node ", node->getName(), " has unexpected dynamic type");

ov::optional<Shape> input_shape;
ov::optional<ov::element::Type> input_prc;

if (!node->getParentEdges().empty()) {
input_shape = ov::optional<Shape>(node->getInputShapeAtPort(0));
input_prc = ov::optional<ov::element::Type>(node->getOriginalInputPrecisionAtPort(0));
}

// Capture reference to the original mem output before graph transformations
auto& memOutputBase = memInputNode->getOutputNode();

// Create a stub memory output
auto memOutputStub = std::make_shared<MemoryOutputStub>(memOutputBase.getId(),
memOutputBase.getName() + "_MemoryOutputStub",
memOutputBase.getTypeStr(),
memOutputBase.getInputShapeAtPort(0),
memOutputBase.getOriginalInputPrecisionAtPort(0),
graph.getGraphContext());

auto memOutputEdge = memOutputBase.getParentEdgeAt(0);
const auto inputNum = memOutputEdge->getInputNum();
graph.RemoveEdge(memOutputEdge);
graph.CreateEdge(node, memOutputStub, inputNum, 0);
graph.AddNode(memOutputStub);
}
}

void GraphOptimizer::MatchSdpaKvCache(Graph &graph) {
auto& graphNodes = graph.GetNodes();

Expand Down
1 change: 1 addition & 0 deletions src/plugins/intel_cpu/src/graph_optimizer.h
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ class GraphOptimizer {
void RemoveMemoryInputConvert(Graph &graph);
void RemoveConvertMemoryOutput(Graph &graph);
void MatchSdpaKvCache(Graph &graph);
void ReplaceMemoryOutputWithMemoryOutputStub(Graph &graph);

bool canBeInplaced(const NodePtr& parentNode, const NodePtr& childNode);
// Method checks that after the sequential execution of Transpose and Reorder nodes,
Expand Down
39 changes: 39 additions & 0 deletions src/plugins/intel_cpu/src/memory_state.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,45 @@ MemoryPtr VariableStateDoubleBuffer::internal_state_mem() const {
return prime_mem();
}

VariableStateSingleBuffer::VariableStateSingleBuffer(const std::string& name,
const MemoryPtr& external_buffer,
const MemoryDescPtr& external_desc)
: VariableStateBase(name, external_desc) {
OPENVINO_ASSERT(external_buffer);
m_internal_mem = external_buffer;
m_internal_desc = m_internal_mem->getDescPtr();
auto&& shape = m_internal_desc->getShape();

if (shape.isStatic()) {
m_internal_mem->nullify();
} else {
// in the case of the original desc has dynamic shape we create an empty tensor
auto new_desc = to_static(m_internal_desc);
m_internal_mem->redefineDesc(new_desc);
}
}
MemoryPtr VariableStateSingleBuffer::input_mem() {
return m_internal_mem;
}
MemoryPtr VariableStateSingleBuffer::output_mem() {
return m_internal_mem;
}
MemoryDescPtr VariableStateSingleBuffer::internal_desc() const {
return m_internal_desc;
}

void VariableStateSingleBuffer::reset_impl() {
auto new_desc = to_static(m_internal_desc);
if (m_internal_mem) {
m_internal_mem->redefineDesc(new_desc);
m_internal_mem->nullify();
}
}

MemoryPtr VariableStateSingleBuffer::internal_state_mem() const {
return m_internal_mem;
}

VariableStateKVcache::VariableStateKVcache(
const std::string& name,
const MemoryDescPtr& external_desc,
Expand Down
21 changes: 21 additions & 0 deletions src/plugins/intel_cpu/src/memory_state.h
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,27 @@ class VariableStateDoubleBuffer : public VariableStateBase {
size_t buffer_num = 0;
};

class VariableStateSingleBuffer : public VariableStateBase {
public:
VariableStateSingleBuffer(const std::string& name,
const MemoryPtr& external_buffer,
const MemoryDescPtr& external_desc);

MemoryPtr input_mem() override;
MemoryPtr output_mem() override;
MemoryDescPtr internal_desc() const override;

private:
void reset_impl() override;
void commit_impl() override {};

MemoryPtr internal_state_mem() const override;

private:
MemoryDescPtr m_internal_desc; //mem desc required by the graph internal tensor
MemoryPtr m_internal_mem;
};

class VariableStateKVcache : public VariableStateBase {
public:
VariableStateKVcache(const std::string& name,
Expand Down
4 changes: 3 additions & 1 deletion src/plugins/intel_cpu/src/nodes/input.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
#include "nodes/node_config.h"
#include "openvino/core/parallel.hpp"
#include "shape_inference/shape_inference_pass_through.hpp"
#include "transformations/cpu_opset/common/op/read_value_with_subgraph.hpp"

using namespace dnnl;
using namespace dnnl::impl::cpu::x64;
Expand Down Expand Up @@ -223,7 +224,8 @@ Input::Input(const std::shared_ptr<ov::Node>& op, const GraphContext::CPtr conte
op::v0::Constant::get_type_info_static(),
op::v0::Result::get_type_info_static(),
op::v3::ReadValue::get_type_info_static(),
op::v6::ReadValue::get_type_info_static()))
op::v6::ReadValue::get_type_info_static(),
ov::intel_cpu::ReadValueWithSubgraph::get_type_info_static()))
OPENVINO_THROW_NOT_IMPLEMENTED("CPU Input node doesn't support ngraph operation ",
op->get_type_name(),
" with name ",
Expand Down
Loading
Loading