-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unity device mapping algorithm #1459
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add compiler-tests
to CI and proj
?
Also, overall this code could use a bit of simplifying and readability-improving. For a lot of the cases where you're unwrapping types consider just defining equivalent functions for those types, which would help simplify things. Overall, more small functions with better names would probably help over the current large functions which can be a bit hard to actually read through
Reviewed 7 of 15 files at r1, 8 of 8 files at r2, all commit messages.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @wmdi)
lib/compiler/include/compiler/graph_optimize_result.struct.toml
line 7 at r2 (raw file):
includes = [ "compiler/machine_mapping.h" ]
Suggestion:
includes = [
"compiler/machine_mapping.dtg.h",
"pcg/parallel_computation_graph/parallel_computation_graph.dtg.h",
]
lib/compiler/include/compiler/graph_optimize_state.h
line 13 at r2 (raw file):
float runtime; bool operator==(GraphOptimizeState const &other) const;
GraphOptimizeState::operator==
should probably have tests
lib/compiler/include/compiler/unity_algorithm.h
line 13 at r2 (raw file):
namespace FlexFlow { struct OptimizerConfig {
Move over to dtgen
so it's printable, equality-checkable, hashable, etc.?
lib/compiler/src/graph_optimize_state.cc
line 6 at r2 (raw file):
bool GraphOptimizeState::operator==(GraphOptimizeState const &other) const { auto layers1 = topological_ordering(graph_optimize_result.pcg);
I'm not sure topological_ordering
is the best thing to rely on here. This seems like a graph-level comparison function, so it might be better to figure out what the semantics are and then it can get added to graph
so you only have to use it here?
lib/compiler/src/graph_optimize_state.cc
line 30 at r2 (raw file):
return !(*this == other); }
Where is operator<
? I think it was declared?
lib/compiler/src/machine_mapping.cc
line 159 at r2 (raw file):
cached_subgraph_costs(cached_subgraph_costs) {} ParallelComputationGraph pcg;
Why move the PCG from OptimalCostFunctor
to MachineMappingSearcher
?
lib/compiler/src/machine_mapping.cc
line 166 at r2 (raw file):
OptimalCostCache &cached_subgraph_costs; struct OptimalCostFunctor {
Why do we have this separate OptimalCostFunctor
?
lib/compiler/src/machine_mapping.cc
line 219 at r2 (raw file):
std::unordered_set<DataflowOutput> split_outputs; for (auto const &[value, _] :
Might be simpler with transform
? Also the assert
isn't really necessary, get
will already throw if you perform an invalid access
lib/compiler/src/machine_mapping.cc
line 287 at r2 (raw file):
optimal_result, OptimalCostResult::parallel_combine( std::visit(OptimalCostFunctor(
Why not just pull out a separate function that does std::visit(OptimalCostFunctor(...))
rather than repeat it each time?
lib/compiler/src/machine_mapping.cc
line 320 at r2 (raw file):
enumerate_machine_views({node}, resource)) { MachineMapping mv_map{{{node, node_machine_views.at(node)}}}; std::unordered_map<OpenDataflowValue, MachineView> machine_views =
Might be cleaner using merge_maps
and generate_map
?
lib/compiler/src/machine_mapping.cc
line 334 at r2 (raw file):
std::vector<std::unordered_map<Node, MachineView>> enumerate_machine_views(std::unordered_set<Node> const &nodes,
Not entirely clear what these do? Can you rename them to something a bit more descriptive?
lib/compiler/test/src/test_optimal_cost.cc
line 9 at r2 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("optimal_cost_0") {
Suggestion:
TEST_CASE("optimal_cost does not crash on minimal inputs") {
lib/substitutions/include/substitutions/sub_parallel_computation_graph.h
line 25 at r2 (raw file):
SubParallelComputationGraph sub_pcg_from_partial_pcg(ParallelComputationGraph const &,
Is this just get_subgraph
for PCGs
?
lib/substitutions/src/substitutions/sub_parallel_computation_graph.cc
line 61 at r2 (raw file):
std::unordered_set<Node> const &nodes) { auto as_open = view_as_labelled_open_dataflow_graph(pcg.raw_graph); OpenDataflowSubgraphResult subgraph_result = get_subgraph(as_open, nodes);
Probably better to just add get_subgraph
for LabelledOpenDataflowGraph
and then use it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @lockshaw)
lib/compiler/include/compiler/graph_optimize_result.struct.toml
line 7 at r2 (raw file):
includes = [ "compiler/machine_mapping.h" ]
Done.
lib/compiler/include/compiler/graph_optimize_state.h
line 13 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
GraphOptimizeState::operator==
should probably have tests
Done.
lib/compiler/include/compiler/unity_algorithm.h
line 13 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Move over to
dtgen
so it's printable, equality-checkable, hashable, etc.?
Done.
lib/compiler/src/graph_optimize_state.cc
line 6 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I'm not sure
topological_ordering
is the best thing to rely on here. This seems like a graph-level comparison function, so it might be better to figure out what the semantics are and then it can get added tograph
so you only have to use it here?
I think I require homomorphism assuming inputs are ordered.
lib/compiler/src/machine_mapping.cc
line 159 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why move the PCG from
OptimalCostFunctor
toMachineMappingSearcher
?
This is to avoid transferring variables that are not changed during DP, i.e., they are 'constants' to the DP, instead of states.
lib/compiler/src/machine_mapping.cc
line 166 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why do we have this separate
OptimalCostFunctor
?
To separate DP states with DP constants (environments).
lib/compiler/src/machine_mapping.cc
line 219 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Might be simpler with
transform
? Also theassert
isn't really necessary,get
will already throw if you perform an invalid access
I do not find methods to get the set of left/right values of a bidict
lib/compiler/src/machine_mapping.cc
line 287 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why not just pull out a separate function that does
std::visit(OptimalCostFunctor(...))
rather than repeat it each time?
Done.
lib/compiler/src/machine_mapping.cc
line 320 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Might be cleaner using
merge_maps
andgenerate_map
?
We do not have the utils required for bidict
lib/compiler/src/machine_mapping.cc
line 334 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Not entirely clear what these do? Can you rename them to something a bit more descriptive?
Done.
lib/compiler/test/src/test_optimal_cost.cc
line 9 at r2 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("optimal_cost_0") {
Done.
lib/substitutions/include/substitutions/sub_parallel_computation_graph.h
line 25 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Is this just
get_subgraph
forPCGs
?
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 9 files at r3, all commit messages.
Reviewable status: 15 of 17 files reviewed, 9 unresolved discussions (waiting on @wmdi)
lib/compiler/include/compiler/optimal_cost_state.struct.toml
line 20 at r3 (raw file):
"utils/hash/unordered_map.h", "utils/fmt/unordered_map.h", ]
Suggestion:
includes = [
"utils/graph/serial_parallel/serial_parallel_decomposition.dtg.h",
"pcg/machine_specification.dtg.h",
"pcg/machine_view.dtg.h",
"utils/graph/open_dataflow_graph/open_dataflow_edge.dtg.h",
"utils/graph/open_dataflow_graph/open_dataflow_value.dtg.h",
]
src_includes = [
"utils/hash/unordered_map.h",
"utils/fmt/unordered_map.h",
]
lib/compiler/src/graph_optimize_state.cc
line 6 at r2 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
I think I require homomorphism assuming inputs are ordered.
Using a hack like this for now is fine, but eventually it would be good to move over to a more principled algorithm implemented in utils/graph
that doesn't make correctness tradeoffs and more explicitly uses the properties of DataflowGraph
that you're relying on for efficiency. Until that's implemented let's add a comment in the code that clarifies that this is a hack and links to an issue with the plan for moving this over to something more correct
lib/compiler/src/machine_mapping.cc
line 159 at r2 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
This is to avoid transferring variables that are not changed during DP, i.e., they are 'constants' to the DP, instead of states.
As discussed in the Wednesday meeting, it should be possible to remove OptimalCostFunctor
, especially as we're now on C++17 and can use overload
There seem to be two potential designs (either (1) making optimal_cost
a method on MachineMappingSearcher
, or (2) making MachineMappingSearcher
just a wrapper for the global state (probably renamed to something like DPContext
) and passing that around between top-level functions as a mutable reference, i.e., essentially doing the passing of this
manually)
lib/compiler/src/machine_mapping.cc
line 219 at r2 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
I do not find methods to get the set of left/right values of a
bidict
See keys
in utils/containers/keys.h
lib/compiler/src/machine_mapping.cc
line 320 at r2 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
We do not have the utils required for
bidict
Not exactly sure the details of simplifying this, but making it more clear what this is doing would be nice (maybe using functions from utils/containers
or pulling out some pieces into separate named lambdas)
lib/compiler/test/src/test_graph_optimize_state.cc
line 8 at r3 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("graph_optimize_state:equality") {
Suggestion:
TEST_CASE("GraphOptimizeState::operator==") {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 9 files at r3.
Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @wmdi)
lib/compiler/src/graph_optimize_state.cc
line 57 at r3 (raw file):
namespace std { size_t hash<::FlexFlow::GraphOptimizeState>::operator()(
See https://reviewable.io/reviews/flexflow/FlexFlow/1459#-O3kIAym5Yc4ECMISkBW. Eventually it might be good to use a proper graph hash like https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.graph_hashing.weisfeiler_lehman_graph_hash.html#networkx.algorithms.graph_hashing.weisfeiler_lehman_graph_hash
lib/compiler/test/src/test_graph_optimize_state.cc
line 44 at r3 (raw file):
"dense1"); ParallelComputationGraph pcg = builder.pcg;
Suggestion:
ParallelComputationGraph pcg1 = [&] {
ParallelComputationGraphBuilder builder;
ParallelTensorShape input_shape =
ParallelTensorShape{ParallelTensorDims{
FFOrdered<ShardParallelDim>{
ShardParallelDim{32, 2},
ShardParallelDim{16, 1},
},
ReplicaParallelDimSet{
SumDegree{1},
DiscardCopyDegree{1},
},
},
DataType::FLOAT};
parallel_tensor_guid_t input0 =
builder.create_input_tensor(input_shape);
parallel_tensor_guid_t dense0 = builder.dense(input0,
8);
parallel_tensor_guid_t dense1 = builder.dense(dense0,
4);
return builder.pcg
}();
lib/compiler/test/src/test_graph_optimize_state.cc
line 67 at r3 (raw file):
"dense0"); ParallelComputationGraph pcg_ = builder.pcg;
Suggestion:
ParallelComputationGraph pcg2 = [&] {
ParallelComputationGraphBuilder builder;
parallel_tensor_guid_t input0 =
builder.create_input_tensor(input_shape);
parallel_tensor_guid_t dense0 = builder.dense(input0,
8);
return builder.pcg;
}();
lib/compiler/test/src/test_graph_optimize_state.cc
line 69 at r3 (raw file):
ParallelComputationGraph pcg_ = builder.pcg; CHECK(GraphOptimizeState(GraphOptimizeResult(pcg, empty_machine_mapping), 0) !=
This doesn't really sufficiently test GraphOptimizeState::operator==
--from this I'm not sure (1) what role the machine mappings and the cost play in equality, and (2) whether any graphs compare equal at all, or whether all of them just return not equal. A few more test/sub-cases are needed
lib/compiler/test/src/test_graph_optimize_state.cc
line 70 at r3 (raw file):
CHECK(GraphOptimizeState(GraphOptimizeResult(pcg, empty_machine_mapping), 0) != GraphOptimizeState(GraphOptimizeResult(pcg_, empty_machine_mapping), 0));
Suggestion:
GraphOptimizeState state1 = GraphOptimizeState{
GraphOptimizeResult{pcg, empty_machine_mapping},
0,
};
GraphOptimzieState state2 = GraphOptimizeState{
GraphOptimizeResult{pcg, empty_machine_mapping},
0,
};
CHECK(state1 != state2);
lib/substitutions/include/substitutions/sub_parallel_computation_graph.h
line 25 at r2 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
Yes
In that case let's rename it to get_pcg_subgraph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 41 files reviewed, 13 unresolved discussions (waiting on @lockshaw)
lib/compiler/src/graph_optimize_state.cc
line 6 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Using a hack like this for now is fine, but eventually it would be good to move over to a more principled algorithm implemented in
utils/graph
that doesn't make correctness tradeoffs and more explicitly uses the properties ofDataflowGraph
that you're relying on for efficiency. Until that's implemented let's add a comment in the code that clarifies that this is a hack and links to an issue with the plan for moving this over to something more correct
Done.
lib/compiler/src/machine_mapping.cc
line 159 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
As discussed in the Wednesday meeting, it should be possible to remove
OptimalCostFunctor
, especially as we're now on C++17 and can useoverload
There seem to be two potential designs (either (1) making
optimal_cost
a method onMachineMappingSearcher
, or (2) makingMachineMappingSearcher
just a wrapper for the global state (probably renamed to something likeDPContext
) and passing that around between top-level functions as a mutable reference, i.e., essentially doing the passing ofthis
manually)
Done.
lib/compiler/src/machine_mapping.cc
line 219 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
See
keys
inutils/containers/keys.h
Done.
lib/compiler/src/machine_mapping.cc
line 320 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Not exactly sure the details of simplifying this, but making it more clear what this is doing would be nice (maybe using functions from
utils/containers
or pulling out some pieces into separate named lambdas)
Done.
lib/compiler/test/src/test_graph_optimize_state.cc
line 8 at r3 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("graph_optimize_state:equality") {
Done.
lib/compiler/test/src/test_graph_optimize_state.cc
line 44 at r3 (raw file):
"dense1"); ParallelComputationGraph pcg = builder.pcg;
Done.
lib/compiler/test/src/test_graph_optimize_state.cc
line 67 at r3 (raw file):
"dense0"); ParallelComputationGraph pcg_ = builder.pcg;
Done.
lib/substitutions/include/substitutions/sub_parallel_computation_graph.h
line 25 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
In that case let's rename it to
get_pcg_subgraph
Done.
lib/compiler/include/compiler/optimal_cost_state.struct.toml
line 20 at r3 (raw file):
"utils/hash/unordered_map.h", "utils/fmt/unordered_map.h", ]
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 41 files reviewed, 13 unresolved discussions (waiting on @lockshaw)
lib/compiler/src/graph_optimize_state.cc
line 57 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
See https://reviewable.io/reviews/flexflow/FlexFlow/1459#-O3kIAym5Yc4ECMISkBW. Eventually it might be good to use a proper graph hash like https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.graph_hashing.weisfeiler_lehman_graph_hash.html#networkx.algorithms.graph_hashing.weisfeiler_lehman_graph_hash
Done.
lib/compiler/test/src/test_graph_optimize_state.cc
line 69 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
This doesn't really sufficiently test
GraphOptimizeState::operator==
--from this I'm not sure (1) what role the machine mappings and the cost play in equality, and (2) whether any graphs compare equal at all, or whether all of them just return not equal. A few more test/sub-cases are needed
This will be fixed in another PR.
lib/compiler/test/src/test_graph_optimize_state.cc
line 70 at r3 (raw file):
CHECK(GraphOptimizeState(GraphOptimizeResult(pcg, empty_machine_mapping), 0) != GraphOptimizeState(GraphOptimizeResult(pcg_, empty_machine_mapping), 0));
This will be fixed in another PR.
lib/substitutions/src/substitutions/sub_parallel_computation_graph.cc
line 61 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Probably better to just add
get_subgraph
forLabelledOpenDataflowGraph
and then use it here
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 27 of 37 files at r4, 6 of 7 files at r5, 1 of 2 files at r6, all commit messages.
Reviewable status: 38 of 41 files reviewed, 31 unresolved discussions (waiting on @wmdi)
lib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h
line 6 at r6 (raw file):
#include "machine_mapping.h" #include "machine_mapping_cache.h" #include "machine_mapping_context.dtg.h"
Suggestion:
#include "compiler/machine_mapping/machine_mapping.h"
#include "compiler/machine_mapping/machine_mapping_cache.h"
#include "compiler/machine_mapping/machine_mapping_context.dtg.h"
lib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h
line 16 at r6 (raw file):
ParallelComputationGraph const &pcg, std::function<std::unordered_set<MachineView>( ParallelLayerAttrs const &, MachineSpecification const &)> const
This isn't going to allow access to the input and output tensor shapes of the operator, which are probably necessary?
Code quote:
ParallelLayerAttrs const &, M
lib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h
line 23 at r6 (raw file):
MachineMappingResult get_optimal_machine_mapping_internal(MachineMappingContext &context,
What is this overload for? Why does it not take a SerialParallelDecomposition
?
lib/compiler/include/compiler/machine_mapping/machine_mapping.h
line 4 at r6 (raw file):
#define _FLEXFLOW_COMPILER_MACHINE_MAPPING_H #include "machine_mapping.dtg.h"
Suggestion:
#include "compiler/machine_mapping/machine_mapping.dtg.h"
lib/compiler/include/compiler/machine_mapping/machine_mapping.h
line 8 at r6 (raw file):
namespace FlexFlow { MachineMapping combine(MachineMapping const &, MachineMapping const &);
Suggestion:
MachineMapping combine_disjoint_mappings(MachineMapping const &, MachineMapping const &);
lib/compiler/include/compiler/machine_mapping/machine_mapping_cache.h
line 10 at r6 (raw file):
namespace FlexFlow { class MachineMappingCache {
Does it make sense to have a separate class/object for this, or this not just the std::unordered_map
API?
lib/compiler/include/compiler/machine_mapping/machine_mapping_context.struct.toml
line 7 at r5 (raw file):
includes = [ "machine_mapping.dtg.h",
Suggestion:
"compiler/machine_mapping/machine_mapping.dtg.h",
lib/compiler/include/compiler/machine_mapping/machine_mapping_context.struct.toml
line 29 at r6 (raw file):
[[fields]] name = "cached_subgraph_results" type = "::FlexFlow::MachineMappingCache"
What about moving MachineMappingCache
out of this class so MachineMappingContext
can be passed as const &
and MachineMappingCache
alone gets passed as mutable &
?
lib/compiler/include/compiler/machine_mapping/machine_mapping_result.h
line 14 at r6 (raw file):
MachineMappingResult get_infinity_machine_mapping_result(); void minimize_runtime(MachineMappingResult &m1, MachineMappingResult const &m2);
Why this vs just calling min
or something?
Code quote:
void minimize_runtime(MachineMappingResult &m1, MachineMappingResult const &m2);
lib/compiler/include/compiler/machine_mapping/machine_mapping_result.struct.toml
line 9 at r5 (raw file):
includes = [ "machine_mapping.dtg.h",
Suggestion:
"compiler/machine_mapping/machine_mapping.dtg.h",
lib/compiler/src/unity_algorithm.cc
line 39 at r6 (raw file):
MachineMappingCache cached_subgraph_costs; DeduplicatedPriorityQueue<GraphOptimizeState> candidates;
Would it be cleaner here to, rather than having a custom GraphOptimizeState
type whose hash and operator==
ignore the runtime cost, instead have this be a DeduplicatedPriorityQueue<SomeWrapperAroundAPCG, CustomComparator>
where the CustomComparator
is something like below, where MyCustomType
is analagous to SomeWrapperAroundAPCG
, and I suppose instead of a std::unordered_map<MyCustomType, float>
you'd have a MachineMappingCache
Code snippet:
#include <map>
#include <queue>
#include <unordered_map>
#include <stdlib.h>
#include <iostream>
int guid_ctr = 0;
struct MyCustomType {
MyCustomType()
: guid(guid_ctr++) {}
bool operator==(MyCustomType const &other) const {
return this->guid == other.guid;
}
bool operator!=(MyCustomType const &other) const {
return this->guid != other.guid;
}
int guid;
};
namespace std {
template <>
struct hash<MyCustomType> {
size_t operator()(MyCustomType const &x) const {
return x.guid;
}
};
}
float get_cost(MyCustomType const &) {
return rand();
}
int main() {
std::unordered_map<MyCustomType, float> costs;
auto cached_get_cost = [&](MyCustomType const &x) -> float {
if (costs.find(x) == costs.end()) {
costs.insert({x, get_cost(x)});
}
return costs.at(x);
};
auto my_custom_comparator = [&](MyCustomType const &lhs, MyCustomType const &rhs) -> bool {
return cached_get_cost(lhs) < cached_get_cost(rhs);
};
std::priority_queue<MyCustomType, std::vector<MyCustomType>, decltype(my_custom_comparator)> q{my_custom_comparator};
for (int i = 0; i < 10; i++) {
q.push(MyCustomType{});
}
while (!q.empty()) {
MyCustomType curr = q.top();
q.pop();
std::cout << cached_get_cost(curr) << std::endl;
}
}
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc
line 81 at r6 (raw file):
return enumeration; }
Move to a separate allowed_machine_mappings
file, this file is already getting rather complicated/large
Code quote:
std::vector<std::unordered_map<Node, MachineView>>
allowed_machine_mappings(MachineMappingContext const &context,
std::unordered_set<Node> const &nodes,
MachineSpecification const &resource) {
if (nodes.empty()) {
return {{}};
}
Node node = get_first(nodes);
std::vector<std::unordered_map<Node, MachineView>> partial_enumeration =
allowed_machine_mappings(context, set_minus(nodes, {node}), resource);
std::unordered_set<MachineView> allowed_machine_views_for_node =
context.allowed_machine_views(context.pcg.raw_graph.at(node), resource);
std::vector<std::unordered_map<Node, MachineView>> enumeration;
for (MachineView const &mv : allowed_machine_views_for_node) {
for (std::unordered_map<Node, MachineView> const &partial :
partial_enumeration) {
enumeration.push_back(merge_maps(
partial, std::unordered_map<Node, MachineView>{{node, mv}}));
}
}
return enumeration;
}
std::vector<std::unordered_map<DataflowOutput, MachineView>>
allowed_machine_mappings(MachineMappingContext const &context,
std::unordered_set<DataflowOutput> const &values,
MachineSpecification const &resource) {
std::unordered_set<Node> nodes;
for (DataflowOutput const &v : values) {
nodes.insert(v.node);
}
std::vector<std::unordered_map<Node, MachineView>> node_enumeration =
allowed_machine_mappings(context, nodes, resource);
std::vector<std::unordered_map<DataflowOutput, MachineView>> enumeration;
for (std::unordered_map<Node, MachineView> _node_enumeration :
node_enumeration) {
std::unordered_map<DataflowOutput, MachineView> _emumeration;
for (DataflowOutput const &v : values) {
_emumeration.emplace(v, _node_enumeration.at(v.node));
}
enumeration.push_back(_emumeration);
}
return enumeration;
}
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc
line 187 at r6 (raw file):
GraphSplit graph_split = get_graph_split(decompn1, decompn2); OpenDataflowSubgraphResult subgraph_res1 = get_subgraph(
Is there really a point to the subgraphs at this point, or can we just rely on the SP decomposition for this information? I guess it's that the graph cost is cached, and if you cached it for the SP decomposition then you wouldn't be agnostic to different locations of the same graph?
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc
line 192 at r6 (raw file):
sub_pcg_from_full_pcg(context.pcg).raw_graph, graph_split.second); std::unordered_set<DataflowOutput> split_outputs = transform(
FYI you can use parallel_tensor_guid_t
for this once #1471 is merged
Code quote:
DataflowOutput
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc
line 199 at r6 (raw file):
&split_machine_views : allowed_machine_mappings(context, split_outputs, resource)) { std::unordered_map<OpenDataflowValue, MachineView> fixed_machine_views1 =
FYI you can use open_parallel_tensor_guid_t
for this once #1471 is merged
Code quote:
OpenDataflowValue
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc
line 206 at r6 (raw file):
get_open_dataflow_values(subgraph_res2.graph)); for (auto const &[split_value, split_input] :
Suggestion:
for (auto const &[full_graph_value, subgraph_input] :
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc
line 283 at r6 (raw file):
assert(contains( context.allowed_machine_views(context.pcg.raw_graph.at(node), resource), fixed_machine_views.at(any_output)));
Simplify
Code quote:
assert(contains(
context.allowed_machine_views(context.pcg.raw_graph.at(node), resource),
fixed_machine_views.at(any_output)));
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc
line 302 at r6 (raw file):
return OpenDataflowValue(o); }), [&](OpenDataflowValue const &o) { return mv; });
Simplify, probably separate into two expressions
Code quote:
std::unordered_map<OpenDataflowValue, MachineView> output_mv_map =
generate_map(transform(get_outputs(context.pcg.raw_graph, node),
[](DataflowOutput const &o) {
return OpenDataflowValue(o);
}),
[&](OpenDataflowValue const &o) { return mv; });
lib/compiler/src/compiler/machine_mapping/machine_mapping_cache.cc
line 10 at r6 (raw file):
if (contains_key(cache, state)) { MachineMappingResult result = cache.at(state); return std::make_optional(result);
Suggestion:
return result;
lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc
line 100 at r6 (raw file):
}(); ParallelComputationGraph pcg_non_sp = [&] {
These should be in the SUBCASE
s as they're not shared between them
lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc
line 146 at r6 (raw file):
CHECK(bool(result.runtime > 0)); // TODO(@Mengdi Wu): fill it with actual cost // CHECK(result.runtime == xx);
Rather than running it, getting a cost, and then just hardcoding that into the file, I'd focus on generating testcases (such as restricting the allowed machine views to only a few, controlling the cost function, etc.) where you can actually set analytically derived values
lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc
line 171 at r6 (raw file):
} SUBCASE("PCG is not sp-izable due to multiple inputs") {
You'd want separate testcases for this too that actually check the graph manipulation, not just get_optimal_machine_mapping
-level tests
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc
line 0 at r6 (raw file):
Rename machine_mapping.cc
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc
line 18 at r6 (raw file):
MachineMapping result = combine(machine_mapping_0, machine_mapping_1); CHECK(result == combined); }
Suggestion:
TEST_CASE("combine(MachineMapping, MachineMapping)") {
MachineView machine_view_0 = make_1d_machine_view(gpu_id_t(0), gpu_id_t(1));
MachineView machine_view_1 = make_1d_machine_view(gpu_id_t(0), gpu_id_t(2));
MachineMapping machine_mapping_0 = MachineMapping{{
{Node(0), machine_view_0},
}};
MachineMapping machine_mapping_1 = MachineMapping{{
{Node(1), machine_view_1},
}};
MachineMapping correct = MachineMapping{{
{Node(0), machine_view_0},
{Node(1), machine_view_1},
}};
MachineMapping result = combine(machine_mapping_0, machine_mapping_1);
CHECK(result == correct);
}
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc
line 20 at r6 (raw file):
} TEST_CASE("nodes_are_disjoint") {
Suggestion:
TEST_CASE("nodes_are_disjoint(MachineMapping, MachineMapping)") {
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc
line 28 at r6 (raw file):
{{Node(0), machine_view_0}, {Node(1), machine_view_1}}); CHECK(nodes_are_disjoint(machine_mapping_0, machine_mapping_1)); CHECK_FALSE(nodes_are_disjoint(machine_mapping_0, combined));
Suggestion:
MachineMapping machine_mapping_0({{Node(0), machine_view_0}});
SUBCASE("nodes are disjoint") {
MachineMapping machine_mapping_1({{Node(1), machine_view_1}});
bool correct = true;
bool result = nodes_are_disjoint(machine_mapping_0, machine_mapping_1);
CHECK(result == correct);
}
SUBCASE("nodes are not disjoint") {
MachineMapping machine_mapping_1(
{{Node(0), machine_view_0}, {Node(1), machine_view_1}});
bool correct = false;
bool result = nodes_are_disjoint(machine_mapping_0, machine_mapping_1);
CHECK(result == correct);
}
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_cache.cc
line 11 at r6 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("machine_mapping_cache") {
Suggestion:
TEST_CASE("MachineMappingCache") {
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_result.cc
line 77 at r6 (raw file):
CHECK(s1.runtime < inf.runtime); CHECK(s2.runtime < inf.runtime); }
Delete
Code quote:
TEST_CASE("get_infinity_machine_mapping_result") {
MachineView machine_view_0 = make_1d_machine_view(gpu_id_t(0), gpu_id_t(1));
MachineView machine_view_1 = make_1d_machine_view(gpu_id_t(0), gpu_id_t(2));
MachineMapping machine_mapping_empty(
std::unordered_map<Node, MachineView>{});
MachineMapping machine_mapping_0({{Node(0), machine_view_0}});
MachineMapping machine_mapping_1({{Node(1), machine_view_1}});
MachineMapping combined(
{{Node(0), machine_view_0}, {Node(1), machine_view_1}});
MachineMappingResult s0(0, machine_mapping_empty);
MachineMappingResult s1(1, machine_mapping_0);
MachineMappingResult s2(2, machine_mapping_1);
MachineMappingResult inf = get_infinity_machine_mapping_result();
CHECK(s0.runtime < inf.runtime);
CHECK(s1.runtime < inf.runtime);
CHECK(s2.runtime < inf.runtime);
}
lib/substitutions/include/substitutions/sub_parallel_computation_graph.h
line 25 at r6 (raw file):
SubParallelComputationGraph get_pcg_subgraph(ParallelComputationGraph const &, std::unordered_set<Node> const &);
Suggestion:
std::unordered_set<parallel_layer_guid_t> const &);
lib/compiler/include/compiler/machine_mapping/machine_mapping.struct.toml
line 17 at r6 (raw file):
"utils/hash/unordered_map.h", "utils/fmt/unordered_map.h", ]
Suggestion:
includes = [
"utils/graph/node/node.dtg.h",
"pcg/machine_view.dtg.h",
]
src_includes = [
"utils/hash/unordered_map.h",
"utils/fmt/unordered_map.h",
]
lib/compiler/include/compiler/machine_mapping/machine_mapping.struct.toml
line 21 at r6 (raw file):
[[fields]] name = "machine_views" type = "std::unordered_map<::FlexFlow::Node, ::FlexFlow::MachineView>"
Suggestion:
type = "std::unordered_map<::FlexFlow::parallel_layer_guid_t, ::FlexFlow::MachineView>"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 38 of 41 files reviewed, 31 unresolved discussions (waiting on @lockshaw)
lib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h
line 6 at r6 (raw file):
#include "machine_mapping.h" #include "machine_mapping_cache.h" #include "machine_mapping_context.dtg.h"
Done.
lib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h
line 16 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
This isn't going to allow access to the input and output tensor shapes of the operator, which are probably necessary?
I see. Will fix it when merging the machine view PR.
lib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h
line 23 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
What is this overload for? Why does it not take a
SerialParallelDecomposition
?
This is to further split the logics into multiple pieces. get_optimal_machine_mapping
wraps global variables as MachineMappingContext
and calls this function, and this function performs sp decomposition (and solves the un-sp-izable issue in a later version) and calls other overloads.
lib/compiler/include/compiler/machine_mapping/machine_mapping.h
line 4 at r6 (raw file):
#define _FLEXFLOW_COMPILER_MACHINE_MAPPING_H #include "machine_mapping.dtg.h"
Done.
lib/compiler/include/compiler/machine_mapping/machine_mapping.h
line 8 at r6 (raw file):
namespace FlexFlow { MachineMapping combine(MachineMapping const &, MachineMapping const &);
Done.
lib/compiler/include/compiler/machine_mapping/machine_mapping_cache.h
line 10 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Does it make sense to have a separate class/object for this, or this not just the
std::unordered_map
API?
I think it makes sense to wrap up the save and load logic as they handle some existence cases.
lib/compiler/include/compiler/machine_mapping/machine_mapping_context.struct.toml
line 7 at r5 (raw file):
includes = [ "machine_mapping.dtg.h",
Done.
lib/compiler/include/compiler/machine_mapping/machine_mapping_result.h
line 14 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why this vs just calling
min
or something?
This function only compares MachineMappingResult
s according to the runtime. It does not make sense to have an operator<
for MachineMappingResult
that only compares runtime, which is required by min
.
lib/compiler/include/compiler/machine_mapping/machine_mapping_result.struct.toml
line 9 at r5 (raw file):
includes = [ "machine_mapping.dtg.h",
Done.
lib/compiler/src/compiler/machine_mapping/machine_mapping_cache.cc
line 10 at r6 (raw file):
if (contains_key(cache, state)) { MachineMappingResult result = cache.at(state); return std::make_optional(result);
Done.
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc
line 18 at r6 (raw file):
MachineMapping result = combine(machine_mapping_0, machine_mapping_1); CHECK(result == combined); }
Done.
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc
line 20 at r6 (raw file):
} TEST_CASE("nodes_are_disjoint") {
Done.
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc
line 28 at r6 (raw file):
{{Node(0), machine_view_0}, {Node(1), machine_view_1}}); CHECK(nodes_are_disjoint(machine_mapping_0, machine_mapping_1)); CHECK_FALSE(nodes_are_disjoint(machine_mapping_0, combined));
Done.
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc
line at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Rename
machine_mapping.cc
What does this mean?
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_cache.cc
line 11 at r6 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("machine_mapping_cache") {
Done.
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_result.cc
line 77 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete
Done.
lib/compiler/include/compiler/machine_mapping/machine_mapping.struct.toml
line 17 at r6 (raw file):
"utils/hash/unordered_map.h", "utils/fmt/unordered_map.h", ]
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 38 of 41 files reviewed, 31 unresolved discussions (waiting on @lockshaw)
lib/compiler/include/compiler/machine_mapping/machine_mapping_context.struct.toml
line 29 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
What about moving
MachineMappingCache
out of this class soMachineMappingContext
can be passed asconst &
andMachineMappingCache
alone gets passed as mutable&
?
I think MachineMappingCache
should be part of the context, and it is fine to have a mutable context.
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc
line 81 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Move to a separate
allowed_machine_mappings
file, this file is already getting rather complicated/large
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 38 of 41 files reviewed, 31 unresolved discussions (waiting on @lockshaw)
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc
line 206 at r6 (raw file):
get_open_dataflow_values(subgraph_res2.graph)); for (auto const &[split_value, split_input] :
Done.
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc
line 283 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Simplify
Done.
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc
line 302 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Simplify, probably separate into two expressions
Done.
lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc
line 100 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
These should be in the
SUBCASE
s as they're not shared between them
Done.
lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc
line 146 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Rather than running it, getting a cost, and then just hardcoding that into the file, I'd focus on generating testcases (such as restricting the allowed machine views to only a few, controlling the cost function, etc.) where you can actually set analytically derived values
In terms of actual cost, I mean the cost model we have for a pcg. Even for analytical results, we need a precise cost model.
lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc
line 171 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
You'd want separate testcases for this too that actually check the graph manipulation, not just
get_optimal_machine_mapping
-level tests
Done.
lib/substitutions/include/substitutions/sub_parallel_computation_graph.h
line 25 at r6 (raw file):
SubParallelComputationGraph get_pcg_subgraph(ParallelComputationGraph const &, std::unordered_set<Node> const &);
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 37 files at r4, 7 of 17 files at r7, all commit messages.
Reviewable status: 33 of 43 files reviewed, 24 unresolved discussions (waiting on @wmdi)
lib/compiler/include/compiler/machine_mapping/allowed_machine_mappings.h
line 12 at r7 (raw file):
allowed_machine_mappings(MachineMappingContext const &context, std::unordered_set<Node> const &nodes, MachineSpecification const &resource);
Suggestion:
std::vector<std::unordered_map<parallel_layer_guid_t, MachineView>>
allowed_machine_mappings(MachineMappingContext const &context,
std::unordered_set<parallel_layer_guid_t> const &nodes,
MachineSpecification const &resource);
lib/compiler/include/compiler/machine_mapping/allowed_machine_mappings.h
line 16 at r7 (raw file):
std::vector<std::unordered_map<DataflowOutput, MachineView>> allowed_machine_mappings(MachineMappingContext const &context, std::unordered_set<DataflowOutput> const &values,
Suggestion:
std::unordered_set<parallel_tensor_guid_t> const &values,
lib/compiler/test/src/test_graph_optimize_state.cc
line 70 at r3 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
This will be fixed in another PR.
Isn't this a trivial fix?
lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc
line 146 at r6 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
In terms of actual cost, I mean the cost model we have for a pcg. Even for analytical results, we need a precise cost model.
In a way yes, but you control the cost model: you can add a custom implementation of CostEstimator
that returns whatever numbers you think would be useful for testing. We absolutely do want to (at some point) test the whole compiler+costestimator+machineviews+everything stack, but even if all of that was present we would also want to have tests that just that compiler is correct in isolation, which is what most of these tests should be--provide an arbitrary tiny set of allowed machine views (say, two of them), create a tiny PCG (with, say, 2 operators in parallel), and write a CostEstimator
subclass that gives you back some chosen constants (1 for each of the possible node-machineview assignments). Then solve for what the resulting value of get_optimal_machine_mappings
should be, run the function, and check that the returned value is the same as what you solved analytically.
At this point there shouldn't be anything stopping you from testing pretty much the whole Unity DP in isolation, and once everything is merged we'll add a couple end-to-end tests just to make sure there weren't any unexpected issues at the boundaries between components
lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc
line 12 at r7 (raw file):
auto allowed_machine_views1 = [&](ParallelLayerAttrs const &, MachineSpecification const &) { // TODO(@Mengdi Wu): Replace it with actual allowed machine views when
Ideally don't do this--these tests are supposed to confirm that get_optimal_machine_mapping
is correct, not that get_allowed_machine_views
is correct. If you call get_allowed_machine_views
in this test and the test fails, you don't know whether it is because get_optimal_machine_mapping
is wrong or if it s because get_allowed_machine_views
is wrong
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc
line at r6 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
What does this mean?
Rename from test_machine_mapping.cc
to machine_mapping.cc
--the current convention is to not have the test_
prefix, as they're already under the test
directory
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc
line 25 at r7 (raw file):
combine_disjoint_mappings(machine_mapping_0, machine_mapping_1); CHECK(result == correct); }
Add a SUBCASE
to test the failure case
Suggestion:
TEST_CASE("combine_disjoint_mappings(MachineMapping, MachineMappping)") {
MachineView machine_view_0 = make_1d_machine_view(gpu_id_t(0), gpu_id_t(1));
MachineView machine_view_1 = make_1d_machine_view(gpu_id_t(0), gpu_id_t(2));
SUBCASE("input machine mappings are disjoint") {
MachineMapping machine_mapping_0 = MachineMapping({
{Node(0), machine_view_0},
});
MachineMapping machine_mapping_1 = MachineMapping({
{Node(1), machine_view_1},
});
MachineMapping correct = MachineMapping({
{Node(0), machine_view_0},
{Node(1), machine_view_1},
});
MachineMapping result =
combine_disjoint_mappings(machine_mapping_0, machine_mapping_1);
CHECK(result == correct);
}
SUBCASE("input machine mappings are not disjoint") {
...
}
}
lib/substitutions/include/substitutions/sub_parallel_computation_graph.h
line 25 at r6 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
Why?
parallel_layer_guid_t
is a wrapper for Node
that we use for representing Node
s in a ParallelComputationGraph
. It primarily improves readability and helps prevent accidentally passing the wrong node to things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 10 of 17 files at r7.
Reviewable status: all files reviewed, 28 unresolved discussions (waiting on @wmdi)
lib/compiler/include/compiler/machine_mapping/allowed_machine_mappings.h
line 10 at r7 (raw file):
std::vector<std::unordered_map<Node, MachineView>> allowed_machine_mappings(MachineMappingContext const &context,
The type signatures of these methods are a bit odd? At least it's not 100% clear for me what they're doing based on the names and input types, especially since there are two overloads? Can you either clarify the names a bit and/or add doxygen docstrings giving a brief description of what they do?
lib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h
line 23 at r6 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
This is to further split the logics into multiple pieces.
get_optimal_machine_mapping
wraps global variables asMachineMappingContext
and calls this function, and this function performs sp decomposition (and solves the un-sp-izable issue in a later version) and calls other overloads.
I think it's probably better to think of the SP decomposition as coming from outside of the optimal machine mapping algorithm, as with Pietro's stuff it at least seems more like creating an SP-ization of a given PCG is a separate step than just part of optimal machine mapping algorithm, but this isn't a bit deal and can always get refactored later. That said, it seems a bit odd to have this function, which conceptually seems to do entirely different things from the other get_optimal_machine_mapping_internal
overloads, have the same name as them, so if you do prefer to use this structure it should at least be renamed to something clearer--ideally I should be able to make a pretty accurate educated guess of what a function does from its name and arguments, and I don't think that's the case here.
lib/compiler/include/compiler/machine_mapping/machine_mapping_context.struct.toml
line 29 at r6 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
I think
MachineMappingCache
should be part of the context, and it is fine to have a mutable context.
Is there a significant downside to moving it out? It would make the code a bit more readable as it would make it more obvious what gets mutated and what doesn't
lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc
line 9 at r7 (raw file):
namespace FlexFlow { std::vector<std::unordered_map<Node, MachineView>>
Might be cleaner to define a machine_mapping.struct.toml
wrapper type around std::unordered_map<parallel_layer_guid_t, MachineView>
so this turns into the more readable std::vector<MachineMapping>
? Also why is this a std::vector
and not a std::unordered_set
/std::set
? Should there be duplicates or does order matter?
lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc
line 14 at r7 (raw file):
MachineSpecification const &resource) { if (nodes.empty()) { return {{}};
Why not an empty vector as a return value? Does an empty machine mapping really have a meaning?
lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc
line 29 at r7 (raw file):
} } return enumeration;
Suggestion:
Node curr_node = get_first(nodes);
std::unordered_set<Node> other_nodes = set_minus(nodes, {node});
std::vector<std::unordered_map<Node, MachineView>> other_node_mappings_from_recursion =
allowed_machine_mappings(context, other_nodes, resource);
std::unordered_set<MachineView> allowed_machine_views_for_curr_node =
context.allowed_machine_views(context.pcg.raw_graph.at(node), resource);
std::vector<std::unordered_map<Node, MachineView>> result;
for (MachineView const &for_curr_node : allowed_machine_views_for_curr_node) {
for (std::unordered_map<Node, MachineView> const &for_other_nodes :
other_node_mappings_from_recursion) {
MachineMapping merged = merge_disjoint_machine_mappings(
MachineView{{
{curr_node, for_cudd_node},
}},
for_other_nodes,
});
enumeration.push_back(merged);
}
}
return result;
lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc
line 39 at r7 (raw file):
for (DataflowOutput const &v : values) { nodes.insert(v.node); }
Suggestion:
std::unordered_set<Node> nodes transform(values, [](DataflowOutput const &v) { return v.node; });
lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc
line 47 at r7 (raw file):
for (std::unordered_map<Node, MachineView> _node_enumeration : node_enumeration) { std::unordered_map<DataflowOutput, MachineView> _emumeration;
Don't use leading underscore, if you need to distinguish it from enumeration
it's probably better to come up with a clearer name as enumeration
is a bit unclear (what is being enumerated? does this have to do with enumerate
from utils/containers/enumerate.h
?)
Code quote:
_emumeration;
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc
line 257 at r7 (raw file):
std::unordered_map<OpenDataflowValue, MachineView> output_mv_map = generate_map(outputs_of_node, [&](OpenDataflowValue const &o) { return mv; });
Suggestion:
[&](OpenDataflowValue const &) { return mv; });
lib/compiler/test/src/test_graph_optimize_state.cc
line 69 at r3 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
This will be fixed in another PR.
In that case can you create an issue for that and link it here?
lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc
line 171 at r6 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
Done.
I just see a comment added? Are we planning to add the multiple-input handling in this PR or in a separate one?
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_cache.cc
line 12 at r7 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("MachineMappingCache") { ParallelComputationGraph pcg = [&] {
Why create a PCG here? It looks like the tests don't actually need it--they just need an SPDecomposition object?
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_cache.cc
line 58 at r7 (raw file):
MachineSpecification machine_spec(1, 1, 1, 1, 1); MachineMappingState state0(subgraph0, machine_spec, {});
Prefer curly-brace initialization for consistency with the rest of the FF codebase
Suggestion:
MachineMappingState state0 = MachineMappingState{
subgraph0, machine_spec, {}};
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_cache.cc
line 72 at r7 (raw file):
cache.save(state0, result0); CHECK(cache.load(state0).value() == result0);
Break into SUBCASE
s with titles that explain what you're testing--right now it's just a sea of checks which makes this test rather difficult to understand
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_cache.cc
line 81 at r7 (raw file):
CHECK(cache.load(state2) == std::nullopt); cache.save(state2, result2);
Include failure cases--what happens if I save a key twice?
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_result.cc
line 21 at r7 (raw file):
MachineMappingResult s2(2, machine_mapping_1); MachineMappingResult result0 = sequential_combine(s0, s1);
Each of these checks/function evaluations should have a SUBCASE
with a title explaining what it's testing/why it's there. Right now it's just a bunch of checks, which makes it difficult to understand quickly what high-level properties you're trying to verify with each. If you can't come up with a concise property that's being verified, the test might need some restructuring/redesigning
(and same with the parallel_combine
testcase below, and the minimize_runtime
testcase below that)
For an example of what I mean, see https://github.com/flexflow/FlexFlow/blob/add-candle-uno-pcg/lib/op-attrs/test/src/op-attrs/ops/concat.cc#L175-L312--each of the failure cases and success cases has a separate SUBCASE
with a title indicating what is being checked, or for a simpler example, https://github.com/flexflow/FlexFlow/blob/add-candle-uno-pcg/lib/utils/test/src/utils/containers/try_merge_nondisjoint_unordered_maps.cc#L9-L75
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_result.cc
line 23 at r7 (raw file):
MachineMappingResult result0 = sequential_combine(s0, s1); CHECK(result0.runtime == 1); CHECK(result0.machine_mapping == machine_mapping_0);
Suggestion:
MachineMappingResult result0 = sequential_combine(s0, s1);
MachineMappingResult correct = MachineMappingResult{machine_mapping_0, 1};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 28 unresolved discussions (waiting on @lockshaw)
lib/compiler/include/compiler/machine_mapping/allowed_machine_mappings.h
line 12 at r7 (raw file):
allowed_machine_mappings(MachineMappingContext const &context, std::unordered_set<Node> const &nodes, MachineSpecification const &resource);
Done.
lib/compiler/include/compiler/machine_mapping/allowed_machine_mappings.h
line 16 at r7 (raw file):
std::vector<std::unordered_map<DataflowOutput, MachineView>> allowed_machine_mappings(MachineMappingContext const &context, std::unordered_set<DataflowOutput> const &values,
Done.
lib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h
line 23 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I think it's probably better to think of the SP decomposition as coming from outside of the optimal machine mapping algorithm, as with Pietro's stuff it at least seems more like creating an SP-ization of a given PCG is a separate step than just part of optimal machine mapping algorithm, but this isn't a bit deal and can always get refactored later. That said, it seems a bit odd to have this function, which conceptually seems to do entirely different things from the other
get_optimal_machine_mapping_internal
overloads, have the same name as them, so if you do prefer to use this structure it should at least be renamed to something clearer--ideally I should be able to make a pretty accurate educated guess of what a function does from its name and arguments, and I don't think that's the case here.
I see. But I don't think it is a problem to have this signature. First, it is only used internally, so it has little effect on other parts. And it actually make the internal implementation more convenient. Second, it actually has the similar functionality as other functions that have the same name but just different format of the inputs. We should name functions according to their functionality instead of the detailed implementation.
lib/compiler/include/compiler/machine_mapping/machine_mapping_context.struct.toml
line 29 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Is there a significant downside to moving it out? It would make the code a bit more readable as it would make it more obvious what gets mutated and what doesn't
I feel like we should put global variables in MachineMappingContext
as possible and only explicitly pass the variables that belong to DP states. I think it is more clear this way.
lib/compiler/src/unity_algorithm.cc
line 39 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Would it be cleaner here to, rather than having a custom
GraphOptimizeState
type whose hash andoperator==
ignore the runtime cost, instead have this be aDeduplicatedPriorityQueue<SomeWrapperAroundAPCG, CustomComparator>
where theCustomComparator
is something like below, whereMyCustomType
is analagous toSomeWrapperAroundAPCG
, and I suppose instead of astd::unordered_map<MyCustomType, float>
you'd have aMachineMappingCache
Sounds a good idea and we will leave all the MCMC search part to another PR.
lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc
line 9 at r7 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Might be cleaner to define a
machine_mapping.struct.toml
wrapper type aroundstd::unordered_map<parallel_layer_guid_t, MachineView>
so this turns into the more readablestd::vector<MachineMapping>
? Also why is this astd::vector
and not astd::unordered_set
/std::set
? Should there be duplicates or does order matter?
It is a vector because no deduplication is required here.
lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc
line 14 at r7 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why not an empty vector as a return value? Does an empty machine mapping really have a meaning?
It means the only machine mapping for an empty set of input nodes.
lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc
line 29 at r7 (raw file):
} } return enumeration;
Done.
lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc
line 39 at r7 (raw file):
for (DataflowOutput const &v : values) { nodes.insert(v.node); }
I don't think we should use transform since it is not a one-to-one mapping but a many-to-one mapping. We should not expect transform to handle many-to-one mapping not matter whether it actually does.
lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc
line 47 at r7 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Don't use leading underscore, if you need to distinguish it from
enumeration
it's probably better to come up with a clearer name asenumeration
is a bit unclear (what is being enumerated? does this have to do withenumerate
fromutils/containers/enumerate.h
?)
Done.
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc
line 187 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Is there really a point to the subgraphs at this point, or can we just rely on the SP decomposition for this information? I guess it's that the graph cost is cached, and if you cached it for the SP decomposition then you wouldn't be agnostic to different locations of the same graph?
Done.
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc
line 257 at r7 (raw file):
std::unordered_map<OpenDataflowValue, MachineView> output_mv_map = generate_map(outputs_of_node, [&](OpenDataflowValue const &o) { return mv; });
Done.
lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc
line 146 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
In a way yes, but you control the cost model: you can add a custom implementation of
CostEstimator
that returns whatever numbers you think would be useful for testing. We absolutely do want to (at some point) test the whole compiler+costestimator+machineviews+everything stack, but even if all of that was present we would also want to have tests that just that compiler is correct in isolation, which is what most of these tests should be--provide an arbitrary tiny set of allowed machine views (say, two of them), create a tiny PCG (with, say, 2 operators in parallel), and write aCostEstimator
subclass that gives you back some chosen constants (1 for each of the possible node-machineview assignments). Then solve for what the resulting value ofget_optimal_machine_mappings
should be, run the function, and check that the returned value is the same as what you solved analytically.At this point there shouldn't be anything stopping you from testing pretty much the whole Unity DP in isolation, and once everything is merged we'll add a couple end-to-end tests just to make sure there weren't any unexpected issues at the boundaries between components
Done.
lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc
line 171 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I just see a comment added? Are we planning to add the multiple-input handling in this PR or in a separate one?
In a separate one.
lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc
line 12 at r7 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Ideally don't do this--these tests are supposed to confirm that
get_optimal_machine_mapping
is correct, not thatget_allowed_machine_views
is correct. If you callget_allowed_machine_views
in this test and the test fails, you don't know whether it is becauseget_optimal_machine_mapping
is wrong or if it s becauseget_allowed_machine_views
is wrong
Done.
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc
line at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Rename from
test_machine_mapping.cc
tomachine_mapping.cc
--the current convention is to not have thetest_
prefix, as they're already under thetest
directory
Done.
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc
line 25 at r7 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Add a
SUBCASE
to test the failure case
Done.
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_cache.cc
line 12 at r7 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why create a PCG here? It looks like the tests don't actually need it--they just need an SPDecomposition object?
Yes, this PCG is only used to create the SPD
lib/substitutions/include/substitutions/sub_parallel_computation_graph.h
line 25 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
parallel_layer_guid_t
is a wrapper forNode
that we use for representingNode
s in aParallelComputationGraph
. It primarily improves readability and helps prevent accidentally passing the wrong node to things
Done.
lib/compiler/include/compiler/machine_mapping/machine_mapping.struct.toml
line 21 at r6 (raw file):
[[fields]] name = "machine_views" type = "std::unordered_map<::FlexFlow::Node, ::FlexFlow::MachineView>"
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r9, 106 of 150 files at r10, 117 of 145 files at r11, 51 of 60 files at r12, 23 of 23 files at r13, all commit messages.
Reviewable status: all files reviewed, 52 unresolved discussions (waiting on @wmdi)
lib/compiler/include/compiler/machine_mapping/include_unconstrained.struct.toml
line 2 at r11 (raw file):
namespace = "FlexFlow" name = "IncludeUnconstrained"
maybe enum instead? (just an idea)
lib/compiler/include/compiler/machine_mapping/machine_mapping_result.h
line 9 at r12 (raw file):
namespace FlexFlow { [[nodiscard]] MachineMappingResult infeasible_machine_mapping_result();
any strong reason for [[nodiscard]]
?
lib/compiler/include/compiler/machine_mapping/machine_mapping_problem_tree/mm_problem_tree_parallel_split_label.struct.toml
line 2 at r11 (raw file):
namespace = "FlexFlow" name = "MMProblemTreeParallelSplitLabel"
why is it empty?
lib/compiler/src/compiler/machine_mapping/get_machine_resource_splits.cc
line 27 at r12 (raw file):
result.insert(std::make_pair(sub_resource2, sub_resource1)); }
note that we are only partitioning across GPUs here.
Currently, I consider how MachineSpecification
fuses GPU and CPU nodes to be a bit awkard (e.g. it doesn't even have a separate intra and inter node bandwitdh per parameters), I think it should instead have as attributes num_nodes, num_devices_per_node, device_type, inter_node_bandwidth, and intra_node_bandwidth, since this way we can ignore the CPU stuff a lot more and only use it when needed.
Also with the i*=2
in this part of the code feels a bit arbitrary?
lib/compiler/src/compiler/machine_mapping/get_tensor_set_movement_across_split.cc
line 4 at r13 (raw file):
#include "compiler/machine_mapping/abstracted_tensor_set_movement/abstracted_tensor_set_movement.h" #include "compiler/machine_mapping/abstracted_tensor_set_movement/get_abstracted_tensor_set_movement_across_split.h" #include "compiler/machine_mapping/transitive_reduced_pcg.h"
some of this imports seem superfluous
lib/compiler/src/compiler/machine_mapping/machine_mapping_result.cc
line 26 at r13 (raw file):
for (MachineMappingResult const &candidate : candidates) { result = minimize_runtime(result, candidate);
optionally foldl
, but it's honestly pretty clear as it is
lib/compiler/src/compiler/machine_mapping/machine_mapping_result.cc
line 51 at r13 (raw file):
ParallelLayerGuidObliviousMachineMapping mapping = [&] { if (parallel_split_transformation.has_value()
I think this should be if (parallel_split_transformation.has_value())
then evaluate whether it should be LthenR
or RthenL
, here if the split_transformation has no value it calls it with LthenR
lib/compiler/src/compiler/machine_mapping/machine_mapping_result.cc
line 97 at r13 (raw file):
MachineMappingResult minimize_runtime(MachineMappingResult const &maybe_m1, MachineMappingResult const &maybe_m2) { FeasibleMachineMappingResult m1 = ({
This pattern to get the feasible results seems to appear fairly often here, maybe pacakge it into a separate function?
lib/compiler/src/compiler/machine_mapping/machine_mapping_result.cc
line 111 at r13 (raw file):
}); if (m2.runtime < m1.runtime) {
ternary optionally
lib/compiler/test/src/compiler/machine_mapping/machine_mapping_problem_tree/get_machine_mapping_problem_tree.cc
line 63 at r13 (raw file):
UnmappedOpCostEstimateKey input_key = make_input_key(input_shape); PCGBinarySPDecomposition sp_decomposition = \
remove
Code quote:
\
lib/local-execution/src/ops/element_binary.h
line 1 at r13 (raw file):
#ifndef _FLEXFLOW_ELEMENT_BINARY_H
change for all src/ops/
to also include the path? not a big deal
Code quote:
_FLEXFLOW_ELEMENT_BINARY_H
lib/local-execution/src/ops/layer_norm.h
line 1 at r13 (raw file):
#ifndef _FLEXFLOW_RUNTIME_SRC_OPS_LAYER_NORM_H
LOCAL_EXECUTION
Code quote:
RUNTIME
lib/utils/include/utils/full_binary_tree/find_paths_to_leaf.h
line 16 at r13 (raw file):
template <typename ParentLabel, typename LeafLabel> std::unordered_set<BinaryTreePath> find_paths_to_leaf(FullBinaryTree<ParentLabel, LeafLabel> const &tree,
super clean implementation
lib/utils/include/utils/graph/digraph/algorithms/get_edges_from_subgraph_to_subgraph.h
line 8 at r13 (raw file):
std::unordered_set<DirectedEdge> get_edges_from_subgraph_to_subgraph(DiGraphView const &, std::unordered_set<Node> const &,
add variable names for clarity
lib/utils/include/utils/graph/series_parallel/binary_sp_decomposition_tree/generic_binary_sp_decomposition_tree/make.h
line 7 at r13 (raw file):
namespace FlexFlow {
maybe a couple of using SomeShorterName = GenericBinarySPDecompositionTree<SeriesLabel, ParallelLabel, LeafLabel>
and same for the others?
I know it's not really standard for the codebase, but I think in this specific case it could help readability quite a bit.
lib/utils/include/utils/graph/series_parallel/binary_sp_decomposition_tree/leaf_only_binary_sp_decomposition_tree/leaf_only_binary_parallel_split_label.struct.toml
line 2 at r13 (raw file):
namespace = "FlexFlow" name = "LeafOnlyBinaryParallelSplitLabel"
why empty?
lib/utils/include/utils/graph/series_parallel/binary_sp_decomposition_tree/leaf_only_binary_sp_decomposition_tree/leaf_only_binary_series_split_label.struct.toml
line 11 at r13 (raw file):
"rapidcheck", ]
why empty?
lib/utils/src/utils/graph/dataflow_graph/algorithms/transitive_reduced_dataflow_graph/transitive_reduced_dataflow_graph.cc
line 7 at r13 (raw file):
TransitiveReducedDataflowGraphView get_dataflow_graph_transitive_reduction(DataflowGraphView const &g) { DiGraphView as_digraph = g;
use an explicit to_digraph
to cast it (just so this won't break in case we decide to modify the coercion rules)
lib/utils/test/src/utils/containers/unordered_map_from_pairs.cc
line 42 at r13 (raw file):
{1, "b"}, };
I think we should throw an error in this case
lib/compiler/src/compiler/machine_mapping/get_allowed_machine_views_list.cc
line 44 at r8 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Might be more readable as something like this? We'd need to add a
transform_tuples
function or something like it that takes in a container ofstd::tuple<A, B, ...>
and a lambda matchingstd::function<T(A const &, B const &, ...)>
and combines the destructuring andtransform
operations, but I'm sure I or @Marsella8 would be happy to add this tocontainers
as I can think of quite a few places where it would be useful. Not a high priority and the current code is fine as is, just something I was thinking of
yeah seems like a good function to add, let me know if it is needed
lib/compiler/src/compiler/machine_mapping/machine_mapping_constraints.cc
line 75 at r13 (raw file):
} else { if (current_machine_view.value() != machine_view) { throw mk_runtime_error(fmt::format("with_additional_layer_machine_views received machine view assignment for layer {} "
with_additional_constraints
Code quote:
with_additional_layer_machine_views
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 15 files at r1, 1 of 9 files at r3, 9 of 37 files at r4, 1 of 17 files at r7, 4 of 21 files at r8, 1 of 3 files at r9, 103 of 150 files at r10, 84 of 145 files at r11, 40 of 60 files at r12, 18 of 23 files at r13, 83 of 83 files at r14, all commit messages.
Reviewable status: all files reviewed, 52 unresolved discussions (waiting on @lockshaw)
lib/compiler/include/compiler/machine_mapping/machine_mapping_context.struct.toml
line 29 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I agree up to the point of grouping mutable and non-mutable global variables into the same type as then the mutation properties become unclear (how do I know that your function isn't also mutating the
pcg
?), especially since only one of the members ever mutable. This is one case where I have a strong preference that we splitMachineMappingCache
out so that the function signatures remain clear about what is safe from mutation
Done.
Description of changes:
This PR recovers the device mapping algorithm (the DP-based algorithm) used in Unity.
Related Issues:
Linked Issues:
Issues closed by this PR:
This change is