diff --git a/docs/src/SUMMARY.md b/docs/src/SUMMARY.md index 6e692add..2108794a 100644 --- a/docs/src/SUMMARY.md +++ b/docs/src/SUMMARY.md @@ -6,14 +6,11 @@ - [Plan Representation](./plan_repr.md) - [Rule Engine](./rule_engine.md) - [Cost Model](./cost_model.md) - ---- - -- [(WIP) Properties](./properties.md) +- [Properties](./properties.md) # Integration -- [(WIP) Apache Arrow Datafusion](./datafusion.md) +- [Apache Arrow Datafusion](./datafusion.md) # Adaptive Optimization diff --git a/docs/src/datafusion.md b/docs/src/datafusion.md index 93565bba..8d0529c3 100644 --- a/docs/src/datafusion.md +++ b/docs/src/datafusion.md @@ -1 +1,100 @@ # Integration with Datafusion + +optd is currently used as a physical optimizer for Apache Arrow Datafusion. To interact with Datafusion, you may use the following command to start the Datafusion cli. + +```bash +cargo run --bin datafusion-optd-cli +cargo run --bin datafusion-optd-cli -- -f tpch/test.sql # run TPC-H queries +``` + +optd is designed as a flexible optimizer framework that can be used in any database systems. The core of optd is in `optd-core`, which contains the Cascades optimizer implementation and the definition of key structures in the optimization process. Users can implement the interfaces and use optd in their own database systems by using the `optd-core` crate. + +The optd Datafusion representation contains Datafusion plan nodes, SQL expressions, optimizer rules, properties, and cost models, as in the `optd-datafusion-repr` crate. + +The `optd-datafusion-bridge` crate contains necessary code to convert Datafusion logical plans into optd Datafusion representation and convert optd Datafusion representation back into Datafusion physical plans. It implements the `QueryPlanner` trait so that it can be easily integrated into Datafusion. + +![integration with Datafusion](./optd-cascades/optd-datafusion-overview.svg) + +## Plan Nodes + +This is an incomplete list of all Datafusion plan nodes and their representations that we have implemented in the system. + +``` +Join(type) left:PlanNode right:PlanNode cond:Expr +Projection expr_list:ExprList +Agg child:PlanNode expr_list:ExprList groups:ExprList +Scan table:String +ExprList ...children:Expr +Sort child:PlanNode sort_exprs:ExprList <- requiring SortExprs +... and others +``` + +Note that only `ExprList` or `List` can have variable number of children. All plan nodes only have a fixed number of children. For projections and aggregations where users will need to provide a list of expressions, they will have `List` node as their direct child. + +Developers can use the `define_plan_node` macro to add new plan nodes into the optd-datafusion-repr. + +```rust +#[derive(Clone, Debug)] +pub struct LogicalJoin(pub PlanNode); + +define_plan_node!( + LogicalJoin : PlanNode, + Join, [ + { 0, left: PlanNode }, + { 1, right: PlanNode } + ], [ + { 2, cond: Expr } + ], { join_type: JoinType } +); +``` + +Developers will also need to add the plan node type into the `OptRelNodeTyp` enum, implement `is_plan_node` and `is_expression` for them, and implement the explain format in `explain`. + +## Expressions + +SQL Expressions are also a kind of `RelNode`. We have binary expressions, function calls, etc. in the representation. + +Notably, we convert all column references into column indexes in the Datafusion bridge. For example, if Datafusion yields a logical plan of: + +``` +LogicalJoin { a = b } + Scan t1 [a, v1, v2] + Scan t2 [b, v3, v4] +``` + +It will be converted to: + +``` +LogicalJoin { #0 = #3 } + Scan t1 + Scan t2 +``` + +in the optd representation. + +## Explain + +We use risinglightdb's pretty-xmlish crate and implement a custom explain format for Datafusion plan nodes. + +```rust +PhysicalProjection { exprs: [ #0 ] } +└── PhysicalHashJoin { join_type: Inner, left_keys: [ #0 ], right_keys: [ #0 ] } + ├── PhysicalProjection { exprs: [ #0 ] } + │ └── PhysicalScan { table: t1 } + └── PhysicalProjection { exprs: [ #0 ] } + └── PhysicalScan { table: t2 } +``` + +This is different from the default Lisp-representation of the `RelNode`. + +## Rules + +Currently, we have a few rules that pulls filters and projections up and down through joins. Also, we have join assoc and join commute rules to reorder the joins. + +## Properties + +We have the `Schema` property that will be used in the optimizer rules to determine number of columns of each plan nodes so that we can rewrite column reference expressions correctly. + +## Cost Model + +We have a simple cost model that computes I/O cost and compute cost based on number of rows of the children plan nodes. diff --git a/docs/src/optd-cascades/optd-cascades-1.svg b/docs/src/optd-cascades/optd-cascades-1.svg index 2b3c947f..805cd39d 100644 --- a/docs/src/optd-cascades/optd-cascades-1.svg +++ b/docs/src/optd-cascades/optd-cascades-1.svg @@ -1,8 +1,8 @@ - + - + optd-cascades-1 Layer 1 diff --git a/docs/src/optd-cascades/optd-cascades-2.svg b/docs/src/optd-cascades/optd-cascades-2.svg index 2340ef6f..d0e71c60 100644 --- a/docs/src/optd-cascades/optd-cascades-2.svg +++ b/docs/src/optd-cascades/optd-cascades-2.svg @@ -1,8 +1,8 @@ - + - + optd-cascades-2 Layer 1 diff --git a/docs/src/optd-cascades/optd-cascades-3.svg b/docs/src/optd-cascades/optd-cascades-3.svg index 8efe7de5..5625037f 100644 --- a/docs/src/optd-cascades/optd-cascades-3.svg +++ b/docs/src/optd-cascades/optd-cascades-3.svg @@ -1,6 +1,6 @@ - + @@ -13,7 +13,7 @@ - + optd-cascades-3 Layer 1 diff --git a/docs/src/optd-cascades/optd-cascades-4.svg b/docs/src/optd-cascades/optd-cascades-4.svg index ca615e59..6717cbaf 100644 --- a/docs/src/optd-cascades/optd-cascades-4.svg +++ b/docs/src/optd-cascades/optd-cascades-4.svg @@ -1,8 +1,8 @@ - + - + optd-cascades-4 Layer 1 diff --git a/docs/src/optd-cascades/optd-datafusion-overview.svg b/docs/src/optd-cascades/optd-datafusion-overview.svg new file mode 100644 index 00000000..9a7a4acc --- /dev/null +++ b/docs/src/optd-cascades/optd-datafusion-overview.svg @@ -0,0 +1,136 @@ + + + + + + + + + + + + optd-datafusion-overview + + Layer 1 + + + + + + + Apache Arrow Datafusion + + + + + + + + + optd + + + + + + + Parser + + Planner + + + + + + + + SQL + + + + + + + Logical + Optimizer + + + + + + + + Logical Plan + + + + + + + Datafusion + Bridge + + + + + + + + Logical Plan + + + + + + + optd Optimizer + + + + + + + + Logical Plan + (optd repr) + + + + + + + Datafusion + Bridge + + + + + + + + Physical Plan + (optd repr) + + + + + + + Execution + + + + + + + + Physical Plan + + + + + logical + physical rules + + + + + diff --git a/docs/src/optd-cascades/optd-plan-repr-1.svg b/docs/src/optd-cascades/optd-plan-repr-1.svg index 689ebfc2..b5c72d9e 100644 --- a/docs/src/optd-cascades/optd-plan-repr-1.svg +++ b/docs/src/optd-cascades/optd-plan-repr-1.svg @@ -1,8 +1,8 @@ - + - + optd-plan-repr-1 Layer 1 diff --git a/docs/src/optd-cascades/optd-plan-repr-2.svg b/docs/src/optd-cascades/optd-plan-repr-2.svg index 2bc57563..edb987fa 100644 --- a/docs/src/optd-cascades/optd-plan-repr-2.svg +++ b/docs/src/optd-cascades/optd-plan-repr-2.svg @@ -1,8 +1,8 @@ - + - + optd-plan-repr-2 Layer 1 diff --git a/docs/src/optd-cascades/optd-rule-1.svg b/docs/src/optd-cascades/optd-rule-1.svg index 4910c7fa..aec238d5 100644 --- a/docs/src/optd-cascades/optd-rule-1.svg +++ b/docs/src/optd-cascades/optd-rule-1.svg @@ -1,8 +1,8 @@ - + - + optd-rule-1 Layer 1 diff --git a/docs/src/optd-cascades/optd-rule-2.svg b/docs/src/optd-cascades/optd-rule-2.svg index 8023a804..39bf9166 100644 --- a/docs/src/optd-cascades/optd-rule-2.svg +++ b/docs/src/optd-cascades/optd-rule-2.svg @@ -1,6 +1,6 @@ - + @@ -8,7 +8,7 @@ - + optd-rule-2 Layer 1 diff --git a/docs/src/plan_repr.md b/docs/src/plan_repr.md index 8e31ef5e..a7d41c16 100644 --- a/docs/src/plan_repr.md +++ b/docs/src/plan_repr.md @@ -5,7 +5,7 @@ optd uses serialized representation of expressions internally. This makes it sup *the optd representation -- one universal structure for all things* ```rust -pub struct RelNode { +pub struct RelNode { pub typ: Typ, pub children: Vec, pub data: Option, @@ -36,8 +36,8 @@ impl Join { ```rust pub struct Join { - pub left: RelNode, - pub right: RelNode, + pub left: RelNode, + pub right: RelNode, pub cond: Expression, pub join_type: JoinType, } diff --git a/docs/src/properties.md b/docs/src/properties.md index d03ffb6a..ba646d4e 100644 --- a/docs/src/properties.md +++ b/docs/src/properties.md @@ -1 +1,54 @@ # Properties + +In optd, properties are defined by implementing the `PropertyBuilder` trait in `optd-core/src/property.rs`. Properties will be automatically inferred when plan nodes are added to the memo table. When initializing an optimizer instance, developers will need to provide a vector of properties the optimizer will need to compute throughout the optimization process. + +## Define a Property + +Currently, optd only supports logical properties. It cannot optimize a query plan with required physical properties for now. An example of property definition is the Datafusion representation's plan node schema, as in `optd-datafusion-repr/src/properties/schema.rs`. + + +```rust +impl PropertyBuilder for SchemaPropertyBuilder { + type Prop = Schema; + + fn derive( + &self, + typ: OptRelNodeTyp, + data: Option, + children: &[&Self::Prop], + ) -> Self::Prop { + match typ { + OptRelNodeTyp::Scan => { + let name = data.unwrap().as_str().to_string(); + self.catalog.get(&name) + } + // ... +``` + +The schema property builder implements the `derive` function, which takes the plan node type, plan node data, and the children properties, in order to infer the property of the current plan node. The schema property is stored as a vector of data types in `Schema` structure. In optd, property will be type-erased and stored as `Box` along with each `RelNode` group in the memo table. On the developer side, it does not need to handle all the type-erasing things and will work with typed APIs. + +## Use a Property + +When initializing an optimizer instance, developers will need to provide a vector of property builders to be computed. The property can then be retrieved using the index in the vector and the property builder type. For example, some optimizer rules will need to know the number of columns of a plan node before rewriting an expression. + +For example, the current Datafusion optd optimizer is initialized with: + +```rust +CascadesOptimizer::new_with_prop( + rules, + Box::new(cost_model), + vec![Box::new(SchemaPropertyBuilder::new(catalog))], + // .. +), +``` + +Therefore, developers can use index 0 and `SchemaPropertyBuilder` to retrieve the schema of a plan node after adding the node into the optimizer memo table. + +```rust +impl PlanNode { + pub fn schema(&self, optimizer: CascadesOptimizer) -> Schema { + let group_id = optimizer.resolve_group_id(self.0.clone()); + optimizer.get_property_by_group::(group_id, 0 /* property ID */) + } +} +```