-
Notifications
You must be signed in to change notification settings - Fork 2
5. Extension examples
We already implemented two example extended syntax in DBSim, SimSelectionSyntax
(in dbsim/extensions/extended_syntax/sim_select_syntax.py
) and SpatialSyntax
(in dbsim/extensions/extended_syntax/spatial_syntax.py
). Here we use the first one to demonstrate how to develop an extended syntax and register it step by step. We will only show the critical code snippets here. Please refer to the source code file for runnable code.
This is an extended syntax that introduces new data type Vector
and new predicate operator to
for computing distance between two vectors. An example query with this syntax:
SELECT musical.title, musical.year
FROM
(SELECT *
FROM
(SELECT * FROM animation, musical WHERE animation.mid = musical.mid)
WHERE
animation.embedding to [1,2,3,4] < 10
)
WHERE musical.year > 1960
where animation.embedding
is a Vector
type column and [1,2,3,4]
is a Vector
type literal.
The first step is to implement the classes of new operators. Their base classes are from ast.py
.
class SimSelectionOp(SelectionOp):
__slots__ = ('relation','bool_op','schema', 'cost_factor', 'num_input_rows')
def __init__(self, relation, bool_op, schema=None, cost_factor=DEFAULT_COST_FACTOR, num_input_rows=-1):
super().__init__(relation, bool_op, schema, cost_factor, num_input_rows)
For SELECT clause, a SimSelectionOp
will be generated when the query satisfies SimSelectionSyntax, otherwise the clause will be parsed to a standard SelectionOp
.
class ToOp(BinaryOp):
"""lhs to rhs"""
__slots__ = ('lhs', 'rhs', 'cost_factor')
def __init__(self, lhs, rhs, cost_factor: float = DEFAULT_COST_FACTOR * VecDim):
self.cost_factor = cost_factor
self.lhs = lhs
self.rhs = rhs
def __str__(self):
return "{} TO {}".format(str(self.lhs), str(self.rhs))
The new keyword to
will be parsed into a predicate operator ToOp
, which should be a subclass of the BinaryOp
as it requires both left hand and right hand operands.
class Vector(Const):
__slots__ = ('const',)
def __init__(self, vector: Vec):
ERROR_IF_NOT_INSTANCE_OF(vector, Vec, "Vector can only be initialized with {} ({} received)".format(getClassNameOfClass(Vec), type(vector)))
self.const = vector
self.cost_factor = TINY_COST_FACTOR * vector.size
def __str__(self):
return str(self.const)
def equal(self, other: 'Vector', ignore_schema: bool = True, match_loadop_and_relation: bool = False) -> bool:
return np.array_equal(self.const, other.const)
The new constant operator Vector
is used to represent a Vector
literal like [1,2,3,4]
in the example query above.
Step 2: defining the class of new syntax and implementing the expression level interfaces (i.e., the expected class attributes)
class SimSelectionSyntax(ExtendedSyntax):
"""
Example #1:
simselect employee_id
from employees
where employees.vector to [1.0,2.0,1.2,5.1] < 1.5
Example #2:
simselect employees.vector to [1.0,2.0,1.2,5.1]
from employees
"""
simselect_keyword = 'simselect'
extended_mul_level_ops = {'to': ToOp}
_extended_symbols_: str = '[]'
_extended_clause_keywords_ = {SQLClause.SELECT: simselect_keyword}
_extended_data_types_: typing.Dict[typing.Type[Const], str] = {Vector: 'VECTOR'}
_extended_data_types_converter_ = convert_values_to_vectors
_extended_predicate_parsers_ = ({
PredExprLevel.MUL: parse_multiplicative_exp,
PredExprLevel.VALUE: parse_value_exp
}, False)
_extended_predicate_op_executors_: \
typing.Dict[typing.Type[SimpleOp], typing.Callable[[typing.Any], typing.Callable]] = \
{ToOp: get_to_op_executor()}
_extended_relation_op_schema_: \
typing.Dict[typing.Type[SuperRelationalOp], typing.Callable[[SuperRelationalOp, 'DataSet'], 'Schema']] = \
{SimSelectionOp: schema_interpreter.schema_from_relation}
_extended_relation_op_executors_: \
typing.Dict[typing.Type[SuperRelationalOp], typing.Callable[[typing.Any], typing.Callable]] = \
{SimSelectionOp: main_compiler.selection_op}
The class SimSelectionSyntax
is inherited from ExtendedSyntax
and implements the class attributes expected by ExtendedSyntax
as mentioned above.
-
_extended_symbols_: str = '[]'
The
[
and]
are not standard SQL symbols, so we add them via_extended_symbols_
. Note that_extended_symbols_
only supports adding single-character symbols, i.e., the string will be split into single characters and added into DBSim kernel. -
_extended_clause_keywords_ = {SQLClause.SELECT: simselect_keyword}
_extended_clause_keywords_
accepts a dictionary to add new SQL clause keywords, like heresimselect
will be added such that DBSim can parse both ofselect
andsimselect
statements. -
_extended_data_types_: typing.Dict[typing.Type[Const], str] = {Vector: 'VECTOR'}
_extended_data_types_
accepts a dictionary including "datatype_class -> datatype_name" to add the new data types to DBSim, where the datatype_class is the extended constant operator and datatype_name is its name string. The name string will then be registered infield.FieldType
(defined indbsim/field.py
). -
_extended_data_types_converter_ = convert_values_to_vectors def convert_values_to_vectors(df_column): num_detect_samples = 5 if df_column.dtype == object and all(value.startswith('[') and value.endswith(']') for value in df_column[:num_detect_samples]): return df_column.apply(lambda value: toVector(value)), field.FieldType.VECTOR return df_column, None
_extended_data_types_converter_
accepts a function object which inputs a Pandas DataFrame column (i.e., a Pandas Series) and converts the values in this column into specific custom data type. Currently this function is only used when loading data adapter from a csv/tsv file. So if you do not need to load your data from files, you can ignore this class attribute. -
_extended_predicate_parsers_ = ({ PredExprLevel.MUL: parse_multiplicative_exp, PredExprLevel.VALUE: parse_value_exp }, False)`
_extended_predicate_parsers_
accepts a tuple (dict(expression_level -> expression_parser), whether_to_block_exceptions)-
expression_level
is the priority level of a predicate operator (also called "operator precedence"). Basically, DBSim categories the computing priorities of predicate operators into several levels fromPredExprLevel.OR
(the lowest priority) toPredExprLevel.VALUE
(the highest level). In this example,Vector
is a constant operator atPredExprLevel.VALUE
level whileToOp
is with the same priority as multiplicative operators ("*" and "/"), i.e.,PredExprLevel.MUL
level. SeePredExprLevel
indbsim/extensions/extended_syntax/registry_utils.py
for details. -
expression_parser
is the parsing function for the new operators in the corresponding expression_level. It should raise aParsingFailure
exception if the current query is out of its scope, to let DBSim try the next parser on the same expression_level.-
For example: the
parse_value_exp
is defined as follows. It only tries to parse a Vector literal. When the current query tokens are not recognized as a Vector literal, it raisesParsingFailure
.def parse_value_exp(tokens: TokenList) -> Expr: token = tokens[0] vector = [] if token == '[': # parse it as a vector literal tokens.pop(0) if len(tokens) == 0: raise SQLSyntaxError("missing closing ']'") token = tokens.pop(0) cur_element = "" while token != ']': if not token.isdigit() and token not in (',', '.'): raise SQLSyntaxError("invalid symbol found in vector.") if token.isdigit() or token == '.': cur_element += token elif token == ',': try: cur_element = float(cur_element) except: raise SQLSyntaxError("invalid vector element: '{}'".format(cur_element)) vector.append(cur_element) cur_element = "" if len(tokens) == 0: raise SQLSyntaxError("missing closing ']'") token = tokens.pop(0) if len(cur_element) > 0: vector.append(float(cur_element)) return Vector(np.array(vector)) else: raise ParsingFailure
-
-
whether_to_block_exceptions
is a boolean value indicating whether to raise exceptions or mute them when they are raised by the expression_parsers. True to mute, False to raise them and interrupt the program. Note that theParsingFailure
will always be muted in both cases.
-
-
_extended_predicate_op_executors_: \ typing.Dict[typing.Type[SimpleOp], typing.Callable[[typing.Any], typing.Callable]] = \ {ToOp: get_to_op_executor()} def getVecDistance(v1: Vec, v2: Vec, metric: Metric = Metric.EUC) -> float: ERROR_IF_NOT_EQ( v1.shape, v2.shape, "could not compute the distance between vectors with different shapes {} and {}"\ .format(v1.shape, v2.shape) ) if metric == Metric.EUC: return euclidean(v1, v2) if metric == Metric.COS_DIST: return cosine(v1, v2) if metric == Metric.COS_SIM: return 1 - cosine(v1, v2) if metric == Metric.DOT: return np.dot(v1, v2) def get_to_op_executor(): from ...compilers import local as main_compiler return partial(main_compiler.binary_op, getVecDistance)
_extended_predicate_op_executors_
accepts a dictionary of "predicate_operator_class -> executor", where theexecutor
is a function that inputs two (if the operator is binary) or one (if not binary) operands and outputs the results. Normally users can directly use or partially borrow the utilities fromdbsim/compilers/local.py
to implementexecutor
. We recommend users to look into that file for more details. -
_extended_relation_op_schema_: \ typing.Dict[typing.Type[SuperRelationalOp], typing.Callable[[SuperRelationalOp, 'DataSet'], 'Schema']] = \ {SimSelectionOp: schema_interpreter.schema_from_relation}`
_extended_relation_op_schema_
accepts a dictionary of "relational_operator_class -> schema_resolving_function", where theschema_resolving_function
is the function to compute the output relation schema for the relational operator based on its input relation schema. Normally users can directly use or partially borrow the utilities fromdbsim/schema_interpreter.py
to implementschema_resolving_function
. We recommend users to look into that file for more details. -
from ...compilers import local as main_compiler _extended_relation_op_executors_: \ typing.Dict[typing.Type[SuperRelationalOp], typing.Callable[[typing.Any], typing.Callable]] = \ {SimSelectionOp: main_compiler.selection_op}
This is similar to
_extended_predicate_op_executors_
, except each dictionary key is an extended relational operator instead of predicate operator.
The parsing principles of SQL clauses by SimSelectionSyntax
is as follows:
- if keyword
to
occurs in SELECT or WHERE clauses, parse the query bySimSelectionSyntax
- or if keyword
simselect
occurs in SELECT clause, parse it bySimSelectionSyntax
- when
simselect
occurs, there must be at least oneto
keyword in either SELECT or WHERE clauses, otherwise the query is invalid.
Based on the principles, the SQL-clause level interfaces are implemented as follows: (note that all these interfaces should be implemented as instance methods as discussed above)
-
Necessary flags:
def __init__(self): self.simselect_detected: bool = False self.to_keyword_detected_in_select_clause: bool = False self.I_am_triggered: bool = False
-
self.simselect_detected
: A flag for whether the query is a standardselect
query or asimselect
query. This is used byparse_simselect_where
to determine whether to throw an error when keyword 'to' is met. Specifically, when 'simselect' is detected, there must be at least one 'to' existing in either the SELECT cluase or the WHERE clause of the current query, otherwise a syntax error should be thrown. -
self.to_keyword_detected_in_select_clause
: A flag for whether 'to' keyword is found in the SELECT clause of the current query. If it is found, aSimSelectionOp
instead ofSelectionOp
will be generated, whethersimselect
is used or not. If it is not found in SELECT clause, it will be checked in WHERE clause later, and if found there, aSimSelectionOp
will be generated. Only when there is noto
and nosimselect
keywords detected in both of SELECT and WHERE clauses, the query will be parsed to a standard selection (SelectionOp
). -
self.I_am_triggered
: A flag to telltrigger_simselect_where
iftrigger_simselect
returns True. If yes,trigger_simselect_where
will always return True. Otherwisetrigger_simselect_where
will do its own keywords checking and return True or False.
-
-
trigger_function and parsing_function for SELECT clause
def trigger_simselect(self, tokens: TokenList) -> bool: try: index_of_from_clause = tokens.index('from') # 'from' exists in tokens, there may be nested queries. # In such case only needs to check the tokens before 'from', # and the nested queries will be checked later # during the recursive parsing for the inner select statements. self.I_am_triggered = \ tokens[0] == self.simselect_keyword \ or (tokens[0] == 'select' and 'to' in tokens[:index_of_from_clause]) return self.I_am_triggered except ValueError: # no 'from' in tokens, i.e., there are no nested queries self.I_am_triggered = \ tokens[0] == self.simselect_keyword \ or (tokens[0] == 'select' and 'to' in tokens) return self.I_am_triggered except Exception as e: raise e def parse_simselect(self, tokens: TokenList) -> List[Expr]: # resets the flags to avoid the influence from previous queries self.simselect_detected = False self.to_keyword_detected_in_select_clause = False if tokens[0] not in ('select', self.simselect_keyword): raise SQLSyntaxError keyword = tokens.pop(0) if keyword == self.simselect_keyword: # 'simselect' detected self.simselect_detected = True for token in tokens: if token in ('from', 'where'): # the current SELECT clause ends at 'from' or 'where' or the tail of tokens break if token == 'to': # 'to' detected in the current SELECT clause self.to_keyword_detected_in_select_clause = True # Whether the extended keywords 'simselect' and 'to' are found or not, # simply does the same thing as standard select, # i.e., using select_core_exp to parse the columns to be selected select_cols = toolbox.select_core_exp(tokens) return select_cols
-
trigger_function and parsing_function for WHERE clause
def trigger_simselect_where(self, tokens: TokenList) -> bool: return self.I_am_triggered or (tokens[:1] == ['where'] and 'to' in tokens) def parse_simselect_where(self, tokens: TokenList, relation: Expr) -> Expr: if tokens[:1] == ['where']: tokens.pop(0) # validate the syntax requirements on the occurrance of 'simselect' and 'to' to_keyword_detected_in_where_clause = False for token in tokens: if token in ('select', 'from', 'where', self.simselect_keyword): # If any one among the 'select', 'from', 'where' and 'simselect' keywords is met, # it means the current WHERE clause is part of a nested query, # and the current WHERE clause reaches at most the token right before the met keyword, # i.e., the met keyword can be used as a rough boundary of the current WHERE clause # for the purpose of checking occurrance of 'to'. break if token == 'to': # 'to' detected in the current WHERE clause to_keyword_detected_in_where_clause = True if self.simselect_detected: ERROR_IF_FALSE( self.to_keyword_detected_in_select_clause or to_keyword_detected_in_where_clause, "simselect requires at least one 'to' expression, zero found.", ExtendedSyntaxError ) if to_keyword_detected_in_where_clause or self.to_keyword_detected_in_select_clause or self.simselect_detected: # if 'to' keyword is found in WHERE clause or in SELECT clause, # or the keyword 'simselect' instead of 'select' is used in SELECT clause, # calls extended_where_core_expr here to generate a SimSelectionOp relation: SimSelectionOp = self.extended_where_core_expr(tokens, relation) else: # otherwise, calls the standard where_core_expr to generate a standard SelectionOp relation: SelectionOp = toolbox.where_core_expr(tokens, relation) else: # The current query has no WHERE clause, # but we still need to validate the syntax requirements # on the occurrance of 'simselect' and 'to'. if self.simselect_detected: ERROR_IF_FALSE( self.to_keyword_detected_in_select_clause, "simselect requires at least one 'to' expression, zero found.", ExtendedSyntaxError ) return relation def extended_where_core_expr(self, tokens: TokenList, relation: Expr) -> SimSelectionOp: return SimSelectionOp(relation, toolbox.or_exp(tokens))
registry: Registry = OrderedDict({
"simselect": RegEntry(
syntax=SimSelectionSyntax,
clause_parsers=OrderedDict({
SQLClause.SELECT: ("trigger_simselect", "parse_simselect"),
SQLClause.WHERE: ("trigger_simselect_where", "parse_simselect_where")
}),
entry_points=[
SimSelectionSyntax.addExtendedSymbolsAndKeywords,
SimSelectionSyntax.addExtendedPredicateParsers,
SimSelectionSyntax.addExtendedDataTypes,
SimSelectionSyntax.addExtendedPredicateOps,
SimSelectionSyntax.addExtendedRelationOps
]
),
})
The final step is to fill a RegEntry
of SimSelectionSyntax
into registry
.
Note:
-
The methods listed in
entry_points
are those "mounting methods" inherited fromExtendedSyntax
.SimSelectionSyntax
should never overwrite them. -
The SQL-clause level and expression level interfaces are independent from each other. You can flexibly implement and register either, both, or neither of them. The non-implemented or non-registered interfaces will be replaced by the native standard implementation automatically inside DBSim. Very flexible.
-
In
dbsim/extensions/extended_syntax/
, we also implemented another syntax calledSpatialSyntax
includingPoint
andCircle
data types andinside
operator to check if aPoint
is inside aCircle
. You can make the two syntax both in effect by registeringSpatialSyntax
also inregistry
. We include some example queries for mixture of the two extended syntax indbsim/tests/test_extended_syntax.py
. But we do not recommend it, even though DBSim provides such a feature to mix up multiple extended syntax. It is not easy to make sure the mixed syntax have no conflict with each other. If those syntax are complex, hidden bugs are very likely to exist due to unrealized conflicts. So we recommend users to include everything in one single extended syntax, which is the best way to develop and debug. -
You can try queries with the mixed syntax (
SimSelectionSyntax
+SpatialSyntax
+standard syntax) by running the unit testtest_ExtendedSyntax_mixture()
indbsim/tests/test_extended_syntax.py
. But again, we do not recommend syntax mixture like that.