Skip to content

5. Extension examples

wyfunique edited this page Apr 22, 2022 · 3 revisions

Step by step example

We already implemented two example extended syntax in DBSim, SimSelectionSyntax (in dbsim/extensions/extended_syntax/sim_select_syntax.py) and SpatialSyntax (in dbsim/extensions/extended_syntax/spatial_syntax.py). Here we use the first one to demonstrate how to develop an extended syntax and register it step by step. We will only show the critical code snippets here. Please refer to the source code file for runnable code.

About SimSelectionSyntax

This is an extended syntax that introduces new data type Vector and new predicate operator to for computing distance between two vectors. An example query with this syntax:

SELECT musical.title, musical.year
    FROM 
      (SELECT * 
        FROM 
          (SELECT * FROM animation, musical WHERE animation.mid = musical.mid) 
        WHERE 
          animation.embedding to [1,2,3,4] < 10
      )
    WHERE musical.year > 1960

where animation.embedding is a Vector type column and [1,2,3,4] is a Vector type literal.

Step 1: defining the classes of new operators

The first step is to implement the classes of new operators. Their base classes are from ast.py.

class SimSelectionOp(SelectionOp):
  __slots__ = ('relation','bool_op','schema', 'cost_factor', 'num_input_rows')
  def __init__(self, relation, bool_op, schema=None, cost_factor=DEFAULT_COST_FACTOR, num_input_rows=-1):
    super().__init__(relation, bool_op, schema, cost_factor, num_input_rows)

For SELECT clause, a SimSelectionOp will be generated when the query satisfies SimSelectionSyntax, otherwise the clause will be parsed to a standard SelectionOp.

class ToOp(BinaryOp):
  """lhs to rhs"""
  __slots__ = ('lhs', 'rhs', 'cost_factor')
  def __init__(self, lhs, rhs, cost_factor: float = DEFAULT_COST_FACTOR * VecDim):
    self.cost_factor = cost_factor
    self.lhs = lhs
    self.rhs = rhs
  def __str__(self):
    return "{} TO {}".format(str(self.lhs), str(self.rhs))

The new keyword to will be parsed into a predicate operator ToOp , which should be a subclass of the BinaryOp as it requires both left hand and right hand operands.

class Vector(Const):
  __slots__ = ('const',)

  def __init__(self, vector: Vec):
    ERROR_IF_NOT_INSTANCE_OF(vector, Vec, "Vector can only be initialized with {} ({} received)".format(getClassNameOfClass(Vec), type(vector)))
    self.const = vector
    self.cost_factor = TINY_COST_FACTOR * vector.size
  def __str__(self):
    return str(self.const)
  def equal(self, other: 'Vector', ignore_schema: bool = True, match_loadop_and_relation: bool = False) -> bool:
    return np.array_equal(self.const, other.const)

The new constant operator Vectoris used to represent a Vector literal like [1,2,3,4] in the example query above.

Step 2: defining the class of new syntax and implementing the expression level interfaces (i.e., the expected class attributes)

class SimSelectionSyntax(ExtendedSyntax):
  """
  Example #1:

    simselect employee_id
    from employees
    where employees.vector to [1.0,2.0,1.2,5.1] < 1.5
  
  Example #2:

    simselect employees.vector to [1.0,2.0,1.2,5.1]
    from employees
  """
  simselect_keyword = 'simselect'
  extended_mul_level_ops = {'to': ToOp}

  _extended_symbols_: str = '[]'
  _extended_clause_keywords_ = {SQLClause.SELECT: simselect_keyword}
  _extended_data_types_: typing.Dict[typing.Type[Const], str] = {Vector: 'VECTOR'}
  _extended_data_types_converter_ = convert_values_to_vectors
  _extended_predicate_parsers_ = ({
    PredExprLevel.MUL: parse_multiplicative_exp,
    PredExprLevel.VALUE: parse_value_exp
  }, False)
  _extended_predicate_op_executors_: \
      typing.Dict[typing.Type[SimpleOp], typing.Callable[[typing.Any], typing.Callable]] = \
        {ToOp: get_to_op_executor()}
  _extended_relation_op_schema_: \
      typing.Dict[typing.Type[SuperRelationalOp], typing.Callable[[SuperRelationalOp, 'DataSet'], 'Schema']] = \
        {SimSelectionOp: schema_interpreter.schema_from_relation}
  _extended_relation_op_executors_: \
      typing.Dict[typing.Type[SuperRelationalOp], typing.Callable[[typing.Any], typing.Callable]] = \
        {SimSelectionOp: main_compiler.selection_op}

The class SimSelectionSyntax is inherited from ExtendedSyntax and implements the class attributes expected by ExtendedSyntax as mentioned above.

  1. _extended_symbols_: str = '[]'

    The [ and ] are not standard SQL symbols, so we add them via _extended_symbols_. Note that _extended_symbols_ only supports adding single-character symbols, i.e., the string will be split into single characters and added into DBSim kernel.

  2. _extended_clause_keywords_ = {SQLClause.SELECT: simselect_keyword}

    _extended_clause_keywords_ accepts a dictionary to add new SQL clause keywords, like here simselect will be added such that DBSim can parse both of select and simselect statements.

  3. _extended_data_types_: typing.Dict[typing.Type[Const], str] = {Vector: 'VECTOR'}

    _extended_data_types_ accepts a dictionary including "datatype_class -> datatype_name" to add the new data types to DBSim, where the datatype_class is the extended constant operator and datatype_name is its name string. The name string will then be registered in field.FieldType (defined in dbsim/field.py).

  4. _extended_data_types_converter_ = convert_values_to_vectors
    
    def convert_values_to_vectors(df_column):
          num_detect_samples = 5
          if df_column.dtype == object and all(value.startswith('[') and value.endswith(']') for value in df_column[:num_detect_samples]):
            return df_column.apply(lambda value: toVector(value)), field.FieldType.VECTOR 
          return df_column, None

    _extended_data_types_converter_ accepts a function object which inputs a Pandas DataFrame column (i.e., a Pandas Series) and converts the values in this column into specific custom data type. Currently this function is only used when loading data adapter from a csv/tsv file. So if you do not need to load your data from files, you can ignore this class attribute.

  5. _extended_predicate_parsers_ = ({
        PredExprLevel.MUL: parse_multiplicative_exp,
        PredExprLevel.VALUE: parse_value_exp
      }, False)`

    _extended_predicate_parsers_ accepts a tuple (dict(expression_level -> expression_parser), whether_to_block_exceptions)

    • expression_level is the priority level of a predicate operator (also called "operator precedence"). Basically, DBSim categories the computing priorities of predicate operators into several levels from PredExprLevel.OR(the lowest priority) to PredExprLevel.VALUE(the highest level). In this example, Vector is a constant operator at PredExprLevel.VALUE level while ToOp is with the same priority as multiplicative operators ("*" and "/"), i.e., PredExprLevel.MUL level. See PredExprLevel in dbsim/extensions/extended_syntax/registry_utils.py for details.

    • expression_parser is the parsing function for the new operators in the corresponding expression_level. It should raise a ParsingFailure exception if the current query is out of its scope, to let DBSim try the next parser on the same expression_level.

      • For example: the parse_value_exp is defined as follows. It only tries to parse a Vector literal. When the current query tokens are not recognized as a Vector literal, it raises ParsingFailure.

        def parse_value_exp(tokens: TokenList) -> Expr:
          token = tokens[0]
          vector = []
          if token == '[':
            # parse it as a vector literal
            tokens.pop(0)
            if len(tokens) == 0:
              raise SQLSyntaxError("missing closing ']'")
            token = tokens.pop(0)
            cur_element = ""
            while token != ']':
              if not token.isdigit() and token not in (',', '.'):
                raise SQLSyntaxError("invalid symbol found in vector.")
              if token.isdigit() or token == '.':
                cur_element += token
              elif token == ',':
                try:
                  cur_element = float(cur_element)
                except:
                  raise SQLSyntaxError("invalid vector element: '{}'".format(cur_element))
                vector.append(cur_element)
                cur_element = ""
              if len(tokens) == 0:
                raise SQLSyntaxError("missing closing ']'")
              token = tokens.pop(0)
            if len(cur_element) > 0:
              vector.append(float(cur_element))
            return Vector(np.array(vector))
          else:
            raise ParsingFailure
    • whether_to_block_exceptions is a boolean value indicating whether to raise exceptions or mute them when they are raised by the expression_parsers. True to mute, False to raise them and interrupt the program. Note that the ParsingFailure will always be muted in both cases.

  6. _extended_predicate_op_executors_: \
          typing.Dict[typing.Type[SimpleOp], typing.Callable[[typing.Any], typing.Callable]] = \
            {ToOp: get_to_op_executor()}
    
    def getVecDistance(v1: Vec, v2: Vec, metric: Metric = Metric.EUC) -> float:
      ERROR_IF_NOT_EQ(
        v1.shape, v2.shape, 
        "could not compute the distance between vectors with different shapes {} and {}"\
          .format(v1.shape, v2.shape)
      )
      if metric == Metric.EUC:
        return euclidean(v1, v2)
      if metric == Metric.COS_DIST:
        return cosine(v1, v2)
      if metric == Metric.COS_SIM:
        return 1 - cosine(v1, v2)
      if metric == Metric.DOT:
        return np.dot(v1, v2)
    
    def get_to_op_executor():
      from ...compilers import local as main_compiler
      return partial(main_compiler.binary_op, getVecDistance)

    _extended_predicate_op_executors_ accepts a dictionary of "predicate_operator_class -> executor", where the executor is a function that inputs two (if the operator is binary) or one (if not binary) operands and outputs the results. Normally users can directly use or partially borrow the utilities from dbsim/compilers/local.py to implement executor. We recommend users to look into that file for more details.

  7. _extended_relation_op_schema_: \
          typing.Dict[typing.Type[SuperRelationalOp], typing.Callable[[SuperRelationalOp, &#39;DataSet&#39;], &#39;Schema&#39;]] = \
            {SimSelectionOp: schema_interpreter.schema_from_relation}`

    _extended_relation_op_schema_ accepts a dictionary of "relational_operator_class -> schema_resolving_function", where the schema_resolving_function is the function to compute the output relation schema for the relational operator based on its input relation schema. Normally users can directly use or partially borrow the utilities from dbsim/schema_interpreter.py to implement schema_resolving_function. We recommend users to look into that file for more details.

  8. from ...compilers import local as main_compiler
    _extended_relation_op_executors_: \
          typing.Dict[typing.Type[SuperRelationalOp], typing.Callable[[typing.Any], typing.Callable]] = \
            {SimSelectionOp: main_compiler.selection_op}

    This is similar to _extended_predicate_op_executors_, except each dictionary key is an extended relational operator instead of predicate operator.

Step 3: implementing the SQL-clause level interfaces

The parsing principles of SQL clauses by SimSelectionSyntax is as follows:

  • if keyword to occurs in SELECT or WHERE clauses, parse the query by SimSelectionSyntax
  • or if keyword simselect occurs in SELECT clause, parse it by SimSelectionSyntax
  • when simselect occurs, there must be at least one to keyword in either SELECT or WHERE clauses, otherwise the query is invalid.

Based on the principles, the SQL-clause level interfaces are implemented as follows: (note that all these interfaces should be implemented as instance methods as discussed above)

  1. Necessary flags:

    def __init__(self):
        self.simselect_detected: bool = False
        self.to_keyword_detected_in_select_clause: bool = False
        self.I_am_triggered: bool = False
    • self.simselect_detected: A flag for whether the query is a standard select query or a simselect query. This is used by parse_simselect_where to determine whether to throw an error when keyword 'to' is met. Specifically, when 'simselect' is detected, there must be at least one 'to' existing in either the SELECT cluase or the WHERE clause of the current query, otherwise a syntax error should be thrown.
    • self.to_keyword_detected_in_select_clause: A flag for whether 'to' keyword is found in the SELECT clause of the current query. If it is found, a SimSelectionOp instead of SelectionOp will be generated, whether simselect is used or not. If it is not found in SELECT clause, it will be checked in WHERE clause later, and if found there, a SimSelectionOp will be generated. Only when there is no to and no simselect keywords detected in both of SELECT and WHERE clauses, the query will be parsed to a standard selection (SelectionOp).
    • self.I_am_triggered: A flag to tell trigger_simselect_where if trigger_simselect returns True. If yes, trigger_simselect_where will always return True. Otherwise trigger_simselect_where will do its own keywords checking and return True or False.
  2. trigger_function and parsing_function for SELECT clause

    def trigger_simselect(self, tokens: TokenList) -> bool:
        try:
          index_of_from_clause = tokens.index('from')
          # 'from' exists in tokens, there may be nested queries.
          # In such case only needs to check the tokens before 'from', 
          #   and the nested queries will be checked later 
          #   during the recursive parsing for the inner select statements.
          self.I_am_triggered = \
            tokens[0] == self.simselect_keyword \
            or (tokens[0] == 'select' and 'to' in tokens[:index_of_from_clause])
          return self.I_am_triggered
        except ValueError:
          # no 'from' in tokens, i.e., there are no nested queries
          self.I_am_triggered = \
            tokens[0] == self.simselect_keyword \
            or (tokens[0] == 'select' and 'to' in tokens)
          return self.I_am_triggered
        except Exception as e:
          raise e
    
      def parse_simselect(self, tokens: TokenList) -> List[Expr]:
        # resets the flags to avoid the influence from previous queries 
        self.simselect_detected = False
        self.to_keyword_detected_in_select_clause = False
    
        if tokens[0] not in ('select', self.simselect_keyword):
          raise SQLSyntaxError
        keyword = tokens.pop(0)
        if keyword == self.simselect_keyword:
          # 'simselect' detected
          self.simselect_detected = True
        for token in tokens:
          if token in ('from', 'where'):
            # the current SELECT clause ends at 'from' or 'where' or the tail of tokens
            break
          if token == 'to':
            # 'to' detected in the current SELECT clause
            self.to_keyword_detected_in_select_clause = True
        
        # Whether the extended keywords 'simselect' and 'to' are found or not, 
        #   simply does the same thing as standard select, 
        #   i.e., using select_core_exp to parse the columns to be selected  
        select_cols = toolbox.select_core_exp(tokens) 
        return select_cols 
  3. trigger_function and parsing_function for WHERE clause

    def trigger_simselect_where(self, tokens: TokenList) -> bool:
        return self.I_am_triggered or (tokens[:1] == ['where'] and 'to' in tokens)
    
      def parse_simselect_where(self, tokens: TokenList, relation: Expr) -> Expr:
        if tokens[:1] == ['where']:
          tokens.pop(0)
          # validate the syntax requirements on the occurrance of 'simselect' and 'to'
          to_keyword_detected_in_where_clause = False
          for token in tokens:
            if token in ('select', 'from', 'where', self.simselect_keyword):
              # If any one among the 'select', 'from', 'where' and 'simselect' keywords is met, 
              #   it means the current WHERE clause is part of a nested query, 
              #   and the current WHERE clause reaches at most the token right before the met keyword,
              #   i.e., the met keyword can be used as a rough boundary of the current WHERE clause
              #   for the purpose of checking occurrance of 'to'.
              break
            if token == 'to':
              # 'to' detected in the current WHERE clause
              to_keyword_detected_in_where_clause = True
          if self.simselect_detected:
            ERROR_IF_FALSE(
              self.to_keyword_detected_in_select_clause or to_keyword_detected_in_where_clause, 
              "simselect requires at least one 'to' expression, zero found.", 
              ExtendedSyntaxError
            )
          if to_keyword_detected_in_where_clause or self.to_keyword_detected_in_select_clause or self.simselect_detected: 
            # if 'to' keyword is found in WHERE clause or in SELECT clause, 
            #   or the keyword 'simselect' instead of 'select' is used in SELECT clause, 
            #   calls extended_where_core_expr here to generate a SimSelectionOp
            relation: SimSelectionOp = self.extended_where_core_expr(tokens, relation)
          else:
            # otherwise, calls the standard where_core_expr to generate a standard SelectionOp
            relation: SelectionOp = toolbox.where_core_expr(tokens, relation)
        else:
          # The current query has no WHERE clause,  
          #   but we still need to validate the syntax requirements 
          #   on the occurrance of 'simselect' and 'to'.
          if self.simselect_detected:
            ERROR_IF_FALSE(
              self.to_keyword_detected_in_select_clause,
              "simselect requires at least one 'to' expression, zero found.", 
              ExtendedSyntaxError
            )
        return relation
    
      def extended_where_core_expr(self, tokens: TokenList, relation: Expr) -> SimSelectionOp:
        return SimSelectionOp(relation, toolbox.or_exp(tokens))

Step 4: registering the new syntax into registry

registry: Registry = OrderedDict({
  "simselect": RegEntry( 
                  syntax=SimSelectionSyntax, 
                  clause_parsers=OrderedDict({
                    SQLClause.SELECT: ("trigger_simselect", "parse_simselect"), 
                    SQLClause.WHERE: ("trigger_simselect_where", "parse_simselect_where")
                  }), 
                  entry_points=[
                    SimSelectionSyntax.addExtendedSymbolsAndKeywords,
                    SimSelectionSyntax.addExtendedPredicateParsers,
                    SimSelectionSyntax.addExtendedDataTypes,
                    SimSelectionSyntax.addExtendedPredicateOps,
                    SimSelectionSyntax.addExtendedRelationOps
                  ]
                ),
})

The final step is to fill a RegEntry of SimSelectionSyntax into registry.

Note:

  • The methods listed in entry_points are those "mounting methods" inherited from ExtendedSyntax. SimSelectionSyntax should never overwrite them.

  • The SQL-clause level and expression level interfaces are independent from each other. You can flexibly implement and register either, both, or neither of them. The non-implemented or non-registered interfaces will be replaced by the native standard implementation automatically inside DBSim. Very flexible.

  • In dbsim/extensions/extended_syntax/, we also implemented another syntax called SpatialSyntax including Point and Circle data types and inside operator to check if a Point is inside a Circle. You can make the two syntax both in effect by registering SpatialSyntax also in registry. We include some example queries for mixture of the two extended syntax in dbsim/tests/test_extended_syntax.py. But we do not recommend it, even though DBSim provides such a feature to mix up multiple extended syntax. It is not easy to make sure the mixed syntax have no conflict with each other. If those syntax are complex, hidden bugs are very likely to exist due to unrealized conflicts. So we recommend users to include everything in one single extended syntax, which is the best way to develop and debug.

  • You can try queries with the mixed syntax (SimSelectionSyntax+SpatialSyntax+standard syntax) by running the unit test test_ExtendedSyntax_mixture() in dbsim/tests/test_extended_syntax.py. But again, we do not recommend syntax mixture like that.