Skip to content

Prototyping what it takes to build a DataFrame with expressions

License

Notifications You must be signed in to change notification settings

gordonwatts/dataframe_expressions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dataframe_expressions

Simple accumulating of expressions for dataframe operations

Expression Samples

You start with a top level data frame:

from dataframe_expressions import DataFrame
d = DataFrame()

Now you can mask it with simple operations:

d1 = d[d.x > 10]

The operators <,>, <=, >=, ==, and != are all supported. You can also combine logical expressions, though watch for operator precedence:

d1 = d[(d.x > 10) & (d.x < 20)]

Of course, chaining is also allowed:

d1 = d[dx > 10]
d2 = d1[d1.x < 20]

And d2 will be identical to d1 of the last example. You can also reverse the order, for example:

d1 = d[10 < dx]

The system will actually render the mask expression as dx > 10 (as per math and python rules).

The basic 4 binary math operators work as well

d1 = d.x/1000.0

They also work as expected if reversed, in case you were worried about that (e.g. 1000.0/d.x).

Extension functions are supported:

d1 = d.x.count()

And, much the same way, numpy functions are supported:

import numpy as np
d1 = np.sin(d.x)

as well as some python function:

d1 = abs(d.x)

Internally, this is rendered as sin(d.x). The numpy functions are translated directly into calls like this - it is up to whatever backend you have to actually implement them. For the complete list of numpy functions, see the numpy math page.

Finally, other numpy functions - array_functions - are also translated. For example:

h = np.histogram(d.x, bins=50, range=(-0.5,10))

creates a DataFrame which makes a call to the np_histogram function. A backend can then implement that function.

One of the most useful extra expressions in a functional language is the if-then-else expression. In python this is a if a > b else b. Unfortunately, due to the way the python interpreter works, we can't use this directly with DataFrames. Instead, we can use the np.where 3-argument function. np.where(<test>, <test-true-result>, <test-false-result>) - and the nice thing about dataframe_expressions is that the true and false results are not calculated unless they are needed (unlike true numpy). See the numpy.where documentation for further details. Support, of course, is dependent on the backend.

Lambda functions and captured variables

It is possible to use lambda's that capture variables, allowing combinations of objects. For example:

d.jets.map(lambda j: d.eles.map(lambda e: j.DeltaR(e)))

Would produce a stream of DataFrame's for each jet with each electron. It is up to the backend how a function like map is used (and of course DeltaR). Further, the backend must run the parsing as arguments can be arbitrary, so dataframe_expressions can't figure out the meaning on its own. The function map here, for example, has no special meaning in this library.

Backend Functions

Sometimes the backend defines some functions which are directly callable. For example, DataR which might take several parameters. With some hints, these are encoded as direct function calls in the final ast:

from dataframe_expressions import user_func

@user_func
def calc_it (pt1: float) -> float:
    assert False, 'Should never be called'

calced = calc_it(d.jets.pt)

In this case, calced would be expected to be a column of jet pt's that were all put together.

Filter Functions

If a filter gets to be too complex (the code between a [ and a ]), then it might be simpler to put it in a separate function.

def good_jet(j):
    (j.pt > 30) & (abs(j.eta) < 2.4)

good_jets_pt = df.jets[good_jet].pt

Adding computed expressions to the Data Model

There are two ways to define new columns in the data model. In both cases the idea is that a new computation expression can replace the old one. The first method looks more pandas like, and the second one looks more like a regular expression substitution. The second method is quite general, powerful, and thus quite likely to take your foot off. Not sure it will survive the prototype.

Adding a new computed expression column

This is the most common way to add a new expression to the data model: one provides a lambda function that is computed during rendering by dataframe_expressions:

df.jets['ptgev'] = lambda j: j.pt / 1000.0

By default the argument is everything that proceeds the brackets - in this case df.jets. All the rules about capturing variables apply here, so it is possible to add a set of tracks near the jet, for example, using this (as long as it is implemented by the backend). For example:

def near(tks, j):
    return tks[tks.map(lambda t: DeltaR(t, j) < 0.4)]

df.jets['tracks'] = lambda j: near(df.tracks, j)

# This will now get you the number of tracks near each jet:
df.jets.tracks.Count()

The above assumes a lot of backend implementation: DeltaR, map, Count, along with the detector data model that has jets and tracks, but hopefully gives one an idea of the power available.

Replacing the contents of a column

It is possible to graft one part of the data model into another part of the data model, when necessary. It can be done with the above lambda expression as well, but this is a short cut:

df.jets['mcs'] = df.mcs[df.mcs.pdgId == 11]

how_many_mcs = df.jets.mcs.Count()

Though that would have the same number for every jet.

Because of the way rendering works, the following also does what you expect:

df.jets['ptgev'] = df.jets.pt/1000.0

jetpt_in_gev = df.jets.ptgev

This is because in the current dataframe_expressions model, every single appearance of a common expression, like df.jets corresponds to the same same set of jets. In sort, implied iterators are common here. In this prototype it isn't obvious this should be here.

All of this will work even through a filter, as you might expect:

df.jets['ptgev'] = df.jets.pt / 1000.0

jetpt_in_gev = df.jets[df.jets.ptgev > 30].ptgev

The prototype implementation is particularly fragile - but that is due to poor design rather than a technical limitation.

You can also refer to a leaf using a simple syntax. For example, df.jets["ptgev"] and df.jets.ptgev are the same on the right hand side of an expression. df.xxx and df["xxx"] are equivalent in all circumstances.

Adding to the data model using objects

Another way to do this is build an object. For example, lets say you want to make it easy to do 3-vector operations. You might write something like this:

class vec(DataFrame):
    def __init__(self, df: DataFrame):
        DataFrame.__init__(self, df)

    @property
    def x(self) -> DataFrame:
        return self.x
    @property
    def y(self) -> DataFrame:
        return self.y
    @property
    def z(self) -> DataFrame:
        return self.z

    @property
    def xy(self) -> DataFrame:
        import numpy as np
        return np.sqrt(self.x*self.x + self.y*self.y)

Now you can write v.xy and you have the L_xy distance from the origin. It is also possible to implement vector operations. This library doesn't help you with that, but it isn't difficult.

You can add the class decorator exclusive_class if you only want the supplied properties to be available (so v.zz would cause an error).

The extra work to support this is almost trivial - see test cases, even one with vector addition, in the file test_object.py for further examples.

Adding to the data model using an Alias

This is a simple feature which allows you to invent short hand for more complex expressions. This makes it easy to use. Further, the backend never knows about these short-hand scripts - they are just substituted in on the fly as the DAG is built. For example, in the ATLAS experiment I to access jet pT in GeV i need to always divide by 1000. So:

define_alias('', 'pt', lambda o: o.pt / 1000.0)

Now if one enters d.jets.pt, the backend will see it as if I typed df.jets.pt/1000.0. The same can be done for collections. For example:

define_alias('.', 'eles', lambda e: e.Electrons("Electrons"))

And when one enters d.eles.pt the backend will see df.Electrons("Electrons").pt / 1000.0.

The aliases can reference each other (though no recursion is allowed), so fairly complex expressions can be built up. This library's alias resolution is quite simple (it is a prototype). Matching is possible. For example, if the first argument is a ., then only references directly off the dataframe are translated. This feature could be used to define a personality module for an analysis for an experiment.

Usage with a backend

While the above shows you want the library can track, it says nothing about how you use it. The following steps are necessary.

  1. Subclass dataframe_expressions.DataFrame with your library to create a "source" dataframe. For example, it could refer to a file, or a network endpoint the supplies data. Make sure you initialize the DataFrame sub class by calling its __init__ method. However, no need to pass any arguments. For this discussion lets call this MyDF

  2. Users build expression as you would expect, df = MyDF(...), and df1 = df.jets[df.jets.pt > 10]

  3. Users trigger rendering of the expression in your library in some way that makes sense, get_data(df1) for example, where you must supply the get_data method.

  4. When you get control with the DataFrame expression the user wants rendered, you can now do the following to render it:

from dataframe_expressions import render
expression, context = render(df1)

expression is an ast.AST that describes what is being looked at. If the expression is df.jets.pt then the ast is a chain of python ast.Attribute nodes, and the bottom one will be a special ast_Dataframe object that contains a member DataFrame which points to your original sub-classed MyDF. You can tell it is the special DataFrame because it will have no children.

If there are filters, there is another special ast object you need to be able to process: ast_Filter. For example, df[df.met > 50].jets.pt, will have expression starting with two ast.Attribute nodes for the jets.pt attributes, followed by a ast_Filter node. The ast_Filter object has one expression, filter, which points to an expression that is the filter. It should evaluate to true or false. The second member points to the DataFrame it is filtering - in this case MyDF. As long as there is repeated phrase, like df in df[df.met > 50].jets.pt or df.jets in df.jets[df.jets.count() == 2], they will point to the same ast_DataFrame object - so you can use that in walking the tree to recognize common sub-expressions expression(s).

There is one last trick: lambda functions. dataframe_expressions can't evaluate the lambda functions without knowing more about the user's intent: so evaluating them must be triggered by your library. The lambda functions are represented by an ast_Callable object. When you do encounter them, you can render them into the same ast.AST like form by calling render_callable and passing the context along with the ast_Callable and any arguments to pass to the lambda.

To see how this works, see packages like hep_tables and hl_tables.

Helpers

The dumps function will dump a dataframe to a string. For the most part, the string will be correct python (lambda functions and other function routines are the only exception). This is useful for including in error text or in logging in libraries that make use of this library.

The dataframe_expressions library makes use of the python logging library to dump expressions it is asked to render at the debug level. If you want to turn on just messages from this library the following code will dump debug level messages to stdout:

import logging
ch = logging.StreamHandler()
logging.getLogger('dataframe_expressions').setLevel(logging.DEBUG)
logging.getLogger('dataframe_expressions').addHandler(ch)

Technology Choices

Not sure these are the right thing, but...

  • Using the python ast module to record expressions. Mostly because it is already complete and there are nice visitor objects that make walking it easy. Down side is that python does change the ast every few versions.

  • An attribute on DataFrame refers to some data. A method call, however, does not refer to data. So, you can say d.pt to get at the pt, but if you said d.pt() that would be "bad". The reason for this is so that we can add functions that do things in a fluent way. For example, d.jets.count() to count the number of jets. Or d.jets[d.jets.pt > 100].count() or similar. Really, the back end can interpret this, but the front-end semantics sort-of make this assumption.

Architecture Questions

This isn't an exhaustive list. Just a list of some choices I had to make to get this off the ground.

  • Should there be a Column and Dataset?

    • Yes - turns out we have rediscovered why there is a Mask and a column distinction in numpy. So the Column object is really a Mask object. This is bad naming, but hopefully for this prototype that won't make much of a difference. So we should definitely think a bit about why a Mask has to be treated differently from a DataFrame - it isn't intuitively obvious until you get into the code.
    • No - since things can return "bool" values and we don't know it because we have no type system, they are identical to a column, except we assume they are a df: df[df.hasProdVtx & df.hasDecayVtx], for example.
    • We should get rid of the concept of a parent, dynamic, and replace it with ast_DataFrame - we have it in here already - so why not just stick to that rather than having both it and p.
  • Should we allow for "&" and "|" as logical operators, redefining what they mean in python? numpy defines several logical operators which should translate, but those aren't implemented yet.

  • I currently have a parent as "p" in the expression, but then we have a dataframe ast and column ast - which makes it not needed. Why not just convert to using the same thing to refer to a df in an ast?

    • Internally, the "parent" dataframe is represented as p - which means nothing can ever have a p object on it or all hell is likely to break loose. A very good argument for not doing it this way.
  • For typing I do not know how to forward declare so I can use COlumn and DataFrame inside my method definitions. Static type checkers should pick this up for now by simple logic.

  • Using BitAnd and BitOr for and and or - but should I use the logical and and or here to make it clear in the AST what we are talking about?

  • What does d1[d[d.x > 0].jets.pt > 20].pt mean? Is this where we are hitting the limit of things? I'd say it means nothing and should create an error. Something like d1[(d[d.x > 0].jets.pt > 20).count()].pt works, however. Actually even the above - what does that mean? Isn't the right way to do that is d1[(d[d.x > 0].jets[d.jets.pt>0].count())] or similar? Ugh. Ok - the thing to do for now is be strict, and we can add things which make life easier later.

  • Sometimes functions are defined in places they make no sense. For example, the abs (or any numpy function) is defined always, even if your DataFrame represents a collection of jets. A reason to have columns and collections as different objects to help the user, and help editors guess possibilities.

  • There should be no concept of parent in a DataFrame. The expression should be everything, and point to any referenced objects. This will be especially true if multiple root DataFrame's are ever to be used.

  • Is it important to define new columns using the '=' sign? e.g. df.jets.ptgev = df.jets.pt/1000.0?

  • The rule that every expression that is the same implies the same implied iterator. That means the current code can't do 2 jets, for example. There are several ways to "fix" this, however, the biggest question: is this reasonable?

  • The ability to have an exclusive_object is implemented at runtime - perhaps we can come up with a scheme where we just define objects and they "fit" in correctly? Thus editors, etc., would be able to tag this as a problem.

About

Prototyping what it takes to build a DataFrame with expressions

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages