Simple accumulating of expressions for dataframe operations
You start with a top level data frame:
from dataframe_expressions import DataFrame
d = DataFrame()
Now you can mask it with simple operations:
d1 = d[d.x > 10]
The operators <,>, <=, >=, ==,
and !=
are all supported. You can also combine logical expressions, though watch for operator precedence:
d1 = d[(d.x > 10) & (d.x < 20)]
Of course, chaining is also allowed:
d1 = d[dx > 10]
d2 = d1[d1.x < 20]
And d2
will be identical to d1 of the last example. You can also reverse the order, for example:
d1 = d[10 < dx]
The system will actually render the mask expression as dx > 10
(as per math and python rules).
The basic 4 binary math operators work as well
d1 = d.x/1000.0
They also work as expected if reversed, in case you were worried about that (e.g. 1000.0/d.x
).
Extension functions are supported:
d1 = d.x.count()
And, much the same way, numpy
functions are supported:
import numpy as np
d1 = np.sin(d.x)
as well as some python function:
d1 = abs(d.x)
Internally, this is rendered as sin(d.x)
. The numpy
functions are translated directly into calls like this - it is up to whatever backend you have to actually implement them. For the complete list of numpy
functions, see the numpy
math page.
Finally, other numpy
functions - array_functions
- are also translated. For example:
h = np.histogram(d.x, bins=50, range=(-0.5,10))
creates a DataFrame
which makes a call to the np_histogram
function. A backend can then implement that function.
One of the most useful extra expressions in a functional language is the if-then-else
expression. In python this is a if a > b else b
. Unfortunately, due to the way the python interpreter works, we can't use this directly with DataFrame
s. Instead, we can use the np.where
3-argument function. np.where(<test>, <test-true-result>, <test-false-result>)
- and the nice thing about dataframe_expressions
is that the true and false results are not calculated unless they are needed (unlike true numpy
). See the numpy.where
documentation for further details. Support, of course, is dependent on the backend.
It is possible to use lambda's that capture variables, allowing combinations of objects. For example:
d.jets.map(lambda j: d.eles.map(lambda e: j.DeltaR(e)))
Would produce a stream of DataFrame
's for each jet with each electron. It is up to the backend how a function like map
is used (and of course DeltaR
). Further, the backend must run the parsing as arguments can be arbitrary, so dataframe_expressions
can't figure out the meaning on its own. The function map
here, for example, has no special meaning in this library.
Sometimes the backend defines some functions which are directly callable. For example, DataR
which might take several parameters. With some hints, these are encoded as direct function calls in the final ast
:
from dataframe_expressions import user_func
@user_func
def calc_it (pt1: float) -> float:
assert False, 'Should never be called'
calced = calc_it(d.jets.pt)
In this case, calced
would be expected to be a column of jet pt
's that were all put together.
If a filter gets to be too complex (the code between a [
and a ]
), then it might be simpler to put it in a separate function.
def good_jet(j):
(j.pt > 30) & (abs(j.eta) < 2.4)
good_jets_pt = df.jets[good_jet].pt
There are two ways to define new columns in the data model. In both cases the idea is that a new computation expression can replace the old one. The first method looks more pandas
like, and the second one looks more like a regular expression substitution. The second method is quite general, powerful, and thus quite likely to take your foot off. Not sure it will survive the prototype.
This is the most common way to add a new expression to the data model: one provides a lambda function that is computed during rendering by dataframe_expressions
:
df.jets['ptgev'] = lambda j: j.pt / 1000.0
By default the argument is everything that proceeds the brackets - in this case df.jets
. All the rules about capturing variables apply here, so it is possible to add a set of tracks near the jet, for example, using this (as long as it is implemented by the backend). For example:
def near(tks, j):
return tks[tks.map(lambda t: DeltaR(t, j) < 0.4)]
df.jets['tracks'] = lambda j: near(df.tracks, j)
# This will now get you the number of tracks near each jet:
df.jets.tracks.Count()
The above assumes a lot of backend implementation: DeltaR
, map
, Count
, along with the detector data model that has jets and tracks, but hopefully gives one an idea of the power available.
It is possible to graft one part of the data model into another part of the data model, when necessary. It can be done with the above lambda expression as well, but this is a short cut:
df.jets['mcs'] = df.mcs[df.mcs.pdgId == 11]
how_many_mcs = df.jets.mcs.Count()
Though that would have the same number for every jet.
Because of the way rendering works, the following also does what you expect:
df.jets['ptgev'] = df.jets.pt/1000.0
jetpt_in_gev = df.jets.ptgev
This is because in the current dataframe_expressions
model, every single appearance of a common expression, like df.jets
corresponds to the same same set of jets. In sort, implied iterators are common here. In this prototype it isn't obvious this should be here.
All of this will work even through a filter, as you might expect:
df.jets['ptgev'] = df.jets.pt / 1000.0
jetpt_in_gev = df.jets[df.jets.ptgev > 30].ptgev
The prototype implementation is particularly fragile - but that is due to poor design rather than a technical limitation.
You can also refer to a leaf using a simple syntax. For example, df.jets["ptgev"]
and df.jets.ptgev
are the same on the right hand side of an expression. df.xxx
and df["xxx"]
are equivalent in all circumstances.
Another way to do this is build an object. For example, lets say you want to make it easy to do 3-vector operations. You might write something like this:
class vec(DataFrame):
def __init__(self, df: DataFrame):
DataFrame.__init__(self, df)
@property
def x(self) -> DataFrame:
return self.x
@property
def y(self) -> DataFrame:
return self.y
@property
def z(self) -> DataFrame:
return self.z
@property
def xy(self) -> DataFrame:
import numpy as np
return np.sqrt(self.x*self.x + self.y*self.y)
Now you can write v.xy
and you have the L_xy
distance from the origin. It is also possible to implement vector operations. This library doesn't help you with that, but it isn't difficult.
You can add the class decorator exclusive_class
if you only want the supplied properties to be available (so v.zz
would cause an error).
The extra work to support this is almost trivial - see test cases, even one with vector addition, in the file test_object.py
for further examples.
This is a simple feature which allows you to invent short hand for more complex expressions. This makes it easy to use. Further, the backend never knows about these short-hand scripts - they are just substituted in on the fly as the DAG is built. For example, in the ATLAS experiment I to access jet pT in GeV i need to always divide by 1000. So:
define_alias('', 'pt', lambda o: o.pt / 1000.0)
Now if one enters d.jets.pt
, the backend will see it as if I typed df.jets.pt/1000.0
. The same can be done for collections. For example:
define_alias('.', 'eles', lambda e: e.Electrons("Electrons"))
And when one enters d.eles.pt
the backend will see df.Electrons("Electrons").pt / 1000.0
.
The aliases can reference each other (though no recursion is allowed), so fairly complex expressions can be built up. This library's alias resolution is quite simple (it is a prototype). Matching is possible. For example, if the first argument is a .
, then only references directly off the dataframe are translated. This feature could be used to define a personality module for an analysis for an experiment.
While the above shows you want the library can track, it says nothing about how you use it. The following steps are necessary.
-
Subclass
dataframe_expressions.DataFrame
with your library to create a "source" dataframe. For example, it could refer to a file, or a network endpoint the supplies data. Make sure you initialize theDataFrame
sub class by calling its__init__
method. However, no need to pass any arguments. For this discussion lets call thisMyDF
-
Users build expression as you would expect,
df = MyDF(...)
, anddf1 = df.jets[df.jets.pt > 10]
-
Users trigger rendering of the expression in your library in some way that makes sense,
get_data(df1)
for example, where you must supply theget_data
method. -
When you get control with the
DataFrame
expression the user wants rendered, you can now do the following to render it:
from dataframe_expressions import render
expression, context = render(df1)
expression
is an ast.AST
that describes what is being looked at. If the expression is df.jets.pt
then the ast is a chain of python ast.Attribute
nodes, and the bottom one will be a special ast_Dataframe
object that contains a member DataFrame
which points to your original sub-classed MyDF
. You can tell it is the special DataFrame
because it will have no children.
If there are filters, there is another special ast object you need to be able to process: ast_Filter
. For example, df[df.met > 50].jets.pt
, will have expression starting with two ast.Attribute
nodes for the jets.pt
attributes, followed by a ast_Filter
node. The ast_Filter
object has one expression, filter
, which points to an expression that is the filter. It should evaluate to true or false. The second member points to the DataFrame
it is filtering - in this case MyDF
. As long as there is repeated phrase, like df
in df[df.met > 50].jets.pt
or df.jets
in df.jets[df.jets.count() == 2]
, they will point to the same ast_DataFrame
object - so you can use that in walking the tree to recognize common sub-expressions expression(s).
There is one last trick: lambda
functions. dataframe_expressions
can't evaluate the lambda functions without knowing more about the user's intent: so evaluating them must be triggered by your library. The lambda functions are represented by an ast_Callable
object. When you do encounter them, you can render them into the same ast.AST
like form by calling render_callable
and passing the context along with the ast_Callable
and any arguments to pass to the lambda
.
To see how this works, see packages like hep_tables
and hl_tables
.
The dumps
function will dump a dataframe to a string. For the most part, the string will be correct python (lambda functions and other function routines are the only exception). This is useful for including in error text or in logging in libraries that make use of this library.
The dataframe_expressions
library makes use of the python logging
library to dump expressions it is asked to render at the debug level. If you want to turn on just messages from this library the following code will dump debug level messages to stdout:
import logging
ch = logging.StreamHandler()
logging.getLogger('dataframe_expressions').setLevel(logging.DEBUG)
logging.getLogger('dataframe_expressions').addHandler(ch)
Not sure these are the right thing, but...
-
Using the python
ast
module to record expressions. Mostly because it is already complete and there are nice visitor objects that make walking it easy. Down side is that python does change the ast every few versions. -
An attribute on DataFrame refers to some data. A method call, however, does not refer to data. So, you can say
d.pt
to get at the pt, but if you saidd.pt()
that would be "bad". The reason for this is so that we can add functions that do things in a fluent way. For example,d.jets.count()
to count the number of jets. Ord.jets[d.jets.pt > 100].count()
or similar. Really, the back end can interpret this, but the front-end semantics sort-of make this assumption.
This isn't an exhaustive list. Just a list of some choices I had to make to get this off the ground.
-
Should there be a
Column
andDataset
?- Yes - turns out we have rediscovered why there is a Mask and a column distinction in numpy. So the Column object is really a Mask object. This is bad naming, but hopefully for this prototype that won't make much of a difference. So we should definitely think a bit about why a Mask has to be treated differently from a
DataFrame
- it isn't intuitively obvious until you get into the code. - No - since things can return "bool" values and we don't know it because we have no type system, they are identical to a column, except we assume they are a df:
df[df.hasProdVtx & df.hasDecayVtx]
, for example. - We should get rid of the concept of a parent, dynamic, and replace it with ast_DataFrame - we have it in here already - so why not just stick to that rather than having both it and
p
.
- Yes - turns out we have rediscovered why there is a Mask and a column distinction in numpy. So the Column object is really a Mask object. This is bad naming, but hopefully for this prototype that won't make much of a difference. So we should definitely think a bit about why a Mask has to be treated differently from a
-
Should we allow for "&" and "|" as logical operators, redefining what they mean in python? numpy defines several logical operators which should translate, but those aren't implemented yet.
-
I currently have a parent as "p" in the expression, but then we have a dataframe ast and column ast - which makes it not needed. Why not just convert to using the same thing to refer to a df in an ast?
- Internally, the "parent" dataframe is represented as
p
- which means nothing can ever have ap
object on it or all hell is likely to break loose. A very good argument for not doing it this way.
- Internally, the "parent" dataframe is represented as
-
For typing I do not know how to forward declare so I can use COlumn and DataFrame inside my method definitions. Static type checkers should pick this up for now by simple logic.
-
Using BitAnd and BitOr for and and or - but should I use the logical and and or here to make it clear in the AST what we are talking about?
-
What does
d1[d[d.x > 0].jets.pt > 20].pt
mean? Is this where we are hitting the limit of things? I'd say it means nothing and should create an error. Something liked1[(d[d.x > 0].jets.pt > 20).count()].pt
works, however. Actually even the above - what does that mean? Isn't the right way to do that isd1[(d[d.x > 0].jets[d.jets.pt>0].count())]
or similar? Ugh. Ok - the thing to do for now is be strict, and we can add things which make life easier later. -
Sometimes functions are defined in places they make no sense. For example, the
abs
(or anynumpy
function) is defined always, even if yourDataFrame
represents a collection of jets. A reason to havecolumns
andcollections
as different objects to help the user, and help editors guess possibilities. -
There should be no concept of
parent
in aDataFrame
. The expression should be everything, and point to any referenced objects. This will be especially true if multiple rootDataFrame
's are ever to be used. -
Is it important to define new columns using the '=' sign? e.g.
df.jets.ptgev = df.jets.pt/1000.0
? -
The rule that every expression that is the same implies the same implied iterator. That means the current code can't do 2 jets, for example. There are several ways to "fix" this, however, the biggest question: is this reasonable?
-
The ability to have an
exclusive_object
is implemented at runtime - perhaps we can come up with a scheme where we just define objects and they "fit" in correctly? Thus editors, etc., would be able to tag this as a problem.