Use custom resolver for query and eval with nested frames.

Verify preflighting of nested expressions using AST visitation. Remove logic for splitting queries by string. Now the evaluation is handled by a nested column resolver, and the mixed-mode expressions are preflighted by examining the parsed abstract syntax tree for the query expression.
lincc-frameworks · Oct 9, 2024 · 78796eb · 78796eb
1 parent 832848a
commit 78796eb
Show file tree

Hide file tree

Showing 10 changed files with 320 additions and 93 deletions.
diff --git a/.gitignore b/.gitignore
@@ -133,6 +133,9 @@ dmypy.json
 # vscode
 .vscode/
 
+# PyCharm
+.idea/
+
 # dask
 dask-worker-space/
 

diff --git a/docs/tutorials/data_loading_notebook.ipynb b/docs/tutorials/data_loading_notebook.ipynb
@@ -11,7 +11,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "With a valid Python environment, nested-pandas and it's dependencies are easy to install using the `pip` package manager. The following command can be used to install it:"
+    "With a valid Python environment, nested-pandas and its dependencies are easy to install using the `pip` package manager. The following command can be used to install it:"
    ]
   },
   {
@@ -47,7 +47,7 @@
     "\n",
     "We can use the `NestedFrame` constructor to create our base frame from a dictionary of our columns.\n",
     "\n",
-    "We can then create an addtional pandas dataframes and pack them into our `NestedFrame` with `NestedFrame.add_nested`"
+    "We can then create an addtional pandas dataframes and pack them into our `NestedFrame` with `NestedFrame.add_nested`."
    ]
   },
   {
@@ -97,7 +97,7 @@
     "# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.\n",
     "# You can of course remove this and use your own directory and real files on your system.\n",
     "with tempfile.TemporaryDirectory() as temp_path:\n",
-    "    # Generates parquet files with random data within our temporary directorye.\n",
+    "    # Generates parquet files with random data within our temporary directory.\n",
     "    generate_parquet_file(10, {\"nested1\": 100, \"nested2\": 10}, temp_path, file_per_layer=True)\n",
     "\n",
     "    # Read each individual parquet file into its own dataframe.\n",
@@ -148,7 +148,7 @@
    "source": [
     "So inspect `nf`, a `NestedFrame` we created from our call to `read_parquet` with the `to_pack` argument, we're able to pack nested parquet files according to the shared index values with the index in `base.parquet`.\n",
     "\n",
-    "The resulting `NestedFrame` having the same number of rows as `base.parquet` and with `nested1.parquet` and `nested2.parquet` packed into the 'nested1' and 'nested2' columns respectively."
+    "The resulting `NestedFrame` having the same number of rows as `base.parquet` and with `nested1.parquet` and `nested2.parquet` packed into the `nested1` and `nested2` columns respectively."
    ]
   },
   {
@@ -164,7 +164,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Since we loaded each individual parquet file into its own dataframe, we can also verify that using `read_parquet` with the `to_pack` argument is equivalent to the following method of packing the dataframes directly with `NestedFrame.add_nested`"
+    "Since we loaded each individual parquet file into its own dataframe, we can also verify that using `read_parquet` with the `to_pack` argument is equivalent to the following method of packing the dataframes directly with `NestedFrame.add_nested`."
    ]
   },
   {
@@ -189,11 +189,11 @@
    "source": [
     "# Saving NestedFrames to Parquet Files\n",
     "\n",
-    "Additionally we can save an existing `NestedFrame` as one of more parquet files using `NestedFrame.to_parquet``\n",
+    "Additionally we can save an existing `NestedFrame` as one of more parquet files using `NestedFrame.to_parquet`.\n",
     "\n",
     "When `by_layer=True` we save each individual layer of the NestedFrame into its own parquet file in a specified output directory.\n",
     "\n",
-    "The base layer will be outputted to \"base.parquet\", and each nested layer will be written to a file based on its column name. So the nested layer in column `nested1` will be written to \"nested1.parquet\"."
+    "The base layer will be outputted to `base.parquet`, and each nested layer will be written to a file based on its column name. So the nested layer in column `nested1` will be written to `nested1.parquet`."
    ]
   },
   {
@@ -233,7 +233,7 @@
    "source": [
     "We also support saving a `NestedFrame` as a single parquet file where the packed layers are still packed in their respective columns.\n",
     "\n",
-    "Here we provide `NestedFrame.to_parquet` with the desired path of the *single* output file (rather than the path of a directory to store *multiple* output files) and use `per_layer=False'\n",
+    "Here we provide `NestedFrame.to_parquet` with the desired path of the *single* output file (rather than the path of a directory to store *multiple* output files) and use `per_layer=False`.\n",
     "\n",
     "Our `read_parquet` function can load a `NestedFrame` saved in this single file parquet without requiring any additional arguments. "
    ]

diff --git a/docs/tutorials/data_manipulation.ipynb b/docs/tutorials/data_manipulation.ipynb
@@ -49,7 +49,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "First, we can directly fetch a column from our nested column (aptly called \"nested\"). For example, below we can fetch the time column, \"t\", by specifying `\"nested.t\"` as the column to retrieve. This returns a \"flat\" view of the nested t column, where all rows from all dataframes are present in one dataframe."
+    "First, we can directly fetch a column from our nested column (aptly called \"nested\"). For example, below we can fetch the time column, \"t\", by specifying `\"nested.t\"` as the column to retrieve. This returns a \"flat\" view of the nested `t` column, where all rows from all dataframes are present in one dataframe."
    ]
   },
   {
@@ -170,7 +170,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This is functionally equivalent to using `add_nested`"
+    "This is functionally equivalent to using `add_nested`:"
    ]
   },
   {

diff --git a/docs/tutorials/low_level.ipynb b/docs/tutorials/low_level.ipynb
@@ -8,7 +8,7 @@
     "# Lower-level interface for performance and flexibility\n",
     "## Reveal the hidden power of nested Series\n",
     "\n",
-    "This section is for users looking to optimize the performance, both computationally and in memory-usage, of their workflows. This section also details a broader suite of data representations usable within `nested-pandas`.\n",
+    "This section is for users looking to optimize both the compute and memory performance of their workflows. This section also details a broader suite of data representations usable within `nested-pandas`.\n",
     "It shows how to deal with individual nested columns: add, remove, and modify data using both \"flat-array\" and \"list-array\" representations.\n",
     "It also demonstrates how to convert nested Series to and from different data types, like `pd.ArrowDtype`d Series, flat dataframes, list-array dataframes, and collections of nested elements."
    ]
@@ -36,7 +36,7 @@
    "source": [
     "## Generate some data and get a Series of `NestedDtype` type\n",
     "\n",
-    "We are going to use built-in data generator to get a `NestedFrame` with a \"nested\" column being a `Series` of `NestedDtype` type.\n",
+    "We are going to use the built-in data generator to get a `NestedFrame` with a \"nested\" column being a `Series` of `NestedDtype` type.\n",
     "This column would represent [light curves](https://en.wikipedia.org/wiki/Light_curve) of some astronomical objects. "
    ]
   },
@@ -94,7 +94,7 @@
    "id": "33d8caacf0bf042e",
    "metadata": {},
    "source": [
-    "You can also get a list of fields with `.fields` attribute"
+    "You can also get a list of fields with `.fields` attribute:"
    ]
   },
   {
@@ -130,7 +130,7 @@
    "id": "7167f5a9c947d96f",
    "metadata": {},
    "source": [
-    "You can also get a subset of nested columns as a new nested Series"
+    "You can also get a subset of nested columns as a new nested Series:"
    ]
   },
   {
@@ -479,7 +479,7 @@
    "source": [
     "#### pd.Series from an array\n",
     "\n",
-    "Construction with `pyarrow` struct arrays is the cheapest way to create a nested Series. It is very semilliar to initialisation of a `pd.Series` of `pd.ArrowDtype` type."
+    "Construction with `pyarrow` struct arrays is the cheapest way to create a nested Series. It is very similar to the initialization of a `pd.Series` of `pd.ArrowDtype` type."
    ]
   },
   {
@@ -611,21 +611,21 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
     "name": "ipython",
-    "version": 2
+    "version": 3
    },
    "file_extension": ".py",
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython2",
-   "version": "2.7.6"
+   "pygments_lexer": "ipython3",
+   "version": "3.12.6"
   }
  },
  "nbformat": 4,

diff --git a/docs/tutorials/nested_spectra.ipynb b/docs/tutorials/nested_spectra.ipynb
@@ -79,7 +79,7 @@
     "flux = np.array([])\n",
     "err = np.array([])\n",
     "index = np.array([])\n",
-    "# Loop over each spectrum, adding it's data to the arrays\n",
+    "# Loop over each spectrum, adding its data to the arrays\n",
     "for i, hdu in enumerate(sp):\n",
     "    wave = np.append(wave, 10 ** hdu[\"COADD\"].data.loglam)  # * u.angstrom\n",
     "    flux = np.append(flux, hdu[\"COADD\"].data.flux * 1e-17)  # * u.erg/u.second/u.centimeter**2/u.angstrom\n",
@@ -115,7 +115,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "And we can see that each object now has the \"coadd_spectrum\" nested column with the full spectrum available."
+    "And we can see that each object now has the `coadd_spectrum` nested column with the full spectrum available."
    ]
   },
   {
@@ -161,7 +161,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
@@ -175,9 +175,9 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.11"
+   "version": "3.12.6"
   }
  },
  "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }