Add Notebook for Loading Data to NestedPandas (#85)

* Add Notebook for Loading Data to NestedPandas * Clear notebook output * Run pre-commit hooks * Address review comments
lincc-frameworks · May 17, 2024 · 33a3e6f · 33a3e6f
1 parent 3dea29f
commit 33a3e6f
Showing 1 changed file with 289 additions and 0 deletions.
diff --git a/docs/tutorials/data_loading_notebook.ipynb b/docs/tutorials/data_loading_notebook.ipynb
@@ -0,0 +1,289 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Loading Data into Nested-Pandas"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "With a valid Python environment, nested-pandas and it's dependencies are easy to install using the `pip` package manager. The following command can be used to install it:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# % pip install nested-pandas"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from nested_pandas.datasets import generate_parquet_file\n",
+    "from nested_pandas import NestedFrame\n",
+    "from nested_pandas import read_parquet\n",
+    "\n",
+    "import os\n",
+    "import pandas as pd\n",
+    "import tempfile"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Loading Data from Dictionaries\n",
+    "Nested-Pandas is tailored towards efficient analysis of nested datasets, and supports loading data from multiple sources.\n",
+    "\n",
+    "We can use the `NestedFrame` constructor to create our base frame from a dictionary of our columns.\n",
+    "\n",
+    "We can then create an addtional pandas dataframes and pack them into our `NestedFrame` with `NestedFrame.add_nested`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "nf = NestedFrame(data={\"a\": [1, 2, 3], \"b\": [2, 4, 6]}, index=[0, 1, 2])\n",
+    "\n",
+    "nested = pd.DataFrame(\n",
+    "    data={\"c\": [0, 2, 4, 1, 4, 3, 1, 4, 1], \"d\": [5, 4, 7, 5, 3, 1, 9, 3, 4]},\n",
+    "    index=[0, 0, 0, 1, 1, 1, 2, 2, 2],\n",
+    ")\n",
+    "\n",
+    "nf = nf.add_nested(nested, \"nested\")\n",
+    "nf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Loading Data from Parquet Files"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For larger datasets, we support loading data from parquet files.\n",
+    "\n",
+    "In the following cell, we generate a series of temporary parquet files with random data, and ingest them with the `read_parquet` method.\n",
+    "\n",
+    "First we load each file individually as its own data frame to be inspected. Then we use `read_parquet` to create the `NestedFrame` `nf`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "base_df, nested1, nested2 = None, None, None\n",
+    "nf = None\n",
+    "\n",
+    "# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.\n",
+    "# You can of course remove this and use your own directory and real files on your system.\n",
+    "with tempfile.TemporaryDirectory() as temp_path:\n",
+    "    # Generates parquet files with random data within our temporary directorye.\n",
+    "    generate_parquet_file(10, {\"nested1\": 100, \"nested2\": 10}, temp_path, file_per_layer=True)\n",
+    "\n",
+    "    # Read each individual parquet file into its own dataframe.\n",
+    "    base_df = read_parquet(os.path.join(temp_path, \"base.parquet\"))\n",
+    "    nested1 = read_parquet(os.path.join(temp_path, \"nested1.parquet\"))\n",
+    "    nested2 = read_parquet(os.path.join(temp_path, \"nested2.parquet\"))\n",
+    "\n",
+    "    # Create a single NestedFrame packing multiple parquet files.\n",
+    "    nf = read_parquet(\n",
+    "        data=os.path.join(temp_path, \"base.parquet\"),\n",
+    "        to_pack={\n",
+    "            \"nested1\": os.path.join(temp_path, \"nested1.parquet\"),\n",
+    "            \"nested2\": os.path.join(temp_path, \"nested2.parquet\"),\n",
+    "        },\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "When examining the individual tables for each of our parquet files we can see that:\n",
+    "\n",
+    "a) they all have different dimensions\n",
+    "b) they have shared indices"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Print the dimensions of all of our underlying tables\n",
+    "print(\"Our base table 'base.parquet' has shape:\", base_df.shape)\n",
+    "print(\"Our first nested table table 'nested1.parquet' has shape:\", nested1.shape)\n",
+    "print(\"Our second nested table table 'nested2.parquet' has shape:\", nested2.shape)\n",
+    "\n",
+    "# Print the unique indices in each table:\n",
+    "print(\"The unique indices in our base table are:\", base_df.index.values)\n",
+    "print(\"The unique indices in our first nested table are:\", nested1.index.unique())\n",
+    "print(\"The unique indices in our second nested table are:\", nested2.index.unique())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "So inspect `nf`, a `NestedFrame` we created from our call to `read_parquet` with the `to_pack` argument, we're able to pack nested parquet files according to the shared index values with the index in `base.parquet`.\n",
+    "\n",
+    "The resulting `NestedFrame` having the same number of rows as `base.parquet` and with `nested1.parquet` and `nested2.parquet` packed into the 'nested1' and 'nested2' columns respectively."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "nf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Since we loaded each individual parquet file into its own dataframe, we can also verify that using `read_parquet` with the `to_pack` argument is equivalent to the following method of packing the dataframes directly with `NestedFrame.add_nested`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Packing Together Existing Dataframes Into a NestedFrame"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "NestedFrame(base_df).add_nested(nested1, \"nested1\").add_nested(nested2, \"nested2\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Saving NestedFrames to Parquet Files\n",
+    "\n",
+    "Additionally we can save an existing `NestedFrame` as one of more parquet files using `NestedFrame.to_parquet``\n",
+    "\n",
+    "When `by_layer=True` we save each individual layer of the NestedFrame into its own parquet file in a specified output directory.\n",
+    "\n",
+    "The base layer will be outputted to \"base.parquet\", and each nested layer will be written to a file based on its column name. So the nested layer in column `nested1` will be written to \"nested1.parquet\"."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "restored_nf = None\n",
+    "\n",
+    "# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.\n",
+    "# You can of course remove this and use your own directory and real files on your system.\n",
+    "with tempfile.TemporaryDirectory() as temp_path:\n",
+    "    nf.to_parquet(\n",
+    "        temp_path,  # The directory to save our output parquet files.\n",
+    "        by_layer=True,  # Save each layer of the NestedFrame to its own parquet file.\n",
+    "    )\n",
+    "\n",
+    "    # List the files in temp_path to ensure they were saved correctly.\n",
+    "    print(\"The NestedFrame was saved to the following parquet files :\", os.listdir(temp_path))\n",
+    "\n",
+    "    # Read the NestedFrame back in from our saved parquet files.\n",
+    "    restored_nf = read_parquet(\n",
+    "        data=os.path.join(temp_path, \"base.parquet\"),\n",
+    "        to_pack={\n",
+    "            \"nested1\": os.path.join(temp_path, \"nested1.parquet\"),\n",
+    "            \"nested2\": os.path.join(temp_path, \"nested2.parquet\"),\n",
+    "        },\n",
+    "    )\n",
+    "\n",
+    "restored_nf  # our dataframe is restored from our saved parquet files"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We also support saving a `NestedFrame` as a single parquet file where the packed layers are still packed in their respective columns.\n",
+    "\n",
+    "Here we provide `NestedFrame.to_parquet` with the desired path of the *single* output file (rather than the path of a directory to store *multiple* output files) and use `per_layer=False'\n",
+    "\n",
+    "Our `read_parquet` function can load a `NestedFrame` saved in this single file parquet without requiring any additional arguments. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "restored_nf_single_file = None\n",
+    "\n",
+    "# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.\n",
+    "# You can of course remove this and use your own directory and real files on your system.\n",
+    "with tempfile.TemporaryDirectory() as temp_path:\n",
+    "    output_path = os.path.join(temp_path, \"output.parquet\")\n",
+    "    nf.to_parquet(\n",
+    "        output_path,  # The filename to save our NestedFrame to.\n",
+    "        by_layer=False,  # Save the entire NestedFrame to a single parquet file.\n",
+    "    )\n",
+    "\n",
+    "    # List the files within our temp_path to ensure that we only saved a single parquet file.\n",
+    "    print(\"The NestedFrame was saved to the following parquet files :\", os.listdir(temp_path))\n",
+    "\n",
+    "    # Read the NestedFrame back in from our saved single parquet file.\n",
+    "    restored_nf_single_file = read_parquet(output_path)\n",
+    "\n",
+    "restored_nf_single_file  # our dataframe is restored from a single saved parquet file"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}