Skip to content

Commit

Permalink
Add Notebook for Loading Data to NestedPandas (#85)
Browse files Browse the repository at this point in the history
* Add Notebook for Loading Data to NestedPandas

* Clear notebook output

* Run pre-commit hooks

* Address review comments
  • Loading branch information
wilsonbb authored May 17, 2024
1 parent 3dea29f commit 33a3e6f
Showing 1 changed file with 289 additions and 0 deletions.
289 changes: 289 additions & 0 deletions docs/tutorials/data_loading_notebook.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,289 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Loading Data into Nested-Pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With a valid Python environment, nested-pandas and it's dependencies are easy to install using the `pip` package manager. The following command can be used to install it:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# % pip install nested-pandas"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from nested_pandas.datasets import generate_parquet_file\n",
"from nested_pandas import NestedFrame\n",
"from nested_pandas import read_parquet\n",
"\n",
"import os\n",
"import pandas as pd\n",
"import tempfile"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Loading Data from Dictionaries\n",
"Nested-Pandas is tailored towards efficient analysis of nested datasets, and supports loading data from multiple sources.\n",
"\n",
"We can use the `NestedFrame` constructor to create our base frame from a dictionary of our columns.\n",
"\n",
"We can then create an addtional pandas dataframes and pack them into our `NestedFrame` with `NestedFrame.add_nested`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nf = NestedFrame(data={\"a\": [1, 2, 3], \"b\": [2, 4, 6]}, index=[0, 1, 2])\n",
"\n",
"nested = pd.DataFrame(\n",
" data={\"c\": [0, 2, 4, 1, 4, 3, 1, 4, 1], \"d\": [5, 4, 7, 5, 3, 1, 9, 3, 4]},\n",
" index=[0, 0, 0, 1, 1, 1, 2, 2, 2],\n",
")\n",
"\n",
"nf = nf.add_nested(nested, \"nested\")\n",
"nf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Loading Data from Parquet Files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For larger datasets, we support loading data from parquet files.\n",
"\n",
"In the following cell, we generate a series of temporary parquet files with random data, and ingest them with the `read_parquet` method.\n",
"\n",
"First we load each file individually as its own data frame to be inspected. Then we use `read_parquet` to create the `NestedFrame` `nf`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"base_df, nested1, nested2 = None, None, None\n",
"nf = None\n",
"\n",
"# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.\n",
"# You can of course remove this and use your own directory and real files on your system.\n",
"with tempfile.TemporaryDirectory() as temp_path:\n",
" # Generates parquet files with random data within our temporary directorye.\n",
" generate_parquet_file(10, {\"nested1\": 100, \"nested2\": 10}, temp_path, file_per_layer=True)\n",
"\n",
" # Read each individual parquet file into its own dataframe.\n",
" base_df = read_parquet(os.path.join(temp_path, \"base.parquet\"))\n",
" nested1 = read_parquet(os.path.join(temp_path, \"nested1.parquet\"))\n",
" nested2 = read_parquet(os.path.join(temp_path, \"nested2.parquet\"))\n",
"\n",
" # Create a single NestedFrame packing multiple parquet files.\n",
" nf = read_parquet(\n",
" data=os.path.join(temp_path, \"base.parquet\"),\n",
" to_pack={\n",
" \"nested1\": os.path.join(temp_path, \"nested1.parquet\"),\n",
" \"nested2\": os.path.join(temp_path, \"nested2.parquet\"),\n",
" },\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When examining the individual tables for each of our parquet files we can see that:\n",
"\n",
"a) they all have different dimensions\n",
"b) they have shared indices"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Print the dimensions of all of our underlying tables\n",
"print(\"Our base table 'base.parquet' has shape:\", base_df.shape)\n",
"print(\"Our first nested table table 'nested1.parquet' has shape:\", nested1.shape)\n",
"print(\"Our second nested table table 'nested2.parquet' has shape:\", nested2.shape)\n",
"\n",
"# Print the unique indices in each table:\n",
"print(\"The unique indices in our base table are:\", base_df.index.values)\n",
"print(\"The unique indices in our first nested table are:\", nested1.index.unique())\n",
"print(\"The unique indices in our second nested table are:\", nested2.index.unique())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So inspect `nf`, a `NestedFrame` we created from our call to `read_parquet` with the `to_pack` argument, we're able to pack nested parquet files according to the shared index values with the index in `base.parquet`.\n",
"\n",
"The resulting `NestedFrame` having the same number of rows as `base.parquet` and with `nested1.parquet` and `nested2.parquet` packed into the 'nested1' and 'nested2' columns respectively."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we loaded each individual parquet file into its own dataframe, we can also verify that using `read_parquet` with the `to_pack` argument is equivalent to the following method of packing the dataframes directly with `NestedFrame.add_nested`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Packing Together Existing Dataframes Into a NestedFrame"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"NestedFrame(base_df).add_nested(nested1, \"nested1\").add_nested(nested2, \"nested2\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Saving NestedFrames to Parquet Files\n",
"\n",
"Additionally we can save an existing `NestedFrame` as one of more parquet files using `NestedFrame.to_parquet``\n",
"\n",
"When `by_layer=True` we save each individual layer of the NestedFrame into its own parquet file in a specified output directory.\n",
"\n",
"The base layer will be outputted to \"base.parquet\", and each nested layer will be written to a file based on its column name. So the nested layer in column `nested1` will be written to \"nested1.parquet\"."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"restored_nf = None\n",
"\n",
"# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.\n",
"# You can of course remove this and use your own directory and real files on your system.\n",
"with tempfile.TemporaryDirectory() as temp_path:\n",
" nf.to_parquet(\n",
" temp_path, # The directory to save our output parquet files.\n",
" by_layer=True, # Save each layer of the NestedFrame to its own parquet file.\n",
" )\n",
"\n",
" # List the files in temp_path to ensure they were saved correctly.\n",
" print(\"The NestedFrame was saved to the following parquet files :\", os.listdir(temp_path))\n",
"\n",
" # Read the NestedFrame back in from our saved parquet files.\n",
" restored_nf = read_parquet(\n",
" data=os.path.join(temp_path, \"base.parquet\"),\n",
" to_pack={\n",
" \"nested1\": os.path.join(temp_path, \"nested1.parquet\"),\n",
" \"nested2\": os.path.join(temp_path, \"nested2.parquet\"),\n",
" },\n",
" )\n",
"\n",
"restored_nf # our dataframe is restored from our saved parquet files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We also support saving a `NestedFrame` as a single parquet file where the packed layers are still packed in their respective columns.\n",
"\n",
"Here we provide `NestedFrame.to_parquet` with the desired path of the *single* output file (rather than the path of a directory to store *multiple* output files) and use `per_layer=False'\n",
"\n",
"Our `read_parquet` function can load a `NestedFrame` saved in this single file parquet without requiring any additional arguments. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"restored_nf_single_file = None\n",
"\n",
"# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.\n",
"# You can of course remove this and use your own directory and real files on your system.\n",
"with tempfile.TemporaryDirectory() as temp_path:\n",
" output_path = os.path.join(temp_path, \"output.parquet\")\n",
" nf.to_parquet(\n",
" output_path, # The filename to save our NestedFrame to.\n",
" by_layer=False, # Save the entire NestedFrame to a single parquet file.\n",
" )\n",
"\n",
" # List the files within our temp_path to ensure that we only saved a single parquet file.\n",
" print(\"The NestedFrame was saved to the following parquet files :\", os.listdir(temp_path))\n",
"\n",
" # Read the NestedFrame back in from our saved single parquet file.\n",
" restored_nf_single_file = read_parquet(output_path)\n",
"\n",
"restored_nf_single_file # our dataframe is restored from a single saved parquet file"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

0 comments on commit 33a3e6f

Please sign in to comment.