Skip to content

Commit

Permalink
Merge commit '151c08f0dd62228271a6242eb4164a0eccaf5569' into streamer
Browse files Browse the repository at this point in the history
# Conflicts:
#	.pre-commit-config.yaml
#	docs/walkthrough/one-dim.ipynb
#	sharrow/aster.py
#	sharrow/dataset.py
#	sharrow/flows.py
#	sharrow/nested_logit.py
#	sharrow/relationships.py
#	sharrow/shared_memory.py
#	sharrow/tests/test_relationships.py
  • Loading branch information
jpn-- committed Dec 5, 2023
2 parents ce6bfea + 151c08f commit 13aa10c
Show file tree
Hide file tree
Showing 31 changed files with 2,418 additions and 307 deletions.
8 changes: 4 additions & 4 deletions .github/workflows/run-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,12 +46,12 @@ jobs:
run: |
conda info -a
conda list
- name: Lint with flake8
- name: Lint with Ruff
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
ruff check . --select=E9,F63,F7,F82 --statistics
# exit-zero treats all errors as warnings.
ruff check . --exit-zero --statistics
- name: Test with pytest
run: |
python -m pytest
Expand Down
14 changes: 10 additions & 4 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
repos:

- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.3.0
rev: v4.4.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
Expand All @@ -13,18 +13,24 @@ repos:
hooks:
- id: nbstripout

- repo: https://github.com/charliermarsh/ruff-pre-commit
rev: v0.0.274
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]

- repo: https://github.com/pycqa/isort
rev: 5.10.1
rev: 5.12.0
hooks:
- id: isort
args: ["--profile", "black", "--filter-files"]

- repo: https://github.com/psf/black
rev: 22.10.0
rev: 23.3.0
hooks:
- id: black

- repo: https://github.com/PyCQA/flake8
rev: 5.0.4
rev: 4.0.1
hooks:
- id: flake8
8 changes: 4 additions & 4 deletions docs/_script/hide_test_cells.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@

# Text to look for in adding tags
text_search_dict = {
"# TEST": "remove_cell", # Remove the whole cell
"# HIDDEN": "remove_cell", # Remove the whole cell
"# NO CODE": "remove_input", # Remove only the input
"# HIDE CODE": "hide_input", # Hide the input w/ a button to show
"# TEST": "remove-cell", # Remove the whole cell
"# HIDDEN": "remove-cell", # Remove the whole cell
"# NO CODE": "remove-input", # Remove only the input
"# HIDE CODE": "hide-input", # Hide the input w/ a button to show
}

# Search through each notebook and look for th text, add a tag if necessary
Expand Down
207 changes: 199 additions & 8 deletions docs/walkthrough/encoding.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
"id": "f17c8818",
"metadata": {
"tags": [
"remove_cell"
"remove-cell"
]
},
"outputs": [],
Expand All @@ -30,7 +30,9 @@
"cell_type": "code",
"execution_count": null,
"id": "d4e7246c",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import numpy as np\n",
Expand Down Expand Up @@ -217,7 +219,7 @@
"id": "2ad591bb",
"metadata": {
"tags": [
"remove_cell"
"remove-cell"
]
},
"outputs": [],
Expand Down Expand Up @@ -296,7 +298,7 @@
"id": "bd549b5e",
"metadata": {
"tags": [
"remove_cell"
"remove-cell"
]
},
"outputs": [],
Expand Down Expand Up @@ -377,7 +379,7 @@
"id": "7d74e53e",
"metadata": {
"tags": [
"remove_cell"
"remove-cell"
]
},
"outputs": [],
Expand Down Expand Up @@ -477,7 +479,7 @@
"id": "a016d30f",
"metadata": {
"tags": [
"remove_cell"
"remove-cell"
]
},
"outputs": [],
Expand Down Expand Up @@ -525,7 +527,7 @@
"id": "28afb335",
"metadata": {
"tags": [
"remove_cell"
"remove-cell"
]
},
"outputs": [],
Expand Down Expand Up @@ -690,6 +692,195 @@
" _name='WLK_LOC_WLK_FAR'\n",
").to_series() == [0,152,474]).all()"
]
},
{
"cell_type": "markdown",
"id": "cb219dc3-dd66-44cd-a7c5-2a1da4bc1467",
"metadata": {
"tags": []
},
"source": [
"# Pandas Categorical Dtype\n",
"\n",
"Dictionary encoding is very similar to the approach used for the pandas Categorical dtype, and\n",
"can be used to achieve some of the efficiencies of categorical data, even though xarray lacks\n",
"a formal native categorical data representation. Sharrow's `construct` function for creating\n",
"Dataset objects will automatically use dictionary encoding for \"category\" data. \n",
"\n",
"To demonstrate, we'll load some household data and create a categorical data column."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3b765919-69a4-4fb0-b805-9d3b5fed7897",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"hh = sh.example_data.get_households()\n",
"hh[\"income_grp\"] = pd.cut(hh.income, bins=[-np.inf,30000,60000,np.inf], labels=['Low', \"Mid\", \"High\"])\n",
"hh = hh[[\"income\",\"income_grp\"]]\n",
"hh.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "312faa0b-13cf-4649-9835-7a53b5e81a0b",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"hh.info()"
]
},
{
"cell_type": "markdown",
"id": "c51a88d2-02b1-4502-9f4b-271fbb126699",
"metadata": {},
"source": [
"We'll then create a Dataset using construct."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cd1c2fd5-59c6-48cb-bd6e-d2f9dde2aa36",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"hh_dataset = sh.dataset.construct(hh[[\"income\",\"income_grp\"]])\n",
"hh_dataset"
]
},
{
"cell_type": "markdown",
"id": "033b3629-a16b-47a4-bb18-10af9c7c4f07",
"metadata": {},
"source": [
"Note that the \"income\" variable remains an integer as expected, but the \"income_grp\" variable, \n",
"which had been a \"category\" dtype in pandas, is now stored as an `int8`, giving the \n",
"category _index_ of each element (it would be an `int16` or larger if needed, but that's\n",
"not necessary with only 3 categories). The information about the labels for the categories is \n",
"retained not in the data itself but in the `digital_encoding`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "369442af-1c69-41eb-b530-ea398d6eac7a",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"hh_dataset[\"income_grp\"].digital_encoding"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "58db6505-1c90-475e-8d91-0e2e89ec0f0e",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# TESTING\n",
"assert hh_dataset[\"income_grp\"].dtype == \"int8\"\n",
"assert hh_dataset[\"income_grp\"].digital_encoding.keys() == {'dictionary', 'ordered'}\n",
"assert all(hh_dataset[\"income_grp\"].digital_encoding['dictionary'] == np.array(['Low', 'Mid', 'High'], dtype='<U4'))\n",
"assert hh_dataset[\"income_grp\"].digital_encoding['ordered'] is True"
]
},
{
"cell_type": "markdown",
"id": "38f8a6c9-4bca-4e73-82b0-e0996814565a",
"metadata": {},
"source": [
"If you try to make the return trip to a pandas DataFrame using the regular \n",
"`xarray.Dataset.to_pandas()` method, the details of the categorical nature\n",
"of this variable are lost, and only the int8 index is available."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ad2f8677-b02c-4d28-892f-3272804bf714",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"hh_dataset.to_pandas()"
]
},
{
"cell_type": "markdown",
"id": "95f26992-241f-4982-aa56-f1055a35f969",
"metadata": {},
"source": [
"But, if you use the `single_dim` accessor on the dataset provided by sharrow,\n",
"the categorical dtype is restored correctly."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "43b94637-c025-4578-9e97-4b6484cb231e",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"hh_dataset.single_dim.to_pandas()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b1b86a4b-c19f-4ae9-8a8b-395f53209bc6",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# TESTING\n",
"pd.testing.assert_frame_equal(\n",
" hh_dataset.single_dim.to_pandas(),\n",
" hh\n",
")"
]
},
{
"cell_type": "markdown",
"id": "5e66c294-747d-414c-8e03-b8b551a0e2a9",
"metadata": {},
"source": [
"Note that this automatic handling of categorical data only applies when constructing\n",
"or deconstructing a dataset with a single dimension (i.e. the `index` is not a MultiIndex).\n",
"Multidimensional datasets use the normal xarray processing, which will dump string\n",
"categoricals back into python objects, which is bad news for high performance applications."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2208108a-d051-4941-ad72-b89a05169d81",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"sh.dataset.construct(\n",
" hh[[\"income\",\"income_grp\"]].reset_index().set_index([\"HHID\", \"income\"])\n",
")"
]
}
],
"metadata": {
Expand All @@ -708,7 +899,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.10"
"version": "3.10.9"
},
"toc": {
"base_numbering": 1,
Expand Down
6 changes: 5 additions & 1 deletion docs/walkthrough/one-dim.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -939,7 +939,11 @@
"cell_type": "code",
"execution_count": null,
"id": "fb9f17cc",
"metadata": {},
"metadata": {
"tags": [
"raises-exception"
]
},
"outputs": [],
"source": [
"nest_tree"
Expand Down
Loading

0 comments on commit 13aa10c

Please sign in to comment.