Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to use pandas 2.x #838

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from
Draft

Conversation

jpn--
Copy link
Member

@jpn-- jpn-- commented Mar 25, 2024

Addresses #794.

The update from pandas 1.x to 2.x introduces a number of small but material changes that affect ActivitySim:

  • DataFrame Index objects are all one class with different datatypes, instead of being different classes (e.g. there is no more Int64Index class).
  • The read_csv function by default now interprets "None" as a missing value (i.e. NaN) instead of being the Python object None.
  • The groupby operation, when applied to categorical data, now sorts the categories in the result unless told not to (resulting in different order of rows in outputs for some operations).
  • A simple df.join() also potentially sorts the resulting rows differently unless an explicit sort argument is given.
  • Index objects no longer can be checked as is_monotonic but instead need is_monotonic_increasing.
  • The handling of dtypes appears to have improved in some instances, where dtypes used to be promoted by some operations now they are not (e.g. variables that are originally int16 used to become int64 after some operations and now they don't).

@jpn-- jpn-- requested a review from i-am-sijia March 26, 2024 00:21
@jpn-- jpn-- marked this pull request as draft April 3, 2024 23:05
@jpn--
Copy link
Member Author

jpn-- commented Apr 3, 2024

While I've made these updates and all the regular CI tests pass (i.e. the results look correct), I have discovered the change to pandas 2.x incurs a significant runtime penalty when running without sharrow.

non-sharrow test timings for pandas 1.x:

58.60s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_mp
53.71s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc
53.66s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_chunkless
53.23s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_recode

non-sharrow test timings for pandas 2.x:

148.50s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc
148.14s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_chunkless
147.83s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_recode
140.09s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_mp

It will require some research to figure out why this is happening, and whether it can be solved relatively easily... or at all. Initial profiling suggests the problem is in pandas.core.internals.managers.BlockManager.get_dtypes, which is getting called from df.eval, but we almost certainly do not want to mess around with pandas internals.

@@ -236,6 +236,8 @@ def vehicle_allocation(
logger.info("Running for occupancy = %d", occup)
# setting occup for access in spec expressions
locals_dict.update({"occup": occup})
if model_settings.sharrow_skip:
locals_dict["disable_sharrow"] = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My memory might be sloppy. Why possibly opting out sharrow for vehicle allocation?

t = pa.Table.from_pandas(df, preserve_index=True, columns=columns)
except (pa.ArrowTypeError, pa.ArrowInvalid):
# if there are object columns, try to convert them to categories
df = df.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw your latest comment about significantly longer run time with this PR. I noticed you are calling copy() here. In pandas 2.0 copy() defaults to a deep copy. I wonder if this contributed to the run time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is causing the problem. This code only executes in the write_tables step at the end of the model run.

@jpn-- jpn-- self-assigned this Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

2 participants