Update to use pandas 2.x #838

jpn-- · 2024-03-25T03:18:04Z

Addresses #794.

The update from pandas 1.x to 2.x introduces a number of small but material changes that affect ActivitySim:

DataFrame Index objects are all one class with different datatypes, instead of being different classes (e.g. there is no more Int64Index class).
The read_csv function by default now interprets "None" as a missing value (i.e. NaN) instead of being the Python object None.
The groupby operation, when applied to categorical data, now sorts the categories in the result unless told not to (resulting in different order of rows in outputs for some operations).
A simple df.join() also potentially sorts the resulting rows differently unless an explicit sort argument is given.
Index objects no longer can be checked as is_monotonic but instead need is_monotonic_increasing.
The handling of dtypes appears to have improved in some instances, where dtypes used to be promoted by some operations now they are not (e.g. variables that are originally int16 used to become int64 after some operations and now they don't).

# Conflicts: # conda-environments/activitysim-dev.yml # conda-environments/github-actions-tests.yml

jpn-- · 2024-04-03T23:14:28Z

While I've made these updates and all the regular CI tests pass (i.e. the results look correct), I have discovered the change to pandas 2.x incurs a significant runtime penalty when running without sharrow.

non-sharrow test timings for pandas 1.x:

58.60s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_mp
53.71s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc
53.66s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_chunkless
53.23s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_recode

non-sharrow test timings for pandas 2.x:

148.50s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc
148.14s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_chunkless
147.83s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_recode
140.09s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_mp

It will require some research to figure out why this is happening, and whether it can be solved relatively easily... or at all. Initial profiling suggests the problem is in pandas.core.internals.managers.BlockManager.get_dtypes, which is getting called from df.eval, but we almost certainly do not want to mess around with pandas internals.

i-am-sijia · 2024-04-03T22:56:50Z

activitysim/abm/models/vehicle_allocation.py

@@ -236,6 +236,8 @@ def vehicle_allocation(
        logger.info("Running for occupancy = %d", occup)
        # setting occup for access in spec expressions
        locals_dict.update({"occup": occup})
+        if model_settings.sharrow_skip:
+            locals_dict["disable_sharrow"] = True


My memory might be sloppy. Why possibly opting out sharrow for vehicle allocation?

i-am-sijia · 2024-04-03T23:30:27Z

activitysim/core/workflow/state.py

+                t = pa.Table.from_pandas(df, preserve_index=True, columns=columns)
+            except (pa.ArrowTypeError, pa.ArrowInvalid):
+                # if there are object columns, try to convert them to categories
+                df = df.copy()


I saw your latest comment about significantly longer run time with this PR. I noticed you are calling copy() here. In pandas 2.0 copy() defaults to a deep copy. I wonder if this contributed to the run time?

I don't think this is causing the problem. This code only executes in the write_tables step at the end of the model run.

jpn-- added 13 commits March 18, 2024 18:11

updates for pandas 2.2

7b850ca

pytables 3.9

002604d

input checker message failbacks

8819b8c

fix veh type categoricals

be5c024

restore original pandas read_csv NaNs

98bc2e4

is_monotonic_increasing

5beffda

fix disagg acc sorting

9b67fec

drop unused indexes

234a420

update pipeline ref

58003ed

temporarily disable sharrow in vehicle alloc

012e92e

fix dtype problem

c6975a4

ensure MAX index does not overflow

2a899e5

sort on join to preserve index ordering from old pandas

a752ea4

jpn-- requested a review from i-am-sijia March 26, 2024 00:21

jpn-- added 5 commits March 25, 2024 19:24

local compute test simplifies debugging

543b19a

Merge branch 'main' into depend-pandas-2

8ed8fb9

more robust conversion to pyarrow

50c9f6d

Merge branch 'main' into depend-pandas-2

a393dbd

Merge branch 'main' into depend-pandas-2

c06d737

# Conflicts: # conda-environments/activitysim-dev.yml # conda-environments/github-actions-tests.yml

jpn-- marked this pull request as draft April 3, 2024 23:05

i-am-sijia reviewed Apr 3, 2024

View reviewed changes

jpn-- self-assigned this Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to use pandas 2.x #838

Update to use pandas 2.x #838

jpn-- commented Mar 25, 2024 •

edited

Loading

jpn-- commented Apr 3, 2024

i-am-sijia Apr 3, 2024

i-am-sijia Apr 3, 2024

jpn-- Apr 3, 2024

Update to use pandas 2.x #838

Are you sure you want to change the base?

Update to use pandas 2.x #838

Conversation

jpn-- commented Mar 25, 2024 • edited Loading

jpn-- commented Apr 3, 2024

i-am-sijia Apr 3, 2024

Choose a reason for hiding this comment

i-am-sijia Apr 3, 2024

Choose a reason for hiding this comment

jpn-- Apr 3, 2024

Choose a reason for hiding this comment

jpn-- commented Mar 25, 2024 •

edited

Loading