Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(python): Minor tweak in code example in section Coming from Pandas #11764

Merged
merged 5 commits into from
Oct 17, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 11 additions & 14 deletions docs/user-guide/migration/pandas.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,19 +147,20 @@ called `hundredXValue` where the `value` column is multiplied by 100.
In `Pandas` this would be:

```python
df["tenXValue"] = df["value"] * 10
df["hundredXValue"] = df["value"] * 100
df.assign(
tenXValue=lambda df_: df_.value * 10,
hundredXValue=lambda df_: df_.value * 100
Comment on lines +151 to +152
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the polars example should probably also use this syntax then, i.e.

df.with_columns(
    tenXValue=pl.col("value") * 10,
    hundredXValue=pl.col("value") * 100,
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely certain, but it appears that using .alias is the recommended approach for renaming columns.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering this is a brief migration guide to help newcomers from Pandas, I agree with your perspective.

)
```

These column assignments are executed sequentially.

In `Polars` we add columns to `df` using the `.with_columns` method and name them with
the `.alias` method:
In `Polars` we add columns to `df` using the `.with_columns` method:

```python
df.with_columns(
(pl.col("value") * 10).alias("tenXValue"),
(pl.col("value") * 100).alias("hundredXValue"),
tenXValue=pl.col("value") * 10,
hundredXValue=pl.col("value") * 100,
)
```

Expand All @@ -174,7 +175,7 @@ the values in column `a` based on a condition. When the value in column `c` is e
In `Pandas` this would be:

```python
df.loc[df["c"] == 2, "a"] = df.loc[df["c"] == 2, "b"]
df.assign(a=lambda df_: df_.a.where(df_.c != 2, df_.b))
```

while in `Polars` this would be:
Expand All @@ -187,21 +188,17 @@ df.with_columns(
)
```

The `Polars` way is pure in that the original `DataFrame` is not modified. The `mask` is
also not computed twice as in `Pandas` (you could prevent this in `Pandas`, but that
would require setting a temporary variable).

Additionally `Polars` can compute every branch of an `if -> then -> otherwise` in
`Polars` can compute every branch of an `if -> then -> otherwise` in
parallel. This is valuable, when the branches get more expensive to compute.

#### Filtering

We want to filter the dataframe `df` with housing data based on some criteria.

In `Pandas` you filter the dataframe by passing Boolean expressions to the `loc` method:
In `Pandas` you filter the dataframe by passing Boolean expressions to the `query` method:

```python
df.loc[(df['sqft_living'] > 2500) & (df['price'] < 300000)]
df.query('m2_living > 2500 and price < 300000')
```

while in `Polars` you call the `filter` method:
Expand Down