-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Mutate function #1229
[WIP] Mutate function #1229
Conversation
🚀 Deployed on https://deploy-preview-1229--pyjanitor.netlify.app |
Codecov Report
@@ Coverage Diff @@
## dev #1229 +/- ##
===========================================
+ Coverage 82.51% 97.82% +15.30%
===========================================
Files 78 79 +1
Lines 3770 3814 +44
===========================================
+ Hits 3111 3731 +620
+ Misses 659 83 -576 |
9ceedd0
to
8c96eb5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good to go with the changes here, @samukweku. Thank you for taking the time and effort to put in this PR for the mutate
function!
Can I check, if we go down this path with the mutate
function, is there anything stopping us from replicating the dplyr
API here? When studying the API design implemented in this PR, I noticed that it deviated from dplyr substantially, so I wanted to make sure this difference was intentional.
There are some comments that I hope we can address. Once you've addressed them, I am happy to merge!
By the way, for the future, would you be okay keeping the PR compact? There were a lot of lines changed that weren't related to the mutate
function that also got snuck in. I found it a bit tedious to juggle mentally. The hardest part for me was deciding whether a line change was substantially related to mutate
or not. I think I could have been more efficient with the review if all changes were strictly related to mutate
only here.
The computation should return a 1-D array like object | ||
that is the same length as `df` or a scalar | ||
that can be broadcasted to the same length as `df`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samukweku I had to do a double-take on this sentence. I think it may be improved by indicating how this knowledge affects the user's use of the mutate
API. You may want to add in a "for example..." to help here. What do you think?
|
||
@pytest.mark.functions | ||
def test_empty_args(dataframe): | ||
"""Test output if args is empty""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Test output if args is empty""" | |
"""Ensure that dataframe is not mutated if args is empty""" |
that is the same length as `df` or a scalar | ||
that can be broadcasted to the same length as `df`. | ||
|
||
The argument should be of the form `(columns, func, names_glue)`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The argument should be of the form `(columns, func, names_glue)`; | |
The argument should be of the form `(columns, mutation_func, names_glue)`; |
I changed the name here b/c I thought it would be clearer. However, this is mostly cosmetic (IMO), and it does break the consistency in how we use func
to denote a function, so it's okay to not accept this suggestion.
[`select_columns`][janitor.functions.select.select_columns] | ||
syntax for flexibility. | ||
The function `func` should be a string | ||
(which is dispatched to `pd.Series.transform`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If func
should be a string, I think it may be worth referencing a source of truth for what strings are valid. For example, I wouldn't be able to pass in "means" but "mean" is valid.
janitor/functions/collapse_levels.py
Outdated
@@ -72,6 +72,8 @@ class max_speed type | |||
if not isinstance(df.columns, pd.MultiIndex): | |||
return df | |||
|
|||
df = df[:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samukweku what does this syntax do for us here? Should it be df = df.copy()
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are mutating the columns in this case, so a slice suffices. Pandas creates a new index altogether, which does not affect the original dataframe.
my apologies on the line changes ... I'll def keep the PR more compact in future iterations ... will def review my git skills again |
…1220) * names_pattern supports named_groups * improve tests * fix format in example * add another example for named groups * add details about names_pattern and named groups * add support for dictionary for names_pattern * changelog * changelog * Update CHANGELOG.md * add test for _ as placeholder for .value * change logic for replacing _ with .value * format example output * Version added * Update pivot.py * update docs based on feedback Co-authored-by: Eric Ma <ericmjl@users.noreply.github.com>
* minor fix for drop_constant_columns * minor fix for get_dupes * minor fix for collapse_levels, primarily for speed * fix test fails * fix test fails * vectorise collapse_levels some more for performance sake * allow for mutation * leave collapse_levels as is * Update test_collapse_levels.py * Update test_collapse_levels.py * Update test_collapse_levels.py * restor collapse_levels to before * shortcut if all entries are strings in a list in a select call * use get_indexer_for for lists that contain only strings in select * make more robust by checking on scalar, instead of just strings * improve comments * rebase * more edits * remove extra check * shortcut for * * exclude api/utils from mkdocs * exclude api/utils from mkdocs * simplify move * avoid mutation in collapse_levels * make move more robust with select syntax * docs * fix docstring * replicate fill_empty in impute - reduce duplication * add tests * fix doctest * fix docstrings * defer copy in pivot_wider to pd.pivot * fix np.bool8 deprecation * simplify dtype column selection * fix warning msg output for change_type * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rebase * expose _select_index * add parameters * use get_index_labels where possible * add test for multiple columns * make column selection more robust for sequences * add test for set/dict selection * add test for move - both source and target are lists * exclude utils from docs * fix test fails --------- Co-authored-by: samuel.oranyeli <samuel.oranyeli@grow.inc> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
ad9c014
to
199631e
Compare
PR Description
Please describe the changes proposed in the pull request:
transform_column
andtransform_columns
functionsAt its core, it is nothing more than an elaborate for loop. All the hard work is passed on to Pandas. user should only use this if
pd.assign
is not enoughExamples:
This PR resolves #1226 and #1178.
PR Checklist
Please ensure that you have done the following:
<your_username>
:dev
, but rather from<your_username>
:<feature-branch_name>
.AUTHORS.md
.CHANGELOG.md
under the latest version header (i.e. the one that is "on deck") describing the contribution.Automatic checks
There will be automatic checks run on the PR. These include:
Relevant Reviewers
Please tag maintainers to review.