Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve docs for data_to_wide #506

Merged
merged 30 commits into from
May 31, 2024
Merged

Improve docs for data_to_wide #506

merged 30 commits into from
May 31, 2024

Conversation

strengejacke
Copy link
Member

@strengejacke strengejacke commented May 20, 2024

Background: https://x.com/stephenjwild/status/1792294171924967920
And I recently wanted to use data_to_wide() and couldn't get it work at first, took some time to figure out the logic behind reshaping to wide again. Especially, the documentation of id_cols was confusing.

Copy link
Member

@etiennebacher etiennebacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, those are nice docs improvements. However I disagree on the renaming of id_cols as by. I don't think it makes things any clearer but instead leads to confusion since those two functions are clearly based on tidyr::pivot_*() functions so departing from those only for one argument is strange.


Regarding the tweet, I just have access to the first one since I don't have an X account, so maybe there are more explanations, but to me those pivot_*() functions made pivoting/unpivoting so much easier to understand than reshape2 or data.table approaches.

R/data_read.R Show resolved Hide resolved
R/data_to_wide.R Outdated Show resolved Hide resolved
R/data_to_long.R Outdated Show resolved Hide resolved
R/data_to_long.R Outdated Show resolved Hide resolved
R/data_to_long.R Show resolved Hide resolved
R/data_to_wide.R Show resolved Hide resolved
R/data_to_wide.R Outdated Show resolved Hide resolved
R/data_to_wide.R Outdated Show resolved Hide resolved
R/data_to_wide.R Outdated Show resolved Hide resolved
R/data_to_wide.R Show resolved Hide resolved
@strengejacke
Copy link
Member Author

strengejacke commented May 21, 2024

but to me those pivot_*() functions made pivoting/unpivoting so much easier to understand

True, but I think by even makes it even clearer. by refers across our easystats-packages to a variable/column that indicates "grouping". Reshaping to wide format commonly, but not in general, identifies "repeated measurements" by an "ID", or group. And id_cols suggests plural, although often only one column is selected (that was also one of the criticism on Twitter). Thus, id_cols is not even the best choice from a grammar perspective, too ;-)

@etiennebacher
Copy link
Member

By @strengejacke in a review comment:

I personally prefer by, both because it's in line with renaming the other arguments across functions into by. Especially, since by refers to a "grouping" variable, and imho, by makes clearer that this specified variables is used for gathering those rows ("groups") that are spread as new columns.

However, since we already allow select (easystats-name) and cols (tidyverse-name), we could probably use both argument names here, too?

@etiennebacher
Copy link
Member

However, since we already allow select (easystats-name) and cols (tidyverse-name), we could probably use both argument names here, too?

I think we should deprecate and then remove one of the two. Those pivot functions already have a lot of args so I suggest we avoid adding bloat with aliases.

Thus, id_cols is not even the best choice from a grammar perspective, too ;-)

I don't think there is a best choice for this, one could use id_col but it would suggest this must be a single column while it's not always the case. Just for comparison with other packages:

  • base::reshape(): idvar
  • data.table::melt(): id.vars
  • pandas and polars: index

Considering a by argument would be quite unique (doesn't mean that other packages always make things right, but it's quite indicative).

Curious about what others think about renaming id_cols to by @easystats/core-team

@strengejacke
Copy link
Member Author

We slice and reshape by the levels of that variable - it's still my personal favorite, and we use this argument name now aggressively consistent in our packages. I don't mind deviating from other packages, especially since users are free to use those other packages if they prefer that syntax.

I just realize that when you teach R to beginners, it's really good if you can repeat yourself: "use select for..." or "use by for...". Easier to understand and remember for newbies

@strengejacke
Copy link
Member Author

@DominiqueMakowski
Copy link
Member

DominiqueMakowski commented May 30, 2024

I think I agree with @etiennebacher; id_cols' purpose was to eventually specify an index, but i don't think it is conceptually different from what by is expected to do

Sorry @strengejacke 😅 - but I'm not convinced it should be replaced or made an alias

@bwiernik
Copy link
Contributor

I don't quite understand the use of by as an alias for id_cols. Could you talk about the similarity there?

@strengejacke
Copy link
Member Author

strengejacke commented May 30, 2024

but i don't think it is conceptually different from what by is expected to do

Yes, that's why I thought we could use by, because we use it in conceptually similar ways across other functions now.

But I'm ok with sticking to id_cols, though I don't see the need to use that name. The id_cols refers to a variable, whose "levels" is used for grouping/stratification. In every other instance we now use by, not here though.

I think "re-thinking" reshaping data in this way makes it rather easy to use, especially since it would be in line with our other functions.

@strengejacke
Copy link
Member Author

Btw, for data_to_long(), we have select for selecting columns, but also its alias cols for compatibility with pivot_longer()

@bwiernik
Copy link
Contributor

I think the conceptual similarity of by to id_cols is very tenuous. I suggest we drop it

@strengejacke
Copy link
Member Author

Ok, if nobody likes my suggestion (😢), let's stick to id_cols then.

Copy link
Member

@etiennebacher etiennebacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the failures look unrelated to this PR.

RIP by in data_to_wide()

@etiennebacher etiennebacher merged commit a7d3c80 into main May 31, 2024
22 of 29 checks passed
@etiennebacher etiennebacher deleted the improve_docs branch May 31, 2024 11:46
@strengejacke
Copy link
Member Author

by:

exterminador-do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants