-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Step_integer() documentation and use with unseen data #1316
Comments
Hello @nhward 👋 Thanks for bringing up this issue. I will respond to the 3 points I see: clarity in documentationand yes I agree, the documentation is a little unclear as to what is actually done. input typesI was surprised to see that this method works with numeric input. I was expected (and we are only testing) that it works for character and factor input which makes the most sense. I kinda want to deprecate the use of this step in non character/factor input, but don't have a clear idea of how to do that right now validityThis steps implements what is commonly called integer encoding another ref. It is a well defined method for dealing with categorical variables, although the performance is often not the best. |
Great reply. I am familiar with label encoding but I did not immediately see this as label encoding until you pointed this out as I was wearing a numeric variate hat. Personally, I avoid label encoding as it makes observation-distance calculations less meaningful for nominal variables (for methods that depend on distance calculations.) DocumentationI noticed that a Google search for "R Recipe step label encoding" does not return a reference to The wording employed in the documentation of Data typesOrdinal should be handled with Numeric data should only be allowed in Alternatively, (and most preferably), numeric data could throw errors if a new numeric value is introduced in unseen data, as per I seem to be introducing more problems than solutions. I will leave it to you to assess the value of these suggestions. |
The step_integer documentation states:
Niavely, I thought that meant each value would be replaced by its integer truncation as I was looking for some data-type conversion steps. I made this mistake because I did not read (more correctly, had forgotten) the Details section which explains things fully. A (much) better description would be:
Once the true nature of the recipe step is made clear, users can ask themselves whether unseen observations can ever be truly passed through this step sensibly.
The code below shows that observations, that are not part of the 50 training cases, are given the integer value of zero.
Unless I misunderstand something this recipe step is fundamentally flawed as a step that can process unseen data. The
zero_based
parameter does not address this problem either. If the goal is to replace the variable with its rank order, then new observations can never be processed sensibly, (as things stand) since neither 0 nor max+1 are sensible values for new observations.Perhaps the description should read:
I hope I am not being a moaner by raising this. I use
recipe()
all the time and appreciate the work of others. I do fear this recipe step will cause more harm than good as it is so easy to misuse.The text was updated successfully, but these errors were encountered: