Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Categorical Values directly #45

Open
schlichtanders opened this issue Feb 28, 2024 · 2 comments
Open

Support Categorical Values directly #45

schlichtanders opened this issue Feb 28, 2024 · 2 comments

Comments

@schlichtanders
Copy link

Motivation and description

In Data Science CategoricalArrays.CategoricalValue or CategoricalArrays.CategoricalVector and the like appear often. (RDatasets loads DataFrames with columns of that type by default).

It would be great if onehotbatch could simply be applied on this.

I just came to this package, still figuring out how to transform such a Categorical Value/Vector into onehot Vector/Matrix... It is very possible that I missed something obvious

Possible Implementation

No response

@mcabbott
Copy link
Member

Attempting to construct the minimal object:

julia> using CategoricalArrays, OneHotArrays

julia> cv = CategoricalArrays.CategoricalValue('b', CategoricalArray('a':'z'))
CategoricalValue{Char, UInt32} 'b'

julia> dump(cv)
CategoricalValue{Char, UInt32}
  pool: CategoricalPool{Char, UInt32, CategoricalValue{Char, UInt32}}
    levels: Array{Char}((26,))
      1: Char 'a'
      2: Char 'b'
      3: Char 'c'
      4: Char 'd'
      5: Char 'e'
      ...
      22: Char 'v'
      23: Char 'w'
      24: Char 'x'
      25: Char 'y'
      26: Char 'z'
    invindex: Dict{Char, UInt32}
      slots: Memory{UInt8}
        length: Int64 64
        ptr: Ptr{Nothing} @0x0000000160607020
    ...

julia> cv.pool.levels
26-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
...

julia> Int(cv.ref), length(cv.pool.levels)
(2, 26)

julia> OneHotArrays.onehot(cv::CategoricalValue) = OneHotVector(cv.ref, length(cv.pool.levels))

julia> onehot(cv)
26-element OneHotVector(::UInt32) with eltype Bool:
 
 1
 
 
 
 
...

julia> dump(onehot(cv))
OneHotVector{UInt32}
  indices: UInt32 0x00000002
  nlabels: Int64 26

Are these two integers all that's required, or are there more complicated examples?

@schlichtanders
Copy link
Author

I think this is all, but I am not an expert on CategoricalArrays

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants