Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement "voxelized" tokenizer #49

Open
kjappelbaum opened this issue Apr 19, 2024 · 4 comments
Open

Implement "voxelized" tokenizer #49

kjappelbaum opened this issue Apr 19, 2024 · 4 comments

Comments

@kjappelbaum
Copy link
Contributor

No description provided.

@kjappelbaum
Copy link
Contributor Author

@kjappelbaum
Copy link
Contributor Author

in Meta's paper (/cc @smiret-intel)

Screenshot 2024-04-19 at 08 42 56

@kjappelbaum
Copy link
Contributor Author

kjappelbaum commented Apr 19, 2024

For building the tokenizer we can do two routes:

  • Observe the extremes of the coordinates and then add everything between the extremes
    • This has an interplay with the resolution
    • This is what the paper from the Aspuru-Guzik group did
    • This might be easier with fractional coordinates
  • Fit on some observed data and then only put those in the vocab

the second approach will limit generalizability, the first will give a very large vocab. Are there any other things that come to mind that we should consider, @smiret-intel , @n0w0f ?

@n0w0f
Copy link
Contributor

n0w0f commented Apr 23, 2024

I am lookin at Regression Transfomer tokenizer implementation in this branch.

Pros:

  • Smaller Vocab
  • no resolution issues

Cons:

  • Requires Pretraining ? ( Not a similar treatment of number, as seen in the pretraining corpus by big models)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants