Implement "voxelized" tokenizer #49

kjappelbaum · 2024-04-19T06:38:26Z

No description provided.

kjappelbaum · 2024-04-19T06:41:21Z

in https://arxiv.org/pdf/2305.05708.pdf

kjappelbaum · 2024-04-19T06:43:22Z

in Meta's paper (/cc @smiret-intel)

kjappelbaum · 2024-04-19T06:45:58Z

For building the tokenizer we can do two routes:

Observe the extremes of the coordinates and then add everything between the extremes
- This has an interplay with the resolution
- This is what the paper from the Aspuru-Guzik group did
- This might be easier with fractional coordinates
Fit on some observed data and then only put those in the vocab

the second approach will limit generalizability, the first will give a very large vocab. Are there any other things that come to mind that we should consider, @smiret-intel , @n0w0f ?

n0w0f · 2024-04-23T11:47:28Z

I am lookin at Regression Transfomer tokenizer implementation in this branch.

Pros:

Smaller Vocab
no resolution issues

Cons:

Requires Pretraining ? ( Not a similar treatment of number, as seen in the pretraining corpus by big models)

n0w0f mentioned this issue Apr 23, 2024

Tokenizer - Optionally round all floating point numbers to int #48

Closed

n0w0f mentioned this issue Apr 24, 2024

feat: add optional RT tokenizer treatment for numbers #55

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement "voxelized" tokenizer #49

Implement "voxelized" tokenizer #49

kjappelbaum commented Apr 19, 2024

kjappelbaum commented Apr 19, 2024

kjappelbaum commented Apr 19, 2024

kjappelbaum commented Apr 19, 2024 •

edited

Loading

n0w0f commented Apr 23, 2024 •

edited

Loading

Implement "voxelized" tokenizer #49

Implement "voxelized" tokenizer #49

Comments

kjappelbaum commented Apr 19, 2024

kjappelbaum commented Apr 19, 2024

kjappelbaum commented Apr 19, 2024

kjappelbaum commented Apr 19, 2024 • edited Loading

n0w0f commented Apr 23, 2024 • edited Loading

kjappelbaum commented Apr 19, 2024 •

edited

Loading

n0w0f commented Apr 23, 2024 •

edited

Loading