-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bigger word lists #6
Comments
I replaced the bloom filter with a delta compressed word list which removes the errors in the lookup and easily fits the full wordlist: #8 See this Twitter thread for context: https://twitter.com/ZetaTwo/status/1490393006549180419 The compressed list now uses about 19k memory which corresponds to about a 70% compression rate or equivalently about 1.4 bytes per word. |
Very nice! I've tried to use banking myself and compress by generating and traversing a tree of letters (commit), combining the letter and node depth into the same byte. It is fairly slow when it has to traverse the whole tree, but ended up with 12972 words over 26992 bytes (~2 bytes per word). Thought I'd add my solution to the thread, but @ZetaTwo's solution looks most promising 😃 |
I used another approach based on tom7's "Portmantout" work, encoding the whole word list (*) as one large string represented with 5-bit letters + an index with offsets. Since there are ~13500 words (i.e. offsets), we need at least 14 bits to keep track of them all. The index is therefore just a structure of 14-bit numbers that each represent a legal word. The 5-bit encoded portmantout of the word list clocks in at 8210 bytes, while the index weighs 3910, yielding a total of 12120 bytes. Since this is just encoding (and not actual compression), I'd argue this could run on a Gameboy. Lookups are going to be another challenge. I think that could be implemented with a Word -> Offset hash map. (*) sans 81 words which didn't fit in the portmantout structure. That's a FIXME |
I recently made my own port, not knowing about this one, which has save data, uses the MBC3 real-time clock, and, relevant to this feature request, has the full word lists. |
You could look into banking and compression.
Also don’t use strings in your word list, strings end with
\0
, so each word takes up 6B and 16% of the space is basically wasted.MBC5 can have 512 banks (at least one is for your code) á 16KiB of which GBDK-2020 can address 255 with it’s functions. 256 banks are still 4MiB.
So you can fit > 1.67 million 5 char words in your list. But you also have to manage and find them. Or > 835 thousand 5 char words when you are limited to GBDK-2020’s capabilities.
TLDR: just do 0x00-0x29 for A-Z, 0x2A-0x7F for Digram Encoding, 0x80 to flag the end of it
Another trick would be to not use ASCII, it has 128 characters, but you actually just need use a-z (26 characters). You could for example use a different encoding where youcan fit two of the 15 most frequent charaters in one byte. The most frequent characters seem to be: (maybe easier to implement than digram encoding)
0: the other 11 characters in the second part of the byte, or
<empty>
if F is the second part or<null>
if0x00
1: E
2: T
3: A
4: O
5: N
6: I
7: H
8: S
9: R
A: L
B: D
C: U
D: C
E: M
F: W
00:
01: Y
02: F
03: G
04: P
05: B
06: V
07: K
08: J
09: X
0A: Q
0B: Z
This has to end with
0x00
again since this coding has a variable length. So we need some form of delimiter again. Pro of this is, that you can use the C string functions.“WORLD” is
0xF4, 0x9A, 0xB0, 0x00
“RUGBY” is
0x9C, 0x03, 0x05, 0x01, 0x00
“FRANK” is
0x02, 0x93, 0x50, 0x07, 0x00
Digram Encoding is likely better, but that also needs some form of delimiter. Custom encoding + some form of compression is also possible.
Just Digram Encoding might always gain. With 2B references It could have 3B to 5B, because you would just use the plain string in the worst case. So it’s probably not worth it.
Another encoding is to use one bit (
&0x80
) for marking the end and just replacing parts of ASCII you don’t need with most common digraphs/digrams:(er, ti, ar, ou, or, le, th as 01, 02, 03, 04, 05, 06, 07)
“OTHER” is
0x4F, 0x07, 0x81
“MOUTH” is
0x4D, 0x04, 0x87,
“ORDER” is
0x05, 0x44, 0x81
“UNCLE” is
0x55, 0x4E, 0x43, 0x86
“VOICE” is
0x56, 0x4F, 0x49, 0x43, 0xC5
The text was updated successfully, but these errors were encountered: