Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bigger word lists #6

Open
basxto opened this issue Feb 7, 2022 · 4 comments
Open

Bigger word lists #6

basxto opened this issue Feb 7, 2022 · 4 comments

Comments

@basxto
Copy link

basxto commented Feb 7, 2022

You could look into banking and compression.
Also don’t use strings in your word list, strings end with \0, so each word takes up 6B and 16% of the space is basically wasted.

MBC5 can have 512 banks (at least one is for your code) á 16KiB of which GBDK-2020 can address 255 with it’s functions. 256 banks are still 4MiB.

So you can fit > 1.67 million 5 char words in your list. But you also have to manage and find them. Or > 835 thousand 5 char words when you are limited to GBDK-2020’s capabilities.

TLDR: just do 0x00-0x29 for A-Z, 0x2A-0x7F for Digram Encoding, 0x80 to flag the end of it

Another trick would be to not use ASCII, it has 128 characters, but you actually just need use a-z (26 characters). You could for example use a different encoding where youcan fit two of the 15 most frequent charaters in one byte. The most frequent characters seem to be: (maybe easier to implement than digram encoding)

  • 0: the other 11 characters in the second part of the byte, or <empty> if F is the second part or <null> if 0x00

  • 1: E

  • 2: T

  • 3: A

  • 4: O

  • 5: N

  • 6: I

  • 7: H

  • 8: S

  • 9: R

  • A: L

  • B: D

  • C: U

  • D: C

  • E: M

  • F: W

  • 00:

  • 01: Y

  • 02: F

  • 03: G

  • 04: P

  • 05: B

  • 06: V

  • 07: K

  • 08: J

  • 09: X

  • 0A: Q

  • 0B: Z

This has to end with 0x00 again since this coding has a variable length. So we need some form of delimiter again. Pro of this is, that you can use the C string functions.

“WORLD” is 0xF4, 0x9A, 0xB0, 0x00
“RUGBY” is 0x9C, 0x03, 0x05, 0x01, 0x00
“FRANK” is 0x02, 0x93, 0x50, 0x07, 0x00

Digram Encoding is likely better, but that also needs some form of delimiter. Custom encoding + some form of compression is also possible.
Just Digram Encoding might always gain. With 2B references It could have 3B to 5B, because you would just use the plain string in the worst case. So it’s probably not worth it.

Another encoding is to use one bit (&0x80) for marking the end and just replacing parts of ASCII you don’t need with most common digraphs/digrams:
(er, ti, ar, ou, or, le, th as 01, 02, 03, 04, 05, 06, 07)
“OTHER” is 0x4F, 0x07, 0x81
“MOUTH” is 0x4D, 0x04, 0x87,
“ORDER” is 0x05, 0x44, 0x81
“UNCLE” is 0x55, 0x4E, 0x43, 0x86
“VOICE” is 0x56, 0x4F, 0x49, 0x43, 0xC5

@ZetaTwo
Copy link

ZetaTwo commented Feb 7, 2022

I replaced the bloom filter with a delta compressed word list which removes the errors in the lookup and easily fits the full wordlist: #8 See this Twitter thread for context: https://twitter.com/ZetaTwo/status/1490393006549180419

The compressed list now uses about 19k memory which corresponds to about a 70% compression rate or equivalently about 1.4 bytes per word.

@M4R7iNP
Copy link

M4R7iNP commented Feb 7, 2022

Very nice! I've tried to use banking myself and compress by generating and traversing a tree of letters (commit), combining the letter and node depth into the same byte. It is fairly slow when it has to traverse the whole tree, but ended up with 12972 words over 26992 bytes (~2 bytes per word). Thought I'd add my solution to the thread, but @ZetaTwo's solution looks most promising 😃

@jchillerup
Copy link

jchillerup commented Feb 8, 2022

I used another approach based on tom7's "Portmantout" work, encoding the whole word list (*) as one large string represented with 5-bit letters + an index with offsets.

Since there are ~13500 words (i.e. offsets), we need at least 14 bits to keep track of them all. The index is therefore just a structure of 14-bit numbers that each represent a legal word.

The 5-bit encoded portmantout of the word list clocks in at 8210 bytes, while the index weighs 3910, yielding a total of 12120 bytes. Since this is just encoding (and not actual compression), I'd argue this could run on a Gameboy.

Lookups are going to be another challenge. I think that could be implemented with a Word -> Offset hash map.

(*) sans 81 words which didn't fit in the portmantout structure. That's a FIXME

@SNBeast
Copy link

SNBeast commented Mar 28, 2022

I recently made my own port, not knowing about this one, which has save data, uses the MBC3 real-time clock, and, relevant to this feature request, has the full word lists.
The SRAM presence, RTC presence, and expanded ROM were achieved using these compiler arguments and, for the word lists, splitting them into each letter (and the daily list separately) for making smaller chunks that can be auto-packed and for faster searches. The delay is pretty minor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants