- First, run
python random_token_combos.py
. This generatesrandom_pairs_lower.txt
, which lists all words that fulfill the following criteria:- 7 letters long
- 2 subword tokens long (using the tokenizer that both GPT-3.5 and GPT-4 use; it needs to be 2 tokens long whether the word follows a space or not)
- The first subword token is 3 letters long, and the second is 4 letters long (again, these lengths need to be identical whether the word follows a space or not).
- Then, sort these words by the probability assigned to them by GPT-2 by running
python gpt2_prob_sevenletter.py
. This generatesrandom_pairs_lower_scored.txt
, which lists each word along with a log probability. The log probability is computed as the log probability that GPT-2 assigns to the sentenceThe word is "WORD"
, minus the log probability that it assigns toThe word is "'; thus, this yields the log probability assigned to just the word and the following quotation mark in the context of
The word is "`. The closing quotation mark is included because it serves to indicate the end of the word. - Then, bin the words by running
python select_words.py
to createwords_5bins.txt
. - The final list of words can be found in
bin1_prob.txt
,bin2_prob.txt
,bin3_prob.txt
,bin4_prob.txt
, andbin5_prob.txt
.
seven_letter_words
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||