The following link is a link to the GitHub repository of the self assigned final project in the course Language Analytics (F23.147201U021.A). Within the GitHub repository all necessary code are provided to reproduce the results of the project.
https://github.com/rikkeuldbaek/DadJokeGenerator_LA_FP
For this final project I will be generating dad jokes using two different methods of text generation, namely a simple RNN (recurrent neural network) and a pretrained GPT-2 model. Both models will be finetuned and trained on a dad jokes dataset in order to generate text of the like. In order to make the generated dad jokes from the two models comparable, I have chosen to generate two kinds of jokes from prefixes. These prefixes will indicate the start phrase of the joke, so that the jokes will have similar nature (similar one liner structure). The first kind of joke has the following prefix: "What do you call", while the second kind of joke has the following prefix: "Why did the". The quality of the generated dad jokes from both models will be evaluated in the results section.
The dad jokes dataset is a web scraped collection of signature one liner dad jokes. The dataset consists of 743 dad jokes with a mean length of 75 words. The data is both available within this repository and via Kaggle, please see resources for further information.
For this project I have used the transformer model GPT-2 which has been pretrained on a very large corpus of English text data. The language model comes in different sizes: GPT-Small, GPT-Medium, GPT-Large, and GPT-XL, and due to computational limiations the smallest GPT model is used (GPT-Small with 124 million parameters). The GPT-2 model is trained to find the next word in sentences, thus being ideal for the purpose of this final project. The GPT-2 model is available via HuggingFace
, please see resources for further information.
Similarly I will be using a recurrent neural network (RNN) to generate dad jokes. RNNs are a type of sophisticated neural networks where the connection between the nodes are circular, thus being able to account for the sequential and temporal aspects of language. However, simple RNNs have a hard time dealing with long distance dependencies which are found in the nature of language, i.e., words from long ago have less impact than words from recent time (also known as the vanishing gradient problem). To overcome this problem, I have introduced a Gated Recurrent Unit layer (GRU) with 350 units in my RNN model. GRUs are an improved version of the simple RNNs, as they are able avoid phasing out the information through time, thus storing relevant information from long time ago. The RNN model of this project is constructed using tools from Keras - TensorFlow
, please see resources for further information.
The scripts require a certain folder structure, thus the table below presents the required folders and their description and content.
Folder name | Description | Content |
---|---|---|
src |
dad joke generator scripts and data preprocessing script | gpt2_joke_generator_what.py ,gpt2_joke_generator_why.py , RNN_joke_generator_what.py , RNN_joke_generator_why.py , data.py |
data |
.csv file of dad jokes | dad-a-base.csv |
out |
saved .txt files of generated dad jokes | dad_jokes_gpt2_what.txt , dad_jokes_gpt2_why.txt , dad_jokes_RNN_what.txt , dad_jokes_RNN_why.txt |
The data.py
script located in src
only preprocesses the data for the RNN. The gpt2_joke_generator_what.py
, gpt2_joke_generator_why.py
, RNN_joke_generator_what.py
, and RNN_joke_generator_why.py
located in src
generates dad jokes which are saved in the out
folder.
In order for the user to be able to run the code, please make sure to have bash and python 3 installed on the used device. The code has been written and tested with Python 3.9.2 on a Linux operating system. In order to run the provided code for this assignment, please follow the instructions below.
1) Clone the repository
git clone https://github.com/rikkeuldbaek/DadJokeGenerator_LA_FP
2) Setup
Setup virtual environment (LA_fp_env
) and install required packages.
bash setup.sh
In order to run the GPT-2 and RNN model, please run the following command in the terminal after setting up. Please note that the scripts take quite some time to run since two types of jokes are generated from each model.
bash run.sh
The GPT-2 and RNN model have the following default arguments stated in the tables below. These arguments can be modified and adjusted in the run.sh
script. If no modifications are added, default parameters are run. In case help is needed, please write --help
in continuation of the code below instead of writing an argument.
#GPT-2 model
python src/gpt2_joke_generator_what.py #add arguments here or --help
python src/gpt2_joke_generator_why.py #add arguments here or --help
#RNN model
python src/RNN_joke_generator_what.py #add arguments here or --help
python src/RNN_joke_generator_why.py #add arguments here or --help
The following arguments are available for the data.py
script. These can be adjusted if the user wants to run the models on new data with a different folder structure.
Argument | Type | Default |
---|---|---|
--folder | string | "data" |
--file | string | "dad-a-base.csv" |
The following arguments are available for the gpt2_joke_generator_what.py
and gpt2_joke_generator_why.py
scripts. Note that the --prefix_what
argument is reserved to the gpt2_joke_generator_what.py
script, and the --prefix_why
argument is reserved to the gpt2_joke_generator_why.py
script.
Argument | Type | Default |
---|---|---|
--path_to_data | string | os.path.join(os.getcwd(), "data") |
--file | string | "dad-a-base.csv" |
--steps | integer | 300 |
--restore_from | string | "fresh" |
--prefix_what | string | "What do you call" |
--prefix_why | string | "Why did the" |
--gen_text_len | integer | 75 |
--temperature | float | 1.0 |
The following arguments are available for the RNN_joke_generator_what.py
and RNN_joke_generator_why.py
scripts. Note that the --prefix_what
argument is reserved to the RNN_joke_generator_what.py
script, and the --prefix_why
argument is reserved to the RNN_joke_generator_why.py
script.
Argument | Type | Default |
---|---|---|
--n_epochs | integer | 30 |
--batch_size | integer | 50 |
--verbose | integer | 1 |
--prefix_what | string | "What do you call" |
--prefix_why | string | "Why did the" |
--n_next_words | integer | 12 |
The data.py
is automatically called upon when running the RNN model scripts through the run.sh
script, thus the arguments for data.py
must be parsed to the RNN model scripts inside the run.sh
bash file:
#RNN model
python src/RNN_joke_generator_what.py --arguments_for_model --arguments_for_data
python src/RNN_joke_generator_why.py --arguments_for_model --arguments_for_data
The results of the GPT-2 model looks fairly fine. The model has been specified to generate 15 different jokes of each kind ("What do you call" and "Why did the"), thus multiple jokes were evaluated. The models seems to be picking up on the signature one liner dad joke style, with a sentence structure using a question and an answer. However, the majority of the generated jokes are lacking sematic quality or humor:
What do you call someone with no nose? Someone without a nose?
What do you call a gorilla by his fierce little teeth? A gummy beast!
Why did the barrel go to the bathroom? He was going to need a space bar.
Why did the spider say stay still? So soft it wondered if it was silk or silkect.
Although the majority of GPT-2 generated dad jokes are pretty poor, some of the jokes are actually funny. However, it turns out that these jokes are not always original. Using a simple google search to check whether or not the dad jokes are already present online, I found that most of the generated jokes are not original, i.e., they are already present somehwere on the internet. This is most likely due to the fact that GPT-2 is pretrained on web-data, and thus are likely to rely on it's pre-existing knowledge of available dad jokes. The temperature
argument, specifying originality of generated text, could be tweaked to overcome the problem, however this is a trade-off between originality and semantic quality of the sentence. For instance the GPT-2 model generated the following unoriginal jokes (pre-existing dad jokes found online):
What do you call a pig that knows karate? A pork chop!
Why did the barber win the race? He took a short cut.
Nonetheless, the finetuned GPT-2 model did actually create one new original "What do you call"-dad joke. When performing a little google search, I was not able to find a similar joke, hence I consider the following joke a new original dad joke:
What do you call a troublesome dog? A Terrier-izer.
Please navigate to the out
folder to inspect all GPT-2 generated dad jokes.
The results of the RNN model are quite poor. The model can only generate one joke of each kind ("What do you call" and "Why did the"), hence only two generated jokes are evaluated. From the generated jokes below it is very obvious that the jokes lack grammatical and semantic quality.
What Do You Call A Cow With A Trampoline A Milk Shake A Fish A Rooster
Why Did The Coffee Fail The Driving Test It Never Came To A Full Stop
The poor performance is not surprising, since the model has been trained on a very little text corpus of 743 relatively short sentences (dad jokes). There's simply too little data in the dataset, and this is definitely reflected in the quality of generated dad jokes. However, both dad jokes above have similar structures of pre-existing jokes which are found in the dad jokes dataset: "What do you call a cow on a trampoline? A milkshake." and "Why did the sentence fail the driving test? It never came to a full stop". Hence the RNN model almost replicates pre-existing jokes from the dad jokes dataset, making it a pretty poor model for generating new text. Please navigate to the out
folder to inspect the RNN generated dad jokes.
Overall, the finetuned GPT-2 model outperforms the finetuned RNN when generating dad jokes. This makes pretty good sense as the GPT-2 model is pertrained on a very large english corpus, and thus have learned an inner representation of the English language beforehand. Contraily the simple RNN is only trained on a very little English corpus (dad jokes dataset) without any pre-existing knowledge of English language. After all, one could argue that it is not fair to compare the performance of a finetuned GPT-2 model and a finetuned RNN, as these models are indeed very different in their nature. One should keep in mind that humor is subjective, and the generated dad jokes in this final project is evaluated by the author, hence this evaluation of results bears the mark of the author's subjectivity.