The Futurama Corpus

This repo contains a corpus of dialogue spoken on the television show Futurama. The corpus itself is found in data/futurama_scripts.txt, and the futurama_parse.py module is included to allow easy access to the dialogue, by character. This corpus was created by scraping the Futurama scripts available here, which include 7 seasons of the show and 4 movies. Unfortunately, the website moderators got a little lazy with the formatting of the last season, so many of those scripts could not be parsed. The futurama_scrape.py script was used to scrape the scripts.

Below is a breakdown of the corpus by major character:

Fry: 34,805 words
Bender: 30,333 words
Leela: 28,993 words
Farnsworth: 16,936 words
Hermes: 8,095 words
Zoidberg: 7,201 words
Amy: 6,582 words

The included futurama_generator.py script shows just one example of an application of this corpus. This script randomly generates dialogue from the character Fry using a simple 5-gram model with some rule filtering. My hope is that others may find some fun with this corpus for future NLP projects.

Requirements:

Python 3
NLTK
Beautiful Soup

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
README.md		README.md
futurama_generator.py		futurama_generator.py
futurama_parser.py		futurama_parser.py
futurama_scrape.py		futurama_scrape.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Futurama Corpus

About

Releases

Packages

Languages

acalabrigo/futurama-corpus

Folders and files

Latest commit

History

Repository files navigation

The Futurama Corpus

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages