I downloaded all of the Seinfeld scripts from seinology.com and wrote scripts to extract the scripts and put them into a SQLite database.
Feel free to message me if you want the DB file.
mkdir scripts
python download.py scripts
- Fix any issues in the data (See CHANGES MADE TO DATA)
./run.sh seinfeld.db scripts
sqlite> .schema episode
CREATE TABLE episode(
id INTEGER PRIMARY KEY,
season_number INTEGER NOT NULL,
episode_number INTEGER NOT NULL,
title TEXT,
the_date TEXT,
writer TEXT,
director TEXT,
UNIQUE(season_number, episode_number)
);
sqlite> select * from episode limit 3;
id season_number episode_number title the_date writer director
1 1 0 Good News, Bad News July 5, 1989 Larry David, Jerry Seinfeld Art Wolff
2 2 5 The Apartment April 4, 1991 Peter Mehlman Tom Cherones
3 6 16 The Beard February 9, 1995 Carol Leifer Andy Ackerman
CREATE TABLE utterance(
id INTEGER PRIMARY KEY,
episode_id INTEGER NOT NULL,
utterance_number INTEGER NOT NULL,
speaker TEXT NOT NULL,
text TEXT NOT NULL,
UNIQUE(episode_id, utterance_number),
FOREIGN KEY(episode_id) REFERENCES episode(id)
);
sqlite> select * from utterance limit 3;
id episode_id utterance_number speaker text
1 1 1 JERRY (pointing at George's shirt) See, to me, that button is in the worst possible spot. The second button literally makes or breaks the shirt, look at it. It's too high! It's in no-man's-land. You look like you live with your mother.
2 1 2 GEORGE Are you through?
3 1 3 JERRY You do of course try on, when you buy?
- Script transcribers sometimes describe how a line is spoken or what's going on in a scene as parentheticals preceding lines. I'd like to remove these and I think it may be as easy as looking for a pair of parentheses at the beginning of a line.
- A lot of the character names are uses inconsistently.
In the file '01.shtml' look for "pc: 101, season 1, episode 1 (Pilot)" and change "episode 1" to "episode 0".
####Characters with the most lines
SELECT speaker, count(*) as count
FROM utterance
GROUP BY speaker
ORDER BY count DESC
LIMIT 20;
speaker count
JERRY 14645
GEORGE 9613
ELAINE 7967
KRAMER 6656
NEWMAN 625
MORTY 502
HELEN 470
FRANK 429
SUSAN 382
ESTELLE 273
MAN 207
PETERMAN 199
WOMAN 199
PUDDY 163
LEO 145
JACK 124
STEINBRENNER 122
MICKEY 118
BANIA 102
ROSS 102