Skip to content

Final project for NSS Data Science cohort. Uses NLP against the Spoon River Anthology to determine relationships and groups amongst the characters.

Notifications You must be signed in to change notification settings

taylorperkins/Spoon-River-Anthology

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spoon River Anthology

"The Hill"

Where are Elmer, Herman, Bert, Tom and Charley,
The weak of will, the strong of arm, the clown, the boozer, the fighter?
All, all are sleeping on the hill.
One passed in a fever,
One was burned in a mine,
One was killed in a brawl,
One died in a jail,
One fell from a bridge toiling for children and wife—
All, all are sleeping, sleeping, sleeping on the hill.
Where are Ella, Kate, Mag, Lizzie and Edith,
The tender heart, the simple soul, the loud, the proud, the happy one?
—
All, all are sleeping on the hill.
One died in shameful child-birth,
One of a thwarted love,
One at the hands of a brute in a brothel,
One of a broken pride, in the search for heart’s desire;
One after life in far-away London and Paris
Was brought to her little space by Ella and Kate and Mag
—
All, all are sleeping, sleeping, sleeping on the hill.
Where are Uncle Isaac and Aunt Emily,
And old Towny Kincaid and Sevigne Houghton,
And Major Walker who had talked
With venerable men of the revolution?
—
All, all are sleeping on the hill.
They brought them dead sons from the war,
And daughters whom life had crushed,
And their children fatherless, crying
—
All, all are sleeping, sleeping, sleeping on the hill.
Where is Old Fiddler Jones
Who played with life all his ninety years,
Braving the sleet with bared breast,
Drinking, rioting, thinking neither of wife nor kin,
Nor gold, nor love, nor heaven?
Lo! he babbles of the fish-frys of long ago,
Of the horse-races of long ago at Clary’s Grove,
Of what Abe Lincoln said
One time at Springfield.

Overview

Spoon River Anthology is a book of poems written by Edgar Lee Masters written in the early 1900's. Every poem in the book is told by a fictional character (some based on people Edgar knew) that have died in this town. Given that each character has passed away, the characters will talk about any number of things.. Love, loss, regret, guilt, joy, what it is like in the afterlife, or how people are currently attending to their gravestones. It is an incredible novel, with so much spirit and emotion.

For this particular project, I hope to do two things.

  1. Uncover the relationships in the book. Since the characters are from the same town, many of them knew each other. It is an interesting project to uncover those relationships, and be able to view the town at a high level.
  2. Second is to cluster the characters based on sentiment. An easy question to ask is.. Which characters are talking about the same things? Is there a way to determine similarities? I think so, and I would love to solve this using nlp and a graphing interface, such as networkx.

My Thought Process, Order to Completion

  1. I need the data! This dataset is pretty easy to obtain overall. I programmatically went through the contents of the book found here and scraped out what I needed. You can find the code for my process here.
  2. Now that I have the text for the poems, I need to clean them a bit. There are new lines at the beginning of the files, and sometimes in the middle of the poems. For every 5th line in a poem, there number representing the line number at the very end of that line. In addition to this.. Every poem has either the first or first and second words completely capitalized. This makes it a pretty difficult to appropriately identify names, or other words of importance. This need to be normalized as every other line in the poems. These processes are relatively simple, and you can find the code for that process here.
  3. Model on

Determining Similarities

I want to have a simple and understandable approach/alogorithm for determining if two poems are similar. From there, I can break it down further and get more granular. With that being said, this is my first approach.

Each poem is a document, since that is what we will be comparing, making the book the corpus. After vectorizing the documents, I will perform a TF-IDF process over all of them. The TF-IDF will assign weights to every word in the document within the context of the document while considering both how many documents are present. This makes it easier to compare one poem against another using a numeric score.

During this process, I will also be converting each unique word to a numeric representation and storing the results as a dictionary for further use. This is better for both memory and speed when processing my results.

In this scenario, the TF-IDF will not be enough to compare the documents. In addition to TF-IDF, I will also use synonyms to help determine weight.

Given two poems and after normalizing the current scores, the rules will be as follows:

  • For any word in P1 that is also found in P2, we sum the TF-IDF scores of all instances from both poems.
  • For any word in P1 that has a synonym that is also a synonym to a different word in P2, we sum all TF-IDF scores for those instances and divide by 2.
  • For words without a direct match, and no synonym mapping to a different word, their score does not contribute to the overall similarity score.

Technologies To Use

  1. Gensim. Really killer nlp library. I will be using this for a few things during this process. Converting strings to vectors, storing their vectors in a re-useable dictionary, and their TF-IDF model.
  2. NLTK. I will be using this package for their wordnet module. They have a fantastic synonym set to map between words.
  3. NetworkX. Python graph library used for.. Graphs. This will help find the open triangle relationships I will need to for synonym matching between poems.

About

Final project for NSS Data Science cohort. Uses NLP against the Spoon River Anthology to determine relationships and groups amongst the characters.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published