GitHub - Data-Science-for-Linguists-2020/Smoking-Gun-Classification: This repository acts as my term project for LING1340: Data Science for Linguists. In this project, I'll be working on building a classifier that can sort through emails to flag useful information for investigators.

Smoking Gun Classification

About:
This repository acts as my term project for LING1340: Data Science for Linguists. In this project, I'll be working on building a classifier that can sort through emails to flag useful information for investigators. The data set that I will be training my classifier on is a set of released emails from around 150 users--mostly senior employees--at former American energy company Enron. These emails are significant for various different types of natural language processing, but the unique situation of these emails being sourced from a notoriously corrupt company allows for the domain specific training that I'm looking for.

Why?:
I was inspired to take this on when I was looking through a catalogue of large data sets. The interesting thing though, is that most of these emails aren't really what you would expect out of the lore that surrounds insider trading or cooking the books. They're mostly like, "my kid wants a football for christmas." But I, like most people, want the juicy stuff! So I want to make a classifier that can separate the signal from the noise effectively.

Data Specifications:

version: May, 2015
size: 1.7GB, nearly 3,500 folders, over 500,000 emails
history: The dataset was originally published by the Federal Energy Regulation Commission during their investigation of Enron. The data was then worked on by the CALO (A Cognitive Assistant that Learns and Organizes), an organization within the AIC (Artificial Intelligence Center), that is a part of the SRI (Stanford Research Institute). The data was later purchased and remodeled by researchers at MIT in the 2010's. I accessed and found the data through former Carnegie Mellon professor William M. Cohen's website, linked below.
organization:
- relative path: maildir/lastname-firstinitial (specific to user)/email directory (specific to user)
  - maildir/arora-h/all_documents
  - maildir/arora-h/deleted_items
  - maildir/arora-h/discussion_threads
  - ...
- other notes: per William Cohen
  'The dataset here does not include attachments, and some messages have been deleted "as part of a redaction effort due to requests from affected employees". Invalid email addresses were converted to something of the form user@enron.com whenever possible (i.e., recipient is specified in some parse-able format like "Doe, John" or "Mary K. Smith") and to no_address@enron.com when no recipient was specified.'

Resources:

to read more about the Enron scandal: https://en.wikipedia.org/wiki/Enron#2001_Accounting_scandals
to access the data set: http://www.cs.cmu.edu/~enron/

Directory

Data Sample: This contains one of the folders from the Enron Corpus.
Notebooks: My code for the project, follow along chronologically!
Pictures: Pictures from various points in the project. Most are summarized in my final report or notebooks.
Project_Details: Contains my presentation, progress reports, and license.
Others: Other than that, there are some pickles and my guestbook

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
Notebooks		Notebooks
Project_Details		Project_Details
data_sample/cash-m		data_sample/cash-m
pickles		pickles
pictures		pictures
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Smoking Gun Classification

About

Releases

Packages

Languages

Data-Science-for-Linguists-2020/Smoking-Gun-Classification

Folders and files

Latest commit

History

Repository files navigation

Smoking Gun Classification

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages