A textual analysis of the travel patterns for the US president
and secretary using NMF, PCA, t-SNE, and TF-IDF. I found this dataset
online when I was preparing a final project for my data science class and
thought that these data might be interesting to look at on their own + I
wanted to practice some of my data analysis skills :).
I scraped the raw dataset for the analysis from the U.S. Historian website using
BeautifulSoup and cleaned the data which can be found in the tidy dataset. I primarily
used scikit-learn to analyze the data and seaborn + plotly for visualizations.
A project report/analysis interpretations can be found here.
.
├── LICENSE
├── README.md
├── data
│ ├── README.md # Data dictionaries
│ ├── travel_processed.csv # A tidy dataset with a processed textual component
│ ├── travel_raw.csv # A raw scraped dataset
│ └── travel_tidy.csv # A tidy dataset with an unprocessed textual component
├── external
│ ├── LICENSE # License for ctfidf.py
│ ├── README.md # Credit for ctfidf.py
│ └── ctfidf.py # A class for calculating class-based TF-IDF
├── models
│ ├── kmeans.joblib # K-Means saved model
│ ├── nmf.joblib # NMF saved model
│ ├── pca.joblib # PCA saved model
│ ├── tfidf_features.joblib # TF-IDF matrix
│ ├── tfidf_vectorizer.joblib # TF-IDF scikit object
│ └── tsne.joblib # t-SNE saved model
├── notebooks
│ ├── visualize_clusters.ipynb # Visualizing PCA and t-SNE embeddings
│ ├── visualize_exp.ipynb # Exploratory analyses and visualizations
│ └── visualize_tfidf.ipynb # Visualizing TF-IDF results
└── src
├── data
│ ├── preprocess.py # Script for cleaning the textual portion of the data
│ ├── scrape.py # Script for scraping the raw data
│ └── tidy.py # Script for cleaning up the dates and separating locales
└── models
└── fit_models.py # Script for fitting the models