Wesleyan Media Project - Party Classifier with Unique ID

Welcome! This repository contains scripts that train and apply a machine learning model to classify political advertisements based on their content and determine which political party (Democratic, Republican, or Other) the ads belong to.

This repo is part of the Cross-platform Election Advertising Transparency Initiative (CREATIVE). CREATIVE is an academic research project that has the goal of providing the public with analysis tools for more transparency of political ads across online platforms. In particular, CREATIVE provides cross-platform integration and standardization of political ads collected from Google and Facebook. CREATIVE is a joint project of the Wesleyan Media Project (WMP) and the privacy-tech-lab at Wesleyan University.

To analyze the different dimensions of political ad transparency we have developed an analysis pipeline. The scripts in this repo are part of the Data Classification step in our pipeline.

1. Overview

We provide 4 different party classifier models that are each trained using different algorithms. We recommend (and implement in our scripts) the Random Forest model which provided the highest accuracy in our training. Nonetheless, we give you all four models should you wish to use them. The party classifier is trained on a dataset of ads that have already been labeled with the party each ad belongs to.

In addition to the classifier in this repo, we also provide an ad-level party classifier. Unlike the ad-level party classifier that operates at the individual ad level, the party classifier in this repo works at the entity level by analyzing all ads associated with a particular entity "pd_id" collectively.

In situations where you need clear and specific predictions about political party affiliations for ads, it is better to use this party classifier instead of the ad-level classifier because the former operates under the assumption that all ads associated with a single pd_id will belong to the same party, leading to more consistent and potentially more accurate predictions about party affiliation when viewing ads collectively rather than individually.

2 Setup

The scripts in this repo are in Python. Make sure you have both installed and set up before continuing. To install and set up Python you can follow the Beginner's Guide to Python. The Python Scripts in this repo uses Jupyter Notebook as an interface. It is an interactive environment for Python development. You can install Jupyter Notebook by following the Jupyter Notebook website.

To start setting up the repo and run the scripts, first clone this repo to your local directory:

git clone https://github.com/Wesleyan-Media-Project/party_classifier_pdid.git

Then, ensure you have the following software packages installed for Python:

scikit-learn
pandas
numpy
joblib

You can install the required packages by running the following command:

pip install pandas numpy scikit-learn joblib

After installing the required packages, you can run the scripts in the following order:

01_create_training_data.ipynb
02_training.ipynb
03_google2022_inference.ipynb, 03_inference_140m.ipynb, or 03_inference_fb2022.ipynb

To run the above IPython Notebook code ending with .ipynb, you can open the Jupyter Notebook interface by type the following in your terminal:

jupyter notebook

After you open the Jupyter Notebook interface, you can navigate to the folder where you have cloned the repo and open the script you want to run.

Then, click on the first code cell to select it. Run each cell sequentially by clicking the Run button or pressing Shift + Enter.

If you want to use the trained model we provide, you can only run the inference script since the model files are already present in the /models folder.

2.1 Training

Note: If you do not want to train a model from scratch, you can use the trained model we provide here, and skip to 2.2.

Please note that to access the data stored on Figshare, you will need to fill out a brief form and then immediately get data access.

To run our scripts, you need to have a trained classifier. The script 01_create_training_data.ipynb prepares a training dataset by first reading the ad data fb_2020_140m_adid_var1.csv.gz which has the metadata for each ad and merging it with the WMP entity file wmp_fb_entities_v090622.csv which has the party affiliation information on each entity that publishes ads on Facebook, based on the pd_id column. This allows the script to associate each ad with a party affiliation. You will need to download fb_2020_140m_adid_var1.csv.gz and make sure that where it is located on your machine matches up with what is in the script 01_create_training_data.ipynb.

Second, the script checks for each page ID (page_id) and ensures all associated ads have consistent party affiliations. If a page ID has ads with conflicting party affiliations, it marks that page ID as non-usable.

Third, the script split page names into train and test with a 70/30 split to make sure pd ids of a usable page into either train or test but never both.

Finally, the script prepares text for train, testing, and inference by filtering out the rows with non-usable page IDs and saving the resulting data frame as a compressed CSV file: 140m_with_page_id_based_training_data.csv.gz. This file contains the ad data, page IDs, party affiliations, and the train-test split information for the usable page IDs.

Once our training data is ready, the script 02_training.ipynb loads this training data and trains the party classifier. During the training process, the script trains the following machine learning models and picks the best model (e.g., Random Forest) based on classification reports:

Multinomial Naive Bayes (MultinomialNB)
Logistic Regression
Support Vector Machine (SVM)
Random Forest

All models are saved in the /models folder. The best model which later used to inference the new data is party_clf_pdid_rf.joblib.

2.1.1 Model details

The best-performing model is identified as the Random Forest Classifier with the following hyperparameters we found using Grid Search Cross-Validation:

'clfmax_depth': 25,
'clfmax_features': 0.1,
'clf__n_estimators': 500

2.1.2 Model performance

Here is the model performance on the held-out test set:

              precision    recall  f1-score   support

         DEM      0.843     0.941     0.889       491
       OTHER      1.000     0.091     0.167        44
         REP      0.887     0.851     0.869       424

    accuracy                          0.862       959
   macro avg      0.910     0.628     0.642       959
weighted avg      0.870     0.862     0.847       959

2.2 Inference

Please note that to access the data stored on Figshare, you will need to fill out a brief form and then immediately get data access.

After the training, the following scripts 03_google2022_inference.ipynb, 03_inference_140m.ipynb, and 03_inference_fb2022.ipynb are all used to apply the trained model to different datasets. The applied output is saved accordingly in the file party_all_clf_google_2022_advertiser_id.csv, party_all_clf_pdid_fb_2020_140m.csv, and party_all_clf_pdid_fb_2022.csv respectively. Here are the input files you need for each of these inference scripts:

For Facebook 2020: fb_2020_140m_adid_text_clean.csv.gz and fb_2020_140m_adid_var1.csv.gz. Note that these files will be made available when they are ready.
For Google 2022: g2022_adid_01062021_11082022_text.csv.gz
For Facebook 2022: fb_2022_adid_text.csv.gz and fb_2022_adid_var1.csv.gz

Note: If you would like to use a model different than Random Forest, you can simply change the model input script with the appropriate model. For instance, if you want to use the SVM model, replace the following script in the inference scripts: mnb_clf = load('models/party_clf_pdid_rf.joblib') with this: mnb_clf = load('models/party_clf_pdid_svm.joblib')

3. Thank You

We would like to thank our supporters!

This material is based upon work supported by the National Science Foundation under Grant Numbers 2235006, 2235007, and 2235008.

The Cross-Platform Election Advertising Transparency Initiative (CREATIVE) is a joint infrastructure project of the Wesleyan Media Project and privacy-tech-lab at Wesleyan University in Connecticut.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
data/facebook		data/facebook
models		models
.gitignore		.gitignore
01_create_training_data.ipynb		01_create_training_data.ipynb
02_training.ipynb		02_training.ipynb
03_google2022_inference.ipynb		03_google2022_inference.ipynb
03_inference_140m.ipynb		03_inference_140m.ipynb
03_inference_fb2022.ipynb		03_inference_fb2022.ipynb
CREATIVE_logo.png		CREATIVE_logo.png
Creative_Pipelines.png		Creative_Pipelines.png
LICENSE		LICENSE
README.md		README.md
nsf.png		nsf.png
party_all_clf_google_2022_advertiser_id.csv		party_all_clf_google_2022_advertiser_id.csv
party_all_clf_pdid_fb_2020_140m.csv		party_all_clf_pdid_fb_2020_140m.csv
party_all_clf_pdid_fb_2022.csv		party_all_clf_pdid_fb_2022.csv
plt_logo.png		plt_logo.png
wmp-logo.png		wmp-logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wesleyan Media Project - Party Classifier with Unique ID

Table of Contents

1. Overview

2 Setup

2.1 Training

2.1.1 Model details

2.1.2 Model performance

2.2 Inference

3. Thank You

About

Releases

Packages

Contributors 7

Languages

License

Wesleyan-Media-Project/party_classifier_pdid

Folders and files

Latest commit

History

Repository files navigation

Wesleyan Media Project - Party Classifier with Unique ID

Table of Contents

1. Overview

2 Setup

2.1 Training

2.1.1 Model details

2.1.2 Model performance

2.2 Inference

3. Thank You

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages