This repo contains a collection of Python scripts for preprocessing, training and predicting classes of text data. There are three scripts involved in this process, all located in the src
folder; vectorizer.py
, which is the script for preprocessing and extracting features from the text data using tf-idf. Then there are two classification scripts, logistic_regression.py
and neural_network.py
, each using either logistic regression or a MLP classifier for binary classification and then saving a classification report for analysis or other use.
The text data used for this project is the Fake News Dataset. It consists of 10556 news articles, each containing a title, article text and label. All articles are either real or fake news, and the task is to predict the true label of articles based on the article text.
The vectorizer.py
script is run first to extract features. The script follows these steps:
- Import dependencies
- Load data located in
in
folder - Intitialize vectorizer
- Vectorize data using tf-idf
- Save the vectorizer in the
models
folder - Save the preprocessed data in the
in
folder
The logistic_regression.py
and neural_network.py
scripts follow these steps:
- Import dependencies
- Load preprocessed data from the
vectorizer.py
script - Initialize the model (logistic regression or MLP classifier)
- Fit the model to the data
- Use model to predict labels
- Save the model to the
models
folder - Save a classification report to the
out
folder
The code is tested on Python 3.11.2. Futhermore, if your OS is not UNIX-based, a bash-compatible terminal is required for running shell scripts (such as Git for Windows).
The repo was setup to work with Windows (the WIN_ files), MacOS and Linux (the MACL_ files).
git clone https://github.com/alekswael/text_classification_with_MLP_LR
cd text_classification_with_MLP_LR
NOTE: Depending on your OS, run either WIN_setup.sh
or MACL_setup.sh
.
The setup script does the following:
- Creates a virtual environment for the project
- Activates the virtual environment
- Installs the correct versions of the packages required
- Deactivates the virtual environment
bash WIN_setup.sh
NOTE: Depending on your OS, run either WIN_run.sh
or MACL_run.sh
.
OBS: Make sure to run the vectorizer script first!
The script does the following:
- Activates the virtual environment
- Runs
vectorizer.py
located in thesrc
folder - Deactivates the virtual environment
bash WIN_run_vectorizer.sh
Then, you can run either the *nn.sh
or *lr.sh
script.
The scripts do the following:
- Activates the virtual environment
- Runs
logistic_regression.py
orneural_network.py
located in thesrc
folder - Deactivates the virtual environment
bash WIN_run_lr.sh
bash WIN_run_nn.sh
Some model parameters can be set through the argparse
module. However, this requires running the Python script seperately OR altering the run*.sh
file to include the arguments. The Python script is located in the src
folder. Make sure to activate the environment before running the Python script.
vectorizer.py [-h] [--data DATA] [--out OUT] [--text TEXT] [--label LABEL] [--test_size TEST_SIZE] [--ngram_range NGRAM_RANGE] [--lowercase LOWERCASE] [--max_df MAX_DF] [--min_df MIN_DF] [--max_features MAX_FEATURES]
options:
-h, --help show this help message and exit
--data DATA Name of data file, should be a .csv (default: fake_or_real_news.csv)
--out OUT Folder where vectorizer is saved (default: models)
--text TEXT Name of text (X) column in data file (default: text)
--label LABEL Name of label (y) column in data file (default: label)
--test_size TEST_SIZE
Size of test split, int between 0-1. (default: 0.2)
--ngram_range NGRAM_RANGE
Ngram range for vectorizer, two digits seperated by a comma. NB: NO SPACES ALLOWED (default: 1,2)
--lowercase LOWERCASE
If the data should be transformed to lowercase (default: True)
--max_df MAX_DF Specify max_df parameter (default: 0.95)
--min_df MIN_DF Specify min_df parameter (default: 0.05)
--max_features MAX_FEATURES
Specify max number of features for vectorizer (default: 500)
neural_network.py [-h] [--nodes_layers NODES_LAYERS] [--max_iter MAX_ITER]
options:
-h, --help show this help message and exit
--nodes_layers NODES_LAYERS
Number of nodes per layer. Default is one layer of 5 nodes. For one layer, do not include comma at end! NB: NO SPACES ALLOWED (default: 5)
--max_iter MAX_ITER Number of iterations. Default is 1000. (default: 1000)
This repository has the following structure:
│ .gitignore
│ MACL_run_lr.sh
│ MACL_run_nn.sh
│ MACL_run_vectorizer.sh
│ MACL_setup.sh
│ README.md
│ requirements.txt
│ WIN_run_lr.sh
│ WIN_run_nn.sh
│ WIN_run_vectorizer.sh
│ WIN_setup.sh
│
├───.github
│ .keep
│
├───in
│ fake_or_real_news.csv
│
├───models
│ .gitkeep
│
├───out
│ .gitkeep
│
├───src
│ .gitkeep
│ logistic_regression.py
│ neural_network.py
│ vectorizer.py
│
└───__pycache__
As seen by the classification reports below, the two methods obtain very similar performance at 89% accuracy. In such a case, it is useful to pick the least compute-intensive model for the task.
Classification report for logistic regression
precision recall f1-score support
FAKE 0.89 0.88 0.88 628
REAL 0.88 0.90 0.89 639
accuracy 0.89 1267
macro avg 0.89 0.89 0.89 1267
weighted avg 0.89 0.89 0.89 1267
Classification report for neural network
precision recall f1-score support
FAKE 0.91 0.87 0.89 628
REAL 0.88 0.91 0.90 639
accuracy 0.89 1267
macro avg 0.89 0.89 0.89 1267
weighted avg 0.89 0.89 0.89 1267