Text classification benchmarks

Author: Aleksander Moeslund Wael

About the project

This repo contains a collection of Python scripts for preprocessing, training and predicting classes of text data. There are three scripts involved in this process, all located in the src folder; vectorizer.py, which is the script for preprocessing and extracting features from the text data using tf-idf. Then there are two classification scripts, logistic_regression.py and neural_network.py, each using either logistic regression or a MLP classifier for binary classification and then saving a classification report for analysis or other use.

Data

The text data used for this project is the Fake News Dataset. It consists of 10556 news articles, each containing a title, article text and label. All articles are either real or fake news, and the task is to predict the true label of articles based on the article text.

Pipeline

The vectorizer.py script is run first to extract features. The script follows these steps:

Import dependencies
Load data located in in folder
Intitialize vectorizer
Vectorize data using tf-idf
Save the vectorizer in the models folder
Save the preprocessed data in the in folder

The logistic_regression.py and neural_network.py scripts follow these steps:

Import dependencies
Load preprocessed data from the vectorizer.py script
Initialize the model (logistic regression or MLP classifier)
Fit the model to the data
Use model to predict labels
Save the model to the models folder
Save a classification report to the out folder

Requirements

The code is tested on Python 3.11.2. Futhermore, if your OS is not UNIX-based, a bash-compatible terminal is required for running shell scripts (such as Git for Windows).

Usage

The repo was setup to work with Windows (the WIN_ files), MacOS and Linux (the MACL_ files).

1. Clone repository to desired directory

git clone https://github.com/alekswael/text_classification_with_MLP_LR
cd text_classification_with_MLP_LR

2. Run setup script

NOTE: Depending on your OS, run either WIN_setup.sh or MACL_setup.sh.

The setup script does the following:

Creates a virtual environment for the project
Activates the virtual environment
Installs the correct versions of the packages required
Deactivates the virtual environment

bash WIN_setup.sh

3. Run pipeline

NOTE: Depending on your OS, run either WIN_run.sh or MACL_run.sh.

OBS: Make sure to run the vectorizer script first!

The script does the following:

Activates the virtual environment
Runs vectorizer.py located in the src folder
Deactivates the virtual environment

bash WIN_run_vectorizer.sh

Then, you can run either the *nn.sh or *lr.sh script.

The scripts do the following:

Activates the virtual environment
Runs logistic_regression.py or neural_network.py located in the src folder
Deactivates the virtual environment

bash WIN_run_lr.sh
bash WIN_run_nn.sh

Note on model tweaks

Some model parameters can be set through the argparse module. However, this requires running the Python script seperately OR altering the run*.sh file to include the arguments. The Python script is located in the src folder. Make sure to activate the environment before running the Python script.

vectorizer.py [-h] [--data DATA] [--out OUT] [--text TEXT] [--label LABEL] [--test_size TEST_SIZE] [--ngram_range NGRAM_RANGE] [--lowercase LOWERCASE] [--max_df MAX_DF] [--min_df MIN_DF] [--max_features MAX_FEATURES]

options:
  -h, --help            show this help message and exit
  --data DATA           Name of data file, should be a .csv (default: fake_or_real_news.csv)
  --out OUT             Folder where vectorizer is saved (default: models)
  --text TEXT           Name of text (X) column in data file (default: text)
  --label LABEL         Name of label (y) column in data file (default: label)
  --test_size TEST_SIZE
                        Size of test split, int between 0-1. (default: 0.2)
  --ngram_range NGRAM_RANGE
                        Ngram range for vectorizer, two digits seperated by a comma. NB: NO SPACES ALLOWED (default: 1,2)
  --lowercase LOWERCASE
                        If the data should be transformed to lowercase (default: True)
  --max_df MAX_DF       Specify max_df parameter (default: 0.95)
  --min_df MIN_DF       Specify min_df parameter (default: 0.05)
  --max_features MAX_FEATURES
                        Specify max number of features for vectorizer (default: 500)

neural_network.py [-h] [--nodes_layers NODES_LAYERS] [--max_iter MAX_ITER]

options:
  -h, --help            show this help message and exit
  --nodes_layers NODES_LAYERS
                        Number of nodes per layer. Default is one layer of 5 nodes. For one layer, do not include comma at end! NB: NO SPACES ALLOWED (default: 5)
  --max_iter MAX_ITER   Number of iterations. Default is 1000. (default: 1000)

Repository structure

This repository has the following structure:

│   .gitignore
│   MACL_run_lr.sh
│   MACL_run_nn.sh
│   MACL_run_vectorizer.sh
│   MACL_setup.sh
│   README.md
│   requirements.txt
│   WIN_run_lr.sh
│   WIN_run_nn.sh
│   WIN_run_vectorizer.sh
│   WIN_setup.sh
│   
├───.github
│       .keep
│       
├───in
│       fake_or_real_news.csv
│       
├───models
│       .gitkeep
│       
├───out
│       .gitkeep
│
├───src
│       .gitkeep
│       logistic_regression.py
│       neural_network.py
│       vectorizer.py
│
└───__pycache__

Report on findings

As seen by the classification reports below, the two methods obtain very similar performance at 89% accuracy. In such a case, it is useful to pick the least compute-intensive model for the task.

Classification report for logistic regression

              precision    recall  f1-score   support

        FAKE       0.89      0.88      0.88       628
        REAL       0.88      0.90      0.89       639

    accuracy                           0.89      1267
   macro avg       0.89      0.89      0.89      1267
weighted avg       0.89      0.89      0.89      1267

Classification report for neural network

              precision    recall  f1-score   support

        FAKE       0.91      0.87      0.89       628
        REAL       0.88      0.91      0.90       639

    accuracy                           0.89      1267
   macro avg       0.89      0.89      0.89      1267
weighted avg       0.89      0.89      0.89      1267

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text classification benchmarks

Author: Aleksander Moeslund Wael

About the project

Data

Pipeline

Requirements

Usage

1. Clone repository to desired directory

2. Run setup script

3. Run pipeline

Note on model tweaks

Repository structure

Report on findings

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github		.github
in		in
models		models
out		out
src		src
.gitignore		.gitignore
MACL_run_lr.sh		MACL_run_lr.sh
MACL_run_nn.sh		MACL_run_nn.sh
MACL_run_vectorizer.sh		MACL_run_vectorizer.sh
MACL_setup.sh		MACL_setup.sh
README.md		README.md
WIN_run_lr.sh		WIN_run_lr.sh
WIN_run_nn.sh		WIN_run_nn.sh
WIN_run_vectorizer.sh		WIN_run_vectorizer.sh
WIN_setup.sh		WIN_setup.sh
requirements.txt		requirements.txt

alekswael/text_classification_with_MLP_LR

Folders and files

Latest commit

History

Repository files navigation

Text classification benchmarks

Author: Aleksander Moeslund Wael

About the project

Data

Pipeline

Requirements

Usage

1. Clone repository to desired directory

2. Run setup script

3. Run pipeline

Note on model tweaks

Repository structure

Report on findings

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages