This repository contains the solution of my team for the Facebook Hateful Memes Competition hosted on DrivenData.
Competition paper: https://arxiv.org/abs/2005.04790?fbclid=IwAR3hdA_-nPAM7DdZ5oBW-B48NYD-pA2aRkxDm43ljfwLrRXNkF7re_bzBaQ
DrivenData: https://www.drivendata.org/competitions/70/hateful-memes-phase-2/page/266/
Facebook AI: https://ai.facebook.com/blog/hateful-memes-challenge-and-data-set/
Our system achieves:
- Cross-Validation score over 10-fold: 0.882 AUC and 0.813 acc.
- Test unseen score: 0.788 AUC and 0.745 acc.
Leaderboard position: 7th (and 2nd by accuracy metric)
Team name: MemeLords
Team members:
- C.S Bahusruth
- Victor Callejas Fuentes
The aim of this competition is to create an algorithm that identifies multimodal hate speech in memes.
The dataset is constructed in a way that the algorithm must be multimodal to excel. The dataset consists of pairs of text-images.
The competition is a binary classification task, not hateful or hateful.
The metric to optimize is the Area Under the ROC Curve.
Our full methodology and findings will be released on a paper soon (maybe)
Python 3.7.6
- (Recommended) Create a virtual environment:
python -m virtualenv env
Enter the virtual environment.
source env/bin/activate
- Install dependencies
pip install -r requirements.txt
- Download raw data
Go to the competition page, download and extract raw data into:
data/raw/
- You can generate the processed data by the methods described below or download it from here:
and extract it into:
data/
You can train the models or download our pretrained ones provided here:
- ensemble (154 MB)
- level-0 (22 GB)
- VQA Multimodal checkpoints (2 GB)
and extract into:
artifacts/
.
├── artifacts # Model checkpoints with logs and predictions
├── data
├── src # Source files
├── notebooks # Source files on IPython format
├── submissions # Submissions to competition leaderboard
├── LICENSE
├── README.md
└── requiremets.txt # pip env frozen
.
├── ensemble # Metaclassfier model checkpoints
├── level-0 # Level-0 models
| ├── DistilBert
| | ├── model # Model and optimizer checkpoint for each fold
| | ├── preds # Test and validation predictions for each fold
| | └── logs... # Log and results for each fold
| └── Roberta...
└── MM # Multimodal pre-trained checkpoints from VQA repository
.
├── BUTD_features # Image features extracted from Faster R-CNN
├── external # External data
├── folds # K-Folds created
├── interim # Half-processed data
├── processed # Processed data
└── raw # Raw competition data
.
├── config # Run and model configurations
├── data # Data preprocessing to Data loaders
├── models # Model definitions
├── utils # Training and model auxiliary files
└── train.py # Training entry point
We used a cross-validation strategy in order to choose our best level-0 models and parameters.
The folds can be generated with the script provided at:
notebooks/folds.ipynb
Once the best parameters are chosen, we perform a final training on all data [train + dev]. Stochastic Weight Averaging (SWA) is key to avoid overfitting.
This behavior is chosen in the run configuration file.
The cross-validation strategy it's as follows:
Level 1
- Divide train data into K-folds
- Train some models
- For each model, we save validation and test features when validation AUC is higher
Level 2
- Validation saved features from each model are now train data, we use the same folds as in level 1
- For each K-fold we train a meta-classifier
Level 3
- Final predictions are the median of the probabilities of each meta-classifier
We use them as they are provided
Image features: are features extracted from this repository Bottom-up Attention, which uses a Faster R-CNN trained on Visual Genome dataset. We run this on Google Colab: notebook
Objects and attributes: image features converted to words using Visual Genome dictionary
Web entities: we extract these with the script:
cd src
python utils/web_entities.py
Topic: from web entities we get internet knowledge about them using Duck Duck Go API
All of these are combined to generate inputs for the models in these notebooks:
notebooks/prepro.ipynb
notebooks/generate_vision.ipynb
We developed two types of base models, one that directly uses features from a pre-trained Faster R-CNN network as Uniter, Visualbert... (Pure Multimodal) and another type of model where we remove the lineal projection from Faster R-CNN extracted features to transformer embedding space by using directly Visual Genome Objects and Attributes dictionary(Multimodal text).
This are models like UNITER, LXMERT, VisualBert...
We use this repository Transformers-VQA.
Best results are achieved with Uniter.
For these models, you need to provide text and image features.
Text: combinations from text generated data. For example:
- captions
- captions + web entities
Image features: are features extracted from a Faster R-CNN, see section Data.
These are models derivated from Bert.
The input of these models is just text, combinations from text generated data. For example:
- captions + frcnn objects + frcnn attributes + web entities + topic...
- captions + frcnn objects
We used the transformers library to try different transformers models, we achieved best results with:
The meta classifier consists of a simple dense linear layer over the features extracted from the level-0 models.
In the case of K-fold, there will be K meta-classifiers and the final probability will be a simple median over their probabilities, this helps predictions to be robust.
Best cross-validation and test unseen scores stacking these 5 models:
- Pure Multimodal Uniter (captions and image features)
- Pure Multimodal Uniter (extended captions and image features)
- Multimodal Text Ernie (extended captions)
- Multimodal Text Distilbert (extended captions)
- Multimodal Text Distilroberta (extended captions)
You can use these models already for inference or later in an ensemble.
-
Modify configuration files
- Modify cfg.py
- Modify model configuration files BertConfig.py and MMConfig.py
-
Execute training
cd src # Must be inside this directory python train.py
Logs will be output in the terminal and saved in the artifacts folder for further inspection
Run notebook
notebooks/Stacking.ipynb
Here we expose some learnings and tips.
Using Visual Genome labels (words) instead of features (2048 dimensional vector) allows us to reduce the model complexity.
These labels are already in the same embedding space as captions, so you do not depend on the projection of the features.
There is a loss of information by doing this, but helps with overfitting.
Some of the examples proposed in this dataset are very difficult to classify with current approaches as further context is needed to make a good classification out of it.
For example, for the image with id = 16395:
-
Faster R-CNN trained on Visual Genome: chin long hair face nose eyebrow hair hairstyle facial expression blond
-
Web entities: Bethany hamilton
-
Topic: Shark attack victims
Just with the features extracted from the Faster R-CNN, some examples can't make good predictions due to lack of context.
Most of the time the Faster R-CNN features are the best and web entities and topics fail, but the combination of this works best.
These web entities are constructed by searching for similar images on the internet.
We assert the necessity of creating a new architecture that it's able to retain information that could be scrapped from internet knowledge.
SWA allow us to achieve the best results during multiple epochs, that way we can perform final training without validation more confidently.
Different models have different tokenizers and pretraining methods, so each one of them can extract information that the others can not and vice versa, so the combination of them achieves best results.
FP16 training should be the norm and not the exception. This allows us to achieve the same results as with FP32 training but twice faster.
We tried multiple classification heads for the transformers (MLPs, Convolutional layers...) but always got best results with just one linear dense layer, this could be because of overfitting.
Models trained on Visual Genome works better for our task than the one trained on COCO.
-
Image objects relationships detection via Scene Graph
The current SOTA multimodal model for Image-Text Classification is ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph It is like the Uniter model we use but also taken into account the relationships between the objects apart from objects and attributes.
ERNIE-ViL is only available for the PaddlePaddle Framework but a similar model could be created from Uniter or our Multimodal Text models and Scene Graph reconstruction from Unbiased Scene Graph Generation from Biased Training
-
Conterfactual Training
As proposed here Counterfactual VQA: A Cause-Effect Look at Language Bias
-
Multimodal Blending loss
As proposed here What Makes Training Multi-modal Classification Networks Hard?
-
Policy Based Learning
During the competition, we found that hate speech depends on the definition of this, and usually is not just about the hate sentiment expressed, it's about this hate sentiment expressed to a protected entity, which varies on the definition, and therefore a preprocessing to detect the protected entities on text and image could help the model to improve performance and diminish overfitting.
We proposed a text and object detection based on a protected entities database and add them as additional tokens for the transformer.
Our work has been based on these papers and we recommend them to learn more about Multimodal Classification problems.
-
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data
-
What Makes Training Multi-Modal Classification Networks Hard?
-
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation
-
Exploring Deep Multimodal Fusion of Text and Photo for Hate Speech Classification
-
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
-
Image–text sentiment analysis via deep multimodal attentive fusion
-
Supervised Multimodal Bitransformers for Classifying Images and Text
MIT