This task was the subject of my Deep Learning's exam of 02/06/23 of the Master's degree course in Artificial Intelligence of the University of Bologna.
The purpose of this project is to take in input a sequence of words corresponding to a random permutation of a given english sentence, and reconstruct the original sentence.
In this work, I propose a Transformer model for the task of sentence shuffling, which can generate shuffled sentences that preserve the meaning and grammaticality of the original sentences. I compare the performance of the proposed Transformer model with other alternative methods that use Seq2Seq models or positional vectors.
The dataset was taken from Hugging Face's Wikipedia dataset ("20220301.simple"). The processed dataset is made of ~120k sentences of max 30 words. The vocabulary size is 10k words (the most common in the dataset).
Given s the source (original) string and p the prediction, the quality of the results will be measured according to the following metric:
- look for the longest substring w between s and p
- compute
$\frac{|w|}{max(|s|,|p|)}$
If the match is exact, the score is 1.
- No pretrained model can be used.
- The neural network model should have less than 20M parameters.
For this task, it was implemented a Transformer model, following as much as possible the implementation described in the original paper "Attention is all you need" (Vaswani et al., 2017).
Here are some of the most important hyperparameters and their meaning:
embedding_dim
-> 256 (The dimensionality of the model's hidden states and embeddings)num_heads
-> 10 (The number of attention heads used in the multi-head attention mechanism)num_layers
-> 5 (The number of layers in the encoder and decoder stacks)intermediate_dim
-> 1640 (The number of units in the feedforward sublayer of the encoder and decoder layers)dropout_rate
-> 0.1 (The dropout rate used in the encoder and decoder layers)batch_size
-> 512 (The number of samples processed in each training batch)optimizer
-> Adam, with the same parameters of the paper (The optimizer used to train the model)learning_rate
-> The learning rate followed a custom schedule, taken by the original paperwarmup_steps
-> 1000 (The number of warmup steps used in the learning rate schedule)
Total trainable parameters: 19,990,034
After computing autoregressive inference for 1000 random test set datapoints, the average score was:
0.4856412631262165.
note: the test shuffled sentence is omitted but it's just random shuffles of the original sentence words
- test sentence: "on august 6 1928 in pittsburgh pennsylvania"
- prediction: "pittsburgh pennsylvania on august 6 1928"
- score: 0.5348837209302325