Learning Personality
is a project carried out during my bachelor internship.
In collaboration with the MAD laboratory (DiSCo) of the University of Milano-Bicocca and a group of researchers from the psychology department of the university, the work consists in identifying a procedure capable of extracting, through automatic approaches, the personality of the "target" object, in this case the big five personality traits (called OCEAN), to which a given text, written in natural language, refers.
Different spaces of representation are explored, starting from an approach that exploits the bag-of-wordsrepresentation, up to the construction of an embedding of words using the skip-gram version of the word2vec
algorithm of Tomas Mikolov. Three different types of artificial neural networks has been used.
The dataset for this task can be downloaded from https://www.yelp.com/dataset/challenge
The nature of this thesis project is highly experimental and aims to present detailed analyses on the topic, as at present there are no important investigations that address the problem of learning personality traits starting from text in natural language.
A series of Python scripts have been created to automate and make preprocessing, feature extraction and model training repeatable.
The first model implemented is a feed-forward fully-connected NN.
The second model uses a class of distributional algorithms which consist in the use of a neural network capable of learning, in an unsupervised way, the contexts of words. The word embedding generated here is used as input for a convolutional NN.
The third model transforms the regression problem into a binary multi-label classification problem, in which for each personality dimension the output will be 0 or 1.
-
In the
input_pipeline
,preprocess_dataset
,TFRecord_dataset
,load_ocean
files :- The json file is parsed by extracting only the reviews (transformed into lower cases).
- The two training and test 80-20 datasets are generated: we have in total 1243000 sentences in the test dataset and 4974000 in the training one.
- Sentences are generated by splitting on punctuation.
- Stopwords are removed from sentences.
- Sentences that do not contain the adjectives are deleted.
- The three zips containing the entire dataset, the training and test ones, are saved on file.
-
In the
dictionary
,voc
,remove_adj
files :- A .txt file is generated containing for each line a word for all sentences in order.
- We then sort the file, keep a counter for each word so as not to repeat and order again.
- The adjectives that belong to the ocean dataset that appear in the dictionary are eliminated.
- We generate a new compact file in which we have only the first n most frequent words, so that the 'UNK' token is subsequently associated with them.
- In the
extract_features
,model_input
,training
files :- A lookup-table is created containing the 60000 most frequent words. Unique words are indexed with a unique integer value (corresponding to the line number), words not included in the first 60000 most common words will be marked with "-1".
- A reverse lookup-table is created that allows you to search for a word by going through its unique identifier. Unknown words, identified by '-1', are replaced with the 'UNK' token
- The bag of words vector is generated and the ocean vector is associated with it.
- We build a basic model, with n layers fully connected. The ReLU non-linear activation function is applied to each of them. A batch-normalization is performed after each layer.
- The simulations can be prepared for n epochs. The optimizer chosen is Adagrad with a learning rate of 0,001. The target function used is the mean squared error MSE, moreover we used the root mean squared error RMSE.
-
In the
mikolov_features
,mikolov_embedding_model
,mikolov_embedding_training
files :- The same procedure is performed to build the dictionary of the most frequent 60000 words.
- The features for the construction of the embedding are generated by forming a dataset consisting of the coupling of each word with its context. The word on the right and the word on the left of the target are considered as context.
- You can determine the size of the embedding and the number of negative labels used for sampling.
- The objective function used by the network is the Stocastic Gradient Descent SGD.
-
In the
mikolov_model
,mikolov_training
files :- We build a model with n convolutional layers and a final one fully connected. The ReLU non-linear activation function is applied to each of them. A batch-normalization is performed after each layer. Furthermore, after the first layer there is a pooling layer.
- The simulations can be trained for n epochs. The optimizer chosen is Adagrad with a learning rate of 0.005. The target function used is the mean squared error MSE, also the use of the root mean squared error RMSE metric.
- In the
mikolov_features
file :- The same procedure as for model 2 is carried out, but the construction of the embedding takes place by forming a data set consisting of the coupling of each adjective of our interest with its context. The two words on the right and the two words on the left of the target are considered as context.
- In the
mikolov_multiclass_binary_model
,mikolov_multiclass_binary_training
files :- The procedure for extracting the embedding is the same as for the two previous models.
- We build a basic model, with n layers. The ReLU non-linear activation function is applied to each of them. A batch-normalization is performed after each layer.
- The built model is similar to the previous one with the difference that the objective function used is a softmax cross entropy. Furthermore, the accuracy is used as metric for each personality trait, and the confusion matrices are plotted with Tensorboard.