Minimally Supervised Semi-Supervised Learning
In semi-supervised learning, the biggest issue is how to choose training samples that will capture the distribution of the data of the entire class without extensive manual supervision. Since, unlabeled data is cheap and easily available; one can use the unlabelled data to enhance the weak models that one builds from minimal training data. Therefore, the key challenge is to choose few data points from a big set that will be as good as the fully supervised model. In this work, we plan to use three recent technologies.
- A self-attention based RNN to project text documents into semantic vectors.
- Find topology adaptive hyper-cubes to detect the class boundaries.
- Use determinantal point process to select diverse hyper-cubes and then select representative samples for training.
The main objective to use a self attention based recurrent neural network is to transform a natural language sentence to a semantic vector that can represent the embedding of the sentence. The work in this project is based on the paper "A Structured Self-Attentive Sentence Embedding", which was published in ICLR 2017.
Self Attention LSTM
contains the code for the implementation of the same. To run the code, use a python3
environment and have the necessary libraries used in the code downloaded. Run using python classification.py
and the embedding obtained for the train and test data are stored in the corresponding folder in a 'csv' format. (P.S. Refer to Structured-Self-Attention for more details regarding the implementations.
Hyperparameters:
- Length of semantic vector = 300
- Number of epochs = 30
Initial implementation uses a Decision Tree based classification algorithm to classify the semantic vectors into various categories. Hyper-cubes are then generated for various features (here, 300) that covers the vector space.
Hypercubes
contains the code for the implementation of the same. Run the code in a python2
environment using python Hypercubes.py
; the output is saved in out.txt
in text format. Run python clean.py
to generate the results in 'csv' format; results are stored in 'HyperCool.csv'. Number of hyper-cubes generated are 1173.
Hyperparameters:
- Depth of Decision Tree = 15
Determinantal Point Process chooses a set of diverse hypercubes from the ones obtained in the previous section. It takes the input csv and algorithmically chooses the set of hypercubes that maximises the principal minor(Read Determinantal point processes for machine learning for more information). Code for this section is present in the Hypercubes
folder. Results of the hypercubes are stored in index_dpp_hypercubes.npy
. Algorithm chooses 150 diverse hypercubes from the set. (P.S. Algorithm needs to be feeded with the number of hypercubes it needs to be chosen).
Hyperparameters
- Number of hypercubes to be chosen = 150
- Weight(
FACTOR
) gien to hypercubes of different categories = 50 - Similarity metric between two hypercubes = Inverse Euclidean Distance
To extract the representative samples present in the selected hypercubes, run python extractPoints.py
in extractPoints
folder. Labelled data points are stored in train1.csv
and their labels in train1labels.csv
. Unlabelled data points are stored in otherpoints.csv
. Number of representative points for 150 hypercubes are 924. Hence there are 924 labelled datapoints and 8058 unlabelled datapoints.
Run a semi-supervised learning algorithm on the obtained data points using python ssl.py
within SSL
folder. Train accuracy and Test accuracy are reported.
Hyperparameters
- Batch size for Unsupervised learning = 100
Read MinSSL.pdf
for a better understanding.