This repository consists of the research work and background material that has lead to the development of PaleoRec - A Sequential Recommendation System for Annotating Paleoclimate data.
The initial implementations of the problem comprised of a Conditional Probabilities based solution that trained on the sequential data. Since these are convoluted over and over again and the model consists of just probability values, this method did not provide an effective solution considering the imbalance in the data. Another solution to the problem was to work with Random Forest Decision Tree based Classifier. Random Forest are an ensemble of decision trees that use part of the data to train a tree and then use simple voting to present a decision. In our case we were dealing with 2000 labels and there was an inherent sequence that was pre-requisite to the task. The code for both the methods has been listed in the background work folder under the respective sub-folders. Other data in the background work folder consists of code required to procure seed data for the various fields in the recommendation system. We used SPARQL required to query the Linked Earth ontology which provided data for Archive Type, Proxy Observation Type and Inferred Variable Type.
During the development of PaleoRec, several design decisions and implementations were tried to get the best possible result. We have evaluated the work based on 2 commonly known evaluation metrics uses to guage Recommendation Systems; Hit Ratio(A measure of whether the ground truth data is present in the recommendation list generated by the model) and Mean Reciprocal Rank(A measure of where in the recommendation list is the ground truth item present, higher the position better the recommended list). Current implementation of PaleoRec uses a commonly know RNN Long Short Term Memory(LSTM) for training and prediction. Another RNN called the Gated Recurrent Units(GRU) is a popular method for training data for Sequential Recommendation Systems. A comparison in the evaluation metrics can be seen on the binder notebook available through the widget below.
Most Recommendation Systems base recommendations on user characteristics and past choices, however, we consider user to be anoymous in our recommendation system. Almost all the paleoclimate datasets consist of an author or investigator who is the primary worker on that data. Adding author information to the system did not show major benefits while causing more data sparsity. Cleaning author data also bulked up the code causing atleast 40% slower results. Check out the comparison in evaluation metrics for LSTM with Author as the user and LSTM without Author information thorugh the binder widget available below.
Step for using the binder notebook widget:
- After you launch the binder
- Click on Research_paper_graphs.ipynb.
- In the File Menubar at the top, click on Cell -> Run All