GitHub - paleopresto/recommender: Repository containing the code for the PaleoRec paper

Recommendation System for Linked Paleodata

This repository consists of the research work and background material that has lead to the development of PaleoRec - A Sequential Recommendation System for Annotating Paleoclimate data.

The initial implementations of the problem comprised of a Conditional Probabilities based solution that trained on the sequential data. Since these are convoluted over and over again and the model consists of just probability values, this method did not provide an effective solution considering the imbalance in the data. Another solution to the problem was to work with Random Forest Decision Tree based Classifier. Random Forest are an ensemble of decision trees that use part of the data to train a tree and then use simple voting to present a decision. In our case we were dealing with 2000 labels and there was an inherent sequence that was pre-requisite to the task. The code for both the methods has been listed in the background work folder under the respective sub-folders. Other data in the background work folder consists of code required to procure seed data for the various fields in the recommendation system. We used SPARQL required to query the Linked Earth ontology which provided data for Archive Type, Proxy Observation Type and Inferred Variable Type.

During the development of PaleoRec, several design decisions and implementations were tried to get the best possible result. We have evaluated the work based on 2 commonly known evaluation metrics uses to guage Recommendation Systems; Hit Ratio(A measure of whether the ground truth data is present in the recommendation list generated by the model) and Mean Reciprocal Rank(A measure of where in the recommendation list is the ground truth item present, higher the position better the recommended list). Current implementation of PaleoRec uses a commonly know RNN Long Short Term Memory(LSTM) for training and prediction. Another RNN called the Gated Recurrent Units(GRU) is a popular method for training data for Sequential Recommendation Systems. A comparison in the evaluation metrics can be seen on the binder notebook available through the widget below.

Most Recommendation Systems base recommendations on user characteristics and past choices, however, we consider user to be anoymous in our recommendation system. Almost all the paleoclimate datasets consist of an author or investigator who is the primary worker on that data. Adding author information to the system did not show major benefits while causing more data sparsity. Cleaning author data also bulked up the code causing atleast 40% slower results. Check out the comparison in evaluation metrics for LSTM with Author as the user and LSTM without Author information thorugh the binder widget available below.

Step for using the binder notebook widget:

After you launch the binder
Click on Research_paper_graphs.ipynb.
In the File Menubar at the top, click on Cell -> Run All

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
background_material		background_material
paleorec		paleorec
.gitignore.txt		.gitignore.txt
LICENSE		LICENSE
README.md		README.md
Research_paper_graphs.ipynb		Research_paper_graphs.ipynb
environment.yml		environment.yml
gru_metrics_data		gru_metrics_data
lstm_author_metrics_data		lstm_author_metrics_data
lstm_metrics_data		lstm_metrics_data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Recommendation System for Linked Paleodata

About

Releases 2

Packages

Contributors 2

Languages

License

paleopresto/recommender

Folders and files

Latest commit

History

Repository files navigation

Recommendation System for Linked Paleodata

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages