Skip to content

A NLP approach to predicting keywords in arXiv manuscripts using Keras.

License

Notifications You must be signed in to change notification settings

JNoel71/KeywordPredictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ArXiv Keyword Predictor

Overview

This is a Python repo that constructs a neural network that attempts to predict the keywords used in a manuscript from arXiv given the words from the title and abstract.

How it Works

The Data

The data comes from a Kaggle dataset which can be found here https://www.kaggle.com/datasets/spsayakpaul/arxiv-paper-abstracts. The only columns contained in the dataset store the title, abstract, and keywords associated with the manuscript.

Data Cleaning/Tokenization

The data was tokenized using Keras Tokenizer. The titles and abstracts were tokenized such that only the top 50 most common words were used, a list of filler words was also removed such that simple words like 'a', 'the', and 'where', were not included in tokenization. Keywords were tokenized as well, including only the 4 most common keywords.

Neural Network

The neural network was built to take the tokenized titles and abstracts in different streams before joining them together and producing an output. Custom loss functions were used to ensure loss was properly weighted between different classes as the dataset is imbalanced. An image showing the layout of this network can be seen below.

Image

Performance

Below you can find the classification matrix output by the program on an 80-20 train test split.

Precision Recall F1-Score
Micro Avg 0.80 0.73 0.76
Macro Avg 0.56 0.51 0.53
Weighted Avg 0.79 0.73 0.76
Samples Avg 0.83 0.79 0.78

Other Links

If you are interested in NLP these are some of the articles that helped me.

About

A NLP approach to predicting keywords in arXiv manuscripts using Keras.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages