Capability and skill mapping using transformer-based/contextual text embeddings.
On a high level, the Textstellar
pipeline broadly consists of three modules:
-
Semantic Ranking:
- Given a definition for "X" (a list of reference sentences capturing X), which could be a theme, challenge, or a concept, we find the top-K related items based on their semantic similarity. The reference sentences could be either handcrafted, or GPT-3 prompted
- We apply this to finding relevant research outcomes (and researchers) that are most salient for a given excercise
-
Topic Clustering:
- Perform unsupervised clustering for topic discovery
- Using Topic Coherence to automatically select the optimal cluster size etc.
-
Visualization:
- Plot highest matching outputs and their corresponding authors
- Generate a 2D "night sky" visualization of topics
- Clone this repo and get started with
textstellar.ipynb
notebook to use on your own dataset - Preferably run on a GPU (recommended to use Google Colab)
- Replace all system path(s) as needed
Most importantly, explore the low-dimensional semantic space—at your leisure.
Click here for a live demo.