By combining the Natural Language Toolkit (NLTK) package, the Levenshtein algorithm and an ad-hoc algorithm, this script can:
- Given a list of Scientific Articles titles, extract potential good keywords from titles;
- Select the best keywords by looking at their relative frequency, and use them to create a thematic network of scientific publications.
This was written to scale well up to tens of millions of article titles, and millions of keywords. A few optimizations to the algorithm will be added in the following weeks.
This is just a beta project, you can find a visualization of a graph constructed using this algorithm here. Thanks to Anvaka for the excellent visualization engine!