Fast top-K Area Topics Extraction (FastKATE)

This repository contains the source code, data and API used in our recent paper: Fast Top-k Area Topics Extraction.

Prerequisites

The following dependencies are required and must be installed separately:

Python 3 (used to run our programs)
Aria2 (used to speed up downloading Wikipedia dumps)
WikiExtractor (used to extract plain text from Wikipedia dumps)

Then run git clone https://github.com/thuzhf/FastKATE.git to download this repository to your computer (with the same name). For convenience, please put WikiExtractor and FastKATE under the same parent directory, and we denote this parent directory as <PARENT> in the following steps.

Download and Preprocess Wikipedia Dumps

Since our model utilizes Wikipedia dumps, thus we need to download these data first. We choose Wikipedia dumps of timestamp 20170901 as our example in the following steps. Available timestamps can be found here.

Run cd <PARENT> to enter into the parent directory of FastKATE.
Run python3 -m FastKATE.src.wiki_downloader 20170901 ./wikidata/ all will help you download all possibly needed data of Wikipedia of timestamp 20170901 into the directory ./wikidata/. For quick help, run python3 -m FastKATE.src.wiki_downloader -h.
Decompress all downloaded Wikipedia dumps to the ./wikidata/ with the same name (without suffixes such as .gz and .bz2).
Run python3 wikiextractor/WikiExtractor.py -o ./wikidata/preprocessed/ -b 64M --no-templates ./wikidata/enwiki-20170901-pages-articles-multistream.xml to preprocess downloaded wikidata. For quick help, run python3 wikiextractor/WikiExtractor.py -h.

Generate Topic Embeddings

Run cd <PARENT> to enter into the parent directory of FastKATE.
Run python3 -m FastKATE.src.topic_embeddings 20170901 ./wikidata/ to extract candidate topics (in the form of phrases) from wikidump data and generate vector representations of each topic. For quick help, run: python3 -m FastKATE.src.topic_embeddings -h.
A pretrained topic embeddings model (which is trained using the wikidump of timestamp 20161201 and used in our paper) can be downloaded here (including 3 files; you should download all 3 files and put them in the same folder if you want to use the pretrained model).
Actually our code can be easily modified to train topic embeddings on different datasets other than Wikipedia used here. For those who really want to do this, please refer to the source code for more details.

Extract Category Structure from Wikipedia

Run cd <PARENT> to enter into the parent directory of FastKATE.
Run python3 -m FastKATE.src.taxonomy 20170901 ./wikidata/ to extract category structure from Wikipedia. For quick help, run: python3 -m FastKATE.src.taxonomy -h.
A file containing extracted category structure can be downloaded here (which is used in our paper).