This repository contains the source code, data and API used in our recent paper: Fast Top-k Area Topics Extraction.
The following dependencies are required and must be installed separately:
- Python 3 (used to run our programs)
- Aria2 (used to speed up downloading Wikipedia dumps)
- WikiExtractor (used to extract plain text from Wikipedia dumps)
Then run git clone https://github.com/thuzhf/FastKATE.git
to download this repository to your computer (with the same name). For convenience, please put WikiExtractor and FastKATE under the same parent directory, and we denote this parent directory as <PARENT>
in the following steps.
Since our model utilizes Wikipedia dumps, thus we need to download these data first. We choose Wikipedia dumps of timestamp 20170901
as our example in the following steps. Available timestamps can be found here.
-
Run
cd <PARENT>
to enter into the parent directory of FastKATE. -
Run
python3 -m FastKATE.src.wiki_downloader 20170901 ./wikidata/ all
will help you download all possibly needed data of Wikipedia of timestamp20170901
into the directory./wikidata/
. For quick help, runpython3 -m FastKATE.src.wiki_downloader -h
. -
Decompress all downloaded Wikipedia dumps to the
./wikidata/
with the same name (without suffixes such as.gz
and.bz2
). -
Run
python3 wikiextractor/WikiExtractor.py -o ./wikidata/preprocessed/ -b 64M --no-templates ./wikidata/enwiki-20170901-pages-articles-multistream.xml
to preprocess downloaded wikidata. For quick help, runpython3 wikiextractor/WikiExtractor.py -h
.
-
Run
cd <PARENT>
to enter into the parent directory of FastKATE. -
Run
python3 -m FastKATE.src.topic_embeddings 20170901 ./wikidata/
to extract candidate topics (in the form of phrases) from wikidump data and generate vector representations of each topic. For quick help, run:python3 -m FastKATE.src.topic_embeddings -h
. -
A pretrained topic embeddings model (which is trained using the wikidump of timestamp 20161201 and used in our paper) can be downloaded here (including 3 files; you should download all 3 files and put them in the same folder if you want to use the pretrained model).
-
Actually our code can be easily modified to train topic embeddings on different datasets other than Wikipedia used here. For those who really want to do this, please refer to the source code for more details.
-
Run
cd <PARENT>
to enter into the parent directory of FastKATE. -
Run
python3 -m FastKATE.src.taxonomy 20170901 ./wikidata/
to extract category structure from Wikipedia. For quick help, run:python3 -m FastKATE.src.taxonomy -h
. -
A file containing extracted category structure can be downloaded here (which is used in our paper).
-
Run
cd <PARENT>
to enter into the parent directory of FastKATE. -
Run
python3 -m FastKATE.src.api ./wikidata/
to run the extraction algorithm and set up the API. For quick help, run:python3 -m FastKATE.src.api -h
. A currently running API can be visited here (slightly different from the original paper now because we have integrated MAG and ACM CCS data to further improve original results).-
The inputs of the API are:
- area: area name; should be lowercase; spaces should be replaced by
_
. - k: the number of topics needed to be extracted; should be a positive integer.
- area: area name; should be lowercase; spaces should be replaced by
-
The output of the API is a dict in JSON format, which consists of:
- area: the same as the input.
- result: top-k extracted topics of the given area, accompanied and ranked (in descending order) by their relevance to the given area.
- time: consumed time (in seconds).
-