git clone https://github.com/attardi/wikiextractor.git
wget http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
python -m wikiextractor.WikiExtractor -o <output_folder> -l --no_templates --processes 8 <path_to_bz2_file>
find ./ -name 'wiki*' | xargs grep -o -P "(<a href=\").*?\">" | sort -u > <outputfile>
This outputs linked entities from Wikipedia articles as a list of strings.
pip install SPARQLWrapper
python query_dbpedia.py -i <inputfile> -o <outputfile> -c <completedfile>
This script will use the output file from step 1.4 as input file. It will write as output:
- An output file with the query responses to each successfuly queried entity;
- A completed file with the list of successfuly queried entities so that if the process needs to be restarted we can skip the already fetched entities;
- An error log with the entities which failed to be querried.
python class_extract_dbpedia.py -i <inputfile> -o <outputfile> -b <blanksfile>
This script will use the output file from step 2.2 as input file. It outputs two files:
- An output file with entities and their classes according to DBpedia;
- A blanks file with all entities which did not have any type associated to it, along with their SPAQRL responses.
python process_wiki_files.py -i <inputfile> -o <outputfolder> --unerpath <unerpath> --wikipath <wikipath>
This script will use the output file from step 2.3 as input file. You can also specify the path to the UNER mapping and the path to the wikipedia dumps. If you leave them empty, they default to the wiki and the uner folders. You can also specify the output folder for all output produced. It defaults to process_wiki_files_output.
This generates as output:
- A folder with a subfolder for each wikipedia partition. Each subfolder has a txt file with the annotations and a pkl file with a pickled annotation data structure.
- A file called Entities_Statistics with basic statistical information of the generated corpus (number of sentences and number of tokens).
python punct_alignment_BIO.py
This script will use the output from step 2.4 as input. If the name of this output folder is different than process_wiki_files_output, change the name in this script. It verifies tokenization, correcting some punctuation problems and format the annotations according to the BIO format.
It generates as output:
- A folder (process_wiki_files_output_BIO) with with a subfolder for each wikipedia partition. Each subfolder has a txt file with the annotations with the applied corrections and using the BIO format.
- A concatenated file (inside the output folder) corpus_total_BIO with the whole annotated text (to be used in the next steps for statistical analysis).
This output can then be used to train a tagging model for that language. You have a choice of using the concatened file or select files from the subfolders.
python statistics_BIO.py
This script will use the concatenated output file from step 3.1 (corpus_total_BIO) as input.
It generates statistics_BIO file with the following information:
- Number of tokens
- Number of tokens with "O" tag
- Number of tokens concerning entities (tagged "B" or "I")
- Number of entities in the corpus
- List of UNER classes and number of occurrences in the corpus.
python statistics_BIO_advanced.py
This script will use the concatenated output file from step 3.1 (corpus_total_BIO) as input.
It generates statistics_advanced_BIO file with the following information:
- Number of unique entities in the concatenated corpus
- List of the unique entities and the corresponding UNER tag