-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Cleansing Wikipedia Categories using Centrality
Former-commit-id: d47ba8e
- Loading branch information
1 parent
b391672
commit 888e065
Showing
17 changed files
with
2,430 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,68 @@ | ||
# wikipedia-categories | ||
Cleansing Wikipedia Categories using Centrality | ||
# Cleansing Wikipedia Categories using Centrality | ||
## by Paolo Boldi and Corrado Monti | ||
|
||
We propose a novel general technique aimed at pruning and cleansing the Wikipedia category hierarchy, with a tunable level of aggregation. Our approach is endogenous, since it does not use any information coming from Wikipedia articles, but it is based solely on the user-generated (noisy) Wikipedia category folksonomy itself. We show how the proposed techniques can help reduce the level of noise in the hierarchy and discuss how alternative centrality measures can differently impact on the result. | ||
|
||
For more information see [the paper, presented at WWW2016 (companion), Wiki Workshop 2016, at Montreal](http://dl.acm.org/ft_gateway.cfm?id=2891111&ftid=1707848). | ||
|
||
# Provided dataset | ||
|
||
* `page2cat.tsv.gz` is a gzipped TSV file with the mapping from Wikipedia pages to cleansed categories, from the most important to the least important. | ||
* `ranked-categories.tsv.gz` is a gzipped TSV file with every Wikipedia category and our importance score. | ||
|
||
We also provide head of these files to show how they look like after unzip. | ||
|
||
If you use the dataset or the code, please cite: | ||
Boldi, Paolo, and Corrado Monti. "Cleansing wikipedia categories using centrality." Proceedings of the 25th International Conference Companion on World Wide Web. International World Wide Web Conferences Steering Committee, 2016. | ||
|
||
Bibtex: | ||
|
||
@inproceedings{boldi2016cleansing, | ||
title={Cleansing wikipedia categories using centrality}, | ||
author={Boldi, Paolo and Monti, Corrado}, | ||
booktitle={Proceedings of the 25th International Conference Companion on World Wide Web}, | ||
pages={969--974}, | ||
year={2016}, | ||
organization={International World Wide Web Conferences Steering Committee} | ||
} | ||
|
||
|
||
PLEASE NOTE: *Experiments described in the paper were run on a 2014 snapshot called | ||
`enwiki-20140203-pages-articles.xml.bz2`, while – to provide an updated version – | ||
this dataset refers to `enwiki-20160407-pages-articles.xml.bz2`.* | ||
|
||
# How to run code | ||
|
||
Set up the environment | ||
---------------------- | ||
|
||
In order to compile the code, you'll need Java 8, Ant and Ivy. To install | ||
them (e.g. inside a clean [Vagrant](http://vagrantup.com/) box with | ||
`ubuntu/trusty64`), you should use these lines: | ||
|
||
sudo apt-get --yes update | ||
sudo apt-get install -y software-properties-common python-software-properties | ||
echo oracle-java8-installer shared/accepted-oracle-license-v1-1 select true | sudo /usr/bin/debconf-set-selections | ||
sudo add-apt-repository ppa:webupd8team/java -y | ||
sudo apt-get update | ||
sudo apt-get --yes install oracle-java8-installer | ||
sudo apt-get --yes install oracle-java8-set-default | ||
sudo apt-get --yes install ant ivy | ||
sudo ln -s -T /usr/share/java/ivy.jar /usr/share/ant/lib/ivy.jar | ||
|
||
|
||
Compile the code | ||
---------------------- | ||
|
||
If the environment is set up properly, you should install git and download this repo with | ||
|
||
sudo apt-get install git | ||
git clone https://github.com/corradomonti/wikipedia-categories.git | ||
|
||
and then go to the directory `java`. There, run: | ||
|
||
* `ant ivy-setupjars` to download dependencies | ||
* `ant` to compile | ||
* `. setcp.sh` to include the produced jar inside the Java classpath. | ||
|
||
Now you are ready to run `run.sh`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
version=1.0 | ||
|
||
build.sysclasspath=ignore | ||
|
||
jar.base=/usr/share/java | ||
javadoc.base=/usr/share/javadoc | ||
|
||
dist=dist | ||
src=src | ||
test=test | ||
slow=slow | ||
reports=reports | ||
coverage=coverage | ||
checkstyle=checkstyle | ||
docs=docs | ||
build=build | ||
instrumented=instr | ||
|
||
j2se.apiurl=http://download.oracle.com/javase/6/docs/api/ | ||
fastutil.apiurl=http://fastutil.dsi.unimi.it/docs/ | ||
jsap.apiurl=http://www.martiansoftware.com/jsap/doc/javadoc/ | ||
junit.apiurl=http://junit.sourceforge.net/javadoc_40/ | ||
log4j.apiurl=http://logging.apache.org/log4j/1.2/apidocs/ | ||
slf4j.apiurl=http://www.slf4j.org/apidocs/ | ||
webgraph.apiurl=http://webgraph.dsi.unimi.it/docs/ |
Oops, something went wrong.