Skip to content

Commit

Permalink
Cleansing Wikipedia Categories using Centrality
Browse files Browse the repository at this point in the history
Former-commit-id: d47ba8e
  • Loading branch information
corradomonti committed Apr 21, 2016
1 parent b391672 commit 888e065
Show file tree
Hide file tree
Showing 17 changed files with 2,430 additions and 2 deletions.
70 changes: 68 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,68 @@
# wikipedia-categories
Cleansing Wikipedia Categories using Centrality
# Cleansing Wikipedia Categories using Centrality
## by Paolo Boldi and Corrado Monti

We propose a novel general technique aimed at pruning and cleansing the Wikipedia category hierarchy, with a tunable level of aggregation. Our approach is endogenous, since it does not use any information coming from Wikipedia articles, but it is based solely on the user-generated (noisy) Wikipedia category folksonomy itself. We show how the proposed techniques can help reduce the level of noise in the hierarchy and discuss how alternative centrality measures can differently impact on the result.

For more information see [the paper, presented at WWW2016 (companion), Wiki Workshop 2016, at Montreal](http://dl.acm.org/ft_gateway.cfm?id=2891111&ftid=1707848).

# Provided dataset

* `page2cat.tsv.gz` is a gzipped TSV file with the mapping from Wikipedia pages to cleansed categories, from the most important to the least important.
* `ranked-categories.tsv.gz` is a gzipped TSV file with every Wikipedia category and our importance score.

We also provide head of these files to show how they look like after unzip.

If you use the dataset or the code, please cite:
Boldi, Paolo, and Corrado Monti. "Cleansing wikipedia categories using centrality." Proceedings of the 25th International Conference Companion on World Wide Web. International World Wide Web Conferences Steering Committee, 2016.

Bibtex:

@inproceedings{boldi2016cleansing,
title={Cleansing wikipedia categories using centrality},
author={Boldi, Paolo and Monti, Corrado},
booktitle={Proceedings of the 25th International Conference Companion on World Wide Web},
pages={969--974},
year={2016},
organization={International World Wide Web Conferences Steering Committee}
}


PLEASE NOTE: *Experiments described in the paper were run on a 2014 snapshot called
`enwiki-20140203-pages-articles.xml.bz2`, while – to provide an updated version –
this dataset refers to `enwiki-20160407-pages-articles.xml.bz2`.*

# How to run code

Set up the environment
----------------------

In order to compile the code, you'll need Java 8, Ant and Ivy. To install
them (e.g. inside a clean [Vagrant](http://vagrantup.com/) box with
`ubuntu/trusty64`), you should use these lines:

sudo apt-get --yes update
sudo apt-get install -y software-properties-common python-software-properties
echo oracle-java8-installer shared/accepted-oracle-license-v1-1 select true | sudo /usr/bin/debconf-set-selections
sudo add-apt-repository ppa:webupd8team/java -y
sudo apt-get update
sudo apt-get --yes install oracle-java8-installer
sudo apt-get --yes install oracle-java8-set-default
sudo apt-get --yes install ant ivy
sudo ln -s -T /usr/share/java/ivy.jar /usr/share/ant/lib/ivy.jar


Compile the code
----------------------

If the environment is set up properly, you should install git and download this repo with

sudo apt-get install git
git clone https://github.com/corradomonti/wikipedia-categories.git

and then go to the directory `java`. There, run:

* `ant ivy-setupjars` to download dependencies
* `ant` to compile
* `. setcp.sh` to include the produced jar inside the Java classpath.

Now you are ready to run `run.sh`.
25 changes: 25 additions & 0 deletions java/build.properties
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
version=1.0

build.sysclasspath=ignore

jar.base=/usr/share/java
javadoc.base=/usr/share/javadoc

dist=dist
src=src
test=test
slow=slow
reports=reports
coverage=coverage
checkstyle=checkstyle
docs=docs
build=build
instrumented=instr

j2se.apiurl=http://download.oracle.com/javase/6/docs/api/
fastutil.apiurl=http://fastutil.dsi.unimi.it/docs/
jsap.apiurl=http://www.martiansoftware.com/jsap/doc/javadoc/
junit.apiurl=http://junit.sourceforge.net/javadoc_40/
log4j.apiurl=http://logging.apache.org/log4j/1.2/apidocs/
slf4j.apiurl=http://www.slf4j.org/apidocs/
webgraph.apiurl=http://webgraph.dsi.unimi.it/docs/
Loading

0 comments on commit 888e065

Please sign in to comment.