MRwiki

Tools for analyzing Wikipedia content (especially links) using Python and MapReduce.

Getting started

Below is a sample session in which we download and process Wikipedia dumps using MRwiki tools. Run the following commands in the MRwiki directory.

mkdir tmp
cd tmp
./s1-download-dumps.sh
for f in * ; do gzip -d f ; done
cd ..

At this point the necessary Wikipedia dump files are downloaded and unpacked. It is time to copy them to an HDFS filesystem. Note: we wanted to unpack the dumps, since gzip files are not splittable. If you really care about space, recompress the dump files with bzip2 (not shown here). Warning: do not rename the dump files in any other way than by adding a suffix, since some scripts depend on information that is encoded in the filenames.

export WIKI=hdfs:///user/$USER/wiki
hadoop fs -mkdir $WIKI
hadoop fs -mkdir $WIKI/d1
hadoop fs -put tmp/* $WIKI/d1

Great! The SQL dumps are now on the HDFS. Time to parse them and sort by type. Note: we will use 100 reducers. If you're lucky enough to have a larger cluster at your disposal, go ahead and increase this number. However, bear in mind that allocating more than a thousand or so reducers will probably be a waste of resources.

export COMMONOPTS='-r hadoop --jobconf mapred.reduce.tasks=100 --no-output'
…

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
mrwiki		mrwiki
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MRwiki

Getting started

About

Releases

Packages

Contributors 2

Languages

bolo1729/MRwiki

Folders and files

Latest commit

History

Repository files navigation

MRwiki

Getting started

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages