The processing of Wikpedia dump files uses Hadoop 3.*
(latest tested version is 3.3.1
from 2022-02-10
), which must be downloaded and installed according to your environment. The processing should still work with Hadoop 2
. In your Hadoop installation, some config files under etc/hadoop/
might need to be adjusted for the task, we provide below our configuration files, but the spotted number can of course be adapted to take advantage of more CPU and memory for a more modern machine.
Please note that running Hadoop is not straightfoward and involves often fixing various setup and connection problems (even on very standard and basic Linux installation). The problems depends on the environment (e.g. Linux distribution) and OS version. See bellow for common failures and help.
For each language, there will be 12 generated csv files by this processing step.
articleParents.csv label.csv pageLinkOut.csv
categoryParents.csv page.csv redirectSourcesByTarget.csv
childArticles.csv pageLabel.csv redirectTargetsBySource.csv
childCategories.csv pageLinkIn.csv stats.csv
Intel Core i7-4790K CPU 4.00GHz Haswell, 32GB memory, with 4 cores, 8 threads, SSD, pseudo distributed mode:
-
English Wikipedia XML dump: around 7 h 30
-
French and German Wikipedia XML dump: around 2 h 30 mn
-
other languages: between 30mn and 1h30
Note: at the present time, the process works only in pseudo distributed mode (LMDB cache DB are located under local /tmp/). For cluster level, we would need to locate the LMDB databases for cache on the HDFS and uses distributed haddop caches to access to the cache dbs. However, as the runtime on a single machine is very reasonable, we have not further generalized the process.
We give here the Hadoop 3.*
config files with YARN that we are using to process successfully the Wikidata and Wikipedia dumps. You can adapt them according to the capacity of your server.
etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_PREFIX=/home/lopez/tools/hadoop/hadoop-3.3.1
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/home/lopez/tools/hadoop/hadoop-3.3.1/etc/hadoop"}
etc/hadoop/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/lopez/tools/hadoop/hadoop-3.3.1/mydata/hdfs/namenode/</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/lopez/tools/hadoop/hadoop-3.3.1/mydata/hdfs/datanode/</value>
</property>
etc/hadoop/yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>13312</value> <!-- typical total memory available on the machine having 16GB RAM, in pseudo ditributed mode -->
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>12288</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>4</value> <!-- indicate the number of cores available on the used machine, in pseudo ditributed mode -->
</property>
// Should the job being scheduled but not executed, might due to lack of disk space on namenode or datanode.
// This can be avoided by adding to the yarn-site.xml:
<property>
<name>yarn.nodemanager.disk-health-checker.min-healthy-disks</name>
<value>0.0</value>
</property>
<property>
<name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
<value>100.0</value>
</property>
etc/hadoop/mapred-site.xml
<configuration>
<!--property>
<name>mapred.child.java.opts</name>
<value>-Xmx1024m</value>
</property--> <!-- in case we need more memory in the main hadoop job, it should not be the case -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>3072</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>8192</value> <!-- single reduce job ! -->
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1024m</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx8192m</value>
</property>
<property>
<name>mapreduce.job.ubertask.enable</name>
<value>true</value>
</property>
<property>
<name>mapreduce.tasktracker.map.tasks.maximum</name>
<value>4</value>
</property>
<property>
<name>mapreduce.tasktracker.reduce.tasks.maximum</name>
<value>1</value>
</property> <!-- single reduce job ! -->
<property>
<name>mapred.reduce.slowstart.completed.maps</name>
<value>1</value>
</property>
<property>
<name>mapred.task.timeout</name>
<value>1800000</value>
</property> <!-- timout 30 minutes, safer for building the largest LMDB caches -->
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
</configuration>
etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Note: you will probably encounter some issues for starting Hadoop, see bellow for help.
- Prepare the namenode:
hadoop-3.3.1/bin/hdfs namenode -format
- start the HDFS nodes and copy the files to be processed on the HDFS:
hadoop-3.3.1/sbin/start-dfs.sh
hadoop-3.3.1/bin/hdfs dfs -mkdir /user
hadoop-3.3.1/bin/hdfs dfs -mkdir /user/lopez
hadoop-3.3.1/bin/hdfs dfs -put ~/grisp/nerd-data/data/languages.xml /user/lopez/
hadoop-3.3.1/bin/hdfs dfs -put /mnt/data/wikipedia/latest/en/enwiki-latest-pages-articles-multistream.xml /user/lopez/
hadoop-3.3.1/bin/hdfs dfs -put /mnt/data/wikipedia/latest/fr/frwiki-latest-pages-articles-multistream.xml /user/lopez/
hadoop-3.3.1/bin/hdfs dfs -put /mnt/data/wikipedia/latest/de/dewiki-latest-pages-articles-multistream.xml /user/lopez/
...
hadoop-3.3.1/bin/hdfs dfs -mkdir /user/lopez/output
hadoop-3.3.1/bin/hdfs dfs -mkdir /user/lopez/working
Note that the **wiki-latest-pages-articles.xml
file must be passed uncompressed to hadoop. While bzip2
compression format is normally supported automatically by Hadoop as input format (because it is a splitable compression format), it is currently not working with the Wikipedia dump file.
Starting hadoop commonly fails for various reasons, we try to cover here the most common ones:
- password-less authentication is not configured on localhost:
Starting namenodes on [localhost]
localhost: user@localhost: Permission denied (publickey,password).
This can be solved by adding the ssh key of the machine to itself:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
- JAVA_HOME not found:
Starting namenodes on [localhost]
localhost: ERROR: JAVA_HOME is not set and could not be found.
Even if the JAVA_HOME
is correctly set for the user, in the .bashrc or profile, for some unknown reasons Hadoop might fail to find and use it. To fix the issue, the JAVA_HOME
needs to be set in the hadoop/hadoop-env.sh
config file, as indicated in the previous section on configuration files:
export JAVA_HOME=/usr/lib/jvm/<jdk folder>
- Connection refused:
Call From **hostname**/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
Check if you have already a process listening to port :9000
:
sudo netstat -tnlp | grep :9000
If yes, you need to stop this indicated process - this is usually a YARN zombi process that failed to stopped.
- Connextion to ResourceManager fails:
When starting some Hadoop job, you might see:
2022-12-21 16:10:03,762 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032
2022-12-21 16:10:04,810 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
First, just to be sure, check that you have started YARN:
sbin/start-yarn.sh
To solve this problem, you usually need to indicate in the etc/hadoop/yarn-site.xml
configuration file:
<property>
<name>yarn.resourcemanager.address</name>
<value>127.0.0.1:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>127.0.0.1:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>127.0.0.1:8031</value>
</property>
Under ~/grisp/nerd-data
:
mvn clean package
will create job jar under ./target/com.scienceminer.grisp.nerd-data-0.0.6-job.jar
to be used below.
- start YARN:
hadoop-3.3.1/sbin/start-yarn.sh
- English (path in HDFS, except the jar):
hadoop-3.3.1/bin/hadoop jar ~/grisp/nerd-data/target/com.scienceminer.grisp.nerd-data-0.0.6-job.jar /user/lopez/enwiki-latest-pages-articles-multistream.xml /user/lopez/languages.xml en /user/lopez/working /user/lopez/output
After a few hours, when done, getting the csv files for the English language:
hadoop-3.3.1/bin/hdfs dfs -get /user/lopez/output/* /mnt/data/wikipedia/latest/en/
- French:
hadoop-3.3.1/bin/hadoop jar ~/grisp/nerd-data/target/com.scienceminer.grisp.nerd-data-0.0.6-job.jar /user/lopez/frwiki-latest-pages-articles-multistream.xml /user/lopez/languages.xml fr /user/lopez/working /user/lopez/output
Getting the csv files for French:
hadoop-3.3.1/bin/hdfs dfs -get /user/lopez/output/* /mnt/data/wikipedia/latest/fr/
- German:
hadoop-3.3.1/bin/hadoop jar ~/grisp/nerd-data/target/com.scienceminer.grisp.nerd-data-0.0.6-job.jar /user/lopez/dewiki-latest-pages-articles-multistream.xml /user/lopez/languages.xml de /user/lopez/working /user/lopez/output
Getting the csv files for German:
hadoop-3.3.1/bin/hdfs dfs -get /user/lopez/output/* /mnt/data/wikipedia/latest/de/
- Italian:
hadoop-3.3.1/bin/hadoop jar ~/grisp/nerd-data/target/com.scienceminer.grisp.nerd-data-0.0.6-job.jar /user/lopez/itwiki-latest-pages-articles-multistream.xml /user/lopez/languages.xml it /user/lopez/working /user/lopez/output
Getting the csv files for Italian:
hadoop-3.3.1/bin/hdfs dfs -get /user/lopez/output/* /mnt/data/wikipedia/latest/it/
- Spanish:
hadoop-3.3.1/bin/hadoop jar ~/grisp/nerd-data/target/com.scienceminer.grisp.nerd-data-0.0.6-job.jar /user/lopez/eswiki-latest-pages-articles-multistream.xml /user/lopez/languages.xml es /user/lopez/working /user/lopez/output
Getting the csv files for Spanish:
hadoop-3.3.1/bin/hdfs dfs -get /user/lopez/output/* /mnt/data/wikipedia/latest/es/
And so on for other supported languages.
Finally you can clean and stop HDFS and stop YARN:
hadoop-3.3.1/bin/hdfs dfs -rm /user/lopez/*wiki-latest-pages-articles-multistream.xml
hadoop-3.3.1/bin/hdfs dfs -rm -r /user/lopez/output/*
hadoop-3.3.1/bin/hdfs dfs -rm -r /user/lopez/working/*
hadoop-3.3.1/sbin/stop-dfs.sh
hadoop-3.3.1/sbin/stop-yarn.sh