Hadoop

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.

The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

Source: Wikipedia

Setup

One of the outcomes of the Big Data Europe project was a repository of Docker images of each of the building block big data tools, one of those being Apache Hadoop. These continue to be maintained.

Hadoop depends on Java 8.

Rather than install this old version of Java, use the Docker Hadoop project to get started:

git clone https://github.com/big-data-europe/docker-hadoop.git
cd docker-hadoop

Create a local data directory that will be mounted on the NameNode that we can add data files to and then use to add data to HDFS later:

mkdir data

In the volumes key of the namenode service in docker-compose.yml add this local path, relative to the Compose file:

- ./data:/home/data

Expose the port of the DataNode web UI.

In the ports key of the datanode service in docker-compose.yml add this port mapping:

- 9864:9864

Give the DataNode a hostname which we can resolve to access redirects for the DataNode returned by the NameNode when retrieving files using the WebHDFS REST API.

In the datanode service in docker-compose.yml add the following:

hostname: datanode

Add a hosts entry for the DataNode to /etc/hosts (Linux) or C:\Windows\System32\drivers\etc\hosts (Windows).

127.0.0.1       datanode

Deploy the HDFS cluster:

docker-compose up -d

Browse to the web UI's where you can explore the status, logs and the file system:

NameNode
DataNode

Add some data to the local data directory to add to HDFS later:

echo -n 1,"test" > data/test.csv

Check the local data directory is mounted:

docker-compose exec namenode ls -alh /home/data

List the HDFS root directory:

docker-compose exec namenode hdfs dfs -ls -R /

Make a HDFS directory to add data to:

docker-compose exec namenode hdfs dfs -mkdir -p /db/data

Add data to HDFS:

docker-compose exec namenode hdfs dfs -copyFromLocal /home/data /db

List the HDFS directory:

docker-compose exec namenode hdfs dfs -ls -R /db/data

Use the REST API to get the status of a file:

curl -i http://localhost:9870/webhdfs/v1/db/data/test.csv?op=GETFILESTATUS

Use the REST API to get a file:

curl -i -L http://localhost:9870/webhdfs/v1/db/data/test.csv?op=OPEN

returns:

HTTP/1.1 307 Temporary Redirect
Date: Mon, 25 Jan 2021 17:20:49 GMT
Cache-Control: no-cache
Expires: Mon, 25 Jan 2021 17:20:49 GMT
Date: Mon, 25 Jan 2021 17:20:49 GMT
Pragma: no-cache
X-Content-Type-Options: nosniff
X-FRAME-OPTIONS: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Location: http://datanode:9864/webhdfs/v1/db/data/test.csv?op=OPEN&namenoderpcaddress=namenode:9000&offset=0
Content-Type: application/octet-stream
Content-Length: 0

HTTP/1.1 200 OK
Access-Control-Allow-Methods: GET
Access-Control-Allow-Origin: *
Content-Type: application/octet-stream
Connection: close
Content-Length: 11

1,"test"

Delete the HDFS directory:

docker-compose exec namenode hdfs dfs -rm -r /db/data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hadoop.md

hadoop.md

Hadoop

Setup

Files

hadoop.md

Latest commit

History

hadoop.md

File metadata and controls

Hadoop

Setup