BigGIS HDFS

Docker container for Apache Hadoop

Prerequisites

Docker Compose >= 1.9.0

Deployment

On local setup:

docker-compose up -d

On Rancher:

NFS server and Rancher NFS service need to be configured in the cluster. The NFS volumes hdfs-namenode, hdfs-snamenode and hadoop-conf need to be created via the Rancher WebUI.
Add host labels hdfs-name=true & hdfs-api=true to any of your hosts.
Create new HDFS stack hdfs via Rancher WebUI and deploy docker-compose.rancher.yml.

Upload Files to HDFS

To upload files (local or remote) simply use the API-service provided in this project. The following endpoints are available:

/api/v1/upload/files (POST)
/api/v1/upload/remote (POST)
/api/v1/upload/jobs (POST)

Example: Upload `hamlet.txt` to HDFS

Once all containers are up and running, execute the following curl command:

curl -u hdfs:password \
     -F 'file=@data/hamlet.txt' \
     -X POST http://localhost:3000/api/v1/upload/files?hdfspath=/demo/hamlet.txt

{"message":"File(s) uploaded to HDFS. See: hdfs:///demo/hamlet.txt"}

Ports

HDFS WebUI is running on port 50070
HDFS API is running on port 3000

Using the API

Upload arbitrary local files to HDFS

For a single file:

curl -u hdfs:password \
     -F 'file=@your_file.png' \
     -X POST http://localhost:3000/api/v1/upload/files?hdfspath=/path/to/your/files

For multiple files:

curl -u hdfs:password \
     -F 'file=@your_file1.png' \
     -F 'file=@your_file2.png' \
     -X POST http://localhost:3000/api/v1/upload/files?hdfspath=/path/to/your/files

Upload local Spark Jobs to HDFS

For a single job:

curl -u hdfs:password \
     -F 'job=@your_jar.jar' \
     -X POST http://localhost:3000/api/v1/upload/jobs?hdfspath=/path/to/your/jobs

For multiple jobs:

curl -u hdfs:password \
     -F 'job=@your_jar1.jar' \
     -F 'job=@your_jar2.jar' \
     -X POST http://localhost:3000/api/v1/upload/jobs?hdfspath=/path/to/your/jobs

Upload remote files to HDFS

curl -u hdfs:password \
     -d '{"url": "your_url", "hdfspath": "/path/to/your/files"}' \
     -H "Content-Type: application/json" \
     -X POST http://localhost:3000/api/v1/upload/remote

Download Files from HDFS

The API will forward requests to the underlying HDFS WebAPI regarding file listing, description and download. In analogy to the upload functions we will use and pass on the hdfspath parameter for interaction with the files stored in the Hadoop cluster. The client needs to know beforehand how the data is organized on the backend, e.g. data is stored in a folder "data", results are under "results", etc. The user / client will be in controll to organize the filesystem structure on their own.

The following API is defined for lookup and download:

/api/v1/download?hdfspath=<path> (GET)
/api/v1/data/list?hdfspath=<path> (GET)
/api/v1/data/describe?hdfspath=<path> (GET)

General file download

In general the data can be downloaded by setting a valid hdfs path to the respective file. The response will contain in each case a single file as content disposition. Errors are returned as application/json with the status in the HTTP response and a message.

curl -u hdfs:password \
    http://localhost:3000/api/v1/download?hdfspath=/hdfspath/to/your/file.tif

File lookup and listing

There are two additional API endpoints describe will use a readdir (read directory) call that lists information about a directory or a file as a JSON object. In principal this is equal to the object send from Hadoop WebAPI.

[
    {
        "accessTime": 1517482423444,
        "blockSize": 134217728,
        "childrenNum": 0,
        "fileId": 16389,
        "group": "supergroup",
        "length": 65958,
        "modificationTime": 1517472398925,
        "owner": "webuser",
        "pathSuffix": "utm_1.tif",
        "permission": "755",
        "replication": 3,
        "storagePolicy": 0,
        "type": "FILE"
    },
    {
        "accessTime": 1517498610919,
        "blockSize": 134217728,
        "childrenNum": 0,
        "fileId": 16391,
        "group": "supergroup",
        "length": 65958,
        "modificationTime": 1517472398934,
        "owner": "webuser",
        "pathSuffix": "utm_2.tif",
        "permission": "755",
        "replication": 3,
        "storagePolicy": 0,
        "type": "FILE"
    }
]

list on the other hand will only list downloadable files as a plain text list, e.g.

http://localhost:3000/api/v1/download?hdfspath=/test/test_1.tif
http://localhost:3000/api/v1/download?hdfspath=/test/test_2.tif
http://localhost:3000/api/v1/download?hdfspath=/test/test_3.tif
http://localhost:3000/api/v1/download?hdfspath=/test/test_4.tif

The latter function is intended to be used in combination with GDAL on the client. gdalbuildvrt is a handy tool to create a remote representation of a tiled data set. There is a parameter to import the tiles as a text file. At this point the list operation comes into play, which will provide exactly this.

Imagine the following scenario, where the process result is multi-band tiled raster data set. Then one can create a VRT like the following:

curl -u hdfs:password http://localhost:3000/api/v1/data/list?hdfspath=/test > tiles.txt
gdalbuildvrt --config GDAL_PROXY_AUTH=hdfs:password -input_file_list tiles.txt test.vrt

Note: depending on the URLs or files stated in the list, each entry will be opened by GDAL to extract metadata information. This means every file listed will be downloaded temporarily, but not stored on the clients computer. This is currently a drawback.

With the vrt file the user can download a subset using gdal_translate with the GDAL_PROXY_AUTH configuration. Here only those tiles affected are downloaded, e.g. tiles/urls are neglected that are not within a specified BBOX.

gdal_translate --config GDAL_PROXY_AUTH=hdfs:password -of GTiff test.vrt test.tif

See gdal_translate for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
api		api
data		data
files		files
.dockerignore		.dockerignore
.travis.yml		.travis.yml
Dockerfile		Dockerfile
Dockerfile.api		Dockerfile.api
LICENSE		LICENSE
README.md		README.md
docker-compose.rancher.yml		docker-compose.rancher.yml
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BigGIS HDFS

Prerequisites

Deployment

Upload Files to HDFS

Example: Upload `hamlet.txt` to HDFS

Ports

Using the API

Upload arbitrary local files to HDFS

Upload local Spark Jobs to HDFS

Upload remote files to HDFS

Download Files from HDFS

General file download

File lookup and listing

About

Releases

Packages

Languages

License

biggis-project/biggis-hdfs

Folders and files

Latest commit

History

Repository files navigation

BigGIS HDFS

Prerequisites

Deployment

Upload Files to HDFS

Example: Upload hamlet.txt to HDFS

Ports

Using the API

Upload arbitrary local files to HDFS

Upload local Spark Jobs to HDFS

Upload remote files to HDFS

Download Files from HDFS

General file download

File lookup and listing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Example: Upload `hamlet.txt` to HDFS

Packages