Friday, February 14, 2014

Hadoop + Volumes on Fedora, Using Docker

In just a few simple steps, you can have an instance of Hadoop up and running in a Docker container. This was tested with Hadoop 2.2 on Docker 0.7.6 on a Fedora 20 host. This article will also take a minute to explain how to use volumes as well. You may also want to view this in depth blog post by Michael Crosby.

Building the Hadoop Container -

Grab the dockerfile from either of these locations:


1. Get the version of Docker

# docker version

To build:

# docker build -rm -t <username>/hadoop-single . |& tee hadoop_build.log

# docker images

2. To Run ( is provided in the repo):

# ./

# docker ps

Make sure to replace username in the file.

3. To confirm ports are available:

# docker ps
de33e77a7d73 <username>/hadoop-single:latest supervisord -n 6 seconds ago Up 4 seconds>10020/tcp,>13562/tcp,>19888/tcp,>50010/tcp,>50020/tcp,>50030/tcp,>50060/tcp,>50070/tcp,>50075/tcp,>50090/tcp,>50100/tcp,>50105/tcp,>50470/tcp,>50475/tcp,>58261/tcp,>60010/tcp,>8020/tcp,>8030/tcp,>8031/tcp,>8032/tcp,>8033/tcp,>8040/tcp,>8042/tcp,>8080/tcp,>8088/tcp,>8090/tcp,>8480/tcp,>8485/tcp,>9000/tcp, 9001/tcp hungry_pasteur

4. To access services and confirm functionality, open a local browser and hit the following URLs.

# curl http://localhost:50075 - Datanode
# curl http://localhost:50090 - Secondary namenode
# curl http://localhost:8088/cluster - Resource manager
# curl http://localhost:8042/node - Yarn node manager

5. To run a job against the Hadoop container. Install the following packages on your host:


6. Make sure the core-site.xml file has the proper networking.

# grep hdfs /etc/hadoop/core-site.xml

8. Run a job. Below is one example, more examples can be found on the Fedora page.

# hadoop jar /usr/share/java/hadoop/hadoop-mapreduce-examples.jar pi 10 100000

Total committed heap usage (bytes)=1956642816
Shuffle Errors
File Input Format Counters
Bytes Read=1180
File Output Format Counters
Bytes Written=97
Job Finished in 63.565 seconds


Useful commands for checking out hadoop inside the container:

# hdfs dfsadmin -report
# mapred queue -list 


Exploring Volumes -

Fedora page.
Now that the container has been built, take a look at a few ways volumes can be implemented. In short, there are a couple of types of volumes states that you will see:

1. Everything is internal to the container. Nice for proof of concept, testing, very portable, etc...
2. Dynamically created volumes that are set up when you launch the image. These are portable as well but allow you to store your data external to the container. These dynamically created directories have a UID that changes each time you launch the container. You can find the path with a quick Docker inspect.
3. Static volumes that are mounted from the host inside the container. These are not as portable. Meaning you'll need the directory to be available from any host that you launch the container on. However, it does provide you with a known path.

To use the second scenario, there are two ways you can do it:

1. Set up the volume from within the Dockerfile, like:

VOLUME ["/var/cache/hadoop-hdfs/root/dfs/datanode"]

2. Use the -v parameter when launching a container, like:

# docker run -v /var/cache/hadoop-hdfs/root/dfs/datanode/ -d -p 8020:8020 -p 50105:50105 <username>/hadoop-single-volume

To use the third scenario, here's what you need to do:

1. Use the -v parameter when launching a container, like:

# docker run -v /data/hadoop:/var/cache/hadoop-hdfs/root/dfs/datanode -d -p 8020:8020 -p 50105:50105 <username>/hadoop-single-volume

Where the /data/hadoop is the directory on the host and Fedora page.the /var/cache/hadoop-hdfs/root/dfs/datanode is the directory inside the container.

In both cases, there will be a volume created, have a look (you will need the jq package for this). For the dynamically created volume:

# docker inspect <Container ID> | jq -r '.[0].Volumes'
"/var/cache/hadoop-hdfs/root/dfs/datanode": "/var/lib/docker/vfs/dir/89fa90163da28ad75f82ca32047c5e6cd1111ae395db0e6ac5b18977db2343db"

That UID will change everytime the container is launched.

For the static volume:

# docker inspect <Container ID> | jq -r '.[0].Volumes'
"/var/cache/hadoop-hdfs/root/dfs/datanode": "/data/hadoop"

After you mount the /data/hadoop volume, you can run a job that creates some files:

# hadoop jar /usr/share/java/hadoop/hadoop-mapreduce-examples.jar teragen 100 gendata

And check the directory on the localhost for the existence of the files.

# tree /data/hadoop/current/ /data/hadoop/current/
├── BP-1198205689-
│   ├── current
│   │   ├── finalized
│   │   │   ├── blk_1073741830
│   │   │   ├── blk_1073741830_1006.meta
│   │   │   ├── blk_1073741831
│   │   │   ├── blk_1073741831_1007.meta
│   │   │   ├── blk_1073741834
│   │   │   ├── blk_1073741834_1010.meta
│   │   │   ├── blk_1073741835
│   │   │   └── blk_1073741835_1011.meta
│   │   ├── rbw
│   │   └── VERSION
│   ├── dncp_block_verification.log.curr
│   ├── dncp_block_verification.log.prev
│   └── tmp

5 directories, 12 files

And that's it. Test it out.

No comments:

Post a Comment