mushkevych: December 2013

It literally takes seconds to start a Docker container with pseudo-distributed Hadoop cluster. Most of the credits go, of course, to the docker script below and a bunch of configuration files... but let's not outmanoeuvre ourselves and start slowly :)

In short: Docker is a lightweight container that allows you to run your process(es) in a complete isolation from the rest of the system. Almost like a Virtual Machine but faster and lighter.

In this post we will review the Docker skeleton to build, init and run Hadoop (HDFS, Yarn, HBase, Pig) in a Pseudo-Distributed mode. Let's start with the project's filesystem tree structure:

Here, we have three main categories:

Hadoop configuration files (found in hadoop, hbase, zookeeper, pig folders)
Docker scripts: Dockerfile, local_env.sh, build.sh and run.sh
Hadoop util scripts (found in etc, root_scripts directories and build.sh, run.sh scripts)

Hadoop configuration files and util scripts could be copied from my Github [1]. Tiny docker-helper scripts are as follows:

Now, with the foreplay complete, let's see the Dockerfile itself:

This docker instance is based on Ubuntu 12.04 (Precise Pangolin) and covers all required components: Oracle JDK, Hadoop+Ecosystem, basic system utils. Installation instructions are as follows:

Pre-configure local environment:
$> ./local_env.sh
Build the container (it will take a minute or two):
$> ./build.sh
Run the container:
$> ./run.sh
Once in the container - emulate login (and hence - reads env variables):
#> su -
HDFS Initialization (once only):
#> ./hdfs_format.sh
#> ./hadoop_pseudo_start.sh
#> ./hdfs_init.sh
Restart the cluster to finalize initialization:
#> ./hadoop_pseudo_stop.sh
#> ./clear_hadoop_logs.sh
#> ./hadoop_pseudo_start.sh
Enjoy your cluster:
#> hdfs dfs -ls -R /
#> hbase shell
status 'simple'
#> pig

By default, container's filesystem state is reset at each run. In other words - all your data is gone the moment you exit the container. Natural solution to this issue is move HDFS mount point and few other folders outside of the container:

Host OS Filesystem	Container Filesystem	Description
/var/hstation/dfs	/dfs	Folder hosts HDFS filesystem
/var/hstation/workspace	/workspace	Folder to exchange data to/from container
/var/hstation/logs	/logs	Contains Hadoop/HBase/Zookeeper/Pig logs

We are also exposing HTTP ports, that allow us to connect to the Hadoop processes inside the container:

Exposed Container Ports	Description
http://CONTAINER_IP:8088/cluster	Resource Manager
http://CONTAINER_IP:19888/jobhistory	Job History
http://CONTAINER_IP:50070	HDFS Name Node
http://CONTAINER_IP:60010	HBase Master
http://CONTAINER_IP:8042/node	Yarn Node Manager

In the table above, CONTAINER_IP is found by running following command in your container:
#> domainname -i

To sum things up, container build time will take about 10 minutes and another 2-3 minutes to start and init the container for the first time. From that moment on - it's literally seconds before your Hadoop sandbox is ready to crunch the data.

Cheers!

[1] https://github.com/mushkevych/configurations/tree/master/CDH4.pseudo-distributed

mushkevych

Wednesday, December 04, 2013

Docker for Hadoop... or how to build, init and run a Hadoop Cluster in minutes