Wednesday, December 04, 2013

Docker for Hadoop... or how to build, init and run a Hadoop Cluster in minutes

It literally takes seconds to start a Docker container with pseudo-distributed Hadoop cluster. Most of the credits go, of course, to the docker script below and a bunch of configuration files... but let's not outmanoeuvre ourselves and start slowly :)

In short: Docker is a lightweight container that allows you to run your process(es) in a complete isolation from the rest of the system. Almost like a Virtual Machine but faster and lighter.

In this post we will review the Docker skeleton to build, init and run Hadoop (HDFS, Yarn, HBase, Pig) in a Pseudo-Distributed mode. Let's start with the project's filesystem tree structure:


Here, we have three main categories:
  • Hadoop configuration files (found in hadoop, hbase, zookeeper, pig folders)
  • Docker scripts: Dockerfile, local_env.sh, build.sh and run.sh
  • Hadoop util scripts (found in etc, root_scripts directories and build.sh, run.sh scripts)
Hadoop configuration files and util scripts could be copied from my Github [1]. Tiny docker-helper scripts are as follows:

Now, with the foreplay complete, let's see the Dockerfile itself:


This docker instance is based on Ubuntu 12.04 (Precise Pangolin) and covers all required components: Oracle JDK, Hadoop+Ecosystem, basic system utils. Installation instructions are as follows:
  1. Pre-configure local environment:
    $> ./local_env.sh
     
  2. Build the container (it will take a minute or two):
    $> ./build.sh
     
  3. Run the container:
    $> ./run.sh
  4. Once in the container - emulate login (and hence - reads env variables):
    #> su -
  5. HDFS Initialization (once only):
    #> ./hdfs_format.sh
    #> ./hadoop_pseudo_start.sh
    #> ./hdfs_init.sh
     
  6. Restart the cluster to finalize initialization:
    #> ./hadoop_pseudo_stop.sh
    #> ./clear_hadoop_logs.sh
    #> ./hadoop_pseudo_start.sh
  7. Enjoy your cluster:
    #> hdfs dfs -ls -R /
    #> hbase shell
            status 'simple'
    #> pig
By default, container's filesystem state is reset at each run. In other words - all your data is gone the moment you exit the container. Natural solution to this issue is move HDFS mount point and few other folders outside of the container:

Host OS FilesystemContainer FilesystemDescription
/var/hstation/dfs/dfsFolder hosts HDFS filesystem
/var/hstation/workspace/workspaceFolder to exchange data to/from container
/var/hstation/logs/logsContains Hadoop/HBase/Zookeeper/Pig logs

We are also exposing HTTP ports, that allow us to connect to the Hadoop processes inside the container:

Exposed Container PortsDescription
http://CONTAINER_IP:8088/clusterResource Manager
http://CONTAINER_IP:19888/jobhistoryJob History
http://CONTAINER_IP:50070HDFS Name Node
http://CONTAINER_IP:60010HBase Master
http://CONTAINER_IP:8042/nodeYarn Node Manager

In the table above, CONTAINER_IP is found by running following command in your container:
#> domainname -i


To sum things up, container build time will take about 10 minutes and another 2-3 minutes to start and init the container for the first time. From that moment on - it's literally seconds before your Hadoop sandbox is ready to crunch the data.

Cheers!

[1] https://github.com/mushkevych/configurations/tree/master/CDH4.pseudo-distributed