Wednesday, December 04, 2013

Docker for Hadoop... or how to build, init and run a Hadoop Cluster in minutes

It literally takes seconds to start a Docker container with pseudo-distributed Hadoop cluster. Most of the credits go, of course, to the docker script below and a bunch of configuration files... but let's not outmanoeuvre ourselves and start slowly :)

In short: Docker is a lightweight container that allows you to run your process(es) in a complete isolation from the rest of the system. Almost like a Virtual Machine but faster and lighter.

In this post we will review the Docker skeleton to build, init and run Hadoop (HDFS, Yarn, HBase, Pig) in a Pseudo-Distributed mode. Let's start with the project's filesystem tree structure:

├── etc
│   └── environment
├── hadoop
│   ├── core-site.xml
│   ├── hadoop-env.sh
│   ├── hadoop-metrics2.properties
│   ├── hadoop-metrics.properties
│   ├── hdfs-site.xml
│   ├── log4j.properties
│   ├── mapred-site.xml
│   ├── slaves
│   ├── ssl-client.xml.example
│   ├── ssl-server.xml.example
│   ├── yarn-env.sh
│   └── yarn-site.xml
├── hbase
│   ├── hadoop-metrics.properties
│   ├── hbase-env.sh
│   ├── hbase-policy.xml
│   ├── hbase-site.xml
│   ├── log4j.properties
│   └── regionservers
├── pig
│   ├── build.properties
│   ├── log4j.properties
│   └── pig.properties
├── root_scripts
│   ├── clear_hadoop_logs.sh
│   ├── hadoop_pseudo_start.sh
│   ├── hadoop_pseudo_stop.sh
│   ├── hdfs_format.sh
│   ├── hdfs_init.sh
│   └── set_env
├── zookeeper
│   ├── configuration.xsl
│   ├── log4j.properties
│   ├── zoo.cfg
│   └── zoo_sample.cfg
├── build.sh
├── Dockerfile
├── local_env.sh
└── run.sh

Here, we have three main categories:
  • Hadoop configuration files (found in hadoop, hbase, zookeeper, pig folders)
  • Docker scripts: Dockerfile, local_env.sh, build.sh and run.sh
  • Hadoop util scripts (found in etc, root_scripts directories and build.sh, run.sh scripts)
Hadoop configuration files and util scripts could be copied from my Github [1]. Tiny docker-helper scripts are as follows:
#!/bin/bash
sudo sh -c "wget -qO- https://get.docker.io/gpg | apt-key add -"
sudo sh -c "echo deb http://get.docker.io/ubuntu docker main\
> /etc/apt/sources.list.d/docker.list"
sudo apt-get update
sudo apt-get install lxc-docker
sudo mkdir -p --mode=777 /var/hstation/dfs
sudo mkdir -p --mode=777 /var/hstation/workspace
sudo mkdir -p --mode=777 /var/hstation/logs
view raw local_env.sh hosted with ❤ by GitHub
#!/bin/bash
sudo docker build -t bohdanm/cdh_4_5 .
view raw build.sh hosted with ❤ by GitHub
#!/bin/bash
sudo docker run -v /var/hstation/dfs:/dfs -v /var/hstation/workspace:/workspace -v /var/hstation/logs:/hlogs -h hstation.vanlab.com -i -t bohdanm/cdh_4_5 /bin/bash -l
view raw run.sh hosted with ❤ by GitHub

Now, with the foreplay complete, let's see the Dockerfile itself:

FROM ubuntu:precise
MAINTAINER Bohdan Mushkevych
# Installing Oracle JDK
RUN apt-get -y install python-software-properties ;\
add-apt-repository ppa:webupd8team/java ;\
apt-get update && apt-get -y upgrade ;\
echo oracle-java7-installer shared/accepted-oracle-license-v1-1 select true | /usr/bin/debconf-set-selections ;\
apt-get -y install oracle-java7-installer && apt-get clean ;\
update-alternatives --display java ;\
export JAVA_HOME=/usr/lib/jvm/java-7-oracle ;\
export HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
# Cloudera CDH4 APT key and DPKG repositories
RUN apt-get -y install curl ;\
curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key | apt-key add - ;\
echo "deb [arch=amd64] http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib\ndeb-src http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib" > /etc/apt/sources.list.d/cloudera.list
# Removing anything extra and installing pseudo distributed YARN-powered Hadoop
RUN apt-get remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-* ;\
apt-get update ; apt-get install -y hadoop-conf-pseudo
# Installing zookeeper
RUN apt-get install zookeeper-server ;\
# Installing HBase
RUN apt-get install -y hbase ;\
apt-get install -y hbase-master ;\
apt-get install -y hbase-regionserver
# Installing Pig
RUN apt-get install -y pig
# Install command-line utils
RUN apt-get install ping ;\
apt-get install -y vim.tiny
# Copy configuration files
ADD ./etc/ /etc/
ADD ./root_scripts/ /root/
# Init environment
RUN cat /root/set_env >> /etc/profile
RUN unlink /etc/hadoop/conf
ADD ./hadoop/ /etc/hadoop/conf/
RUN unlink /etc/hbase/conf
ADD ./hbase/ /etc/hbase/conf/
RUN unlink /etc/zookeeper/conf
ADD ./zookeeper/ /etc/zookeeper/conf/
# Replace placeholders with the actual settings
RUN sed -i 's/$HOST_ADDRESS/hstation.vanlab.com/g' /etc/hadoop/conf/*
RUN sed -i 's/$HOST_ADDRESS/hstation.vanlab.com/g' /etc/hbase/conf/*
RUN sed -i 's/$HOST_ADDRESS/hstation.vanlab.com/g' /etc/zookeeper/conf/*
RUN sed -i 's/$FS_MOUNT_POINT/\/dfs/g' /etc/hadoop/conf/*
RUN sed -i 's/$FS_MOUNT_POINT/\/dfs/g' /etc/hbase/conf/*
RUN sed -i 's/$FS_MOUNT_POINT/\/dfs/g' /etc/zookeeper/conf/*
# make scripts runnable
RUN chmod +x /root/*.sh
# add user <zookeeper> to group <hadoop>
RUN usermod -a -G hadoop zookeeper
# Expose Hadoop+Eco ports
# HDFS
EXPOSE 8020 50070 50075 50090
# HBase
EXPOSE 60000 60010 60020 60030 8080
# Yarn
EXPOSE 8030 8031 8032 8033 8040 8041 8042 8088 10020 19888
CMD ["/usr/local/bin/circusd", "/etc/circusd.ini"]
view raw Dockerfile hosted with ❤ by GitHub

This docker instance is based on Ubuntu 12.04 (Precise Pangolin) and covers all required components: Oracle JDK, Hadoop+Ecosystem, basic system utils. Installation instructions are as follows:
  1. Pre-configure local environment:
    $> ./local_env.sh
     
  2. Build the container (it will take a minute or two):
    $> ./build.sh
     
  3. Run the container:
    $> ./run.sh
  4. Once in the container - emulate login (and hence - reads env variables):
    #> su -
  5. HDFS Initialization (once only):
    #> ./hdfs_format.sh
    #> ./hadoop_pseudo_start.sh
    #> ./hdfs_init.sh
     
  6. Restart the cluster to finalize initialization:
    #> ./hadoop_pseudo_stop.sh
    #> ./clear_hadoop_logs.sh
    #> ./hadoop_pseudo_start.sh
  7. Enjoy your cluster:
    #> hdfs dfs -ls -R /
    #> hbase shell
            status 'simple'
    #> pig
By default, container's filesystem state is reset at each run. In other words - all your data is gone the moment you exit the container. Natural solution to this issue is move HDFS mount point and few other folders outside of the container:

Host OS FilesystemContainer FilesystemDescription
/var/hstation/dfs/dfsFolder hosts HDFS filesystem
/var/hstation/workspace/workspaceFolder to exchange data to/from container
/var/hstation/logs/logsContains Hadoop/HBase/Zookeeper/Pig logs

We are also exposing HTTP ports, that allow us to connect to the Hadoop processes inside the container:

Exposed Container PortsDescription
http://CONTAINER_IP:8088/clusterResource Manager
http://CONTAINER_IP:19888/jobhistoryJob History
http://CONTAINER_IP:50070HDFS Name Node
http://CONTAINER_IP:60010HBase Master
http://CONTAINER_IP:8042/nodeYarn Node Manager

In the table above, CONTAINER_IP is found by running following command in your container:
#> domainname -i


To sum things up, container build time will take about 10 minutes and another 2-3 minutes to start and init the container for the first time. From that moment on - it's literally seconds before your Hadoop sandbox is ready to crunch the data.

Cheers!

[1] https://github.com/mushkevych/configurations/tree/master/CDH4.pseudo-distributed