mushkevych: Docker for Hadoop... or how to build, init and run a Hadoop Cluster in minutes

It literally takes seconds to start a Docker container with pseudo-distributed Hadoop cluster. Most of the credits go, of course, to the docker script below and a bunch of configuration files... but let's not outmanoeuvre ourselves and start slowly :)

In short: Docker is a lightweight container that allows you to run your process(es) in a complete isolation from the rest of the system. Almost like a Virtual Machine but faster and lighter.

In this post we will review the Docker skeleton to build, init and run Hadoop (HDFS, Yarn, HBase, Pig) in a Pseudo-Distributed mode. Let's start with the project's filesystem tree structure:

	├── etc
	│ └── environment
	├── hadoop
	│ ├── core-site.xml
	│ ├── hadoop-env.sh
	│ ├── hadoop-metrics2.properties
	│ ├── hadoop-metrics.properties
	│ ├── hdfs-site.xml
	│ ├── log4j.properties
	│ ├── mapred-site.xml
	│ ├── slaves
	│ ├── ssl-client.xml.example
	│ ├── ssl-server.xml.example
	│ ├── yarn-env.sh
	│ └── yarn-site.xml
	├── hbase
	│ ├── hadoop-metrics.properties
	│ ├── hbase-env.sh
	│ ├── hbase-policy.xml
	│ ├── hbase-site.xml
	│ ├── log4j.properties
	│ └── regionservers
	├── pig
	│ ├── build.properties
	│ ├── log4j.properties
	│ └── pig.properties
	├── root_scripts
	│ ├── clear_hadoop_logs.sh
	│ ├── hadoop_pseudo_start.sh
	│ ├── hadoop_pseudo_stop.sh
	│ ├── hdfs_format.sh
	│ ├── hdfs_init.sh
	│ └── set_env
	├── zookeeper
	│ ├── configuration.xsl
	│ ├── log4j.properties
	│ ├── zoo.cfg
	│ └── zoo_sample.cfg
	├── build.sh
	├── Dockerfile
	├── local_env.sh
	└── run.sh

view raw filesystem_structure hosted with ❤ by GitHub

Here, we have three main categories:

Hadoop configuration files (found in hadoop, hbase, zookeeper, pig folders)
Docker scripts: Dockerfile, local_env.sh, build.sh and run.sh
Hadoop util scripts (found in etc, root_scripts directories and build.sh, run.sh scripts)

Hadoop configuration files and util scripts could be copied from my Github [1]. Tiny docker-helper scripts are as follows:

	#!/bin/bash
	sudo sh -c "wget -qO- https://get.docker.io/gpg \| apt-key add -"
	sudo sh -c "echo deb http://get.docker.io/ubuntu docker main\
	> /etc/apt/sources.list.d/docker.list"
	sudo apt-get update
	sudo apt-get install lxc-docker

	sudo mkdir -p --mode=777 /var/hstation/dfs
	sudo mkdir -p --mode=777 /var/hstation/workspace
	sudo mkdir -p --mode=777 /var/hstation/logs

view raw local_env.sh hosted with ❤ by GitHub

	#!/bin/bash
	sudo docker build -t bohdanm/cdh_4_5 .

view raw build.sh hosted with ❤ by GitHub

	#!/bin/bash
	sudo docker run -v /var/hstation/dfs:/dfs -v /var/hstation/workspace:/workspace -v /var/hstation/logs:/hlogs -h hstation.vanlab.com -i -t bohdanm/cdh_4_5 /bin/bash -l

view raw run.sh hosted with ❤ by GitHub

Now, with the foreplay complete, let's see the Dockerfile itself:

	FROM ubuntu:precise
	MAINTAINER Bohdan Mushkevych

	# Installing Oracle JDK
	RUN apt-get -y install python-software-properties ;\
	add-apt-repository ppa:webupd8team/java ;\
	apt-get update && apt-get -y upgrade ;\
	echo oracle-java7-installer shared/accepted-oracle-license-v1-1 select true \| /usr/bin/debconf-set-selections ;\
	apt-get -y install oracle-java7-installer && apt-get clean ;\
	update-alternatives --display java ;\
	export JAVA_HOME=/usr/lib/jvm/java-7-oracle ;\
	export HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec

	# Cloudera CDH4 APT key and DPKG repositories
	RUN apt-get -y install curl ;\
	curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key \| apt-key add - ;\
	echo "deb [arch=amd64] http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib\ndeb-src http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib" > /etc/apt/sources.list.d/cloudera.list

	# Removing anything extra and installing pseudo distributed YARN-powered Hadoop
	RUN apt-get remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-* ;\
	apt-get update ; apt-get install -y hadoop-conf-pseudo

	# Installing zookeeper
	RUN apt-get install zookeeper-server ;\

	# Installing HBase
	RUN apt-get install -y hbase ;\
	apt-get install -y hbase-master ;\
	apt-get install -y hbase-regionserver

	# Installing Pig
	RUN apt-get install -y pig

	# Install command-line utils
	RUN apt-get install ping ;\
	apt-get install -y vim.tiny

	# Copy configuration files
	ADD ./etc/ /etc/
	ADD ./root_scripts/ /root/

	# Init environment
	RUN cat /root/set_env >> /etc/profile

	RUN unlink /etc/hadoop/conf
	ADD ./hadoop/ /etc/hadoop/conf/

	RUN unlink /etc/hbase/conf
	ADD ./hbase/ /etc/hbase/conf/

	RUN unlink /etc/zookeeper/conf
	ADD ./zookeeper/ /etc/zookeeper/conf/

	# Replace placeholders with the actual settings
	RUN sed -i 's/$HOST_ADDRESS/hstation.vanlab.com/g' /etc/hadoop/conf/*
	RUN sed -i 's/$HOST_ADDRESS/hstation.vanlab.com/g' /etc/hbase/conf/*
	RUN sed -i 's/$HOST_ADDRESS/hstation.vanlab.com/g' /etc/zookeeper/conf/*

	RUN sed -i 's/$FS_MOUNT_POINT/\/dfs/g' /etc/hadoop/conf/*
	RUN sed -i 's/$FS_MOUNT_POINT/\/dfs/g' /etc/hbase/conf/*
	RUN sed -i 's/$FS_MOUNT_POINT/\/dfs/g' /etc/zookeeper/conf/*

	# make scripts runnable
	RUN chmod +x /root/*.sh

	# add user <zookeeper> to group <hadoop>
	RUN usermod -a -G hadoop zookeeper

	# Expose Hadoop+Eco ports
	# HDFS
	EXPOSE 8020 50070 50075 50090

	# HBase
	EXPOSE 60000 60010 60020 60030 8080

	# Yarn
	EXPOSE 8030 8031 8032 8033 8040 8041 8042 8088 10020 19888

	CMD ["/usr/local/bin/circusd", "/etc/circusd.ini"]

view raw Dockerfile hosted with ❤ by GitHub

This docker instance is based on Ubuntu 12.04 (Precise Pangolin) and covers all required components: Oracle JDK, Hadoop+Ecosystem, basic system utils. Installation instructions are as follows:

Pre-configure local environment:
$> ./local_env.sh
Build the container (it will take a minute or two):
$> ./build.sh
Run the container:
$> ./run.sh
Once in the container - emulate login (and hence - reads env variables):
#> su -
HDFS Initialization (once only):
#> ./hdfs_format.sh
#> ./hadoop_pseudo_start.sh
#> ./hdfs_init.sh
Restart the cluster to finalize initialization:
#> ./hadoop_pseudo_stop.sh
#> ./clear_hadoop_logs.sh
#> ./hadoop_pseudo_start.sh
Enjoy your cluster:
#> hdfs dfs -ls -R /
#> hbase shell
status 'simple'
#> pig

By default, container's filesystem state is reset at each run. In other words - all your data is gone the moment you exit the container. Natural solution to this issue is move HDFS mount point and few other folders outside of the container:

Host OS Filesystem	Container Filesystem	Description
/var/hstation/dfs	/dfs	Folder hosts HDFS filesystem
/var/hstation/workspace	/workspace	Folder to exchange data to/from container
/var/hstation/logs	/logs	Contains Hadoop/HBase/Zookeeper/Pig logs

We are also exposing HTTP ports, that allow us to connect to the Hadoop processes inside the container:

Exposed Container Ports	Description
http://CONTAINER_IP:8088/cluster	Resource Manager
http://CONTAINER_IP:19888/jobhistory	Job History
http://CONTAINER_IP:50070	HDFS Name Node
http://CONTAINER_IP:60010	HBase Master
http://CONTAINER_IP:8042/node	Yarn Node Manager

In the table above, CONTAINER_IP is found by running following command in your container:
#> domainname -i

To sum things up, container build time will take about 10 minutes and another 2-3 minutes to start and init the container for the first time. From that moment on - it's literally seconds before your Hadoop sandbox is ready to crunch the data.

Cheers!

[1] https://github.com/mushkevych/configurations/tree/master/CDH4.pseudo-distributed

3 comments:

Anonymous said...: Hey there, I've been trying to set up the same docker hadoop setup that you describe here.

I downloaded the files, ran local_env, and ran build but ran into some errors.

It looks like apt-get install hadoop-conf-pseudo doesn't work?

If I try that command outside of the dockerfile I get an error "Unable to locate package hadoop-conf-pseudo"

Any ideas what I am missing here?; 11:18 AM
Bohdan Mushkevych said...: Hiho :)

I have made few tweaks and made it available at:

https://bitbucket.org/mushkevych/docker-cdh-pseudo-distributed-4_5

Cheers!; 8:24 PM
Vjeran Marcinko said...: Is Oozie available in your Hadoop setup?; 7:28 AM

Wednesday, December 04, 2013

Docker for Hadoop... or how to build, init and run a Hadoop Cluster in minutes

3 comments: