In short: Docker is a lightweight container that allows you to run your process(es) in a complete isolation from the rest of the system. Almost like a Virtual Machine but faster and lighter.
In this post we will review the Docker skeleton to build, init and run Hadoop (HDFS, Yarn, HBase, Pig) in a Pseudo-Distributed mode. Let's start with the project's filesystem tree structure:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
├── etc | |
│ └── environment | |
├── hadoop | |
│ ├── core-site.xml | |
│ ├── hadoop-env.sh | |
│ ├── hadoop-metrics2.properties | |
│ ├── hadoop-metrics.properties | |
│ ├── hdfs-site.xml | |
│ ├── log4j.properties | |
│ ├── mapred-site.xml | |
│ ├── slaves | |
│ ├── ssl-client.xml.example | |
│ ├── ssl-server.xml.example | |
│ ├── yarn-env.sh | |
│ └── yarn-site.xml | |
├── hbase | |
│ ├── hadoop-metrics.properties | |
│ ├── hbase-env.sh | |
│ ├── hbase-policy.xml | |
│ ├── hbase-site.xml | |
│ ├── log4j.properties | |
│ └── regionservers | |
├── pig | |
│ ├── build.properties | |
│ ├── log4j.properties | |
│ └── pig.properties | |
├── root_scripts | |
│ ├── clear_hadoop_logs.sh | |
│ ├── hadoop_pseudo_start.sh | |
│ ├── hadoop_pseudo_stop.sh | |
│ ├── hdfs_format.sh | |
│ ├── hdfs_init.sh | |
│ └── set_env | |
├── zookeeper | |
│ ├── configuration.xsl | |
│ ├── log4j.properties | |
│ ├── zoo.cfg | |
│ └── zoo_sample.cfg | |
├── build.sh | |
├── Dockerfile | |
├── local_env.sh | |
└── run.sh |
Here, we have three main categories:
- Hadoop configuration files (found in hadoop, hbase, zookeeper, pig folders)
- Docker scripts: Dockerfile, local_env.sh, build.sh and run.sh
- Hadoop util scripts (found in etc, root_scripts directories and build.sh, run.sh scripts)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
sudo sh -c "wget -qO- https://get.docker.io/gpg | apt-key add -" | |
sudo sh -c "echo deb http://get.docker.io/ubuntu docker main\ | |
> /etc/apt/sources.list.d/docker.list" | |
sudo apt-get update | |
sudo apt-get install lxc-docker | |
sudo mkdir -p --mode=777 /var/hstation/dfs | |
sudo mkdir -p --mode=777 /var/hstation/workspace | |
sudo mkdir -p --mode=777 /var/hstation/logs |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
sudo docker build -t bohdanm/cdh_4_5 . |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
sudo docker run -v /var/hstation/dfs:/dfs -v /var/hstation/workspace:/workspace -v /var/hstation/logs:/hlogs -h hstation.vanlab.com -i -t bohdanm/cdh_4_5 /bin/bash -l |
Now, with the foreplay complete, let's see the Dockerfile itself:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
FROM ubuntu:precise | |
MAINTAINER Bohdan Mushkevych | |
# Installing Oracle JDK | |
RUN apt-get -y install python-software-properties ;\ | |
add-apt-repository ppa:webupd8team/java ;\ | |
apt-get update && apt-get -y upgrade ;\ | |
echo oracle-java7-installer shared/accepted-oracle-license-v1-1 select true | /usr/bin/debconf-set-selections ;\ | |
apt-get -y install oracle-java7-installer && apt-get clean ;\ | |
update-alternatives --display java ;\ | |
export JAVA_HOME=/usr/lib/jvm/java-7-oracle ;\ | |
export HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec | |
# Cloudera CDH4 APT key and DPKG repositories | |
RUN apt-get -y install curl ;\ | |
curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key | apt-key add - ;\ | |
echo "deb [arch=amd64] http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib\ndeb-src http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib" > /etc/apt/sources.list.d/cloudera.list | |
# Removing anything extra and installing pseudo distributed YARN-powered Hadoop | |
RUN apt-get remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-* ;\ | |
apt-get update ; apt-get install -y hadoop-conf-pseudo | |
# Installing zookeeper | |
RUN apt-get install zookeeper-server ;\ | |
# Installing HBase | |
RUN apt-get install -y hbase ;\ | |
apt-get install -y hbase-master ;\ | |
apt-get install -y hbase-regionserver | |
# Installing Pig | |
RUN apt-get install -y pig | |
# Install command-line utils | |
RUN apt-get install ping ;\ | |
apt-get install -y vim.tiny | |
# Copy configuration files | |
ADD ./etc/ /etc/ | |
ADD ./root_scripts/ /root/ | |
# Init environment | |
RUN cat /root/set_env >> /etc/profile | |
RUN unlink /etc/hadoop/conf | |
ADD ./hadoop/ /etc/hadoop/conf/ | |
RUN unlink /etc/hbase/conf | |
ADD ./hbase/ /etc/hbase/conf/ | |
RUN unlink /etc/zookeeper/conf | |
ADD ./zookeeper/ /etc/zookeeper/conf/ | |
# Replace placeholders with the actual settings | |
RUN sed -i 's/$HOST_ADDRESS/hstation.vanlab.com/g' /etc/hadoop/conf/* | |
RUN sed -i 's/$HOST_ADDRESS/hstation.vanlab.com/g' /etc/hbase/conf/* | |
RUN sed -i 's/$HOST_ADDRESS/hstation.vanlab.com/g' /etc/zookeeper/conf/* | |
RUN sed -i 's/$FS_MOUNT_POINT/\/dfs/g' /etc/hadoop/conf/* | |
RUN sed -i 's/$FS_MOUNT_POINT/\/dfs/g' /etc/hbase/conf/* | |
RUN sed -i 's/$FS_MOUNT_POINT/\/dfs/g' /etc/zookeeper/conf/* | |
# make scripts runnable | |
RUN chmod +x /root/*.sh | |
# add user <zookeeper> to group <hadoop> | |
RUN usermod -a -G hadoop zookeeper | |
# Expose Hadoop+Eco ports | |
# HDFS | |
EXPOSE 8020 50070 50075 50090 | |
# HBase | |
EXPOSE 60000 60010 60020 60030 8080 | |
# Yarn | |
EXPOSE 8030 8031 8032 8033 8040 8041 8042 8088 10020 19888 | |
CMD ["/usr/local/bin/circusd", "/etc/circusd.ini"] |
This docker instance is based on Ubuntu 12.04 (Precise Pangolin) and covers all required components: Oracle JDK, Hadoop+Ecosystem, basic system utils. Installation instructions are as follows:
- Pre-configure local environment:
$> ./local_env.sh - Build the container (it will take a minute or two):
$> ./build.sh - Run the container:
$> ./run.sh - Once in the container - emulate login (and hence - reads env variables):
#> su - - HDFS Initialization (once only):
#> ./hdfs_format.sh
#> ./hadoop_pseudo_start.sh
#> ./hdfs_init.sh - Restart the cluster to finalize initialization:
#> ./hadoop_pseudo_stop.sh
#> ./clear_hadoop_logs.sh
#> ./hadoop_pseudo_start.sh - Enjoy your cluster:
#> hdfs dfs -ls -R /
#> hbase shell
status 'simple'
#> pig
Host OS Filesystem | Container Filesystem | Description |
/var/hstation/dfs | /dfs | Folder hosts HDFS filesystem |
/var/hstation/workspace | /workspace | Folder to exchange data to/from container |
/var/hstation/logs | /logs | Contains Hadoop/HBase/Zookeeper/Pig logs |
We are also exposing HTTP ports, that allow us to connect to the Hadoop processes inside the container:
Exposed Container Ports | Description |
http://CONTAINER_IP:8088/cluster | Resource Manager |
http://CONTAINER_IP:19888/jobhistory | Job History |
http://CONTAINER_IP:50070 | HDFS Name Node |
http://CONTAINER_IP:60010 | HBase Master |
http://CONTAINER_IP:8042/node | Yarn Node Manager |
In the table above, CONTAINER_IP is found by running following command in your container:
#> domainname -i
To sum things up, container build time will take about 10 minutes and another 2-3 minutes to start and init the container for the first time. From that moment on - it's literally seconds before your Hadoop sandbox is ready to crunch the data.
Cheers!
[1] https://github.com/mushkevych/configurations/tree/master/CDH4.pseudo-distributed