In short: Docker is a lightweight container that allows you to run your process(es) in a complete isolation from the rest of the system. Almost like a Virtual Machine but faster and lighter.
In this post we will review the Docker skeleton to build, init and run Hadoop (HDFS, Yarn, HBase, Pig) in a Pseudo-Distributed mode. Let's start with the project's filesystem tree structure:
Here, we have three main categories:
- Hadoop configuration files (found in hadoop, hbase, zookeeper, pig folders)
- Docker scripts: Dockerfile, local_env.sh, build.sh and run.sh
- Hadoop util scripts (found in etc, root_scripts directories and build.sh, run.sh scripts)
Now, with the foreplay complete, let's see the Dockerfile itself:
This docker instance is based on Ubuntu 12.04 (Precise Pangolin) and covers all required components: Oracle JDK, Hadoop+Ecosystem, basic system utils. Installation instructions are as follows:
- Pre-configure local environment:
$> ./local_env.sh - Build the container (it will take a minute or two):
$> ./build.sh - Run the container:
$> ./run.sh - Once in the container - emulate login (and hence - reads env variables):
#> su - - HDFS Initialization (once only):
#> ./hdfs_format.sh
#> ./hadoop_pseudo_start.sh
#> ./hdfs_init.sh - Restart the cluster to finalize initialization:
#> ./hadoop_pseudo_stop.sh
#> ./clear_hadoop_logs.sh
#> ./hadoop_pseudo_start.sh - Enjoy your cluster:
#> hdfs dfs -ls -R /
#> hbase shell
status 'simple'
#> pig
Host OS Filesystem | Container Filesystem | Description |
/var/hstation/dfs | /dfs | Folder hosts HDFS filesystem |
/var/hstation/workspace | /workspace | Folder to exchange data to/from container |
/var/hstation/logs | /logs | Contains Hadoop/HBase/Zookeeper/Pig logs |
We are also exposing HTTP ports, that allow us to connect to the Hadoop processes inside the container:
Exposed Container Ports | Description |
http://CONTAINER_IP:8088/cluster | Resource Manager |
http://CONTAINER_IP:19888/jobhistory | Job History |
http://CONTAINER_IP:50070 | HDFS Name Node |
http://CONTAINER_IP:60010 | HBase Master |
http://CONTAINER_IP:8042/node | Yarn Node Manager |
In the table above, CONTAINER_IP is found by running following command in your container:
#> domainname -i
To sum things up, container build time will take about 10 minutes and another 2-3 minutes to start and init the container for the first time. From that moment on - it's literally seconds before your Hadoop sandbox is ready to crunch the data.
Cheers!
[1] https://github.com/mushkevych/configurations/tree/master/CDH4.pseudo-distributed