mushkevych: 2013

Wednesday, December 04, 2013

Docker for Hadoop... or how to build, init and run a Hadoop Cluster in minutes

It literally takes seconds to start a Docker container with pseudo-distributed Hadoop cluster. Most of the credits go, of course, to the docker script below and a bunch of configuration files... but let's not outmanoeuvre ourselves and start slowly :)

In short: Docker is a lightweight container that allows you to run your process(es) in a complete isolation from the rest of the system. Almost like a Virtual Machine but faster and lighter.

In this post we will review the Docker skeleton to build, init and run Hadoop (HDFS, Yarn, HBase, Pig) in a Pseudo-Distributed mode. Let's start with the project's filesystem tree structure:

Here, we have three main categories:

Hadoop configuration files (found in hadoop, hbase, zookeeper, pig folders)
Docker scripts: Dockerfile, local_env.sh, build.sh and run.sh
Hadoop util scripts (found in etc, root_scripts directories and build.sh, run.sh scripts)

Hadoop configuration files and util scripts could be copied from my Github [1]. Tiny docker-helper scripts are as follows:

Now, with the foreplay complete, let's see the Dockerfile itself:

This docker instance is based on Ubuntu 12.04 (Precise Pangolin) and covers all required components: Oracle JDK, Hadoop+Ecosystem, basic system utils. Installation instructions are as follows:

Pre-configure local environment:
$> ./local_env.sh
Build the container (it will take a minute or two):
$> ./build.sh
Run the container:
$> ./run.sh
Once in the container - emulate login (and hence - reads env variables):
#> su -
HDFS Initialization (once only):
#> ./hdfs_format.sh
#> ./hadoop_pseudo_start.sh
#> ./hdfs_init.sh
Restart the cluster to finalize initialization:
#> ./hadoop_pseudo_stop.sh
#> ./clear_hadoop_logs.sh
#> ./hadoop_pseudo_start.sh
Enjoy your cluster:
#> hdfs dfs -ls -R /
#> hbase shell
status 'simple'
#> pig

By default, container's filesystem state is reset at each run. In other words - all your data is gone the moment you exit the container. Natural solution to this issue is move HDFS mount point and few other folders outside of the container:

Host OS Filesystem	Container Filesystem	Description
/var/hstation/dfs	/dfs	Folder hosts HDFS filesystem
/var/hstation/workspace	/workspace	Folder to exchange data to/from container
/var/hstation/logs	/logs	Contains Hadoop/HBase/Zookeeper/Pig logs

We are also exposing HTTP ports, that allow us to connect to the Hadoop processes inside the container:

Exposed Container Ports	Description
http://CONTAINER_IP:8088/cluster	Resource Manager
http://CONTAINER_IP:19888/jobhistory	Job History
http://CONTAINER_IP:50070	HDFS Name Node
http://CONTAINER_IP:60010	HBase Master
http://CONTAINER_IP:8042/node	Yarn Node Manager

In the table above, CONTAINER_IP is found by running following command in your container:
#> domainname -i

To sum things up, container build time will take about 10 minutes and another 2-3 minutes to start and init the container for the first time. From that moment on - it's literally seconds before your Hadoop sandbox is ready to crunch the data.

Cheers!

[1] https://github.com/mushkevych/configurations/tree/master/CDH4.pseudo-distributed

Tuesday, August 20, 2013

Puppet+Hiera+Jenkins. Jenkins Integration

Continuation. By now we should have a working bundle of Puppet and Hiera, and in this post we will focus on Jenkins integration, as well as management of the apps configuration.

First, let's recall that roles contain class {'configuration':} declaration. This tiny module allows us to keep a current copy of the apps configuration on the target nodes.
Module's file tree looks as follow:

The manifest file itself (init.pp) is as follows:

It can be summarized as procedure that base on ${environment} name and ${application} name copies configuration files from /biz/puppet/hieradata/${environment}/${application} on Puppet Master to /biz/configuration/${application} on Puppet Agent node.

It is important that destination point has no reference of the ${environment}. This later allows Jenkins to perform configuration update in a complete agnostic way - with only ${application} name being required.

Consider the file tree under files folder. It is grouped by environment and application name. Should you like to add a new application, let's say HelloWorld to the CI env, you would have to:

Make sure that folder /biz/puppet/hieradata/ci exist
Create new subfolder /biz/puppet/hieradata/ci/HelloWorld
Place HelloWorld configuration files into /biz/puppet/hieradata/ci/HelloWorld
Create/update Jenkins script to take the application configuration files from /biz/configuration/${application}
Create/update a role to reference the HelloWorld:

The flow can be illustrated as:

Fig 1: Deployment process

Last, but not least - let's describe development process on Puppet+Hiera+Jenkins govern cluster:

Fig 2: Development process

Fig 2 illustrates that the framework significantly fast-tracks configuration changes. In summary, we have shown a configuration management framework targeted for small-to-medium sized project, where you are comfortable with file-based configuration for your applications.

Cheers!

Puppet+Hiera+Jenkins. Custom Modules

Continuation. In this second post of the series we will focus on custom modules for the Puppet+Hiera+Jenkins framework. For the sake of brevity we will illustrate our efforts at an example of a single QA environment.

As it is defined in puppet.conf, site.pp file could be found at /biz/puppet/hieradata/${environment}. In terms of QA environment it is /biz/puppet/hieradata/qa:

Here, manifest (aka site.pp) assigns a single role to every node in the cluster. Nodes are identified by the full domain name. Let's assume that:

Puppet Master lives at ip-10-0-8-10.sampup.com
ip-10-0-8-11.sampup.com and ip-10-0-8-12.sampup.com are Puppet Agents.

Roles are defined in /biz/puppet/modules/role module:

You might be asking what class {'configuration':} is about? This where actual application configuration is delivered to agent nodes, and we will address this module in the last post of the series.

Corresponding profiles are defined in /biz/puppet/modules/profile module:

Lets discuss Hiera configuration. There is a lot of reading available [1], however we can summarize following:

Hiera data files are named after node's fully qualified domain name
For instance: ip-10-0-8-12.sampup.com.json
Files contain key-values pairs
For instance: "ntp::restrict" : true
Each key comprise of the module name and the property name, joined by ::
For instance, in "ntp::restrict" : true - ntp is the module name
- restrict is the property name
- true is the value
Declared key-values are applied during invocation of the parametrized Puppet classes
In case Hiera finds no filename matching the node's domain name, it will first look in common.json
Puppet will resolve to the default module parameters, should it find no relevant key-values pairs.

/biz/puppet/hieradata/qa/common.json

/biz/puppet/hieradata/qa/ip-10-0-8-12.sampup.com.json

Continue with Jenkins and application configuration.

[1] Hiera reading
http://docs.puppetlabs.com/hiera/1/puppet.html#hiera-lookup-functions

Puppet+Hiera+Jenkins. Concepts

Wouldn't it be great if we could transfer large part of the application configuration from Development to Ops? For instance, typical Java EE application hold tons of external references (such as DB Url, User, Password, Security Certificates, etc). Typical configuration change often requires hours and distracts developers, build engineers and QA. Lots of waste for trivial change.

In this series of three posts we will show a configuration management framework, build of Puppet, Hiera and Jenkins that supports multiple environments:

In this post we will review concepts and Puppet's configuration files
Second post will be devoted to custom modules and interaction with Hiera
Final, third post will focus on Jenkins integration

Our target environment could be pictured as following:

Fig 1: Deployment schema for two environments

We will build on top of practices [1], and I strongly advice to read that article before proceeding further.

Conceptually, the workflow consist of the following steps:
1. Platform setup: during this step system interfaces and services of the nodes are configured (networking, file-system permissions, users and groups creations, rdbms, http servers, etc)

Fig 2: Puppet workflow

2. Application setup: during this step business applications are installed and configured on the nodes

Fig 3: Jenkins workflow

Our skeleton's file structure will look as follow:

Three folders form a top-level hierarchy:

/etc/puppet soon after puppet package is installed this folder contains standard puppet modules
In addition, we will place there two configuration files:
- puppet.conf describing puppet search paths and config
- hiera.yaml describing Hiera search paths and config
/biz/puppet/modules contains all of the custom modules
/biz/puppet/hieradata holds Hiera data files grouped by environment

Lets review puppet configuration files from /etc/puppet

puppet.conf

This file should be normally divided into two: one with [main] and [master] sections deployed at Puppet Master node and another with [main] and [agent] sections deployed to Puppet Agent nodes.
puppet.conf provides wide spectrum of settings [3], but here we define only most critical ones, such as:

modulepath defines path to the custom modules
manifest defines where site.pp files are located
Note that in our case site.pp files are environment-specific. For instance site.pp for QA environment is stored in /biz/puppet/hieradata/qa/ folder
server identifies the Puppet Master domain name
environment defines environment of the Agent node. For instance: qa, ci, etc
Details on the environment setting could be found at [2]

hiera.yaml
Above, we declared that configuration files for particular nodes in the cluster will be in JSON format, grouped by environment and identified by the full domain name.
Note that the attribute -common under :hierarchy: denotes common settings for the environment and in our case refers to the /biz/puppet/hieradata/${environment}/common.json file.

Continue with custom modules.

[1] Craig Dunn: Designing Puppet – Roles and Profiles
http://www.craigdunn.org/2012/05/239/

[2] Defining environment in Puppet/Hiera
http://docs.puppetlabs.com/guides/environment.html

[3] Puppet.conf in details
http://docs.puppetlabs.com/references/latest/configuration.html

Friday, July 26, 2013

OrderedSet for Python 2.7

While working on the Kosaraju algorithm for Coursera course [1] homework, I wrote a simple and (hopefully) convenient OrderedSet for Python 2.7

Feel free to use and share it:

And just in case you are looking for unit tests or simple usage examples, please refer to the gist below:

Cheers!

[1] Algorithms: Design and Analysis, Part 1
https://class.coursera.org/algo-004/class/index

Saturday, May 18, 2013

launch.py

Good news is - you don't have to write any extra code to run your Python application in a separate process (so called daemonization). All you need is - fully specified name of the method/function to execute. Something like:

    some_package.some_class.SomeClass.start_me

    some_package.some_script.main_function

Everything else is handled by launch.py [1].

launch.py is a set of friendly Python tools to install Virtual Environment, start, stop your application. In addition it gives you a single interface to test and analyze your code; provides configuration management.

In this post we will outline two features: installation and daemonization.

Installation. It's simple. launch.py will create a Virtual Environment for you and make sure that your application is executed within it. What you need to do is to download all of the required libraries for your application and place them in folder:

    launch.py/vendor

Order of the libraries installation is defined by script:

    launch.py/scripts/install_virtualenv.sh

Once this step is complete just run ./launch.py -i to install Virtual Environment along with all of required libraries.

launch.py is also there to help with the application start and stop. There are two modes to run your application:

daemonized
the application is started in a separate process
interactive
the application is executed in the same command-line terminal where you have called launch.py and uses shared stdin, stdout and stderr to interact with the user

To use this feature simply follow the guide:

Write the actual code
List of the following assumptions is in place:
- starter method or function has signature starter_name(*args)- classes implement __init__(self, process_name)

PROCESS_CONTEXT = {
...
    'YOUR_PROCESS_NAME': _create_context_entry(
        process_name='YOUR_PROCESS_NAME',
        classname='workers.some_script.main',
        token='YOUR_PROCESS_NAME',
        time_qualifier=''),
...

Should you wish to instantiate a class instance and start its method - simply define the class name:

        classname='workers.some_module.YourPythonClass.starter_name'

./launch.py --run --app YOUR_PROCESS_NAME will start your class' or script's starter method/function in a separate process (daemonized mode)
Add --interactive to the command above and run it in an interactive mode

In summary - launch.py is here to make your life easier. You can find details and read on many other features of the launch.py framework at its wiki [2].

Cheers!

[1] launch.py on the Github
https://github.com/mushkevych/launch.py

[2] launch.py WIKI
https://github.com/mushkevych/launch.py/wiki

Friday, March 08, 2013

PMP Audit and Exam

If you are like me, then after 4 months of studying PMP Exam Prep in transit, you registered at pmi.org and applied for the PMP exam. And (if you are like me) you have seen the sacred "your application has been chosen for the audit". Feel free to be like me and give yourself 20 minutes to panic: I recall running into the bathroom and confessing to my wife that I have just wasted $630 and 4 months of transit reading. I do recall sticky heat that covered me from from head to toe and, as mentioned earlier, it took me 20 minutes to get myself together.

Now, supposedly, you got yourself together and got ready to face the reality - you have 90 days to complete the audit. In fact it took me 40 days to notify my former bosses, gather all of the envelopes and send the application to the PMI. Never shall I forget Canada Post and 10 long days before my documents crossed the border and reached the recipient.

After PMI approval I have spent another 2 months of reading and preparing for the exam, and today I have finally passed it! My first impressions are as following:

I find that fingerprinting, metal detector and eversion of pockets is an overkill for a certification exam
Despite $630 tag there was no coffee or even potable water dispenser (at least this is true for the certification center at Metrotown in Burnaby, BC)
As mentioned in hundreds of posts from around the world - PMI requires you to immerse into their "world of PMI-ism", and your exam success is fully driven by the level of immersion

In summary:

I have used PMP Examp Prep [1] as my primary source:

PM Fastrack [2] turned out to be a very useful tool:

It took me 6 month to prepare myself for the exam and additional 40 days to prepare documents package for the PMI Audit.

Don't forget to enrol to PMI before applying for PMP. This will save you $25.

Remember to fuel yourself with coffee before entering the examination room, as you might have no other opportunity before the end of exam.

Cheers!

[1] PMP Exam Prep:
http://store.rmcproject.com/Detail.bok?no=392

[2] PM Fastrack:
http://store.rmcproject.com/Detail.bok?no=310

Friday, January 11, 2013

HBase: secondary index

As your HBase project moves forward you will likely face a request to search by criteria that is neither included into the primary index nor can be included into it. In other words you will face a problem of fast and efficient search by secondary index. For instance: select all eReaders in a specific price range. In this post, we will review an approach of constructing a secondary index.

As usually, we will work in realm of Surus ORM [1] and Synergy Mapreduce Framework [2], and will start with the definition of a model. For illustration purposes we will use simplified variant of "product" class, that has lowest and highest prices and can only belong to one category. For instance:

ID	category	priceLowest	priceHighest	manufacturer
Sony eReader PRST2BC	E-READER	8900	12900	SONY

Instances will reside in a table product:

To satisfy our search requests, we would like to get a following structure:

ID	products
ID	Sony eReader PRST2BC	Kobo ...	...
E-READER	{ priceLowest : 89000, priceHighest: 12900, manufacturer: SONY}	{ ... }	{ ... }

Here, any search within a specified category would allow us to quickly filter out products in a specific price range or manufacturer.

To create an index as described above, we would need a new model to hold filtration criterias and a mapreduce job to periodically update it.
Secondary index model:

and its corresponding grouping table:

Mapreduce job implies that Job Runner will use product table for source and grouping table for sink. Job's mapper:
and a reducer:
As an alternative to secondary index you can use filtering. For instance SingleColumnValueFilter:
However, SingleColumnValueFilter approach is insufficient for large tables and frequent searches. Stretching it too far will cause performance degradation across the cluster.

To sum it up, secondary indexes are not a trivial, but at the same time - not a paramount of complexity. While designing them, one should look carefully for the filtration criteria and "long-term" perspective.

Hopefully this tutorial would serve you with help.
Cheers!

[1] Surus ORM
https://github.com/mushkevych/surus

[2] Synergy Mapreduce Framework
https://github.com/mushkevych/synergy-framework