Wednesday, December 04, 2013

Docker for Hadoop... or how to build, init and run a Hadoop Cluster in minutes

It literally takes seconds to start a Docker container with pseudo-distributed Hadoop cluster. Most of the credits go, of course, to the docker script below and a bunch of configuration files... but let's not outmanoeuvre ourselves and start slowly :)

In short: Docker is a lightweight container that allows you to run your process(es) in a complete isolation from the rest of the system. Almost like a Virtual Machine but faster and lighter.

In this post we will review the Docker skeleton to build, init and run Hadoop (HDFS, Yarn, HBase, Pig) in a Pseudo-Distributed mode. Let's start with the project's filesystem tree structure:


Here, we have three main categories:
  • Hadoop configuration files (found in hadoop, hbase, zookeeper, pig folders)
  • Docker scripts: Dockerfile, local_env.sh, build.sh and run.sh
  • Hadoop util scripts (found in etc, root_scripts directories and build.sh, run.sh scripts)
Hadoop configuration files and util scripts could be copied from my Github [1]. Tiny docker-helper scripts are as follows:

Now, with the foreplay complete, let's see the Dockerfile itself:


This docker instance is based on Ubuntu 12.04 (Precise Pangolin) and covers all required components: Oracle JDK, Hadoop+Ecosystem, basic system utils. Installation instructions are as follows:
  1. Pre-configure local environment:
    $> ./local_env.sh
     
  2. Build the container (it will take a minute or two):
    $> ./build.sh
     
  3. Run the container:
    $> ./run.sh
  4. Once in the container - emulate login (and hence - reads env variables):
    #> su -
  5. HDFS Initialization (once only):
    #> ./hdfs_format.sh
    #> ./hadoop_pseudo_start.sh
    #> ./hdfs_init.sh
     
  6. Restart the cluster to finalize initialization:
    #> ./hadoop_pseudo_stop.sh
    #> ./clear_hadoop_logs.sh
    #> ./hadoop_pseudo_start.sh
  7. Enjoy your cluster:
    #> hdfs dfs -ls -R /
    #> hbase shell
            status 'simple'
    #> pig
By default, container's filesystem state is reset at each run. In other words - all your data is gone the moment you exit the container. Natural solution to this issue is move HDFS mount point and few other folders outside of the container:

Host OS FilesystemContainer FilesystemDescription
/var/hstation/dfs/dfsFolder hosts HDFS filesystem
/var/hstation/workspace/workspaceFolder to exchange data to/from container
/var/hstation/logs/logsContains Hadoop/HBase/Zookeeper/Pig logs

We are also exposing HTTP ports, that allow us to connect to the Hadoop processes inside the container:

Exposed Container PortsDescription
http://CONTAINER_IP:8088/clusterResource Manager
http://CONTAINER_IP:19888/jobhistoryJob History
http://CONTAINER_IP:50070HDFS Name Node
http://CONTAINER_IP:60010HBase Master
http://CONTAINER_IP:8042/nodeYarn Node Manager

In the table above, CONTAINER_IP is found by running following command in your container:
#> domainname -i


To sum things up, container build time will take about 10 minutes and another 2-3 minutes to start and init the container for the first time. From that moment on - it's literally seconds before your Hadoop sandbox is ready to crunch the data.

Cheers!

[1] https://github.com/mushkevych/configurations/tree/master/CDH4.pseudo-distributed

Tuesday, August 20, 2013

Puppet+Hiera+Jenkins. Jenkins Integration

Continuation. By now we should have a working bundle of Puppet and Hiera, and in this post we will focus on Jenkins integration, as well as management of the apps configuration.

First, let's recall that roles contain class {'configuration':} declaration. This tiny module allows us to keep a current copy of the apps configuration on the target nodes.
Module's file tree looks as follow:


The manifest file itself (init.pp) is as follows:


It can be summarized as procedure that base on ${environment} name and ${application} name copies configuration files from /biz/puppet/hieradata/${environment}/${application} on Puppet Master to /biz/configuration/${application} on Puppet Agent node.

It is important that destination point has no reference of the ${environment}. This later allows Jenkins to perform configuration update in a complete agnostic way - with only ${application} name being required.

Consider the file tree under files folder. It is grouped by environment and application name. Should you like to add a new application, let's say HelloWorld to the CI env, you would have to:
  1. Make sure that folder /biz/puppet/hieradata/ci exist
  2. Create new subfolder /biz/puppet/hieradata/ci/HelloWorld
  3. Place HelloWorld configuration files into /biz/puppet/hieradata/ci/HelloWorld 
  4. Create/update Jenkins script to take the application configuration files from /biz/configuration/${application}
  5. Create/update a role to reference the HelloWorld:


The flow can be illustrated as:
Fig 1: Deployment process

Last, but not least - let's describe development process on Puppet+Hiera+Jenkins govern cluster:

Fig 2: Development process
Fig 2 illustrates that the framework significantly fast-tracks configuration changes. In summary, we have shown a configuration management framework targeted for small-to-medium sized project, where you are comfortable with file-based configuration for your applications.

Cheers!

Puppet+Hiera+Jenkins. Custom Modules

Continuation. In this second post of the series we will focus on custom modules for the Puppet+Hiera+Jenkins framework. For the sake of brevity we will illustrate our efforts at an example of a single QA environment.

As it is defined in puppet.conf, site.pp file could be found at /biz/puppet/hieradata/${environment}. In terms of QA environment it is /biz/puppet/hieradata/qa:


Here, manifest (aka site.pp) assigns a single role to every node in the cluster. Nodes are identified by the full domain name. Let's assume that:
  • Puppet Master lives at ip-10-0-8-10.sampup.com
  • ip-10-0-8-11.sampup.com and ip-10-0-8-12.sampup.com are Puppet Agents. 
Roles are defined in /biz/puppet/modules/role module:


You might be asking what class {'configuration':} is about? This where actual application configuration is delivered to agent nodes, and we will address this module in the last post of the series.

Corresponding profiles are defined in /biz/puppet/modules/profile module:


Lets discuss Hiera configuration. There is a lot of reading available [1], however we can summarize following:
  1. Hiera data files are named after node's fully qualified domain name
    For instance: ip-10-0-8-12.sampup.com.json
  2. Files contain key-values pairs
    For instance: "ntp::restrict" : true
  3. Each key comprise of the module name and the property name, joined by ::
    For instance, in "ntp::restrict" : true - ntp is the module name
    - restrict is the property name
    - true is the value
  4. Declared key-values are applied during invocation of the parametrized Puppet classes
  5. In case Hiera finds no filename matching the node's domain name, it will first look in common.json
    Puppet will resolve to the default module parameters, should it find no relevant key-values pairs.
/biz/puppet/hieradata/qa/common.json


/biz/puppet/hieradata/qa/ip-10-0-8-12.sampup.com.json


Continue with Jenkins and application configuration.

[1] Hiera reading
http://docs.puppetlabs.com/hiera/1/puppet.html#hiera-lookup-functions

Puppet+Hiera+Jenkins. Concepts

Wouldn't it be great if we could transfer large part of the application configuration from Development to Ops? For instance, typical Java EE application hold tons of external references (such as DB Url, User, Password, Security Certificates, etc). Typical configuration change often requires hours and distracts developers, build engineers and QA. Lots of waste for trivial change.

In this series of three posts we will show a configuration management framework, build of Puppet, Hiera and Jenkins that supports multiple environments:
  • In this post we will review concepts and Puppet's configuration files
  • Second post will be devoted to custom modules and interaction with Hiera
  • Final, third post will focus on Jenkins integration
 Our target environment could be pictured as following:
Fig 1: Deployment schema for two environments

We will build on top of practices [1], and I strongly advice to read that article before proceeding further.

Conceptually, the workflow consist of the following steps:
1. Platform setup: during this step system interfaces and services of the nodes are configured (networking, file-system permissions, users and groups creations, rdbms, http servers, etc)
Fig 2: Puppet workflow


2.  Application setup: during this step business applications are installed and configured on the nodes
Fig 3: Jenkins workflow

Our skeleton's file structure will look as follow:


Three folders form a top-level hierarchy:
  • /etc/puppet soon after puppet package is installed this folder contains standard puppet modules
    In addition, we will place there two configuration files:
    - puppet.conf describing puppet search paths and config
    - hiera.yaml describing Hiera search paths and config
  • /biz/puppet/modules contains all of the custom modules
  • /biz/puppet/hieradata holds Hiera data files grouped by environment
Lets review puppet configuration files from /etc/puppet

puppet.conf

This file should be normally divided into two: one with [main] and [master] sections deployed at Puppet Master node and another with [main] and [agent] sections deployed to Puppet Agent nodes.
puppet.conf provides wide spectrum of settings [3], but here we define only most critical ones, such as:
  • modulepath defines path to the custom modules
  • manifest defines where site.pp files are located
    Note that in our case site.pp files are environment-specific. For instance site.pp for QA environment is stored in /biz/puppet/hieradata/qa/ folder
  • server identifies the Puppet Master domain name
  • environment defines environment of the Agent node. For instance: qa, ci, etc
    Details on the environment setting could be found at [2]

hiera.yaml
Above, we declared that configuration files for particular nodes in the cluster will be in JSON format, grouped by environment and identified by the full domain name.
Note that the attribute -common under :hierarchy: denotes common settings for the environment and in our case refers to the /biz/puppet/hieradata/${environment}/common.json file.

Continue with custom modules.

[1] Craig Dunn: Designing Puppet – Roles and Profiles
http://www.craigdunn.org/2012/05/239/

[2] Defining environment in Puppet/Hiera
http://docs.puppetlabs.com/guides/environment.html

[3] Puppet.conf in details
http://docs.puppetlabs.com/references/latest/configuration.html

Friday, July 26, 2013

OrderedSet for Python 2.7

While working on the Kosaraju algorithm for Coursera course [1] homework, I wrote a simple and (hopefully) convenient OrderedSet for Python 2.7

Feel free to use and share it:

And just in case you are looking for unit tests or simple usage examples, please refer to the gist below:

Cheers!

[1] Algorithms: Design and Analysis, Part 1
https://class.coursera.org/algo-004/class/index

Saturday, May 18, 2013

launch.py

Good news is - you don't have to write any extra code to run your Python application in a separate process (so called daemonization). All you need is - fully specified name of the method/function to execute. Something like:

    some_package.some_class.SomeClass.start_me
or
    some_package.some_script.main_function

Everything else is handled by launch.py [1].

launch.py is a set of friendly Python tools to install Virtual Environment, start, stop your application. In addition it gives you a single interface to test and analyze your code; provides configuration management.

In this post we will outline two features: installation and daemonization.

Installation. It's simple. launch.py will create a Virtual Environment for you and make sure that your application is executed within it. What you need to do is to download all of the required libraries for your application and place them in folder:
    launch.py/vendor 

Order of the libraries installation is defined by script:
    launch.py/scripts/install_virtualenv.sh


Once this step is complete just run ./launch.py -i to install Virtual Environment along with all of required libraries.

launch.py is also there to help with the application start and stop. There are two modes to run your application:
  • daemonized
    the application is started in a separate process
  • interactive
    the application is executed in the same command-line terminal where you have called launch.py and uses shared stdin, stdout and stderr to interact with the user
To use this feature simply follow the guide:
  1. Write the actual code
    List of the following assumptions is in place:
    - starter method or function has signature starter_name(*args)- classes implement __init__(self, process_name)
  2. Register your fully specified function/method name in launch.py/system/process_context.py as follows:
    PROCESS_CONTEXT = {
    ...
        'YOUR_PROCESS_NAME': _create_context_entry(
            process_name='YOUR_PROCESS_NAME',
            classname='workers.some_script.main',
            token='YOUR_PROCESS_NAME',
            time_qualifier=''),
    ...
    

    Should you wish to instantiate a class instance and start its method - simply define the class name:

            classname='workers.some_module.YourPythonClass.starter_name'
    

  3. ./launch.py --run --app YOUR_PROCESS_NAME will start your class' or script's starter method/function in a separate process (daemonized mode) 
  4. Add --interactive to the command above and run it in an interactive mode
In summary - launch.py is here to make your life easier. You can find details and read on many other features of the launch.py framework at its wiki [2].

Cheers!

[1] launch.py on the Github
https://github.com/mushkevych/launch.py

[2] launch.py WIKI
https://github.com/mushkevych/launch.py/wiki

Friday, March 08, 2013

PMP Audit and Exam

If you are like me, then after 4 months of studying PMP Exam Prep in transit, you registered at pmi.org and applied for the PMP exam. And (if you are like me) you have seen the sacred "your application has been chosen for the audit". Feel free to be like me and give yourself 20 minutes to panic: I recall running into the bathroom and confessing to my wife that I have just wasted $630 and 4 months of transit reading. I do recall sticky heat that covered me from from head to toe and, as mentioned earlier, it took me 20 minutes to get myself together. 

Now, supposedly, you got yourself together and got ready to face the reality - you have 90 days to complete the audit. In fact it took me 40 days to notify my former bosses, gather all of the envelopes and send the application to the PMI. Never shall I forget Canada Post and 10 long days before my documents crossed the border and reached the recipient.

After PMI approval I have spent another 2 months of reading and preparing for the exam, and today I have finally passed it! My first impressions are as following:
  • I find that fingerprinting, metal detector and eversion of pockets is an overkill for a certification exam
  • Despite $630 tag there was no coffee or even potable water dispenser (at least this is true for the certification center at Metrotown in Burnaby, BC)
  • As mentioned in hundreds of posts from around the world - PMI requires you to immerse into their "world of PMI-ism", and your exam success is fully driven by the level of immersion
In summary:
  • I have used PMP Examp Prep [1] as my primary source:
  • PM Fastrack [2] turned out to be a very useful tool:
  • It took me 6 month to prepare myself for the exam and additional 40 days to prepare documents package for the PMI Audit.
  • Don't forget to enrol to PMI before applying for PMP. This will save you $25.
  • Remember to fuel yourself with coffee before entering the examination room, as you might have no other opportunity before the end of exam.
Cheers!

[1] PMP Exam Prep:
http://store.rmcproject.com/Detail.bok?no=392

[2] PM Fastrack:
http://store.rmcproject.com/Detail.bok?no=310

Friday, January 11, 2013

HBase: secondary index

As your HBase project moves forward you will likely face a request to search by criteria that is neither included into the primary index nor can be included into it. In other words you will face a problem of fast and efficient search by secondary index. For instance: select all eReaders in a specific price range. In this post, we will review an approach of constructing a secondary index.

As usually, we will work in realm of Surus ORM [1] and Synergy Mapreduce Framework [2], and will start with the definition of a model. For illustration purposes we will use simplified variant of "product" class, that has lowest and highest prices and can only belong to one category. For instance:

ID category priceLowest priceHighest manufacturer
Sony eReader PRST2BC E-READER 8900 12900 SONY


Instances will reside in a table product:

To satisfy our search requests, we would like to get a following structure:
ID products
Sony eReader PRST2BC Kobo ... ...
E-READER { priceLowest : 89000,
priceHighest: 12900,
manufacturer: SONY}
{ ... } { ... }
Here, any search within a specified category would allow us to quickly filter out products in a specific price range or manufacturer.

To create an index as described above, we would need a new model to hold filtration criterias and a mapreduce job to periodically update it.
Secondary index model:

and its corresponding grouping table:

Mapreduce job implies that Job Runner will use product table for source and grouping table for sink. Job's mapper:
and a reducer:
As an alternative to secondary index you can use filtering. For instance SingleColumnValueFilter:
However, SingleColumnValueFilter approach is insufficient for large tables and frequent searches. Stretching it too far will cause performance degradation across the cluster.

To sum it up, secondary indexes are not a trivial, but at the same time - not a paramount of complexity. While designing them, one should look carefully for the filtration criteria and "long-term" perspective.

Hopefully this tutorial would serve you with help.
Cheers!

[1] Surus ORM
https://github.com/mushkevych/surus

[2] Synergy Mapreduce Framework
https://github.com/mushkevych/synergy-framework