mushkevych: CSV import into HBase

This week was about extracting user's behavior patterns, grouped by account lifespan. However, before processing themselves took place, I had to copy data from PostgresSql to HBase... and it appeared to be an intensive problem.

First question that comes to mind - why bother - wasn't it done by forefathers and known to us as sqoop? True, but I have two excuses:

I wanted to pre-process data from PostgresSql before placing it into HBase. With sqoop, a set of additional map/reducers would be required
Input data was in multiple csv dumps, so with straight-forward sqoop I needed set of map/reducers for every of them

By trying to avoid as much of additional work as possibly, I first came with idea of omnivore REST interface (hey, its only 3GB). However, even with thread-pooling and connection-pooling the most I squeezed out of it was about ~130 commits per second (totalling 150-200 MB per day).

At this point two things became apparent:

Even with 3GB, import must be moved to Hadoop cluster
It has to be either map/reduce or direct HBase tunnel

And so was it born - CSVImporter. Server-Worker design, based on Surus[1] with thread-pooling, connection-pooling and write-buffer. It also uses supercsv [2].

Performance: 403 MB per hour (largest 2.8 GB CSV dump was imported within 7 hours 11 minutes).

[1] Surus at guthub
https://github.com/mushkevych/surus

[2] Supercsv
http://supercsv.sourceforge.net/codeExamples_general.html

mushkevych

Friday, March 23, 2012

CSV import into HBase

No comments: