Friday, March 23, 2012

CSV import into HBase

This week was about extracting user's behavior patterns, grouped by account lifespan. However, before processing themselves took place, I had to copy data from PostgresSql to HBase... and it appeared to be an intensive problem.

First question that comes to mind - why bother - wasn't it done by forefathers and known to us as sqoop? True, but I have two excuses:
  • I wanted to pre-process data from PostgresSql before placing it into HBase. With sqoop, a set of additional map/reducers would be required
  • Input data was in multiple csv dumps, so with straight-forward sqoop I needed set of map/reducers for every of them
By trying to avoid as much of additional work as possibly, I first came with idea of omnivore REST interface (hey, its only 3GB). However, even with thread-pooling and connection-pooling the most I squeezed out of it was about ~130 commits per second (totalling 150-200 MB per day).

At this point two things became apparent: 
  • Even with 3GB, import must be moved to Hadoop cluster 
  • It has to be either map/reduce or direct HBase tunnel
And so was it born - CSVImporter. Server-Worker design, based on Surus[1] with thread-pooling, connection-pooling and write-buffer. It also uses supercsv [2].

Performance: 403 MB per hour (largest 2.8 GB CSV dump was imported within 7 hours 11 minutes). 


No comments: