First question that comes to mind - why bother - wasn't it done by forefathers and known to us as sqoop? True, but I have two excuses:
- I wanted to pre-process data from PostgresSql before placing it into HBase. With sqoop, a set of additional map/reducers would be required
- Input data was in multiple csv dumps, so with straight-forward sqoop I needed set of map/reducers for every of them
By trying to avoid as much of additional work as possibly, I first came with idea of omnivore REST interface (hey, its only 3GB). However, even with thread-pooling and connection-pooling the most I squeezed out of it was about ~130 commits per second (totalling 150-200 MB per day).
At this point two things became apparent:
- Even with 3GB, import must be moved to Hadoop cluster
- It has to be either map/reduce or direct HBase tunnel
And so was it born - CSVImporter. Server-Worker design, based on Surus[1] with thread-pooling, connection-pooling and write-buffer. It also uses supercsv [2].
Performance: 403 MB per hour (largest 2.8 GB CSV dump was imported within 7 hours 11 minutes).
Performance: 403 MB per hour (largest 2.8 GB CSV dump was imported within 7 hours 11 minutes).
[1] Surus at guthub
https://github.com/mushkevych/surus
[2] Supercsv
http://supercsv.sourceforge.net/codeExamples_general.html
No comments:
Post a Comment