Friday, May 18, 2012

Integer encoding for Hadoop

In short, integer encoding is mechanism to compact multiple integer values into single byte array. It is well described in literature [1] and in this post we will review exemplary implementation.

Unless one wants to re-implement encoding algorithms, we will reuse Varint class from mahout-core [2].
Simplest use-case for 2 integers looks like:

import org.apache.log4j.Logger;
import org.apache.mahout.math.Varint;
import java.io.*;
/**
* @author Bohdan Mushkevych
* Description: module presents tuple of two int values: alpha and beta
*/
class Tuple2I {
protected int alpha;
protected int beta;
public Tuple2I(int alpha, int beta) {
this.alpha = alpha;
this.beta = beta;
}
/**
* @return int presenting alpha
*/
public int getAlpha() {
return alpha;
}
/**
* @return int presenting beta
*/
public int getBeta() {
return beta;
}
}
/**
* @author Bohdan Mushkevych
* Description: module contains logic for VarInt integer encoding: conversion to and from the byte array
*/
public class Encoder {
protected ByteArrayOutputStream baos = new ByteArrayOutputStream();
protected DataOutputStream dos = new DataOutputStream(baos);
private Logger log = Logger.getRootLogger();
public Encoder() {
}
/**
* Method generates byte[] for varargs of integers
* @param args varargs of Integer type
* @return generated value
*/
protected byte[] getByteArray(int... args) {
byte[] result = null;
try {
baos.reset();
for (int i : args) {
Varint.writeSignedVarInt(i, dos);
}
result = baos.toByteArray();
} catch (IOException e) {
log.error("Exception on Integer encoding", e);
}
return result;
}
/**
* method decodes tuple from byte array
* @param value encoded tuple
* @return formed tuple instance
*/
public Tuple2I decode(byte[] value) {
ByteArrayInputStream bais = new ByteArrayInputStream(value);
DataInputStream dis = new DataInputStream(bais);
Tuple2I tuple = null;
try {
int alpha = Varint.readSignedVarInt(dis);
int beta = Varint.readSignedVarInt(dis);
tuple = new Tuple2I(alpha, beta);
} catch (IOException e) {
log.error("Exception on Integer decoding", e);
}
return tuple;
}
}
view raw Encoder.java hosted with ❤ by GitHub

Here, we declared structure of 2 integers - Tuple2I, and followed it by Encoder example that encodes and decodes integers to and from byte array.

For real-world usages of the Integer encoder, refer to Surus [3]. By wide adoption of integer encoding on 3 and 4 integer tuples, I was able to reduce Mapper output by 20-30%, and saved about 30 minutes of computation time.

[1] Data-Intensive Text Processing with MapReduce
http://www.amazon.com/Data-Intensive-Processing-MapReduce-Synthesis-Technologies/dp/1608453421

[2] Typical Maven repository
http://mvnrepository.com/artifact/org.apache.mahout/mahout-core/

[3] Surus
https://github.com/mushkevych/surus

1 comment:

RP said...

I would move stream init code in Encoder down to constructor