Unless one wants to re-implement encoding algorithms, we will reuse Varint class from mahout-core [2].
Simplest use-case for 2 integers looks like:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import org.apache.log4j.Logger; | |
import org.apache.mahout.math.Varint; | |
import java.io.*; | |
/** | |
* @author Bohdan Mushkevych | |
* Description: module presents tuple of two int values: alpha and beta | |
*/ | |
class Tuple2I { | |
protected int alpha; | |
protected int beta; | |
public Tuple2I(int alpha, int beta) { | |
this.alpha = alpha; | |
this.beta = beta; | |
} | |
/** | |
* @return int presenting alpha | |
*/ | |
public int getAlpha() { | |
return alpha; | |
} | |
/** | |
* @return int presenting beta | |
*/ | |
public int getBeta() { | |
return beta; | |
} | |
} | |
/** | |
* @author Bohdan Mushkevych | |
* Description: module contains logic for VarInt integer encoding: conversion to and from the byte array | |
*/ | |
public class Encoder { | |
protected ByteArrayOutputStream baos = new ByteArrayOutputStream(); | |
protected DataOutputStream dos = new DataOutputStream(baos); | |
private Logger log = Logger.getRootLogger(); | |
public Encoder() { | |
} | |
/** | |
* Method generates byte[] for varargs of integers | |
* @param args varargs of Integer type | |
* @return generated value | |
*/ | |
protected byte[] getByteArray(int... args) { | |
byte[] result = null; | |
try { | |
baos.reset(); | |
for (int i : args) { | |
Varint.writeSignedVarInt(i, dos); | |
} | |
result = baos.toByteArray(); | |
} catch (IOException e) { | |
log.error("Exception on Integer encoding", e); | |
} | |
return result; | |
} | |
/** | |
* method decodes tuple from byte array | |
* @param value encoded tuple | |
* @return formed tuple instance | |
*/ | |
public Tuple2I decode(byte[] value) { | |
ByteArrayInputStream bais = new ByteArrayInputStream(value); | |
DataInputStream dis = new DataInputStream(bais); | |
Tuple2I tuple = null; | |
try { | |
int alpha = Varint.readSignedVarInt(dis); | |
int beta = Varint.readSignedVarInt(dis); | |
tuple = new Tuple2I(alpha, beta); | |
} catch (IOException e) { | |
log.error("Exception on Integer decoding", e); | |
} | |
return tuple; | |
} | |
} |
Here, we declared structure of 2 integers - Tuple2I, and followed it by Encoder example that encodes and decodes integers to and from byte array.
For real-world usages of the Integer encoder, refer to Surus [3]. By wide adoption of integer encoding on 3 and 4 integer tuples, I was able to reduce Mapper output by 20-30%, and saved about 30 minutes of computation time.
[1] Data-Intensive Text Processing with MapReduce
http://www.amazon.com/Data-Intensive-Processing-MapReduce-Synthesis-Technologies/dp/1608453421
[2] Typical Maven repository
http://mvnrepository.com/artifact/org.apache.mahout/mahout-core/
[3] Surus
https://github.com/mushkevych/surus