Tuesday, March 27, 2012

Recovering from OutOfMemoryError in Hadoop mapreduce

Dealing with OutOfMemoryErrors in any environment indicates that something is very wrong. There is a good discussion at stackoverflow [1] that can be shortened to "you can catch the error, but is it worth recovering?".

In terms of Hadoop mapreduce framework, I am convinced that it is worth to catch OutOfMemoryError and be able to recover from it. Let me illustrate:


Here, we have typical dummy mapper. For exemplary purposes we introduced variable bigArray to be blamed for OutOfMemoryError.
Now, let's answer few questions:

  • Is it abnormal when system throws OOME?
    Yes. However, when we are dealing with user-generated content, we might not always secure ourselves from it.
    For example: in social networks, you have 99.99% with 10-20 MB of statistics data per account, and 0.01% of fake/misused/flash-mobbed accounts that barely fit into 200-300 MB. Handling those is a challenge.
  • How to prevent system from meltdown?
    First of all - keep all declarations, all references inside the try block. When OOME is thrown, JVM will run GC and clean up all objects referenced within your try block. This way you can proceed.
  • Is it OK to skip 0.01% of problematic data?
    This is polemic, however I am convinced that Yes.
    As Tom White put it "In an ideal world, your code would cope gracefully with all of these conditions. In practice, it is often expedient to ignore the offending records" [2]
To summarize above, let me also present screenshot from real-world mapper. It shows that within 274'022 records we have 9 causing OutOfMemoryExceptions, or 0.000033%.

Figure 1: real-world mapper output



[1] Stack Overflow
http://stackoverflow.com/questions/2679330/catching-java-lang-outofmemoryerror

[2] Tom White: Hadoop: The Definitive Guide, second edition
Chapter 6, page 185 "Skipping Bad Records"
http://shop.oreilly.com/product/0636920010388.do

3 comments:

Serhiy Tsaryk said...

1. It is easier to avoid OoME then to recover.
2. Sometimes it is cheaper to buy additional 16G then to develop & test recovery code.

rpyaset said...

How can you be sure by the time to you get to catch block jvm is in such a state that anything can be done? e.g. run gc manually etc

You're making good point about requirements that are user data driven but still it seems like you're fighting your system resources threshold. Try to figure out peak load and just double the resources to handle it. No always possible but with clouds and dynamic quotas it can be configured for less.

What amount of data on ave are we talking about in your case?

Bohdan Mushkevych said...

Good questions. Let me provide answers:
- in cases with user-driven content among hundreds of thousands accounts, you might expect few that simply fall short from your predictions and expectations;
- it is not always money-wise to pay for additional GB of RAM if you need them to compute >0.01% of (usually fake) data
- in this particular case mappers were set to 512MB, while reducers to 768 MB; most of the processing fitted well into 20-30 MB
- in reality try{} block should be build with understanding of JVM RAM Allocation mechanics.
From [A] we can find out that JVM does computation in advance before allocating the object.
Since Hadoop Table mappers and reducers are single threaded, we can expect that memory allocation is located in try() block.
By keeping all references inside try{} block and by trying to allocate large chunk of RAM we secure ourself from running into bad OutOfMemory situations, when Heap is full.

[A] OOME discussion on stackoverflow: http://stackoverflow.com/questions/9261705/what-happens-when-theres-insufficient-memory-to-throw-an-outofmemoryerror