Integration One: Handle Errors in your Batch Job… Like a Champ!

Mariano Gonzalez on Thursday, June 12, 2014

Handle Errors in your Batch Job… Like a Champ!

Fact: Batch
Jobs are tricky to handle when exceptions raise. The problem is the
huge amounts of data that these jobs are designed to take. If you’re
processing 1 million records you simply can’t log everything. Logs would
become huge and unreadable. Not to mention the performance toll it
would take. On the other hand, if you log too little then it’s
impossible to know what went wrong, and if 30 thousand records failed,
not knowing what’s wrong with them can be a royal pain. This is a
trade-off not simple to overcome.

We took the feedback we got from the early releases around this issue, and now that Mule 3.5.0 and the Batch Module finally went GA with the May 2014 release, small but significant improvements took place to help you with these errors. Let’s take a look at some simple scenarios.

Before we start

This post assumes you’re familiar with the new Batch module included in Mule 3.5. If that’s not the case, here’re some links you might want to read before continuing:

Records Failing in a Step

So, let’s take a simple scenario. I have a simple step and records
tend to fail there. This is the simplest manifestation of the trade-off
we talked about at the beginning of this post. Logging a whole lot of
unmanageable text or log so little that is barely useful. The strategy
we took here is to not log exceptions but to log stack traces. Yeap, you
read correctly. When we found an exception in a step we perform the
following algorithm:

Get the exception’s full stack trace
Strip that stack trace from any messages. Even if all records always
throw the same exception, those messages probably contain record
specific information. E.g: ‘user Walter White is not eligible because
he’s wanted by the DEA’. The user Jesse Pinkman would throw a slightly
different message, but bottom line is that the exception is actually the
same.
Verify if this stack trace was already logged in the current step.
If this is the first time that stack trace is found on the current step,
then we log it adding a nice message saying that we will not log it again for the current step.
Otherwise, if we already saw that stack trace in the given step, then
we simply ignore it but keep count of how many times that exception was
found

Let’s see an example of how the WantedByDEAException would be logged:

              
          
          

            This is the first record to show this exception on this step for this job instance. Subsequent records with the same failureswill not be logged for performance and log readability reasons:
 
********************************************************************************
Message               : null (org.mule.blog.WantedByDEAException). Message payload is of type: String
Code                  : MULE_ERROR--2
--------------------------------------------------------------------------------
Exception stack is:
1. null (org.mule.blog.WantedByDEAException)
  org.mule.blog.FraudValidator:11 (http://www.mulesoft.org/docs/site/current3/apidocs/org/mule/blog/WantedByDEAException.html)
2. null (org.mule.blog.WantedByDEAException). Message payload is of type: String (org.mule.api.MessagingException)
  org.mule.execution.ExceptionToMessagingExceptionExecutionInterceptor:32 (http://www.mulesoft.org/docs/site/current3/apidocs/org/mule/api/MessagingException.html)
--------------------------------------------------------------------------------
Root Exception stack trace:
org.mule.blog.WantedByDEAException
 at org.mule.blog.FraudValidator.process(FraudValidator.java:11)
 at org.mule.execution.ExceptionToMessagingExceptionExecutionInterceptor.execute(ExceptionToMessagingExceptionExecutionInterceptor.java:24)
 at org.mule.execution.MessageProcessorNotificationExecutionInterceptor.execute(MessageProcessorNotificationExecutionInterceptor.java:58)
    + 3 more (set debug level logging or '-Dmule.verbose.exceptions=true' for everything)
********************************************************************************

        
            view raw

            gistfile1.sv

            hosted with ❤ by GitHub

          

GOTCHA: This is done on a by step basis. If another step also throws the same exception, it WILL be logged once more for that step.

Exception Summary

The above is good. We get to see the exception without clogging the
logs. However, because I only see the exception once, I don’t really
know how many times it happened in total. For that we added an exception
summary which is logged when the on-complete phase is reached. This
summary shows for every exception type, how many times it happened in
each step. Suppose that now we have a job with two steps, and both are
capable of throwing BatchException. Then the exception summary would
look like this:

              
            **************************************************************************************************
*             - - + Exception Type + - -             *  - - + Step + - -   *  - - + Count + - -  *
**************************************************************************************************
* com.mulesoft.module.batch.exception.BatchException * Batch_Step          * 10                  *
* com.mulesoft.module.batch.exception.BatchException * Second_Batch_Step   * 9                   *
**************************************************************************************************

            view raw

            gistfile1.sv

            hosted with ❤ by GitHub

Here we can see that the first step failed 10 times and the second failed 9.

Getting more out of the logs

As this post starts saying, the motivation for all of this is the
trade-off between large chunks of data and logging efficiency, that’s
why all the items described above work with Mule’s default configuration
of only logging INFO level messages (if you’re not familiar with
logging levels, take a look at this) .

There’re special cases like debugging or when you’re motivation to
use batch is not the ability to handle large datasets but
recoverability, resilience, management, etc (see this post
for more examples). For those cases, you can get a whole lot of extra
and extremely verbose information by setting the batch category to DEBUG
level in your app’s log4j configuration like this:

              
            log4j.logger.com.mulesoft.module.batch=DEBUG

            view raw

            gistfile1.sv

            hosted with ❤ by GitHub

CAUTION: Use the DEBUG level with care. Definitively
not advised for production environments unless you’re absolutely sure
that your jobs will not handle significantly large data sets.

Error Handling MEL functions

The early access release of the Batch module already shipped with
some error handling MEL functions that can be used in the context of a
batch step. Let’s refresh that real quick:

#[isSuccessfulRecord()]: Boolean functions which returns true if the current record has not thrown exceptions in any prior step.
#[isFailedRecord()]: Boolean functions which returns true if the current record has thrown exceptions in any prior step.
#[failureExceptionForStep(String)]:
This function receives the name of a step as a String argument. If the
current record threw exception on that step, then it returns the actual
Exception object. Otherwise it returns null

Again, the above is just fine. However, we quickly learn that those
felt short in some cases. Let’s consider the following scenario:

A batch job that polls files containing contact information. Those contacts are transformer using DataMapper and fed into the job
The first step transforms them again and inserts them into Salesforce
The second step transforms again and inserts them into Google Contacts

So far so good. The challenging new requirement is to be able to
write into a JMS dead-letter queue per each record that has failed. To
keep it simple, let’s just say that the message will be the exception
itself. In case you haven’t noticed, there’s a gotcha in this
requirement: each record could have failed in both steps, which means
that the same record would translate into two JMS messages.

To support this and other similar scenarios, we added these
additional functions which also require to be executed in the context of
a batch step:

#[getStepExceptions()]:
Returns a java Map<String, Exception> in which the keys are the
name of a batch step in which the current record has failed and the
value is the exception itself. If the record hasn’t failed in any step,
this Map will be empty but will never be null. Also, the Map contains no
entries for steps in which the record hasn’t failed.
#[getFirstException()]:
Returns the Exception for the very first step in which the current
record has failed. If the record hasn’t failed in any step, then it
returns null.
#[getLastException()]:
Returns the Exception for the last step in which the current record has
failed. If the record hasn’t failed in any step, then it returns null.

So let’s take a look at how this application would look like:

As you see, a pretty standard Batch Job. Because this is an error
handling post, I won’t go into detail of the whole app and I’ll just
focus on the last step. Let’s take a closer look:

Since the goal of this step is to gather failures, it makes sense to
skip the ones who haven’t got any, thus we set the accept policy to
failures only.

The first thing we do is to use the set-payload processor and the
getStepExceptions() MEL function together so that the message payload is
set to the exceptions map.

Remember how we said before that a given record could have many
exceptions? That’s the nice thing about this map, it gives you access to
the exceptions found in all steps. Since our goal is to send the
exceptions through JMS and we don’t care about the steps, we can simply
use a foreach to iterate over the map’s values (the exceptions) and send
them through a JMS outbound endpoint:

And we’re done! We have happily aggregated exceptions!

Summary

The above features were introduced in the GA release of Mule 3.5.0.
We tried our best to help you find equilibrium between useful and
meaningful messages without generating undesired verbosity. Have better
ideas or suggestions? Please reach out!

Hope you liked this!

Integration One

Wikipedia

Thursday, 12 June 2014

Handle Errors in your Batch Job… Like a Champ!

Handle Errors in your Batch Job… Like a Champ!

Before we start

Records Failing in a Step

Exception Summary

Getting more out of the logs

Error Handling MEL functions

Summary