Jobs are tricky to handle when exceptions raise. The problem is the
huge amounts of data that these jobs are designed to take. If you’re
processing 1 million records you simply can’t log everything. Logs would
become huge and unreadable. Not to mention the performance toll it
would take. On the other hand, if you log too little then it’s
impossible to know what went wrong, and if 30 thousand records failed,
not knowing what’s wrong with them can be a royal pain. This is a
trade-off not simple to overcome.
We took the feedback we got from the early releases around this issue, and now that Mule 3.5.0 and the Batch Module finally went GA with the May 2014 release, small but significant improvements took place to help you with these errors. Let’s take a look at some simple scenarios.
Before we start
This post assumes you’re familiar with the new Batch module included in Mule 3.5. If that’s not the case, here’re some links you might want to read before continuing:Records Failing in a Step
So, let’s take a simple scenario. I have a simple step and recordstend to fail there. This is the simplest manifestation of the trade-off
we talked about at the beginning of this post. Logging a whole lot of
unmanageable text or log so little that is barely useful. The strategy
we took here is to not log exceptions but to log stack traces. Yeap, you
read correctly. When we found an exception in a step we perform the
following algorithm:
- Get the exception’s full stack trace
- Strip that stack trace from any messages. Even if all records always
throw the same exception, those messages probably contain record
specific information. E.g: ‘user Walter White is not eligible because
he’s wanted by the DEA’. The user Jesse Pinkman would throw a slightly
different message, but bottom line is that the exception is actually the
same. - Verify if this stack trace was already logged in the current step.
If this is the first time that stack trace is found on the current step,
then we log it adding a nice message saying that we will not log it again for the current step.
Otherwise, if we already saw that stack trace in the given step, then
we simply ignore it but keep count of how many times that exception was
found
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Exception Summary
The above is good. We get to see the exception without clogging thelogs. However, because I only see the exception once, I don’t really
know how many times it happened in total. For that we added an exception
summary which is logged when the on-complete phase is reached. This
summary shows for every exception type, how many times it happened in
each step. Suppose that now we have a job with two steps, and both are
capable of throwing BatchException. Then the exception summary would
look like this:
1 2 3 4 5 6 |
|
Getting more out of the logs
As this post starts saying, the motivation for all of this is thetrade-off between large chunks of data and logging efficiency, that’s
why all the items described above work with Mule’s default configuration
of only logging INFO level messages (if you’re not familiar with
logging levels, take a look at this) .
There’re special cases like debugging or when you’re motivation to
use batch is not the ability to handle large datasets but
recoverability, resilience, management, etc (see this post
for more examples). For those cases, you can get a whole lot of extra
and extremely verbose information by setting the batch category to DEBUG
level in your app’s log4j configuration like this:
1 |
|
CAUTION: Use the DEBUG level with care. Definitively
not advised for production environments unless you’re absolutely sure
that your jobs will not handle significantly large data sets.
not advised for production environments unless you’re absolutely sure
that your jobs will not handle significantly large data sets.
Error Handling MEL functions
The early access release of the Batch module already shipped withsome error handling MEL functions that can be used in the context of a
batch step. Let’s refresh that real quick:
- #[isSuccessfulRecord()]: Boolean functions which returns true if the current record has not thrown exceptions in any prior step.
- #[isFailedRecord()]: Boolean functions which returns true if the current record has thrown exceptions in any prior step.
- #[failureExceptionForStep(String)]:
This function receives the name of a step as a String argument. If the
current record threw exception on that step, then it returns the actual
Exception object. Otherwise it returns null
felt short in some cases. Let’s consider the following scenario:
- A batch job that polls files containing contact information. Those contacts are transformer using DataMapper and fed into the job
- The first step transforms them again and inserts them into Salesforce
- The second step transforms again and inserts them into Google Contacts
write into a JMS dead-letter queue per each record that has failed. To
keep it simple, let’s just say that the message will be the exception
itself. In case you haven’t noticed, there’s a gotcha in this
requirement: each record could have failed in both steps, which means
that the same record would translate into two JMS messages.
To support this and other similar scenarios, we added these
additional functions which also require to be executed in the context of
a batch step:
- #[getStepExceptions()]:
Returns a java Map<String, Exception> in which the keys are the
name of a batch step in which the current record has failed and the
value is the exception itself. If the record hasn’t failed in any step,
this Map will be empty but will never be null. Also, the Map contains no
entries for steps in which the record hasn’t failed. - #[getFirstException()]:
Returns the Exception for the very first step in which the current
record has failed. If the record hasn’t failed in any step, then it
returns null. - #[getLastException()]:
Returns the Exception for the last step in which the current record has
failed. If the record hasn’t failed in any step, then it returns null.
As you see, a pretty standard Batch Job. Because this is an error
handling post, I won’t go into detail of the whole app and I’ll just
focus on the last step. Let’s take a closer look:
Since the goal of this step is to gather failures, it makes sense to
skip the ones who haven’t got any, thus we set the accept policy to
failures only.
The first thing we do is to use the set-payload processor and the
getStepExceptions() MEL function together so that the message payload is
set to the exceptions map.
Remember how we said before that a given record could have many
exceptions? That’s the nice thing about this map, it gives you access to
the exceptions found in all steps. Since our goal is to send the
exceptions through JMS and we don’t care about the steps, we can simply
use a foreach to iterate over the map’s values (the exceptions) and send
them through a JMS outbound endpoint:
And we’re done! We have happily aggregated exceptions!
Summary
The above features were introduced in the GA release of Mule 3.5.0.We tried our best to help you find equilibrium between useful and
meaningful messages without generating undesired verbosity. Have better
ideas or suggestions? Please reach out!
Hope you liked this!