How to Handle Job Failures
There’s a discussion on the beanstalkd mailing list right now about queue introspection and handling failures. My response got a little long, and it could be interesting to users of other queueing systems as well, so here’s a blog post instead.
When we first started using beanstalkd at Causes, some things in our worker development and deployment process took a while to iron out, but our strategy for handling job failures worked quite well right from the start. In hindsight, I’m happy about it. This is what we did.
The Basic Rule
Never clean up jobs by hand. If a failure happens once, it can happen again. Always write code to handle newly-discovered failure types automatically, then run the new code to do the cleanup.
Before you begin, note that your workers will be numerous, possibly even more so than your web front-ends. I assume you have good logging infrastructure and analysis tools for your web front ends. Use the same infrastructure for the workers, too. It will make your life easier to see all failures and performance data in one place.
Start by having your workers bury any failed jobs.
See what sorts of failures happen in production (by using the high-quality logging that you have to do anyway).
You will see some failures where the job can simply be deleted, others where it’s better to retry the job, and possibly some rare cases where you want to save the job to be inspected by a human (though this sort of hand-holding does not scale and should be avoided). It might also make sense to retry some jobs only a limited number of times before deleting them.
Add unit tests and update the code to deal with these known failure types appropriately (i.e. delete or retry the job), but continue to bury unanticipated failures. For retries, don’t bother with changing the priority, but do add a time delay with exponential backoff. Of course, you must also fix the business logic to recover from these failures or avoid them entirely whenever possible.
Redeploy your application.
When the new code is in production, kick all buried jobs. They will be handled correctly, and you won’t lose any jobs.
Now look at your worker logs again. This process will have removed a lot of noise from your production logs, and new failure types will float to the surface (though the total volume will of course be much smaller). So repeat.
After a couple of iterations, true failures will be very rare indeed. Your system will be running smoothly and it won’t need much attention.