How to retry failed Kinesis events using AWS Lambda dead letter queues for SNS

AWS announced DLQ support for Lambda — but only for async event sources such as SNS and not poll-based Kinesis streams. Learn More!

By A Cloud Guru News

Jun 08, 2023 • 7 Minute Read

Please set an alt value for this image...

Subscribe to the newsletter

In a previous post, I discussed the challenges with processing Kinesis events with Lambda. The blog shared some of my lessons learned and tips for partial failures, using dead letter queues, and avoiding hot streams.

With regards to failed records, I discussed that while Lambda support for Dead Letter Queues extends to asynchronous invocations such as SNS and S3, it does not support poll-based invocations such as Kinesis and DynamoDB streams. This means developers still need to implement a retry strategy for Kinesis functions — potentially taking advantage of the retry flow with SNS.

Since SNS already comes with DLQ support, you can simplify your setup by sending the failed events to a SNS topic instead. Lambda would then process it a further 3 times before passing it off to the designated DLQ.

In the blog, I’ll provide some concrete example to show how easily this can be accomplished. If you want to try it for yourself, the code for the examples in this blog is available on github.

The Kinesis function

In our Kinesis function, we’ll need to keep track of events that we fail to process. At the end of the function, we can then publish them to SNS for further retries, and take advantage of the built-in “retry then Dead Letter Queue” flow SNS has with Lambda.

As mentioned in the previous post, it’s often more important to keep the system processing events in real-time than to ensure every event is processed successfully, but you do still want to give a failed event a few retries before throwing in the towel.

As part of the demo project, I also included an API endpoint for you to trigger events into Kinesis. Once deployed, if you hit the endpoint with the query string parameter ?fail=true then you can observe Kinesis failing to process the record due to an error:

The event would then be published to SNS.

SNS publish

In the SNS function, we’ll extract the original event object that was published to Kinesis, and attempt to process it again using the same business logic as our Kinesis function.

Out of the box, we get 3 attempts at processing the failed Kinesis event again with the SNS event source.

After all 3 attempts have been exhausted, there’s one more chance to save the data with the Dead Letter Queue (DLQ) support — which is only available for async event sources such as SNS.

As of version 1.22.0 of the Serverless framework, it doesn’t support adding SQS as the Dead Letter Queue resource (via the onError attribute).

There is also the serverless-plugin-lambda-dead-letter plugin, but I never managed to get it to work. So, in the meantime, you might have to make do with setting the DLQ Resource manually if you want to use SQS.

After our Kinesis event has been processed 3 more times by SNS, we can see that the SNS event has been pushed to the designated SQS queue …

The SNS message also has some additional message attributes, although they’re probably not useful to you, especially with a 200 ErrorCode. I know, I’m confused by it too!

So there you go, a simple way to piggyback off the SNS error handling flow for Lambda to retry failed Kinesis events a few more times, whilst doing it out-of-band to allow the stream to go ahead and keep the system real-time.

Trading consistency for realtime-ness

One of the key benefits with using Kinesis is that the order of events for a given partition key is preserved. For example, for a given user ID, it means we can process events related to this user in the same order as they’re received by Kinesis. We can even enforce the ordering of events by setting the SequenceNumberForOrdering attribute in the PutRecord request.

The approach I have outlined here sacrifices this consistency because it defers the retries to a background process, retried events can therefore be processed out of order with respect to other events for the same partition key.

This is a conscious tradeoff to allow the Kinesis processing to continue in the face of persistent failures to process some events, and preserve the overall realtime-ness of the system, as I stated in my previous post:

We found it was more important to keep the system flowing than to halt processing for any failed events, even if for a minute.

Another thing to keep in mind is that, unless you can design away poison messages (and of course, it’s totally doable), eventually the decision to ditch persistently failing events will be made for you when they are expired from the stream.

Consider impact of SNS retries on downstream systems

As I discussed in the post about pub-sub and push-pull messaging patterns, another benefit of using Kinesis with Lambda is that it allows you to control the amount of load you pass onto your downstream systems.

In the event there is a sudden influx of failed events, the retries via SNS can overwhelm your downstream systems and in turn cause even more failed events and escalate a bad situation further.

In conclusion — keep it real

Because of the way Kinesis works, that the level of concurrency and retry is at a per-shard level, it forces us to make a choice between:

keeping the overall system real time, and discard persistently failing events
ensuring all events are process successfully

In a world where the retry-until-success behavior can be narrowed down to a per partition key level, then maybe we won’t have to make this trade off.

As things stand, I think all real time event processing systems should choose option 1 — otherwise, you’re not really building a “real time” system.

If that is your decision, then the approach outlined here makes it easier for you to implement a retry-thrice-then-DLQ flow as it piggybacks Lambda’s built-in retry behaviour for SNS.

If you choose option 2 to ensure all events are process successfully, then this approach is definitely not for you! Instead, you should focus on monitoring and alerting so that you discover these poison messages quickly and design your system to be able to handle them better.

If, as it turns out, your only defense against poison messages is to “discard them” — then you could have chosen option 1 in the first place.

Level up your cloud career

A Cloud Guru makes it easy (and awesome) to get certified and master modern tech skills — whether you’re new to cloud or a seasoned pro. Check out ACG’s current free courses or get started now with a free trial.

Start a Free Trial

A C.

More about this author