Yeah, this comes up very frequently in event driven architectures. So, I think it can be a shared responsibility. So, some of the technology that we use can offer some of these features, and maybe we should define the delivery semantics too so people understand. So, there’s usually really sort of three ways these are usually described. We have at least once delivery, meaning that the message will be delivered at least once but it could happen more than once or it could be duplicate delivery. There is exactly once, meaning that there’s some control over ensuring that something only happens exactly once. And then there’s at most once, which would be best effort. So, something will happen and most will only happen once, but it’s not guaranteed to happen.
What you’ll find most commonly inside of AWS if you start looking at the services…and really, any technology, you should, you know, ask or dig for what is the semantics that are available. Typically, you’ll find that it’s at least once delivery. And the reason is, is because when we’re building distributed systems, there’s something called the CAP theorem and it’s this idea of you have a choice between consistency, availability, and partitioning. There’s actually a great blog written by Mark Brooker around how perhaps in the past, partitioning was something that was optional, and choosing between the CAP theorem, you can pick any two and the other one is sort of eventual. But in reality, partitioning really is now a part of life, right? We don’t work in isolated data centers or systems anymore, we’re building distributed systems.
So, given that you know partitioning will happen, what is your trade-off between consistency and availability? And there’s lots of like nuanced ways you can think about the choices here. But ultimately, when we’re dealing with large distributed systems like SQS, for example, that can scale to 25 billion messages, we want to ensure we can support partitioning in case a particular availability zone or something, you know, fails in the distributed system and we want to ensure that the message is available when you ask for it. Availability is the most important thing. Well, in that case, then we have to trade off on consistency and we save things that are eventually consistent. In the fullness of time, the system will converge and you’ll have consistency.
But because of that trade-off, again, it’s the most important thing in the SQS standard queue for this to occur, then the trade-off is that it’s possible for there to be duplicate message delivery. So, mostly with AWS services, you will see things default to at least once delivery because that’s best suited for what we hear from our customers they want the most from us. Now, something like SQS offers a FIFO option, first in, first out, which also offers the ability to specify an idempotency key, which is called a message duplication ID in the case of SQS, and it allows the SQS service to actually do that deduplication for you. So, that’s beneficial to you.
It’s always important with SQS to read the fine print. In the case of SQS FIFO, actually, the deduplication feature only happens within a five-minute period, right? And so, you might assume that you have a failure message processing, you’re going to send it again with the same message deduplication ID an hour later. Well, actually, SQS FIFO service sees that as a duplicate. And it does it for very good reasons because the memory that it has to keep over five minutes over a huge scale is limited, right? So, again, back to the trade-off. So, there are features of different channels that offer these but they usually come with their own caveats and things to understand. Another choice popular with customers are things like Kafka, for example.
And Kafka, under the covers, uses really innovative techniques built on top of distributed key-value stores in order to track things across its distributed system and uses something…I think it’s called the Kafka ID or it’s a zookeeper-based ID that allows it to determine idempotency in a distributed system. So, there are channels that provide capabilities that you can lean on. However, it’s my opinion that when you’re looking at larger distributed systems, unless all the components of the architecture are based on the same channel where you can depend on those types of features, ultimately, your message might flow through intermediaries in an overall architecture that may or may not support those features or support them with the same behaviors.
So, ultimately, I believe that the best way to address these is actually using more of the classic integration patterns that are covered in Gregor’s book, for example, in “Enterprise Integration Patterns.” And when we take that approach, the advantage is the solutions are agnostic, we can put information in an event or in a message, right? It’s important that the data is there, things like an idempotency key or a sequence ID that rides along with the message payload that you want to process. And so, the advantage there is that any channels, any intermediaries that that message flows across, it retains the semantic meaning of that data that allows producers in order to determine what is a duplicate, how to re-sequence out-of-sequence events.
The trade-off is that you’re not really getting any help from these channels to do it for you. So, you need to build approaches within…and reusable approaches, hopefully, in your subscribers in order to reason about this data and do deduplication. Now, fortunately, we tried to help at least in serverless technologies with the development of something called the Lambda Powertools, and these are a library that you can include in your Lambda functions, it’s available in TypeScript, Java, and Python for now. And these are libraries that you can include in your code and it has implementations to help you solve, for example, idempotency for you so that you can plug these things in and it will help you with it.
So, anyway, Rahul, it’s a long circuitous way of saying, like, there are features available but if you become overly reliant on those features of certain channels, you become coupled to that technology. And in the larger system or complex systems, you don’t always control all of the channels and proxies that messages flow over, so the safest bet is to include the information that allows subscribers to react to those events and reason about that information based on their business context and their requirements.