AWS Made Easy

Ask Us Anything: Episode 4 – Event Driven Architecture

Application integration, Event driven architecture, EventBridge, Serverless

Episode 4
June 1, 2022
1 h 01 min

With Sam Dengler, Principal Solutions Architect, AWS

Join AWS superfan and CTO of ESW Capital, Rahul Subramaniam, and Principal Developer Advocate at AWS, Sam Dengler, as they deep dive into the Event Driven Architecture paradigm.

Latest podcast & videos

September 27, 2022November 3, 2022

1 h 07 min In this episode, Rahul and Stephen continue the theme of Behind the Scenes by showing some of the automation which makes AWS Made Easy possible.

September 20, 2022September 28, 2022

1 h 07 min In this episode, Rahul and Stephen recap the "Behind the Scenes" episode 1, and then discuss a few new AWS announcements, and plan for Behind the ...

September 13, 2022September 20, 2022

1 h 10 min In this episode, Rahul and Stephen begin part 1 of a 3-part series in showing #AWS-powered automation, developed with DevSpaces and DevFlows, to show how they ...

August 30, 2022September 19, 2022

1 h 03 min In this “What’s New Review” post, Rahul and Stephen go over a variety of announcements from AWS. Most of the articles rated very well, with the ...

August 17, 2022September 19, 2022

1 h 11 min In this episode, Rahul and Stephen film from Anaheim, where they were attending an AWS Partner Summit. They filmed from a makeshift studio in a hotel ...

View all »

Summary

This episode with Sam Dengler was fantastic and full of good discussions.

Event Driven Architecture is a paradigm which has existed for quite some time.

One of the seminal books on the topic is Enterprise Integration Patterns by Gregor Hohpe (personal website, Twitter).

At 38:50, Rahul asks a question, “Where is the line between synchronously adding to a queue, and thinking of the overall process as asynchronous?” Sam refers to a great Twitter thread by @SamNewman, here: https://twitter.com/samnewman/status/1414894650125586434 . Gregor Hohpe replies “I’ve been waiting 16 years to answer this”. See the thread for more information.

Sam also mentions a book, The Power of Now: How Winning Companies Sense and Respond to Change Using Real-Time Technology. This book, written by the founder of TIBCO Software Inc, describes how important it is for companies to be able to process events, make decisions, and respond in real time. Event Driven Architecture enables this.

Finally, we ask Sam what resources are available for people to learn more about serverless. His response was ServerlessLand.com. This site brings together all the latest blogs, videos, and training for AWS Serverless. Learn to use and build apps that scale automatically on low-cost, fully-managed serverless architecture.

Transcript

Stephen

All right. Hello, everyone, and welcome to episode four of “AWS Made Easy.” And we’re really lucky, we have Sam Dengler with us today. And so, let’s do a quick round of introductions for everyone. How is everyone doing today? How are you, Rahul? How are you, Sam?
Rahul

I’m doing very well, I come from otherwise sunny Dubai, but it’s pretty late in the evening here.
Sam

Hi, Stephen. It’s great, I’m based out of Atlanta, Georgia, and we’re having a beautiful day here. I’m happy to be joining you for this conversation.
Stephen

Oh, really lucky to have you. Sam, for some background, six years at AWS as a solutions architect and principal solutions architect. He recently made the switch to principal developer advocate. And he’s very experienced in event driven architecture, especially Amazon Step Functions. And he gave a talk at re:Invent called “Building Next-Gen Applications with Event-Driven Architectures.” And we’re gonna put the link in the chat now. Anything else I should add?
Sam

No, that’s it. I’m super passionate about this space and excited to have a conversation about it with you.
Stephen

We’re really lucky to have you. And Rahul, how’d you earn the title of AWS super fan?
Rahul

So, as I stated before, we started operating with AWS about 14 or 15 years ago back in 2007, when I had the hard task of managing a bunch of Lotus Notes servers, if anyone remembers Lotus Notes. And that’s when I first got my glimpse of AWS as a service but back then, it was just easy to add an s3, and the fact that you could provision infrastructure like that with just a simple API call just absolutely blew my mind. And since then, I have broken a number of different services time and again. And yeah, I learned a ton about AWS, and every time a new service comes up, I feel like I’m more and more addicted to this world. So, yeah, I’ve been a huge fan, it’s been a big part of our growth, our strategy, and our, you know, business model overall. And yeah, been involved in this setup forever. So, I look for every opportunity with every product team to see what we can do with new AWS services as they come and figure out new ways to optimize the existing ones. So, yeah, that’s why I call myself a superfan. AWS has been doing some amazing, amazing stuff, and I couldn’t be more privileged to be part of that entire journey with them.
Stephen

Thanks, Rahul. Well, speaking of the amazing stuff they’re doing, did everyone see this announcement here? We’ve got the C7…oh, it’s mirrored, the C7g just came out, announced as general availability yesterday. So, we’re excited about that and probably going to fire one up after the live stream, there’s been some things I’ve been meaning to try out.
Rahul

And by the way, we have a really interesting benchmark on Java products that we just did with the Graviton team. So, look out for that, that should be coming out very, very shortly. We spent a bunch of time reviewing and benchmarking a whole lot of different products on Graviton 3. So, look out for that, we’ll make the announcement on LinkedIn and Twitter as well.
Stephen

Perfect. And then finally, before we start, I just wanted to thank our ASL interpreter, Summer. That’s a really complicated…I have no idea how you do it but thank you for being here and really appreciate you making this webinar more accessible. All right. Oh, I should introduce myself. My name is Stephen, I’m the chief evangelist for DevGraph, and we’re talking about…we make developer tools and trying to make AWS made easy, trying to just make the cloud journey as easy as possible, and I get the lucky role of being able to talk to you guys and make sure we have a good…we’re meeting your needs, so I’m really lucky.

All right, let’s have a…we’re gonna talk about event driven architecture with serverless. And please put your questions in the chat, they’re in Twitch, they’re in Twitter. If we have your questions, we can try to answer them either in real-time or if we can’t get it in real-time, we’ll try to get to it at the end. But we want to talk about the human part and the technical part of serverless. So, I guess to dive in, let’s start with a simple question of what is serverless? How do you do the elevator pitch? I think, Sam, you need to take this one. It’s a bold word. I mean, you think, “Okay, there’s a computer somewhere in the process,” and you think, “Okay, well, that’s serving my request.” So, why serverless? How do we describe that?
Sam

So, yeah, you’re exactly right. If you go down far enough, you’ll find some iron, some data centers and servers that are ultimately executing the code that you want to run. But the idea with serverless is that you’re dealing at a higher level of abstraction or runtime abstraction in the case of Lambda, where you’re more focused on the code that you’re writing, the business problems that you’re trying to solve that differentiate you from your competitors, right? And then really, everything below that line, you’re providing AWS in order to handle more of the operational responsibility. So, it’s not that there aren’t servers, it’s that you are not focused on the management, operations, scaling, and securing of all of the servers. You can stay focused on your business problem and use the tools that give you the greatest percentage of your time addressed to those problems versus, you know, the underlying infrastructure that has to support that.
Stephen

Okay, it’s not serverless because there’s no servers, it’s serverless because you don’t have to think about it.
Sam

That’s right.
Rahul

I think it’s a very interesting way to name something serverless because it forces you to not think in your traditional conventional way. Because for developers, you know, who have been developing for a while, it’s very natural to not only want to deal with all the code that you’ve written and the application of the business logic, but also then think about, you know, the security, the operational aspects of how do you scale, how do you provision, how do you do capacity planning. Like, that’s so ingrained in our traditional way of, you know, learning all the developer best practices, that it takes something almost disruptive even in their nomenclature to be able to think differently about it. So, I think serverless is a really cool way of doing it.
Sam

Yeah, we’re taking this concept of serverless, though, and we’re applying it to older AWS services. So, for example, SQS or Simple Queue Service was one of our first services to go GA many, many years ago. But now we think of this as a serverless service, you get to focus on your use of queues to scale message communication, but you really aren’t focused on the underlying infrastructure to scale those and secure them. So, it’s sort of a term that we’re applying to older technology as well. This concept of serverless has been around for a while but we have now a term, I think, to address it a little bit more directly.
Stephen

Well, I guess speaking off terms, that’s a great segue. I think just before we jump into all of the more detailed questions, I think it’s a good idea to kind of get on the same page in terms. So, I thought I’d put up a list of words that we typically talk about when we talk about serverless. And so, there’s things like event, architecture, orchestration, choreography, coupling, and idempotency. It’s funny, I was thinking about these words, you know, especially the first four, they’re very rooted in these physical concepts, right? With architecture, we think about build…even with choreography, sometimes I do get this reaction.
Man

I do not think it means what you think it means.
Stephen

Okay, choreography, my kids, they love ballet. If I talked about choreography, they hear that very differently. So, let’s talk about what do these words mean in this context.
Rahul

Sam, go ahead.
Sam

Sure, yeah. So, an event really is just a signal that a system state has changed, right? So, outside of all the technological conversations we have, we can really think that events are occurring around us all the time. You know, whether they’re timed events, things that are happening based on our Outlook Calendar or Google Calendar, letting us know we have to be somewhere, or whether the fact that my kids go to the school at a certain time, they leave, the weather is a certain thing, these things are happening around all time. So, we think about events and they’re natural to us and it’s important to kind of step back and think about how we observe and react to events, and that being a core philosophy or an idea concept around events. So, events occur and we have some information around a state change and then we react to those events and take some action as a result of that.

And so, really, a lot of these event driven architecture conversations are taking older concepts and applying them into modern architectures and modern solutions. But these concepts, these terminologies are actually have been around for a long time and sort of a core part of building software. Architecture, I think architecture really as the analysis of trade-offs. So, understanding when we have some options and how we’re going to design a solution, what are the considerations? What are the dimensions of the problem and our choices, and being able to analyze those and then pick the right set of trade-offs or whatever our requirements are. So, architecture, really, I think as a combination of art and science. There’s lots of concepts, there’s design approaches, a lot of things that don’t have hard recommendations but more on thinking about patterns and trade-offs.

There’s really never one true solution to anything, it’s all a balance and a compromise and picking the right sets of things that solve the problems and being comfortable with the negatives or the trade-offs that you’re making or compromises you’re making as a result. Orchestration is a term we use around tight coordination. So, let’s say that the three of us in conversation need to ensure that a certain action is taken ultimately. Well, many times, we need to have tight coordination between us. It might involve a sequence of communication that we have or it might involve the fact that if I’m asking Rahul a question and he doesn’t respond, that I asked him again, I have a way of retrying my request, I have a very specific and definitive idea about what should happen in a business process, for example, or transactional. We call that tight coordination of communication. It can be synchronous or it can be asynchronous, it doesn’t have to do with that. But it’s more around the expectation that we have around how a particular process will be executed.

Choreography is the other side of that, it is loose coordination. In cases when I want to allow freedom from different parties in the communication to be able to take their own individual actions and then return their results in a very asynchronous or loosely coupled way so that we don’t have a firm expectation, we’re leaving some flexibility in the system to change over time independently. And oftentimes, in event driven architectures, it isn’t one or either, we don’t have tight coordination or loose coordination, we always have a combination of both. And again, the architecture is about finding the right places to apply both of these patterns and using them together in order to make, you know, an architecture that is robust and can still be flexible for change over time.

Coupling is really the measure of interdependent variability between systems. So, if we have two components, there is going to be some interdependency between those two components and coupling is the measure of that interdependency between those, which makes it easier or hard for those things to change independently. We often sometimes think of coupling as a singular thing, things are either tightly coupled or loosely coupled. Well, actually, it’s somewhere in between in the amount of coupling we have. Also, coupling can occur in different dimensions. So, we might say we have temporal coupling, which is whether the communication between components is synchronous or asynchronous. It could be a technological coupling, so the components that are in the system, are they tied to a particular technology stack that limits their ability to change over time? So, this idea of coupling is just, again, in the architecture, the identification analysis and choice of trade-offs is just part…it’s one sort of the dimensions of things that we can measure to guide us in the right selection and the right places.

And then finally, idempotency is the ability to process the same event twice without duplicate side effects. So, a classic example of something that’s not idempotent is when you go to order something on the internet and you click the Buy button and, you know, your browser spins and times out, so you click the Buy button again and now all of a sudden, your credit card has been charged twice. Okay? So, that is where we’re trying to apply the same transaction multiple times and we’re getting duplicate side effects. So, idempotency is the ability to identify duplication in an event driven architecture and then have tools in order to de-duplicate those things so we don’t have multiple side effects from the same event.
Stephen

Thanks. I know it’s kind of a funny question to go over all the words in advance, but I think it’ll really help make sure that we’re all talking about the same thing, especially for this domain-specific vocabulary. And kind of thinking about it, it’s funny with the orchestration versus choreography, they’re really well-chosen words since the orchestra has to be perfectly timed or it’s going to sound like a mess. But when you see a ballet and there’s ballerinas, sometimes they look like they’re off doing different things but as a whole, they’re achieving this beautiful objective. So, it’s pretty neat. I was thinking about also with idempotency…and this is a little background that’s up on my part. So, I have a flow that will…when I scheduled this interview to then send some calendar appointments.

And then because my flow broke halfway through, my guest, Sam, he might have gotten a couple of emails and calendar appointments. So, my flow was definitely not idempotent, so that’s another one on the list for later so we need to fix that up. So, thank you for putting up with that. All right, so big picture, when we’re thinking about event driven architecture, what’s the…I guess, so we have this event or this change in state and it kicks off the set of dominoes, and why is that better than a big monolith or I guess I would have started with, okay, if I was to make an application, I’d start thinking about API endpoints and then each one would be backed by a lambda. That’s how I would normally start. So, how is this different or better? You know, what’s the philosophy behind this?
Sam

One of the things I really like about event driven architectures is it really starts from a conversation at a business level and you can have these conversations with technical and non-technical stakeholders about what’s happening in your organization. For example, there’s a collaboration project called event storming, where you get lots of different stakeholders in your organization together, you talk about your business, you talk about the things that occur in your business, and it comes from different perspectives in your organization and their interactions with the customer, for example, can be different whether you’re in support or sales or engineering, right? So, I really like this because it’s very accessible, it isn’t a necessarily a technological conversation where you start.

And so, you can get a much better view, I think, or a 360-view across different perspectives on your customer and what the interactions are and how to identify the current experience and how you might like to improve that. We can work backwards from that and identify the events that are occurring in your organization naturally. So, it’s really easy for people to understand and resonate with an event driven architecture because it’s mirroring the things you’re already familiar with in your organization. A customer signs up a new account, the customer makes a purchase, the customer creates a return, all of these things are are things that are already happening and you’re observing happening in your business and your interactions with customers.

Then, really, taking it from there is kind of the idea that you can then build systems that can react to that, right? And so, I think one of the advantages of event driven architectures is that it does kind of mirror the reality in which we operate and we kind of interact with our customers in our day-to-day jobs. From a technical point of view, when we’re building event driven architectures, largely we’re doing it because with services like in AWS, for example, we now have the ability to scale the technology that can react to these events and scale quickly to handle these events. That was more limiting in the past, right, when we were more limited by the physical infrastructure that we had, some of the software built on top of that, and its ability to scale, you know, slowly over time as, you know, compute resources and networking and infrastructure.

You know, as we kind of move up the stack as that has become cheaper and the ability to scale more on-demand, it’s pushing our focus higher and higher up the stack again back to our business problems. And that really helps, I think, when we think about that in an event driven architecture way and really focusing again on our customer and our problems, understanding the architecture patterns that we can bring to it to address, you know, the architectures to solve those problems. And then feeling confident that under the covers, the infrastructure managed by AWS, for example, and the services that are provided will help implement that, you know, so that you’re assured for the transactional guarantees or the scaling behaviors, all of those things are handled for you.

So, I think that’s why the event driven architectures, even though it’s an older concept, it’s renewed now because the tools and the capabilities that are provided by someone like AWS make those architectures more of a reality. They’re more accessible to engineers and architects to go build and you’re seeing the satisfaction of a really fast time-to-market, a lot of agility now in building applications that can react to what customers…what you’re observing with your customers. And so, that’s attractive, I think, to a business overall, having, you know, a faster time-to-market, more agility, staying more focused on your business and your customers, and being able to deliver value faster. So, it is one of these things that’s like an old thing but it’s new, but I think what has really changed is the nature of the cloud and in particular, serverless mindsets and technologies where you’re paying for value. Again, you’re not so focused on the infrastructure, the scaling concerns, you know, you have a partner like AWS that you can lean on to help with more of those undifferentiated concerns.
Rahul

Yeah, I’ve actually always looked at the evolution of event driven architecture more as an evolution that helps you scale and we’ve kind of been doing this for the last 50 or 60 years. So, you started off with writing, you know, assembly code where everything was just one massive binary that you wrote up. From there, you moved on to higher-level languages and you build libraries to try and decouple some of your code to be able to scale up, you know, the rate at which you could write new code where you were reusing a bunch of other stuff. From there, you moved on to trying to figure out how to virtualize some of these, you know, runtimes, and figure out if you could build a distributed system that would allow you to scale your compute because compute resources and memory resources were very limited.

You created inter-process communication, you created automize to kind of distribute out some of that compute. And, in a sense, those are kind of the start of how some of these events stuff kind of…you know, the origins of some of that event stuff. And then from there, you grew to enterprise service buses that started, you know, selling maybe 30 years ago. And now that you actually have truly compute and memory and resources that are truly decoupled from the code that you write the event, event driven architectures, like you said, has come to the fore, like, that’s now become the paradigm that could potentially help you scale infinitely. Like, when you talk about services like DynamoDB, the underlying architecture is distributed and that event driven architecture maps so beautifully to the distributed architecture that is now enabled with, you know, cloud services like AWS.
Sam

And this is a book I like, it’s called “The Power of Now,” it was written by the founder of TIBCO, this book was written in 1999. And if you took out the technological concerns, it’s exactly the same problems we’re dealing with today, it’s the desire to have more insights into your business faster so you can react and respond quicker. It used to be back in the day that the challenge was compute or it was networking or whatever. But as we’ve removed those, as you said, Rahul, the ability to scale those decoupled pieces just frees us to then focus more and more on, I think, the dream, right, that we always had to be able to be this nimble and react in this way.
Rahul

Correct. So, you know, on that thought, you talked about a few different modes of events, you know, during your talk at re:Invent, where you have synchronous events, you have asynchronous events, and the asynchronous events could be the point to point or they could be kind of route away. So, tell us a little bit more about that and, you know, if you can recap that for the audience as well.
Sam

Certainly. And I think, I’m a software engineer background, and so when I started building software, the world in which we’re operating is typically within an execution stack. But as we start to expand our worldview of what an application is and start moving into more distributed systems, we end up in this world, right, where we have components that are communicating across the network. And if we know anything about, you know, the fallacies of distributed systems and the complexities that introduces, it drives us into understanding, you know, different trade-offs and things that can go wrong, right, and how are we going to deal with those.

And, you know, I think we all start with this idea of synchronous communication and it’s very easy to understand, I think that’s one of its advantages. It’s very low latency communication and it’s very easy to debug. You know, Rahul and I are on a telephone call together, we’re having a synchronous communication, and it’s very efficient, it’s very effective. So, there are a lot of advantages to synchronous communication. However, I think, you know, back to Rahul’s point, as we scale, this is where we start to see some of the challenges, right? So, in synchronous communication, you know, the publisher of messages is entirely dependent on the receiver to be available in order to process those. So, anytime there’s a receiver failure at this first point or anywhere downstream of that, then in the synchronous communication, those failures are transmitted all the way back up to the client. And it’s really up to the client, we have no choice but to deal with those and retry our requests.

Another variation on this is the scaling concern. So, if the number of publishers or the number of requests being sent by a sender scale very quickly and we get a burst of traffic, well, now it’s really a question of how quickly can we scale to process those and ingest those requests. And so, you can think of this as maybe it’s like the thundering herd problem, for example, when we have a failure in a system and all our traffic suddenly shifts and we’re placing all of that pressure on a new single point, how quickly can it scale? How robust is it to handle that scenario? And this is where synchronous communication patterns fall over over time. So, you know, again, architecture, I think, is the analysis and selection of these trade-offs, stuff that synchronous communication is bad or should be avoided, it’s just you want to select it for the right use cases, right, and understand where it can’t go, where it’s gonna fall over over time.

In order to kind of address that, we can introduce asynchronicity. And when we think of, again, synchronous communication being a form of temporal coupling, that things have to happen at the same time or can they happen at different times. And when we think about asynchronous, we introduce an intermediary, something that’s between the sender and receiver to aid in this communication. And one thing that we can introduce is something called a queue. And a queue, essentially, is a buffering mechanism between the publishers and subscribers or the senders and receivers. And that’s beneficial to us for a couple of reasons.

First is if there’s any problem with the receiver where it has a failure, the sender is not immediately impacted by that. As long as the queue is available, then senders can continue to publish messages and it’s not immediately impacted by that. On that scaling concern, the nice thing about a queue is that the receivers use a polling pattern in order to pull messages from the queue and process them. But because the receiver is polling the queue, the receiver controls the rate at which it processes messages off the queue. So, if we have a burst of traffic from our publishing side, there’s no immediate impact on the receivers and the rate at which it can consume it. There’s asynchronicity there and they can have some control of its destiny. So, we often introduce queues, it’s probably why you see SQS queues either explicitly or implicitly involved in so much of the asynchronous communication in Web Services is because it’s just really is the workhorse of the internet.

I think two years ago, the stat we published is that we process 25 billion messages an hour with the Black Friday and Cyber Monday. So, it’s incredible the scale at which SQS can handle, you know, a distributed system itself. But again, architecture, it’s important to understand the trade-offs. When we introduce queues, we do introduce some latency in between those things. We do have to think about times when our receivers may fail and come back online, how long does it take in order to process the backlog of messages that accumulated over time? So, there are, again, some things to consider but I would see queues as a common tool that we can use in distributed systems and event driven architecture in between our publishers and receivers to, you know, deal with synchronous communication and the brittleness that can come out of that.
Rahul

So, Sam, I’m going to ask you this question that has been a thought exercise in my head for years, in fact, for over a decade. When I look at queues, I’ve always thought of, you know, what users are doing is they’re taking a message and instead of, you know, having a receiver process it, they kind of short-circuited by accepting the message. But basically, what you’re doing is you’re trading off a long-duration synchronous call with a shorter duration synchronous call where all you’re doing is accepting a message. But at the end of the day, isn’t that still synchronous? And, yes, the overall processing becomes asynchronous because then there’s another, you know, receiver that will receive that message and process it. But where do you make the trade-off to say, “Yeah, at this point, it becomes asynchronous?” I don’t know if there’s a great answer or a universal answer but I’d love to, you know, pick your brain on that. This has been a constant gear in my head, it’s just converted to synchronous at the same time.
Sam

This is why I love this space is that there’s such interesting conversations to have and nuances to tease out around how these concepts apply. There’s a great Twitter thread, we’ll see if we can find it later, started by Sam Newman who is an author and speaker on microservices and one of my colleagues, Gregor Hohpe, got involved as well exactly on this question. And really, it was posed as what do people consider asynchronous communication to be, right? And it teased out all these nuances. So, in my mind, yes, you’re exactly right, Rahul, in the communication between the client and the queue, there is still a synchronous portion of that because the queue needs to acknowledge that it has received the message. So, yes, that part is definitely synchronous.

So, I think anywhere where you need to perform some validation of the processing or the reception of things, it’s going to require synchronous communication. It could be validation of the schema or that the request is within the performance boundaries of some rate-limiting, for example, of having someone communicate to a server and being throttled because they’re making too many requests based on the definition. All of those things are synchronous parts of the operation. But I think past that point, right? Then we have the asynchronous communication. So, it’s really like…you know, what’s the word here? From a certain point of view, right? So, from the overall processing point of view of the system, there is asynchronous communication, but really, in the world of the publisher of the message, it is a very synchronous interaction between the queue. So, I guess, like in most architecture things, it depends and it’s both correct and it’s really just teasing out the nuance there, but you’re absolutely right.
Stephen

I posted a link to that thread. So, Gregor’s response is, “I’ve been waiting 16 years to tweet this.” So, yeah, it looks like a pretty good thread to have. I would think about it almost like a phone call versus email, right? Where, you know, if I’m phone calling you, it’s synchronous, if I’m emailing you, the email gets there in a synchronous way but then whatever you decide to do with it is how asynchronous it’s gonna be.
Sam

That’s good. Or like a voicemail. So, my synchronous communication with the voicemail system is to leave the voicemail, but who’s listening to that and is going to react to it is asynchronous. Yeah, that’s a great analogy, Stephen.
Stephen

Oh, cool. As we get into AWS Step Functions, what…well, I guess we also have to talk about the router model.
Sam

Oh, yeah. So, the other way that we have coupling is what we call location coupling, so point-to-point communication. So, for example, if I know I want to call Stephen, I have his phone number, and I’m going to call him directly. But let’s say this is a customer support line, so I don’t know that I need to talk to Stephen, I know I need someone’s help. And so, I might call a number and I’m getting routed to a particular someone that may help. So, oftentimes, when we’re starting out in simple architectures, we might code this type of decision on where things should go over time. However, as either the logic changes over time on that decision on where things should go or the number of targets that we could go to changes over time, we find that we end up modifying our code and our logic becomes brittle over time to change.

And so, this is where we can include a different type of intermediary, which is what we call a router. And oftentimes, there’s many types of routers but one of those is an event bus. And the idea of an event bus is that we can publish a message to a bus and we can provide not just the message payload, but enough context, enough metadata about that, that others can come and describe the rules and the filters for the type of data they want to process. But it’s up to the event router, the bus, to facilitate the connection like matchmaking, right? So, we have messages being published to the bus and then the bus provides a service or capability for subscribers to come and go and describe the things they want to process. And again, this decouples things, right? It decouples the publishers and subscribers from being connected directly and synchronously, right? So, we’ve introduced two types of decoupling measures.

One thing we didn’t talk about with tight and loose coupling is the trade-offs just in general with that approach. It is important to note that as we decouple things in systems, it does make some things more challenging, which would be around, “How do we troubleshoot the system overall? How do we know what’s going on in a larger system when things are moving around?” So, that’s another thing to consider as we are…as we have things tightly coupled, it’s very easy to understand and troubleshoot. As we move apart, we gain a lot of things but we also need to understand how we’re going to piece together information that has now been scattered in order to troubleshoot or audit different parts of the system.
Rahul

Yeah, so just back to this router pattern as well as the queue pattern, I mean, one way that I think about it is if you took out a pattern and just added a constraint that you had only one rule in the router, then that basically turns into a queue. But I’d like to spend a few minutes getting your thoughts on the mechanism or how these patterns help enforce any kind of idempotency, or what strategies do you need to bring in, like, you know, guaranteed, you know, only once processing or only once poll? Or, you know, FIFO is one mechanism but, again, what are the guarantees around your poll are at least once kind of processing? Where does the responsibility lie in typically designing architectures and trying to guarantee idempotency? Like, are there any best practices? Is there a way to think about that architecture? I would love to get your thoughts and some examples, if possible, and how people have attempted that.
Sam

Yeah, this comes up very frequently in event driven architectures. So, I think it can be a shared responsibility. So, some of the technology that we use can offer some of these features, and maybe we should define the delivery semantics too so people understand. So, there’s usually really sort of three ways these are usually described. We have at least once delivery, meaning that the message will be delivered at least once but it could happen more than once or it could be duplicate delivery. There is exactly once, meaning that there’s some control over ensuring that something only happens exactly once. And then there’s at most once, which would be best effort. So, something will happen and most will only happen once, but it’s not guaranteed to happen.

What you’ll find most commonly inside of AWS if you start looking at the services…and really, any technology, you should, you know, ask or dig for what is the semantics that are available. Typically, you’ll find that it’s at least once delivery. And the reason is, is because when we’re building distributed systems, there’s something called the CAP theorem and it’s this idea of you have a choice between consistency, availability, and partitioning. There’s actually a great blog written by Mark Brooker around how perhaps in the past, partitioning was something that was optional, and choosing between the CAP theorem, you can pick any two and the other one is sort of eventual. But in reality, partitioning really is now a part of life, right? We don’t work in isolated data centers or systems anymore, we’re building distributed systems.

So, given that you know partitioning will happen, what is your trade-off between consistency and availability? And there’s lots of like nuanced ways you can think about the choices here. But ultimately, when we’re dealing with large distributed systems like SQS, for example, that can scale to 25 billion messages, we want to ensure we can support partitioning in case a particular availability zone or something, you know, fails in the distributed system and we want to ensure that the message is available when you ask for it. Availability is the most important thing. Well, in that case, then we have to trade off on consistency and we save things that are eventually consistent. In the fullness of time, the system will converge and you’ll have consistency.

But because of that trade-off, again, it’s the most important thing in the SQS standard queue for this to occur, then the trade-off is that it’s possible for there to be duplicate message delivery. So, mostly with AWS services, you will see things default to at least once delivery because that’s best suited for what we hear from our customers they want the most from us. Now, something like SQS offers a FIFO option, first in, first out, which also offers the ability to specify an idempotency key, which is called a message duplication ID in the case of SQS, and it allows the SQS service to actually do that deduplication for you. So, that’s beneficial to you.

It’s always important with SQS to read the fine print. In the case of SQS FIFO, actually, the deduplication feature only happens within a five-minute period, right? And so, you might assume that you have a failure message processing, you’re going to send it again with the same message deduplication ID an hour later. Well, actually, SQS FIFO service sees that as a duplicate. And it does it for very good reasons because the memory that it has to keep over five minutes over a huge scale is limited, right? So, again, back to the trade-off. So, there are features of different channels that offer these but they usually come with their own caveats and things to understand. Another choice popular with customers are things like Kafka, for example.

And Kafka, under the covers, uses really innovative techniques built on top of distributed key-value stores in order to track things across its distributed system and uses something…I think it’s called the Kafka ID or it’s a zookeeper-based ID that allows it to determine idempotency in a distributed system. So, there are channels that provide capabilities that you can lean on. However, it’s my opinion that when you’re looking at larger distributed systems, unless all the components of the architecture are based on the same channel where you can depend on those types of features, ultimately, your message might flow through intermediaries in an overall architecture that may or may not support those features or support them with the same behaviors.

So, ultimately, I believe that the best way to address these is actually using more of the classic integration patterns that are covered in Gregor’s book, for example, in “Enterprise Integration Patterns.” And when we take that approach, the advantage is the solutions are agnostic, we can put information in an event or in a message, right? It’s important that the data is there, things like an idempotency key or a sequence ID that rides along with the message payload that you want to process. And so, the advantage there is that any channels, any intermediaries that that message flows across, it retains the semantic meaning of that data that allows producers in order to determine what is a duplicate, how to re-sequence out-of-sequence events.

The trade-off is that you’re not really getting any help from these channels to do it for you. So, you need to build approaches within…and reusable approaches, hopefully, in your subscribers in order to reason about this data and do deduplication. Now, fortunately, we tried to help at least in serverless technologies with the development of something called the Lambda Powertools, and these are a library that you can include in your Lambda functions, it’s available in TypeScript, Java, and Python for now. And these are libraries that you can include in your code and it has implementations to help you solve, for example, idempotency for you so that you can plug these things in and it will help you with it.

So, anyway, Rahul, it’s a long circuitous way of saying, like, there are features available but if you become overly reliant on those features of certain channels, you become coupled to that technology. And in the larger system or complex systems, you don’t always control all of the channels and proxies that messages flow over, so the safest bet is to include the information that allows subscribers to react to those events and reason about that information based on their business context and their requirements.
Rahul

Got it. This is very, very helpful, I think you articulated this really well and made it really easy, hopefully, for all the listeners. I learned a ton, I wasn’t aware of the Powertools that are available for Lambda, so that’s a really good one. We’ve traditionally built a lot of these idempotency when we’re designing architecture, but I’m not sure we knew about these tools to leverage. So, thanks, that was very, very helpful.
Stephen

Yeah, and to kind of zooming out a bit…and thank you for that. It’s really neat that with this event driven architecture, even though, you know, there’s a little bit of a complicated vocabulary, as you zoom out, what I’m seeing is that you’re talking about it the same way you’d whiteboard your business process, right? We’re starting to talk about certain services being responsible for certain things and with the flow of information, the speed of the response, or do you expect a message back at all, these are all very real tangible things that you don’t need to be a programmer to reason about.

And it’s much more tangible than to think, okay, sometimes you have a flowchart but then…you know, I’m thinking back to my early days as a programmer, but what you’re thinking about is memory allocation and pointers and all this stuff, and it’s so far removed from what you were actually trying to do. Whereas this, even though it’s a bit complicated vocabulary, it’s basically how you would whiteboard it in a non-technical setting, just with a bit of extra details tacked on. Would you agree?
Sam

Yeah, I think that the solutions we build today on the cloud are distributed applications in nature and I don’t think you necessarily know that. In a lot of ways, we’re trying to disguise some of that and provide nice integrations between our services so that you can build quickly. The goal is not to introduce complexity, right? But the goal is not to hide complexity either. You know, a lot of times, I think about the code that I would have written in the past, right? And there’s a lot of assumptions that are built into just the way that I’m coding and the components that I’m integrating with. And as we build distributed systems and especially if you look at a serverless architecture diagram, it looks very complicated sometimes, lots of boxes and arrows. But actually, what we’ve done is we’ve just turned a typical architecture inside out, right? A lot of those things just were in code in a big box.

Now we’ve exposed the business processes, the interactions with these messaging components in a way that allows us to talk about them upfront. Another example, I think, I really liked about serverless architectures is the nature of security and failure, so we can talk between any of these components in an architecture like this. What are the failure modes? If we have a queue, for example, do we have a dead-letter queue configured so we make sure we don’t have message loss because that’s important to our system? So, it really brings up as a first-class conversation that failure happens in systems and how are we going to architect and plan for that failure so we’re not surprised by it when it happens.

That failure was always a possibility but I feel like in the past, we haven’t been able to get to those conversations as much because we’re more focused on, you know, getting the infrastructure correct in the code that we’re writing. And sort of as more of those things are handled by other systems, then our conversations become more valuable, they’re more focused on the business and more focused on failure scenarios and what happens when our systems scale or fail. And those are really the more, I think, valuable things to talk about and plan for because that’s ultimately what you’re going to be facing when you get these systems in production and, you know, you run into scenarios you may not have anticipated.
Stephen

As you were saying, when you’re doing it this way, say using Step Functions, for example, you have to think about the error cases right up front rather than kind of distributing them throughout your monolith in perhaps a not consistent way and a not well-documented way and a way that’s really hard to then inspect later when you haven’t touched it for a couple of months. It’s like, “Okay, what happens in this case? Am I not even using the same library or the same…are these error messages going to the same place in the same format?” And all these different things, whereas you have to think about them upfront but they’re all consistent.
Sam

Yeah, exactly.
Rahul

I’m going to jump up to a slightly meta-level question. The more I hear, Sam, you talk about it and let’s see if we can kind of add to it, I feel like our current programming languages and paradigms are not really geared towards capturing and covering all of those conversations about what’s happening between those two systems, what are the kinds of messages that we are passing? Like, we are used to our, you know, procedural programming languages or object-oriented languages. We’re talking about, you know, languages like TypeScript or Python, you know, even scripting languages.

But the more I think about it, I feel like flow programming paradigm is more well suited to really building truly event driven architectures because it has all the constructs around what your processes are, what your message delivery nodes are, which are your consumers, what is the message, what are the input and output schema for every one of those nodes in a flow. It feels like it’s a more native or natural way to describe an event driven system than you could attempt. Because here, you’re distributing all of these conversations between a document or an architecture where you’ve written it all up in a document and then your code is kind of fragmented pieces of how you translate that document.

And then you have your deployment systems, which are a whole other thing, where you decide, “Okay, here are my, you know, 15 different pieces of code that I’ve written,” or Lambda functions or whatever, “And here’s how I’m going to deploy it in this kind of orchestration,” you might use a bunch of other tools in that. It all feels very fragmented and complex in trying to fit the existing paradigms into building these large-scale event driven systems and I know AWS has built tons of them with these traditional stuff. Just from an insider’s view, do you guys feel that friction in trying to adapt these traditional models of writing code? And have you guys thought about something like flow programming which might make things easier to describe, easier to execute in the long run?
Sam

Yeah, you know, when we think about tool selection and implementation, I think exactly what you’re saying, Rahul, it comes down to what are the tools that most fit the expression of the problem we’re solving. So, it might be the programming language that we’re choosing best suits the problem we’re solving, it might be our choice of databases, I know there’s, you know, new things. We don’t use general-purpose databases anymore, we have graph databases that solve a particular problem. So, as we’ve expanded and have more choice, we can pick things that are most well suited to the problems we’re solving.

And I think, in particular, in Step Functions, something that we have there is this domain-specific language called the Amazon States Language, ASL, and that’s where these concepts around orchestration are first-class citizens in the language that we use to describe the applications we’re building. So, what are the next things that we’re going to do in sequence? What are the errors that we’re handling? What are the retry and rate at retry that we want to configure in the system? So, all of those things are first class in the description of the thing that we’re solving. And a lot of times, we use a term like business process to describe the thing that we’re doing, which I think, in some ways, diminishes the thing that we are building.

And it is still in the application that we’re creating, it’s just that we are…the components and the language that we’re using to build applications takes for granted that we are building in the cloud with AWS services as native components, the building blocks that we’re integrating. So, I believe then the workflows that you look at in Step Functions, for example, and the visualizations that we have there, they’re more accurately reflect the architecture and the solution, which will be a combination of integrating AWS…I think of them as superpowers. For example, like all our machine learning, you know, pre-trained models that help you do label detection, I think of those as examples of superpowers that I can use in my application and I need to build an application that are pulling those things together in order to help me solve the problem.

And then where we don’t have superpowers to do that or there’s custom logic, that’s where a Lambda function comes into play or a container if you have containerized applications that you want to integrate as well. But, you know, looking at the code that we’re writing, deciding if the code that we’re writing, “Is this integration logic, or is this really business logic?” And, again, there’s no right answers, these are all based on your business and kind of what you’re building. But there’s many people that I’ve spoken to internally that are aware of the flow architecture book, it’s a very interesting concept, and it’s definitely on our mind in like where we could go and help customers.

There is definitely, I think, an interest in my customer conversations, at least around how customers view the world as streams of events and how they might want to process that. So, I think it is an exciting direction, especially as that becomes more of the first choice in building applications, which I think keeps gaining steam, that that will develop the sets of tools and languages that treat that as the first-class domain in which we’re building applications around. So, I think you’re definitely on the right track, Rahul, and it’s definitely where, you know, we want to support customers in building those types of solutions in the future as well.
Rahul

Got it. So, I’m gonna add one more question to my list, just to figure out where within the AWS services everything falls. So, we have one set of systems, which basically act like your brokers, you know, you’ve got your EventBridge, you’ve got SQS, you’ve got AppFlow, to some extent, which kind of fall in different domains. There’s a bunch of overlap between them as well and I think there’s a ton of overlap between event bridge and SQS in a lot of ways in terms of the way they handle things or what you can do with it. And there might be some trade-offs, so I’d love to hear your thoughts on this.

And then you’ve got EventBridge and AppFlow, for example, where for certain kinds of processing where you have CDC events or whatever streams of things coming in from systems, you might choose to go down the AppFlow way while you could also do stuff using EventBridge. And then you have Step Functions on top of all of that, which you can, again, use as a mechanism of orchestrating a bunch of stuff together. So, how do you lay out all of these services for someone who’s just trying to grapple with the nuances and the nitty-gritty details?
Sam

Yeah. Well, I guess let’s start with EventBridge. EventBridge is a serverless fully managed Event Buses we mentioned in that pattern earlier, it allows you to describe rules and the types of events you want to process based on the JSON structure of the data that’s been published there. So, EventBridge’s, I think, primary value is the ability to route and connect AWS services directly, right? So, if your concern is routing and complex routing rules based on the data, EventBridge really is the best choice for that. When you’re looking for a wide fan-out, so say an event occurs and that you want to match against that event and then you want to send a notification to many mobile devices or you want to send emails, you want to send targets that are person-type targets, then SNS is a great choice for that. It is very well suited for fan-out.

Even though SNS can do some filtering and EventBridge can do some routing…I mean, some fan-out, I think it’s primarily best used when EventBridge is used for routing and SNS is used for fan-out these particular types of targets. AppFlow is an interested one if, I think, the primary value is around data synchronization and mass movement of data between enterprise systems where the connectivity is easy to create and secure. So, let’s take Salesforce example as something that we might want to interact with. Well, now we have a direct integration between Salesforce and EventBridge. So, let’s say a new customer is created in the Salesforce system. Well, that might create an event in Salesforce that gets communicated to EventBridge and then we might choose to react to that event using EventBridges or router to trigger AppFlow in order to connect to Salesforce and move data into AWS to connect with different systems.

So, although AppFlow can do some eventing, its primary focus is going to be on data synchronization and data movement with some transformation of the information from Point A to Point B. SQS is still useful for where you need an explicit queue to do message buffering between systems. You know, a lot of these other systems are…as I mentioned, a lot of AWS services are using SQS for asynchronicity in some ways anyway but with SQS, then you have explicit control over the configuration of that queue for message processing. So, that might include custom DL queues, custom message visibility timeouts, delays in processing messages, there’s many things you can do in SQS, custom re-drive of messages back to those queues, there are all these capabilities in SQS that you can use explicitly.

So, oftentimes, you see these services used in combination, the events come into a router, the router put messages on a queue, that queue might trigger a workflow, right, with Step Functions, and that Step Functions workflow might produce an event that triggers a whole nother system to occur. So, thus when we’re thinking about building these event-driven architectures, it’s really about understanding that sets of tools, where they provide and what value they provide, and then understanding how you can put them together to build a larger solution.

Yeah, so the things that…oh, I forgot to mention, Step Functions, you know, its true value is when you have a very specific plan, a business process that you want to orchestrate and especially when that plan involves integrations with other AWS services and you need a way to scale the execution of that workflow or that business transaction, that’s where Step Functions is well suited, it makes it easy to interact with other AWS services, but it’s all about the expectation of having a plan. I think of it like what I would write in code typically is kind of what I think would go in a Step Functions workflow. It’s just a different expression of that business process intent and provides different features that make some things easier that you don’t have to code. So, does that help piece the story together?
Rahul

I think you’ve been really great at articulating a lot of these concepts and a lot of these things for us, so that’s absolutely awesome. And I think we’re almost coming up on time. I wish we could continue this conversation again. We’d love to have you back here, Sam, to have more of these conversations, we have so many things to discuss, so we’d love to have you come back.
Sam

Anytime. Thank you so much for having me.
Stephen

Yeah, thank you. I think, you know, our list of questions, I thought, “Okay, this looks like an hour,” and I think we’re maybe halfway through and realized, “Okay, there’s 10 more we want to ask.” So, thank you so much for being a great wealth of knowledge and the different explanations. I guess, one quick question, so if someone is brand new to this, where do they start? Is there a resource or a place to go? Where does someone start?
Sam

The easiest place to get started is a website called serverlessland.com. And can you think of it as a content portal that my team has created and it’s really just the easiest place to come to find all the serverless content within AWS and we provide links to documentation and to blogs and best practices and architecture patterns, videos, workshops. This is the best starting place to discover more content.
Stephen

Awesome. Okay, so serverlessland.com, we’ve put the link in the chat. And then also, for more deep dive, we put your re:Invent talk in the chat, and hopefully, we’ll have you back again to continue this. There’s so much more to go into.
Rahul

Exactly.
Sam

Great. Thanks, Stephen.
Stephen

All right, thank you. And then thank you, Summer, this is really, really nice, thank you for making this chat accessible to everyone. And thank you, Sam, for being here. Thank you, Rahul. And thank you, our audience, we really appreciate…oh, we did have one quick question. Is there any time…you know, we’ll get to that at the next one, I think that its own big question. So, let’s cut it here, and thanks again, everyone, and we’ll see you the next time.
[01:00:58]
[music]
[01:01:25]
Stephen

Okay.