AWS Made Easy

Ask Us Anything: Episode 13

Episode 13
August 2, 2022
1 h 01 min

We first begin by welcoming Jeff Barr to the interview. His role is AWS Chief Evangelist and VP. We start by discussing his history with AWS, and common interests amongst Rahul, Stephen, and Jeff, including tech, legos, space, and 3D printing.

Latest podcast & videos

September 27, 2022November 3, 2022

1 h 07 min In this episode, Rahul and Stephen continue the theme of Behind the Scenes by showing some of the automation which makes AWS Made Easy possible.

September 20, 2022September 28, 2022

1 h 07 min In this episode, Rahul and Stephen recap the "Behind the Scenes" episode 1, and then discuss a few new AWS announcements, and plan for Behind the ...

September 13, 2022September 20, 2022

1 h 10 min In this episode, Rahul and Stephen begin part 1 of a 3-part series in showing #AWS-powered automation, developed with DevSpaces and DevFlows, to show how they ...

August 30, 2022September 19, 2022

1 h 03 min In this “What’s New Review” post, Rahul and Stephen go over a variety of announcements from AWS. Most of the articles rated very well, with the ...

August 17, 2022September 19, 2022

1 h 11 min In this episode, Rahul and Stephen film from Anaheim, where they were attending an AWS Partner Summit. They filmed from a makeshift studio in a hotel ...

View all »

Summary

AWS Snowcone in Space

In the first segment, we discuss Jeff’s recent blog post about how the AWS Snowcone team coordinated with the Axiom AX-1 Space mission to put an AWS Snowcone into space.

Jeff’s Blogpost: https://aws.amazon.com/blogs/aws/how-we-sent-an-aws-snowcone-into-orbit/

The Snow devices are edge computing devices, which can do data processing and storage in areas where accessing the cloud may not be possible, or the bandwidth requirements require a physical connection. The snow family includes:

Snowmobile – 100 petabytes in a shipping container. This was the famous 2016 reInvent semi-truck on stage.
Snowball – A 50lb device designed for storing and moving large amounts of data
Snowcone – 4.5lbs
- Fun video of Jeff and Bill Vass doing the Snowcone launch

AWS has an aerospace site. See https://aws.amazon.com/government-education/aerospace-and-satellite/

AWS Prime Day

Jeff’s Blogpost: https://aws.amazon.com/blogs/aws/amazon-prime-day-2022-aws-for-the-win/

In this segment, we talked about Amazon’s annual “Prime Day”, and how AWS prepares to handle the barrage of traffic. We discussed how the different teams solve problems quickly and make sure that their individual components are built to handle the load. Jeff mentions that Amazon has an “Infrastructure Events Management” team for third parties who have large discrete uses of AWS. See: https://aws.amazon.com/premiumsupport/programs/iem/

Graviton

In this segment, we discussed the AWS Graviton processor, and other custom AWS chips such as the Nitro, Trainium, and Inferentia. See https://go.aws/3zQDeTg for a list of current Graviton announcements. We also discuss the AWS Silicon Innovation Day, which took place the next day.

See: https://pages.awscloud.com/GLBL-Silicon-Innovation-Day-2022-reg-event.html

Transcript

Stephen

Hello, and welcome to “AWS Made Easy” episode number 13. I’m your host, Stephen Barr, and our co-host, Rahul Subramaniam.
Rahul

Hi, everyone.
Stephen

How you doing today, Rahul?
Rahul

Great, I can’t believe we are at 13 episodes now. Every single time we do another episode, I feel grateful that we’re able to do this every week. So great going. I can’t wait to get to episode number 20.
Stephen

Absolutely.
Rahul

How was your weekend?
Stephen

[crosstalk 00:00:42] another digit at 100. Oh, weekend was really, really good. Pretty relaxing. It’s been really hot this week. So, lots of swimming and that sort of thing. Not hot by your standards but hot by ours.
Rahul

Yeah, it’s all relative. Over here, I think given the way the pandemic has been, we suddenly find ourselves as schools have opened up, managing our kids’ hectic social lives. This weekend, we had three different birthday parties to go to for kids to chaperone them from one part of the city to another, was how we spent the entire weekend. And yeah, the other exciting thing is, I am helping run a new robotics program at my kids’ school. So that’s been really exciting. So, lots of new things that we are trying to create some exciting projects around. So yeah, we’ll have updates over the next few weeks on how that’s going as well.
Stephen

That’s exciting. Are you using Mindstorms, or custom build or Raspberry Pi, or Arduino? What platform are you using?
Rahul

We are starting with the Arduinos but then moving on to the ESP8266s. So, they have some Wi-Fi. The end goal is that for the school, they’ll build these little boxes, which have all the sensors from temperature sensors to humidity sensors to NFC, you know, sensors that’ll allow students to have their cards and come in and do an auto attendance, then tap into the cameras that are in the classroom to be able to, you know, figure out how students are engaging and who’s present, not present and so on. So yeah, lots of really interesting things in the works. We just have to see how this pans out. This is for middle school, by the way, grades six, seven, and eight. So yeah, quite interesting. Going back to teaching after almost 20 years, so should be fun.
Stephen

I’m really excited to hear how it turns out and to hear how it’s received and how the kids progress through that. So yeah, let’s revisit that at some future point.
Rahul

Absolutely.
Stephen

Well, today we have a really special guest. I’d like to give a little introduction, even though he doesn’t really need much of one. Today, it’s my pleasure to introduce Jeff Barr. Jeff is the AWS Chief Evangelist and VP. So, he’s been at Amazon for more than 20 years. And he started working with them when they were mostly a company working out of the old Pac Med building. And he’s been a web services evangelist since 2003. And in that time, he’s basically defined the phrase, technical evangelist. And on a personal note, I’m proud to say he’s my dad. So, without further ado, let’s bring in Jeff.

All right.
Rahul

Hey Jeff, welcome to the show.
Jeff

Happy to be here. This is fun. I love that printing effect. That was really cool.
Stephen

Oh, cool. I’m really glad.
Rahul

I think that’s one of the three or four things that we all share in common in this group is 3D printing, is of course Lego. And I’m surprised you don’t have your Lego of background, Jeff.
Jeff

I’m doing virtual right now. I’ve got a green screen. So, I’m doing green screen today. So, I can do some good effects a little bit later.
Rahul

Yeah. So yeah, so we’re really excited to have you with us today. And yeah, we’re gonna have hopefully a very engaging discussion. We have so many questions for you. So can’t wait to get started.
Jeff

All right. I hope I have just as many answers as questions.
Stephen

It’s funny. In kind of going over your introductory post and kind of preparing the introduction, I was…Oops, let me share the right screen. This is my memory of you taking me to your office. Not even that whole building, but the top two-thirds was Amazon. And all of Amazon.
Jeff

Yeah, those were those were really old days. That’s a very good picture of that building. Because they started to be a little bit more particular about pictures after a while, but that brings back some really good memories of being on the different floors of that building. And I started low and actually just somehow worked my way up to upper floors and, like, that upper spire, there must have been room for, like, less than 20 people on some of those uppermost floors.
Rahul

Oh, wow.
Stephen

It was a pretty cool, classic art deco architecture. It was pretty inspiring for me to see that as a teenager when you were, you know, just starting there.
Jeff

Well, you never know when you start something where it’s gonna lead you.
Rahul

Very, very true.
Stephen

Well, I think we’re gonna make a giant leap then, and talk about one of your most recent blog posts, the AWS Snowcone in space. We’ll introduce that segment. And then let’s have a good chat about that.

All right.
Jeff

All right. I do wanna go into space.
Rahul

Yeah, so this actually was really, really exciting to hear that, you know, you guys sent the Snowcone to space. We’d love to hear the backstory. Like, what went into the exercise? Number one, how did it all start? Like, you know, sending something to space isn’t the first thing that comes to mind when you have a service or you have an Edge use case. How did it all come to being?
Jeff

So, one of the interesting challenges of being in space is that the bandwidth up and down is actually very, very limited. And there was a question about…they wanted to do some image processing as close to the images as possible. And they said, “Okay, we’ve got all these pictures, we need to identify certain things in the pictures. We don’t think we have the ability to bring the high-res pictures down to Earth and do the image processing and recognition and so forth on Earth because of this limited bandwidth.” And because we’ve been focusing on Edge computing for a while, and getting the processing out closer to where the action is, the Snowcone seemed like an ideal solution. And there’s an entire qualification process you need to work through of okay, here’s the device. Can it withstand the rigors of launch?

Will it work properly in microgravity? Is it safe? Is going to keep the ISS safe? Is it gonna keep the astronauts safe? So quite an interesting process.
Stephen

We saw…oh, go ahead.
Rahul

I was just gonna ask how long before the launch did that process actually get started.
Jeff

So, my understanding is that it was accelerated, that it was somewhere on the order of about six months. And that by the standards of space travel, that was considered very, very quick. And partly that was because we had some great partners working with Axiom and working with NASA. But then also because of the fact that when we build these devices, we make them incredibly rugged to start with. You can find several different videos online of people doing their absolute best to destroy the various Snow devices. We actually sent one to Adam Savage and said, “Please have at it and destroy this, please.” And he tried some different machinery and explosives and it’s like no, it’s pretty tough.
Rahul

I’ve watched a whole bunch of those videos, I remember, of trying to destroy it, but it just doesn’t. Yeah, this is one of them.
Stephen

This is the intro video you did with Bill Vass, and I liked his ending shot with the puppies. I think he had one near his water fountain, if I remember correctly. Yeah. Here we go. That was clever.
Rahul

So funny thing. I think this was back in 2015, or maybe early 2016. This was before…this is one of my first EBCs that I actually did at AWS and this was when you guys didn’t have the EBC Center. So, it was literally just walking into the AWS offices and meeting with the folks who are over there. And I think we were supposed to meet with somebody. I forget who. And that person was not available and was trying to get from another building over to the building where we were. And so, we were literally just walking down the corridor, you know, trying to grab a coffee or whatever. And we run into Bill Vass. And so, Bill invited us and then we started chatting. And we were telling him about how hard it was for us to move a whole bunch of our data from an acquisition that was recent at that time.

And Bill was like, “Hey, you know what? We have this new thing called a Snowball. And we’d love for you to try it.” And I think we got one of the very first Snowballs shipped to us, as part of the, you know, experimental, let’s try and move data centers into AWS very, very quickly. I think it took us, like, three attempts to get it done, you know, debugging and figuring out the solution. But it was awesome. It was such a gamechanger. You know, when you have data centers that have tons of workloads that are running over there to be able to just suck it all up, put it onto this, you know, Snowball machine, and just have it show up in your AWS account. That is absolutely remarkable.
Jeff

It’s worked out really well. And one thing that’s really interesting to me is that this is an evolution of something that we had started quite a bit earlier in the history of AWS, where when customers needed to do data migration to the cloud, we said, “Literally, send us your hard drives,” and we had some specs as to what kind of drives they could send and how they’d prepare the data. And there were some really interesting challenges with that, as you can imagine, and then evolving from that into an actual, like, we’ll build a device, we’ll send you a device that is really good at actually pulling the, you know…it’s got the right level of storage, it’s rugged, it’s got the right level of connectivity so that it doesn’t take you actually months or years to get all the data onto it.
Rahul

Yep, absolutely. These are absolutely fantastic machines on the Edge. So, you know, just moving on to the next question. How are people using these devices on the Edge other than backing up stuff and moving them to AWS?
Jeff

So, the newest Snowball Edge devices, they have a substantial amount of compute power onboard. So, the Snowball Edges, you can get them with GPUs. So, you can do various kinds of image processing, you can do machine learning inferencing. And so, what I like to think of them in general, beyond the kinda the mainstream use case of let’s get data from somewhere out there into AWS, the other one is basically almost like a filtering system that says, “We’ve got this massive amount of raw data we are generating, collecting at the Edge, but we don’t necessarily need all of it.” And so, we either need to pick the small set of that, that we would like to take, or we need to sum it up, aggregate it, preprocess it in some way, and then pass it up to the cloud. So, I kind of think of it as this filtering, aggregation, gathering, buffering device.
Rahul

Got it. Any other interesting stories around any of these devices?
Jeff

So, I’ve had several of them in my house for various photoshoots. And they are somewhat unique looking. And I actually did have the…I had a Snowball Edge, which is a pretty heavy device. And I took one and I needed to return it to Amazon after I’d done my photoshoot. I actually took a picture of my dog, Luna, with it. She was very happy to be in the photo. And I carried it down the hill to the local UPS store. And I was expecting the guy behind the counter…I was expecting I’d have to explain what this was and why it was safe to ship and everything. And he’s like, “Oh my gosh, another one of these cool things.” So, it wasn’t as amazing or unique as I was expecting, which is, like…I’m thinking, “Here I am in this residential neighborhood. Like, which ones of my neighbors are doing this massive amount of data transport to the cloud?”

It’s a little bit interesting to think that that kind of activity is somehow taking place within a little radius of my neighborhood.
Rahul

With the assumption that there aren’t any data centers around there, it sounds scary that people are running those kinda workloads from their homes.
Jeff

You really just never know anymore. I mean, well, here we are. I’m in the lowest level of the house and Stephen’s upstairs and we’re running two concurrent video streams out of our house, which is, like, apparently no big deal these days.
Rahul

True. Very true. Okay.
Stephen

Here’s a question. How do you model the trade-off then between moving the Edge and then investing in the infrastructure to get to the cloud? So, you could either say, “Okay, I’m going to get one of these devices and put it here or I can invest in more bandwidth where I need it.” So how do people think about that trade-off?
Jeff

So, what I hear from our customers is, there are times when they’re in the process of shutting down a data center as part of their cloud migration. And sometimes it’s the case that they’ve let this data center get somewhat out of date. And when they started doing the math and say, “Well, if we were to bring up the connectivity of the data center to the point where we could actually push all this data through the connection…” And they think, “Okay, how much is that gonna cost? How long will it take to do this?” Versus the fact that we’re trying to be done with this data center versus trying to invest more in it. I’ve heard of situations where they do the math, and they say, “It’s on the order of years to get all this data out of our data center.” Because they’ve got poor connectivity or they’ve got just colossal amounts of data.

Or sometimes the internal aggregate network bandwidth in these data centers isn’t wonderful. And they say, “We’ve had this case before of needing to get to all the data. And, like, we’re not gonna rebuild the data center just so we can shut it down.”
Stephen

I was watching the “Snowmobile Talk” and they said that at 10 gigabytes per second, or 10 gigabits per second, it would take 26 years to fill up a Snowmobile. So, they’re not gonna be able to do that in any kind of reasonable timeframe that would matter.
Jeff

Yeah, orders of magnitude can multiply out really quickly into, like…that’s actually impossible.
Rahul

True. I think, for us historically…I mean, since we started using Snowballs, we’ve never had a chance to use a Snowmobile yet. I don’t think we’ve ever moved [crosstalk 00:17:01] but Snowballs for sure. I think for us, it was always the trade-off when it came…or whenever network bandwidth or latency started becoming the bottleneck. I remember a time when just getting a direct connect link setup would take a minimum of 60 days, just working through the vendors, you know, making sure that all those connections could be set up in whatever data center you were in. Like, in the early days of direct connect, I think it would take almost 90 days, and then slowly, it came down to 60, then we were attempting to get to 45 days.

But, you know, it would take a minimum of that much time just to get a connection set up within a bandwidth to start moving…even start moving the data, you know, let alone get the migrations done. And we were trying to get most of our migrations done in a period of 90 days.
Jeff

Yeah, all of those physical activities involved in getting the connections and all of the dealing with telecoms, they’re certainly trying to be innovative and to move quickly but there’s still just a lot of process and a lot of steps in those pipelines to make something happen. So, send us the hardware. We’ll copy it in, copy it out. Just seems to work a whole lot better.
Rahul

Absolutely. So, there’s that. And then there was the other latency aspect of it, which is sometimes just things are so far away that just speed of light latency becomes a bit of an issue. And you want stuff on the Edge.
Jeff

Yeah, there are protocols that are supposed to address that, that keep a lot more data flowing across the pipe. So, you don’t have to have individual acknowledgments for each packet. But still getting those things set up and getting them to run at scale, you might need a lot of compute power to actually keep the connection busy. And it’s specialized expertise that, like…why bother to develop that specialized expertise in a situation where you’re saying, “We’re now moving in the direction of not needing that expertise and that equipment.” Why invest when you can just use the hardware that we can provide you?
Rahul

It’s not core business anyway, and it’s never going to be so why invest in it?
Jeff

Exactly.
Stephen

And speaking of hardware, are there going to be at some point a Snow device that has a Graviton inside of it? Was that already the case, or do we know?
Jeff

I actually never…believe it or not, I know very little about where we’re going with these. And the process when we are getting ready to launch something new is that generally about a month ahead of time, the teams will create an internal ticket, they’ll attach the Amazon document called the PR FAQ. That’s the press release and the frequently asked questions. They’ll give us a quick briefing. They’ll give us product access. But even though the teams generally have at least a year to 18 months of a very concrete roadmap of what they’re doing, I do my best not to look at it, mostly because it helps to separate in my head what’s real versus what we’re working on. And they also reprioritize a lot.

So, when a team launches the first iteration of some brand-new big service, they will often have a very long roadmap of these are things that we want to do. But as soon as they launch, the immediate customer feedback that comes back will start to point them in other directions. So, you’ll see that some companies will say, “Okay, here we are, and it’s 2022. And here’s what we’re doing in ’23, ’24, ’25, and so forth.” We have some good ideas of generally where we’d like to go, but we’d much rather hear where our customers need to go, and just let that drive our roadmap. So, the short answer is, I have no idea what we’re building next. But we’ll keep trying to fulfill customer use cases.
Rahul

But safe to say that none of these devices currently have Graviton processors built into them?
Jeff

Not as far as I know.
Rahul

Okay. Got it.
Stephen

All right. Well, do you at least know the name of the next one?
Jeff

I don’t.
Stephen

I can suggest either Snowplow or Snowshoe.
Jeff

Okay. I heard someone once asking for something like the size of an aircraft carrier for, like, those truly, like, Earth-shattering levels of data migration, but that’s an awful lot of [inaudible 00:21:28]
Rahul

Who has those needs? I wonder who ever has those kind of needs.
Jeff

Yeah.
Stephen

Well, anything else on the space front, before we move on from the topic? What was it like [crosstalk 00:21:42]
Jeff

Well, I found space really exciting, because it’s, like…I’ve been a space fan since I was, like, maybe five or six years old. And I remember learning about, like, the Mercury, and the Gemini and then the Apollo launches and following those, and we just passed the, you know, the anniversary of the first Moon landing. And to be this close to doing that, and to be able to…just even to read about it and know that my colleagues are getting to participate in this, it’s really, really fascinating. And, to me, the more interesting thing and talking to a lot of my colleagues that are doing this, the connection between science and science fiction. There’s actually this really cool circularity where science fiction provides this amazing vision of where the future could go.

And then people are inspired by that and say, “Well, if someone can dream that up, then we can probably actually invent this.” And it’s a much more circular relationship than an outsider might guess. Like, how important the speculation and the science fiction that’s often informed by possibilities and by science and by physics to say, “This is where we could go,” and someone’s like, “Oh, well, if we could go there, well, let’s try to do that.”
Stephen

Well, one of my favorite physical books is your old copy of “2001: A Space Odyssey” with your 10-year-old handwriting in the cover that I’m looking forward to reading to my kids.
Jeff

Awesome.
Rahul

There’s actually a very interesting documentary that I saw a few years ago. It’s called “How Star Trek Changed the World”. I think it’s hosted by William Shatner. And it talks about how everything from the mobile phone to, you know, the sliding doors were created by people who got really inspired by what “Star Trek” had shown, and went ahead and built those things. Like, the flip phone and all of those things kind of came from…you know, the communicator from “Star Trek” was the inspiration for the flip phone. So, it’s pretty amazing how much science fiction kind of inspires us to go do stuff.
Jeff

Totally agree.
Stephen

All right, well, let’s end the segment and I’m gonna end this segment with a little montage I put together of the Ax-1.
Woman 1

Two, one, zero. Ignition. Stand by.
Man 1

Right, deploy. SpaceX Dragon launched.
Woman 2

…zero G and if you look at the right-hand side…
Man 2

…these two meters.
Man 3

One meter to go…Tom Marsh from making his way down.
Stephen

Pretty exciting that there was a Snowcone back in the storage compartment there.
Jeff

I’m jealous of all those astronauts.
Stephen

Well, who knows? Maybe you’ll get to take a ride in one of the later iterations.
Rahul

Do you get a discount program on the Blue Origin flights that are coming up?
Jeff

Not that anyone’s told me about yet. I volunteered to go along as a blogger, but so far, no response.
Rahul

There should be some perks of being, you know, one of the early ones at Amazon.
Jeff

I would hope. I wanna feel those G-forces, I would love that, just crushing G-force. I have to imagine that’s just a feeling that’s pretty hard to replicate anywhere else.
Rahul

Yeah, I can quite imagine.
Stephen

Okay, let’s take a 30-second break. When we come back, we’re gonna be talking about Prime Day.
Announcer

Public cloud costs going up and your AWS bill growing without the right cost controls? CloudFix saves you 10% to 20% on your AWS bill by focusing on AWS recommended fixes that are 100% safe with zero downtime and zero degradation in performance. The best part? With your approval, CloudFix finds and implements AWS fixes to help you run more efficiently. Visit cloudfix.com for a free savings assessment.
Stephen

All right, so tell us about Prime Day.
Jeff

Wow. Prime Day is always exciting. And what I love about it is behind the scenes, the teams that keep all the Amazon infrastructure running, they spend months getting ready for this to make sure that all the different pieces of infrastructure are gonna perform flawlessly. And I remember in the earliest time that Amazon…this was before AWS and this was one of the motivators for AWS is that there was an actual spreadsheet that was passed around from team to team that you would basically say, “Here’s your team, here’s the servers that you have. How many do you need for the next holiday scaling season, and when do you need them?” And each team would fill that in and pass it along. And there’d be some negotiation of timing and of exact allocation. And that was such an obstacle to scaling.

And it was a big, complicated thing that you had to do every year. And of course, it’s really hard to plan how much resources you need. And if you’re innovating and you’ve got some new services, you never know how popular that part of your site’s gonna be and how much you need in the way of resources. So, getting to this point where it’s AWS powered, and there’s great things like auto-scaling and serverless. It doesn’t make it, like, effortless, but it means that it’s a different kind of effort where a lot of the work that goes into it is more in the way of projecting and planning and making sure that each part works well and making sure that you understand the scaling characteristics. And so that when that happens, it’s a matter of just watching the metrics, making sure everything’s heading in the right direction.

If you see something not in the right direction, like, what can I adjust before it goes from…it’s purely noticed internally to something that customers will notice. So, it seems like every year we get better and better at this.
Rahul

Yeah, the numbers are actually absolutely staggering. And every year, you guys beat them by quite a margin. And what I find baffling, in fact, I struggle with it even internally is just being able to predict the demand, both on the minimum requirements side and setting a cap because you kind of don’t want…like, there’s a fine line between letting something, you know, run away, because there’s some error somewhere and, you know, you’ve suddenly got 100 or 1,000 instances of something running and suddenly a bill is kind of super bloated versus actually looking at something that is pure demand and reacting to it responsibly in an elastic manner. How do the teams go about doing that? Is there any light you can shed on that exercise that happens at Amazon, on being able to do that estimation?
Jeff

Yeah. So, one thing that I know that the teams like to do is that they generally like to establish expected ranges for a lot of the metrics and say, “Well, in general, we know that this particular metric will be between low bounds and high bounds,” and they’ll set alarms on either end. And one of the things that I know that we used to talk a lot…I don’t know if this is still the case, I’m guessing it still is, is that in a distributed system, it generally doesn’t fail all at once. It will tend to start slowing down and instead of 100% of the things working, 99.99% work and then 99.98 and so forth. And if you can detect those tiny, like, failures and retries and the situations where maybe a queue is getting slightly longer or a buffer is getting full, those are things that you can alert on very early and take protective action or preventive action before it becomes customer visible.

And the ultimate metric that we watch on the retail side is the number of checkouts per second. So that’s an awesome metric. Because that means everything else has to work. The website has to be functional. It has to be able to display content to the users, it has to be able to accept things being put into the shopping cart, the checkout process has to work. Probably a thousand other things. I have no idea. Actually have to be there, but to the point where you’ve actually successfully done the checkout and your order’s now in the system, and we’re getting ready to do the fulfillment. So many pieces have to work that watching that number turns out to be a great early indicator of something else might not be right.
Rahul

Got it.
Jeff

Now, we have a couple of, I think, almost unique advantages that help us. We’ve been doing this for a really, really long time. So, we have expected values on upsides and downsides of where these metrics should be. So, it’s very easy to say, “Well, where was this yesterday at this time? Or during the last Prime Day peak, where are we comparatively?” I think having that long history is really helpful. What was my second point? Oh, the second one, I think is more important. We’re running at such a scale that the numbers really make a difference. I think this is a challenge if you’re not at world scale, and your numbers are not millions, billions, trillions. We even have things I think we measure in quadrillions, sometimes.

When things are that big, the math works out and trends become really visible. If you’re dealing with ones and tens and hundreds, you probably have to have a slightly different approach, because the statistics aren’t as accurate or as helpful to you.
Rahul

True.
Stephen

We’re looking at your blog post, looking at the SQS. So, this is 70 million transactions on SQS per second. SES was 33,000 emails per second. Let’s see, 152 petabytes to EBS. It’s a huge number. It’s 5,300 Aurora database instances for 288 billion transactions. So yeah, by the law of large numbers, you will see every statistical phenomena that’s remotely present in the data will appear.
Jeff

Exactly.
Rahul

How do teams organize themselves to tackle…I mean, the Amazon system is, you know, presumably a very, very complex system with…I mean, just given the number of databases and stuff that are being used, the number of queues that are probably getting used over here, this is a very, very vast and complex system. One of the challenges of a complex system like that is that you’re likely to get overwhelmed by the number of metrics that you’re looking at at the lower level. It’s great to have that one metric that gives you an early indicator, like you said, the number of checkouts. But when you get into the next level, when something does go wrong, there is a very high likelihood that you could get overwhelmed with just the number of alerts and alarms going off all over the place with that many services that are interlinked, or, you know, which have a strong cohesion of sorts, trying to get everything working.

So how do the teams organize themselves in terms of tackling a problem when there is a problem?
Jeff

So, my understanding…and I haven’t been operational for a really long time, but each team owns a particular service and we always call them a two-pizza team, because we try to keep them small enough that two pizzas of some arbitrary size with arbitrarily hungry staff on there would be happy with. But the idea is that each of these small teams own some very particular important microservice, essentially, that helps the site to succeed. And so, it’s up to that team to figure out what that service should be, build it, figure out what metrics to watch, and to track and to alarm on all those metrics. And to make sure that they know who their customers are and who their dependencies are, all the customers that are counting on them.

So instead of one…like, you used to see these pictures of, like, these, like ISPs and telecoms with these gigantic rooms and big screens and the central operations center. I don’t think we have one of those because that kind of assumes this central level of control. Versus we’re saying, “If each of these individual teams does their part, then we’re much more closer to success than having some central team that somehow has to monitor everything.” Now, clearly, there are situations when something does go wrong, and multiple teams are involved. And when that happens, a ticket is cut. Usually, it’s actually an automatically cut ticket, because some metric went outside the bounds. And it says, “Well, this metric went too high, too low for five successive checks, let’s cut a ticket.”

And then the idea is each of the teams that might have some involvement in there, they have to actually log into a shared phone call. And from there, they start very methodically working through and saying, “Does it look like this? Does it look like that?” And again, because it’s a distributed system, it’s sometimes in the connection between the systems rather than…is any particular thing failing? Well, not one thing is failing. But when you connect these two things together, they’re operating in this interesting way that we’ve done our best to predict and forecast. But sometimes there’s still surprising dynamic behavior. But things that you keep identifying and getting better and better at over time to understand.
Rahul

That sounds pretty interesting, but at the same time, scary, to have that distributed setup, and still have everything work. It’s gotta be a really well-oiled machine to be able to [crosstalk 00:36:22]
Jeff

It’s well oiled. One thing that I think is in our favor is that we try really hard to learn from every event and that there’s just Amazon…not tradition. It’s deeper than a tradition. It’s a best practice and required is that any time something goes wrong, the first thing we do, of course, is get things back on track and restore service for our customers. But then someone is appointed the owner of a document called a COE, a Correction of Errors. And the idea is that you go back into all of the log files, and you say, “Okay, what went wrong?” And sometimes you’ve put together almost a millisecond-by-millisecond recounting of this happened, and then this happened. And then this queue got a little bit too high. And so, we tried to scale and then this other thing happened.

And then you basically, without pointing any fingers at people…you’re always [inaudible 00:37:16] You’re saying, “Well, this system didn’t work and this alarm didn’t fire.” You say, “First, what went wrong? And what do we need to do to make sure that this particular thing can never, ever happen again?” Sometimes that is involved with just simply adding more alarms. You might say, “Well, in this particular situation, we didn’t even know that this was a thing to monitor. But let’s make sure we monitor it the next time around. And we’ll just add some more alarms and alerts.” Or it might be, “Well, we discovered this new behavior, and we need to do some more engineering work to get around it.” But at any point after one of these situations, remedying that technical issue, it becomes the highest priority for the team.

We’ve had a few very high profile, but fortunately very rare, visible events over the years. At the point where we figure out why did that happen, making sure it can’t happen again becomes the very highest priority for the team where they will say, “We’re gonna push everything else on our roadmap. We’ll just push that a bit into the future. We’re gonna make sure that that first one…” I’m not saying it’s acceptable, but it happened. But doing it a second time, that’s for sure not something we would allow. Now, the iteration of this process over years and years and years means that we keep building more resiliency into the system, which is super important. But the other cool thing is that these COEs are numbered.

And when you start talking to our principal engineers…and a principal engineer is someone with decades of experience, and is basically, like, the equivalent of a director or a VP, but they still get to build and architect systems. The more notorious and the COEs where you can learn the best lessons…they’ve got numbers that you listen to these principal engineers and they’re like, “Oh, do you remember 253? Oh, yeah, that wasn’t as bad as 155. Yeah, but we learned a great lesson from 157, which we now tell everybody to use that.” And it’s fascinating to see how much we’ve learned over the years. And these are hard-won lessons that we don’t forget, and that each one of these is gonna help us to do something better and better over time.
Rahul

That sounds like more than a best practice. It seems very ingrained in the Amazon culture.
Jeff

It’s fascinating to read them. Even my best days of actually building stuff with code are long, long, long past but I can still read these things and understand them and say, “We’ve got really sharp people building and running these.” And another aspect of it is that once a team builds something, they also have to run it. We’re not kind of, like, this old school model where there was this team of engineers who built things and in the engineers’ heads, everything was perfect. And then they’d hand it off to these poor folks over in operations, who had to deal with whatever stuff was thrown their way and try to actually keep it running. And of course, there was plenty of finger pointing and issues with…operational issues didn’t always get back to devs.

So, the way we like to run things is the same people that built it, they’re the ones who are actually gonna be on call and to make sure that when something breaks, they’re the ones who have to dive deep and figure out what went wrong. I still remember my first day at Amazon and I had lunch with my manager and he hands me a pager. I don’t think we use pagers anymore and probably the audience doesn’t even know what pagers are. Hands me a pager and says, “You are the secondary on call in two weeks.” And the implication of that is, “Oh, my gosh. I’m gonna be responsible for keeping the system running. I’d better actually understand how all this stuff works.”

And, you know, at the point when you go from being secondary to primary, and during your…not quite a shift, but when you’re the on-call, your pager goes off, you’re supposed to log in and within minutes, know the system in depth to say, “Okay, well, this is the problem. We need to actually take progress toward remedying this.” I was happy to not have the pager anymore, by the way.
Stephen

Just thinking about…you know, I’m still thinking about that 70 million messages a second. So, for every 13 seconds, your backlog grows by a billion. You really wanna think quickly there.
Jeff

The numbers are staggering.
Rahul

That brings you to another question around the elasticity. So, this is now more on the AWS side, less on the Amazon side. I remember a time when the data centers were still growing, you know, coming up where capacity planning, the capacity planning team had a big role to play. There was one time when I was just trying to figure out how many Spot instances I could get and I tried to launch 100 X132 Excels and I got a call from the team saying, “What the hell are you trying to do? Let us know at least, you know, a month in advance so we can plan out this experiment that you’re running or whatever.”

How much of a dent does this Prime Day workload make on the AWS side in terms of the capacity planning exercise and so on? Is it like, “Okay, it’s big enough that you know, capacity planning has to get involved,” or is the AWS size or the scale, large enough that it doesn’t really matter, there’s enough elasticity built in?
Jeff

So, I’m not privy to that information. But I think we’re now at your latter point where we’ve got so many customers with so many different use cases. Some of which are on just different utilization cycles. And some of our customers are using Spot instances. And so, if we were to need a lot more capacity, then that can presumably come from Spot instances. But what I’ve heard and I don’t know this directly is that we are probably not the biggest consumer of compute capacity around.
Rahul

Got it.
Jeff

And also, this is a spike versus someone stepping up and from zero saying, “I now need this much brand-new capacity.” We’ve got all this awesome historical data that’s gonna help us guide through, guide us through [crosstalk 00:43:46]
Stephen

And some very good predictive tools to use that and some very smart statisticians.
Jeff

Indeed, and a lot of that we’ve also made available to our customers. So, we do have predictive scaling that uses historical data. I know we actually…we just recently launched something new that in addition to doing the predictive scaling, it publishes some metrics that are kinda like…it’s kinda tattling on itself to say, “And this is how well or how not well I did at predicting your need for predictive scaling.” Like, that it’s too much, too little and so forth.
Stephen

Well, we want to transition to the Graviton segment really soon. But just one last follow-up question. What did everyone get for Prime Day?
Rahul

I went ahead and ordered a whole bunch of PLA spools [inaudible 00:44:35] I had, like, I think, 4 different colors earlier. I have expanded that to now nearly 10. So yeah, I bought a whole bunch of PLA spools.
Jeff

I was much more restrained. I only bought 1 because I think I’ve got about 20 colors here right now.
Rahul

I’ve seen your shelf.
Jeff

Yeah, I’ve got a lot.
Rahul

He’s got a lot of spools.
Jeff

I’m in a phase of life where I’m trying to own less things, not more. There’s always a lot of stuff I’d like to have. But I’m like, “Do I really need that or not?” And kinda adding complexity only reluctantly.
Rahul

I can understand that.
Stephen

I did get a Fire TV for a dashboard and calendar. It’s off to the side here. We’re gonna do a behind-the-scenes episode in a couple of weeks. So, I’ll show that.
Rahul

That’s awesome.
Stephen

All right. Let’s transition over to the Graviton. All right, here we go. This is some very high-tech stuff. So, I made a very high-tech transition.

All right, there we go. So, looking at the what’s new homepage for Graviton, there’s a lot of different announcements here. We’ve got CodeBuild, added Arm support, we’ve got the new C7g. Graviton 3s are available in more and more regions. Graviton played a big part in the Aurora databases, mostly backed by Gravitons for Prime Day. So, how’s Graviton doing? What’s the 10,000-foot view of Graviton?
Jeff

You know, our customers seem to really, really enjoy Graviton because it’s awesome at…it’s got great compute performance, it’s really good for their scale out workloads. We’re innovating really quickly. One thing I love about the whole idea of Graviton is that because we have good insights into our customer workloads, we can use those to very directly figure out, “Okay, what are the most important features that we need to put into the next generation?” And, you know, do we need to add more memory bandwidth, do we need to add more concurrency in terms of more VCPUs? We’ve got customers doing machine learning inferencing. They’re doing encryption, decryption. And this doesn’t mean that in any way we’re looking at the workloads. It means we’re looking at very, very low-level system utilization.

But being able to do that and saying, “How can we keep on understanding that, feed that into the design, get the design built and available to customers?” Getting that cycle to run very, very quickly, to me is the awesome thing that we’re able to do with Graviton.
Stephen

It’s pretty incredible that we’re, like you said, we’re already on the third iteration of them, plus the Inferentia and the Trainium, the different variants there. It’s incredible that physical process, that’s one of the most complicated manufacturing processes in existence can be iterated upon that quickly.
Jeff

I really agree. And if I could really kinda plug something we’re doing really quickly. Tomorrow, we’re running this online event called The Silicon Innovation Day. And all of these principal engineers that I was just talking about earlier, all of those that are doing the actual detail, hardware, and system design, we’re gonna get to hear directly from all those folks. And in fact, I’m doing a fireside chat tomorrow with Anthony, who’s one of the folks behind the Nitro system.
Stephen
[crosstalk 00:48:19]
Rahul

There’s just so many questions, you know, we have about the Graviton processor, and, you know, the nitty-gritty details inside. Like, for me…by the way, as you were talking about the Graviton process, I think that one of…looking at all these announcements, one of the things that AWS has been doing really well, at least from the customer side of things is taking all the managed services, and just swapping them out with Graviton processor, or at least offering Graviton as an option for all of those services, all of those managed services, which makes it really simple for anyone to switch over. Because even the migration itself is just taken care of. Apart from those, how easy or difficult has it been for customers to switch over to Graviton?
Jeff

It really depends on the particular workloads. If customers are using basically, like, interpreted languages or scripting languages, it’s generally very, very easy. Sometimes if you’re using something like Python, you’ve gotta make sure that your library’s available for the Arm architecture. If it’s a compiled language, of course, you need to make sure that your whole toolchain is able to compile for that architecture. But by and large, I’m hearing that that’s pretty easy for customers where we’ve sponsored some different, like, migration days and migration events and helping customers to not just understand the benefits, but to actually go ahead and do those migrations pretty quickly.

And I have to say I’ve never heard a complaint. I’ve seen awesome benchmarks. Our customers…it’s actually easier for our customers to share benchmarks than it is for us because the customers own the most interesting, compelling real-world use cases. And the customers can say, “Under our real-world use case, this is what we measured.” And the customers can be fully objective, they don’t have any particular slants. And they’ve got no reason to point things one way or the other. Any time we publish a benchmark, everybody can look at it with, “Are you…is this legit or is this…are you slanting this a tiny bit?” And the reality is, we’re very, very careful when we publish numbers to make sure that there’s reproducibility all the way back to the bare metal of any of these numbers.

When you see the numbers in my Prime Day post, those weren’t just, like, we looked at a dashboard and snapshot it and said, “Yeah, it looks like 37%.” We have to capture the data. We have to run the math. We generally have a couple different teams look at it to verify it. It has to go all the way through an approval process to say, “Are we 100% confident that these are the real numbers?” And that means that we don’t always wanna work that process for each and every number that we have to share. The customers can be smaller, and they don’t have to go through as much of a… it’s not a bureaucracy. It’s just to make sure that we’re being as honest and fair as we possibly can be. And accurate. Most of the benchmarks are great to watch.
Rahul

No, absolutely. We actually recently did a Java benchmark for a bunch of our workloads. And yeah, I mean, the price performance, and just overall performance of those…of the Graviton 3s, the C7gs are, you know, absolutely mind-boggling, compared to what we we’re getting on the x86s.
Jeff

It’s hard to believe we’re now in third generation after just a few years.
Rahul

Yeah, I think 2018 was when Graviton was launched. Graviton2, I think, came 2019, if I’m not mistaken. So, yeah, it’s been three generations in just, what, four years which is pretty remarkable.
Jeff

One of the fun things I’m looking at to learn some more about tomorrow is there’s this fascinating idea with…especially with chip design. Generally, applies to a lot of technology but chip design especially is that each generation of chips gets to actually design its successors. And so, the better you can run any kind of simulations and the better your electronic design automation software works, the faster you can do that, the faster you can iterate. So, there’s this…it’s almost recursion in reverse, where you’re recursing back into time and saying, “Well, these chips were designed by generation N-1, and N-1 was designed by N-2. And at some point, we had no chips. So, we did things by hand or we had things built out of transistors, and those transistors were ultimately replacing things built from vacuum tubes.”

And it’s fascinating to think of that dependency of how whatever you do depends on what came before and who came before.
Stephen

I think I’ve heard several science fiction films start with this premise of the chips make themselves faster. And I just posted an article into the chat about…funny enough, you mentioned Arm is saying how they’re using the Gravitons. They’ve shifted their EDA loads onto Graviton. So, Arm using Gravitons to design better Arms and that’s just a self-reinforcing positive feedback loop.
Jeff

It actually is. And this is not just theoretical, and not just at the design stage either. A few years ago, one of the coolest places I’ve gotten to visit in my life was I got to visit a semiconductor fab. And it was actually jointly owned by Intel and Micron, if I remember. And it’s so interesting and so complicated. There’s all these robots and all these machines but if you look inside any of the machines…like, there’s one company that makes a lot of the fab machines, that actually [crosstalk 00:53:59] Say again.
Rahul

ASML. ASML runs most of those, I think.
Jeff

And if you carefully peer through the vents into these machines, of course, it’s just more chips. And so, there’s this fascinating cycle of how those chips are now building their successors. And now Stephen, coming back to your science fiction point. The other thing is, any time you read the story of like, “Oh, and then we rebuilt civilization,” it’s actually really hard to do that, you know. It’s taken us many thousands of years to get where we are. And to think that, “Oh, yeah, we just rebuilt everything from the plans or whatever.” Not gonna happen.
Stephen

It would be really, really hard to bootstrap that from nothing, from raw materials.
Jeff

Exactly.
Rahul

One question that I had which, you know, we’re now starting to see a lot of enterprise customers ask is around energy and energy consumption and, you know, the greenness of their workloads, so to speak. So, I started looking at this some time ago. The thing that I’ve always struggled with is that there isn’t a good standard or a measure or a metric for it. Historically, we always used TDPs when it came to Intel-based machines to understand what kind of power can…like, TDPs were like a proxy for what kind of power consumption they had. You know, the larger the processor, the greater the processor speeds, the higher their TDPs and, you know, that was your proxy for how much energy got consumed.

AMD for a while kind of followed that and then suddenly, AMD decided on a completely different way of measuring TDP. So that kinda went out of the window. And now, both Intel and AMD do completely random stuff, depending on the chip generation that they’re talking about. So that metric is again, completely useless to predict the energy consumption. How does AWS get its energy…I know that there’ve been conversations about, you know…like, even James Hamilton, when he talked about the Graviton and the energy consumption of it, he talks about a 60% better power consumption, or lower power consumption. How does AWS measure that? Like, what are the metrics that are looked at when you’re declaring something like energy consumption?
Jeff

Oh, wow. So, this could be an entire segment all by itself. It’s something that’s incredibly important to us and to our customers. We do realize that when we’re building these incredibly large-scale data centers, that the outside perception is, wow, that’s just like burning tons of really precious energy and it must generate…throw off lots of carbon and a huge carbon footprint. We’re doing everything possible to switch to totally renewable energy. We’re always looking for ways to add efficiency. One thing we launched last year is something called the Customer Carbon Footprint Calculator. I think that’s the official name of it. And that attempts to give a very accurate representation to our customers of the carbon footprint of their AWS usage. We continue to add details. And again, this is something you have to get right.

We can’t just make this fuzzy, imaginary number. We try to account for…yeah, there we go, Customer Carbon Footprint Tool. We try to account for all of the different types of consumption. I think there’s three different levels. And I think we currently account for the first two that are, like, the direct and the indirect. There’s a third one that has a lot more accounting, where you need to take into account different aspects of your supply chain as well. I know that the general thinking is that cloud computing can be and generally is more efficient than running your individual data centers because we’re building new data centers all the time. For most companies, a new data center is…how often you build one of those? Every 10 years, 20 years? We’re building new regions and new AZs all the time. So always learning and trying to make those better and better.
Rahul

Got it. But specifically around the Gravitons, you know, like, the fact that you now have three sockets on a board and you’re able to get one Nitro to kinda manage the three sockets. You talk about 192 cores being managed over that…yeah, there are tons of efficiencies that automatically come from that kind of, you know, architecture, and you’re fitting it all in 1U, if I’m not mistaken. So, this entire motherboard kind of slots into 1U. But are there any specific metrics or numbers or, you know, the mechanisms by which you measure the process efficiency or energy efficiency of those?
Jeff

Hang on just one sec there. All right. I’m sure we have those internally. I don’t have anything to share with you.
Rahul

Okay, great.
Stephen

It would be interesting as some counterfactual of this is what it would’ve taken to accomplish the same thing on a Xeon or something like that. But that’s very involved to calculate.
Rahul

I can imagine. Yeah, I think we’re almost coming up on time. And Jeff, I know you’re incredibly busy. But thank you so much for coming over. And we have so many other things to talk about as well. We really hope you can come and join us another time. We love [crosstalk 00:59:40]
Jeff

Absolutely. For sure. Well, you both know where to find me. So, I’m more than happy to wanna do that again.
Stephen

Thank you again. This was really, really fun. And is there any final things we should mention?
Jeff

You know, one thing I didn’t mention is that if you are doing something similar to your own Prime Day, we have an entire team called IEM, Infrastructure Event Management, and they’re gonna help you with architecture and scaling guidance. And they will virtually sit next to you during your event and watch the metrics and, you know, point out anything that’s not looking right and make sure that you’re gonna have a successful event.
Rahul

Great. And if I’m not mistaken, this is tied to enterprise support and seems like a very, very invaluable asset to have if you’re having any big event.
Jeff

Agreed.
Rahul

Awesome. Thank you so much, Jeff. And thanks for taking the time to talk to us today.
Jeff

Any time.
Stephen

Thank you. All right, we’ll see you next time.
Announcer

Is your AWS public cloud bill growing? While most solutions focus on visibility, CloudFix saves you 10% to 20% on your AWS bill by finding and implementing AWS recommended fixes that are 100% safe. There’s zero downtime and zero degradation in performance. We’ve helped organizations save millions of dollars across tens of thousands of AWS instances. Interested in seeing how much money you can…