AWS Made Easy

Ask Us Anything: Episode 7

Episode 7
June 16, 2022
1 h 03 min

For today’s episode, Rahul and Stephen reviewed a few of the most recent announcements from AWS.

Latest podcast & videos

September 27, 2022November 3, 2022

1 h 07 min In this episode, Rahul and Stephen continue the theme of Behind the Scenes by showing some of the automation which makes AWS Made Easy possible.

September 20, 2022September 28, 2022

1 h 07 min In this episode, Rahul and Stephen recap the "Behind the Scenes" episode 1, and then discuss a few new AWS announcements, and plan for Behind the ...

September 13, 2022September 20, 2022

1 h 10 min In this episode, Rahul and Stephen begin part 1 of a 3-part series in showing #AWS-powered automation, developed with DevSpaces and DevFlows, to show how they ...

August 30, 2022September 19, 2022

1 h 03 min In this “What’s New Review” post, Rahul and Stephen go over a variety of announcements from AWS. Most of the articles rated very well, with the ...

August 17, 2022September 19, 2022

1 h 11 min In this episode, Rahul and Stephen film from Anaheim, where they were attending an AWS Partner Summit. They filmed from a makeshift studio in a hotel ...

View all »

Summary

Announcement #1

Split data into train and test sets in a few clicks with Amazon SageMaker Data Wrangler

Announcement #2

Amazon SageMaker Canvas accelerates onboarding with new interactive product tours and sample datasets

Announcement #3

Amazon Transcribe now supports automatic language identification for multi-lingual audio

Try it with this clip from I Love Lucy: https://www.youtube.com/watch?v=Xle3I-5nfpI

Announcement #4

AWS Service Catalog announces support for Attribute Based Access Control (ABAC)

Announcement #5

Introducing Amazon R6id instances

Announcement #6

Amazon EC2 R5n instances now available in additional regions

Transcript

Stephen

Hello, everyone, and welcome to “AWS Made Easy.” This is “Ask Us Anything” and we’re doing a “What’s New Review” segment this week. How are you doing, Rahul? How was your weekend?
Rahul

Doing very well, had a pretty amazing weekend. We went up to the mountains, Mount Baker. And for us who come from the subcontinent where snow is, you know, one of those things you rarely ever get to see or witness, I think it was awesome. The kids had an amazing time. And, yeah, it was a fun, relaxed weekend with a bunch of hikes.
Stephen

Oh, fantastic. We spent some time over at Snoqualmie Pass. And so, I think my two little ones, you know, born and raised in Australia, I think they have a similar experience with snow, basically meaning non-existent.
Rahul

I have some nightmare memories from Snoqualmie. I have a YouTube video that’s floating around somewhere of me trying to ski for the first time in my life and falling off or dismounting ungracefully from the ski lift and I hope no one finds that one.
Stephen

It’s a bit of an art to do that. Actually, this would be an embarrassing story. So, if my sister Bianca is watching, I’m really sorry. When we were both learning how to ski and I was about 13 and she must have been about 6 or 7. And we’re both shuffling up to the chairlift and somehow, I got separated from her and she got on the chairlift by herself and I was in the chair behind her. And she screamed the entire way up the mountain and not just a little…like it was a full-on terror, panic. I thought she was gonna cause an avalanche, that poor kid. For what it’s worth, she did become the best skier in the family.
Rahul

That’s good to hear. Awesome, I actually looking forward to the conversation today, there are a bunch of very interesting announcements that came up. And, yeah, looking forward to the discussions from the audience. Please feel free to ask questions. All yours, Stephen.
Stephen

All right, let’s jump into it. All right. So this is, “Split into train and test datasets in a few clicks with Amazon SageMaker Data Wrangler.” Now, this is really…okay, general availability of splitting data into train and test splits with SageMaker Data Wrangler. So, for a bit of background and I used to be a practicing data scientist, when you’re doing machine learning models, you want to split your data into training and testing and then validation data. And the idea is if you train a model and then tested on the exact same day to get trained with it, it’s going to be very accurate. But what you’re really interested in is predicting things outside of your training data.

So, for example, if you’re buying a house, you might feed it a bunch of things like square footage and postcode or zip code, and I don’t know, does it have a view, things like that, all these different things that they call features. And then you put in a bunch of data, which has, say, a price, and then you train it. And then at the end of the day, you have a model and you say, “Okay, here’s some other houses, see if you can predict the price.” And then you hide that price from the model and you see, “Okay, how good are the predictions?”

And so, there’s a bit of nuance to that because sometimes there’s timing in the data and you want to make sure that there’s no what you call target leakage or features that could accidentally expose the variable of interest. So, it’s really neat that with SageMaker Data Wrangler, you could actually…you can simplify this step. I thought it was interesting. They’re really addressing some core data science needs here. Data preparation, workflow, selection, cleansing, exploration from a visual interface, and I think that’s really neat.
Rahul

Yeah, so I learned one other thing, which is even though machine learning and, you know, this stuff sounds very fancy and sounds, you know, very exciting and bleeding edge, about 90% of the time that is spent by data scientists or by people who are building these machine learning models is actually spent in grunt work. And that grunt work is taking loads and loads and loads of data, you know, cleaning it all up, taking out all the outliers, trying to figure out what features might influence or not influence, run a bunch of experiments to look at the standard deviation of your data, look for biases in the data.

It is literally…I don’t know, this is absolutely apt, this particular image you pulled up. It is 90% of the grunt work. And everyone thinks, “Oh, I’m gonna build a nice machine learning model and I’m gonna, you know, change the world.” But all these machine learning models are all about the data and the data, more often than not, is incredibly messy and dirty. And just to get it in shape so that the machine learning models can do something reasonable with them, just takes an inordinate amount of effort, an inordinate amount of time in the grand scheme of things. And Data Wrangler actually really helps you with tackling a bunch of that grunt work.
Stephen

There’s other one that’s also appropriate. You’re absolutely right. It’s funny, I had this vision when I was studying, “Okay, I’m going to be at a whiteboard, I’m going to be proving some theorem about the limits of this algorithm converging to some asymptotic result.” It’s like, “No, no, no, I’m joining CSV.” I’ll do one more…
Rahul

Yeah.
Stephen

I’ll do one more that I found that was very…so this is, “What is a data scientist?” This is what I thought I’d be doing. Under the “What I actually do,” “Select spending from db.users.” So, I think what is the secret of all data science were because like you said, you’re spending an enormous amount of time on prep, cleansing, joining, validating. And so, this is pretty neat, you can see Amazon is really addressing the real pain point here.
Rahul

Yep. And the other thing that, you know, all the machine learning teams…I mean, machine learning is now such a massive org within AWS. But one thing that you see all of these teams kind of trying to do is make sure that all the various steps that you have in the pipeline are being made super easy. If you haven’t already, take a look at, you know, a lot of the functions that the Data Wrangler tool has, there are over 300 functions that you can use out of the box that help you your transform the data in a very, very simple, easy manner, create new features or vectors that you can use. They are trying to make it really easy for people to get through this painful data processing stage and data cleansing stage.

And yes, I think what you can expect to see is across all the different SageMaker services, SageMaker is now a portfolio of maybe 25-30 different services, you will start seeing a bunch of simplification of how you use it. There are a bunch of, you know, autopilot-like mechanisms where the tool runs, it finds a bunch of patterns for you, and then you get to make simple choices. If you haven’t looked at it…I think this all started back in 2019 when SageMaker Autopilot first launched, what we did…or what was done as part of Autopilot was you could actually run a whole bunch of different models, you know, just completely automatically without you having to figure out how to do you know, all the fine-tuning of all these models and do hyperparameter tuning and all of that stuff. You could just let the model run.

And the output of it, I thought was pretty amazing because the output told you for each of the models that were candidates, what each one cost in terms of time, what each one cost in terms of money, what is the cost of running all the compute behind it to create the inferences. And third is the latency requirements. So, depending on what your trade-offs are for your problems, you can decide between the accuracy, the cost, and the latency. You can pick the model that fits your use case the best. And most people don’t realize it, they always think in terms of accuracy. But accuracy isn’t always the only parameter you care about, you do care about the latency and the cost in a bunch of cases. Like there are cases where you want your model to return back an answer within, you know, 100 milliseconds because that’s the critical decision you’re making. You don’t care that the accuracy is 99% or the confidence score has been at 10%. Eighty percent is good enough, but you just need that super-fast decision-making.
Stephen

Yeah, that’s a really interesting point that you get the…well, as most principles in nature, you have that diminishing returns and that’s true with machine learning models, right? At some point, you don’t need to be 10 times more expensive and have an extra 50 layers in your network to get that extra couple of percent. And sometimes you do and you can afford that, but a lot of times you don’t, you just need a, you know, what’s good enough right this second?
Rahul

Exactly.
Stephen

Cool. All right. Well, here is…anything else we want to say on the split data into train and test sets?
Rahul

No, I just love Data Wrangler. You guys should try it, use it, give us feedback, we’d love to know what your experience with Data Wrangler has been. From my old world of not having tools like Data Wrangler, this has been a pretty big game-changer, it simplifies so many things.
Stephen

Well, and even for auditability. Now, I remember a long time ago, I think this is my academic days, I would see these professors pass back and forth, “Okay, finaldraft.latex,” because we were writing our papers in LaTex. “No, this is the actual finaldraft.latex,” “Actually, we revised it.” And they would pass all these files back and forth on Dropbox and this is…I was trying to introduce things like version control and whatnot. But I remember a particular professor said, “Wait a minute, all the numbers in my table have changed, what happened?” Because we were just getting to the time where we had some automatic tables being generated but we didn’t have all the versioning mechanisms in place.

And I know where AWS is going with SageMaker Data Wrangler and just in general, they want every model to be auditable to know, “Okay, given the model, where did the data come from to train it?” And for the higher-level features, how are they constructed with the lower-level data to really audit the whole thing back from the data to the trained model to the predictions. And especially as these things become more and more a part of our lives, we’re gonna want those paper trails both for legal reasons and for intellectual reasons. I think it’s a neat thing that they’re enabling, they’re doing all the grunt work to be able to actually use this stuff in important and regulated environments.
Rahul

The versioning is actually a very, very good point to bring up. Because I think as you do a lot of this work, you realize that it’s all about experimentation. You run hundreds of these experiments, you don’t know what outcome you’re gonna get, and sometimes you have to roll back, trace back to the previous step, do something different. And it’s all these crazy parts you take down, a spider web of, you know, experiments, that being able to version it and go back to a particular version, being able to trace, you know, why certain changes…you know, why you’re going down a particular path, is incredibly valuable. And back to what you said, I have had infinite number of final-final-final versions of my dataset where, you know, we’ve just literally replicated and copied that dataset over and over and over again and then completely lost track of, you know, where we came from during those experiments. So, yeah, an incredibly valuable tool to have.
Stephen

As an example of how easy that can be, I remember I was doing a consulting project for a university and they wanted to predict student performance partway through the semester so they can figure out which students could maybe benefit from some extra outreach, some extra tutorials. And one feature we tried to use was the date that they registered for the class prior to the start of the class. And the idea that…you know, the narrative you start building in your head, “Okay, students who want to plan ahead more and they’re more responsible, they’re gonna register a couple of weeks in advance, and maybe a student who registered the day before they started is not that responsible.” But then it turns out that that variable was extremely predictive when it had negative numbers. How in the world does it have negative numbers?

And it turns out that the admissions office for a certain group of students was retroactively enrolling them after they had already passed and they were in this funny program where they were kind of guaranteed a pass and so they get retroactive enrollments. And so, this feature looks like it was way more predictive than it actually was. But it took us a while to actually…it actually took us marching down to the admissions office and saying, “What’s going on here? This makes no sense,” and for someone to explain the data generating process. But like you said, we’d run so many different experiments at that time, just trying to keep all that straight plus, you know, 100 other stories just like it, you really need some good tools. And at the time, we didn’t have them but I’m glad that we do now.
Rahul

Yeah, I agree.
Stephen

Well, let’s do another SageMaker one. And I actually put this URL into the chat so that people can follow along and I’ll put the next one in there as well. So, this is the SageMaker Data Wrangler, and now, “SageMaker Canvas accelerates onboarding with new interactive product tours and sample datasets.” So, this is what you had mentioned earlier.
Rahul

Yeah, back to the simplicity that the SageMaker team is bringing to the product. They’re making it easier and easier with very few clicks. This is almost like a WYSIWYG version, you know, if you remember Visual Basic from back in the day, you know, the WYSIWYG editors and stuff like that. They’re gonna bring WYSIWYG, of sort, to SageMaker Canvas. And it is a pretty interesting tool where if you just want to play around with some datasets that you already have, just get going without having to set up a bunch of, you know, super big, heavy-duty pipelines, SageMaker Canvas is pretty neat in, you know, getting you started: understanding the basic concepts, getting some basic models out, starting with ML. And as you build more complex data sets, as you need to, you know, kind of bring in data from different data sources, build a data lake, do all of that stuff, you can advance to more complicated scenarios, but SageMaker Canvas is actually a great way to get started.
Stephen

That’s really neat. And even just the idea of having a good sample dataset…because there’s one…you know, what we used to do is just grab random things off of Kaggle. But Kaggle’s quality, sometimes they’re good and sometimes they’re bad and they’re not too easy because the whole point is to competition. Actually, this reminds me of another story. At one point, I was doing data science for a research hospital and I got a grad student intern. And, you know, the first question…she’s coming, again, from an academic background, she said, “Oh, can you hand me the data set on a thumb drive and I’ll have a look?” This is a hospital, there’s no, you know, hospital.csv that has everything you want.

That was my daily pain for the last six months before she got there was, “I’m gonna join this data from this silo and this data,” and then plus, it’s a hospital so you have to deal with PHI and all that stuff. So, being able to have…and like they said, they have medical datasets like readmission for diabetic patients. I mean, readmission is an enormously costly area, and to be able to predict this…I mean, this is a huge area in most medical, it’s all about reducing readmission because that’s a huge expense. And so, being able to have a data set that you can play with and kind of get up to speed without worrying about those other issues, it’s really a neat idea.
Rahul

Yeah, and actually having a bunch of this predetermined or pre-available data set for you to get started is incredibly valuable because like we were talking earlier, all machine learning models are about data and you can spend so much time just grappling…and I think that’s why they call it Data Wrangler. The amount of time that you can spend just grappling with the data and giving up because it’s so damn hard is…I mean, it just makes sense to have some of these datasets be available so you can get started with, you know, some of the machine learning stuff, get started with something, and then sometimes you can even figure out what kind of data you want to have from the existing datasets. Like, you don’t have to learn and experiment with modeling from scratch.

If you know a certain kind of data set works well with certain kinds of features or vectors, then you’ll know that wherever you pull data from for your use case, you want to have a similar vector set, at least make that your starting point. It may not always work but it’s a reasonably good starting point. And then from that point on, you can run the models again, train it, figure out whether the accuracy or the confidence scores, you know, continue to remain good, if not, then you can start looking for what’s different. And I think that approach is a lot easier than trying to literally reinvent or invent your machine learning model from scratch.
Stephen

And this gives you the flexibility…well, it takes care of everything else for you so you can focus on the stuff that’s really important. And just like we talked about businesses should focus on their core competency, data scientists should focus on their core competency, which is not designing APIs and CloudFormation templates. It’s picking good features that’s proxy for the…that should have some predictive power. I was just looking at this paragraph here, it even comes with CloudFormation templates. I was thinking back to a project I did where I was doing a…it was another medical prediction problem and most of my time was spent fiddling with CloudFormation.

And the actual problem itself wasn’t that complicated but a big bulk of…and again, it was a consulting project, it was all billable, but still, it was…I think we could spend a lot more time in the algorithms if we didn’t spend as much time doing CloudFormation from scratch. And again, at the time, I didn’t know it very well, so I was kind of learning as I went, and having a template ready to go, that would have been really helpful. And as you can see on LinkedIn, the difference of the value of a data scientist who can put together a toy-problem in a Jupyter Notebook versus one who can actually put something into production. Now, that doesn’t have to be like a Netflix-level deployment, but just putting something into production that someone else can use and it seems like this is really empowering that kind of data science.
Rahul

I completely agree. By the way, I’m not a big fan of CloudFormation in general because, you know, if anything that’s non-trivial that you want to build, CloudFormation just, you know, makes your life hell and I’m just being very blunt about it. It helps to have the next-gen tools like CDK in the CloudFormation under the hood for you but that’s kind of why it’s neat that it addresses both of these for everyone so you can get started without any roadblocks. It would be interesting to look at the script to see how easy it is to modify and kind of tweak it to your own needs, whether it be the instance types and sizes, or where it gets deployed or, you know, what kind of permission schemes you would need to give the cluster or the setup for your data lake and, you know, wherever your data resides. So, that would be an interesting exercise, which I’m probably going to do right after this live stream. But like we were talking earlier, it’s great that you have something to get started with. I think the barriers to get started are being reduced every day at AWS, and the fact that you have a template to get started with is awesome.
Stephen

Yeah, definitely. I agree with what you said on…I would rather do CDK than CloudFormation but I’d rather have a template than a blank file.
Rahul

Yeah, I wouldn’t know how to get started with CloudFormation from scratch anymore. Like, I would probably give up.
Stephen

We have a question from a LinkedIn user, it’s a bit tangential to our normal focus, should we have a look at it?
Rahul

Yeah, let’s go ahead.
Stephen

So it says, “How can someone start a career in ML but can’t find a chance to start?”
Rahul

Yeah, go ahead.
Stephen

No, you go ahead.
Rahul

I mean, these days, the demand for ML specialists or people with some ML experience is so great. One of the ways that a lot of organizations are recruiting for these people is via a lot of competitions that you see online, Kaggleism is a great example of it. AWS has their own…you know, everything from DeepRacer competitions to just a bunch of other machine learning contests that you’ll have online. So, I’d say start with a lot of these competitions, start participating in them, because they make datasets available to you and then it boils down to honing your skills and how you do feature development for those datasets and how you build out those models. You can actually be on a pretty accelerated curve using that mechanism. Competitions will be my advice, go start participating in a bunch of these online contests and start building your skills that way because everyone that I know of is recruiting out of those competitions. You show up on the list there, you will get hired.
Stephen

And I’ll add a few things. In addition to competitions, just do things that build up your public profile, right? So, you can use AWS and start a blog, have a public GitHub repository, have a Twitter feed, just build up a public profile. And not everything you have to do has to be perfect, so you can just say, “Hey, I’m just exploring,” but just try it out and talk out loud about your learning. Now that hopefully COVID is tempering off, I got a lot of value out of meetups. So, depending on where you are, look at meetups. And there’s data science in different areas, I went to a few data science and healthcare meetups. And actually, look, if you have a local university…so, often, there’s meetups that are sponsored by tech companies but then universities also have meetings that although they’re advertised to students, they’re open to anybody.

And so, if you’re not an official university student, just go show up and introduce yourself, shake hands. A long time ago, I was interested in a company’s operations and the machine learning and I just found the person on LinkedIn and wrote them and said, “Hey, can I be an intern for a couple of weeks for free?” And if you have had the bandwidth to do that, that is…I mean, there’s lots of ways. I think you’ve got incredible tools at your disposal, AWS has a pretty good free tier, just start putting your work out there and experimenting. And like I said, now that there’s these datasets that are clean and public and ready to go, have a look at that and, yeah, do some analysis, post it in public, and then also just invest a little bit of time in data visualization because that’s part of that whole package. And so, being able to present your results in a compelling way is also useful.
Rahul

I think I completely agree with that. One of the things that AWS has done in general is kind of democratized all of these otherwise very niche, you know, high-tech kind of operations. Like, machine learning 10 years ago was not really in the realm of, you know, a student in a dorm or, you know, anyone wanting to get into machine learning. Today, the fact that data sets are available in an s3 dump, which costs literally nothing per GB to compute that’s available by the R, the only thing that you need to really invest in is time and the effort, and the rest of the resources actually become incredibly cheap to get started with.

So, if you can spend the time, it’s actually really easy to learn fast, there’s tons of resources out there. I would also say reach out to some of the AWS heroes or the AWS community builders around machine learning. Reach out to us, you know, we have a bunch of AWS community builders within our own organization, we’d love to, you know, put you in touch with them, and give you a helping hand in getting started. But, yeah, there’s an amazing community of, you know, machine learning specialists who are more than happy to share their experience, to share their learnings, and help you get started. So, yeah, reach out to us and we’d love to, you know, connect you to all the right people.
Stephen

You couldn’t get a better answer than that. Well, let’s take a quick 30-second break and then we’ll come back and we have more great results to discuss.
Woman

Public cloud costs going up and your AWS bill growing without the right cost controls? CloudFix saves you 10% to 20% on your AWS bill by focusing on AWS recommended fixes that are 100% safe with zero downtime and zero degradation in performance. The best part? With your approval, CloudFix finds and implements AWS fixes to help you run more efficiently. Visit cloudfix.com for a free savings assessment.
Stephen

So, I think this one is pretty fun and it’s a great application of machine learning, so a fun segue. “Amazon Transcribe now supports automatic language identification for multi-lingual audio.” So, this is interesting. So, we use Transcribe, I want to get our pipeline perfectly ironed out to use Transcribe for this very show. But Transcribe is interesting because you just give it a video or an mp3 file, and it will try and extract the contents into text. Now, this is interesting. If your audio recording contains more than one language, you can enable multi-language identification and this identifies all languages spoken in the audio file and creates a transcript for each identified language.

And that means if speakers change the language mid-conversation or if each participant is speaking a different language, your transcription output detects and transcribes each language correctly. Until now, Transcribe would only detect the dominant language. So, thinking of examples, so I’m very lucky my grandmother who’s visiting from the East Coast, she speaks mostly…her English is fine, but she’ll default to Spanish. And so, if I asked her a question in English, she’s going to answer in Spanish and vice versa. I’m sure, Rahul, you’ve seen the same thing with maybe all the people in your life, they default to one language and they can switch seamlessly.
Rahul

Yeah. In fact, I think outside of the U.S. in general, in most places around the world, people are multilingual. People usually know at least two, if not three languages. Where I come from in India, I think on an average, folks speak maybe three languages. In my family, I have folks who can speak up to seven languages. So, it’s very common to have people in multilingual setups. For our organization in particular, so more on the work side of things, we have folks from 130 different countries.

And it’s not just the dominant language that matters, but it’s also the accents that matter when you’re talking, you know, when your native language is something else other than English, being able to transcribe all of that knowledge and all of those conversations becomes quite a challenge. Like, there is…I mean, I know AWS has a few models, like, you know, Indian English or they had these hybrid ones where…we do end up using a lot of our Hindi or, you know, other regional languages as part of our English conversation. So, the fact that you can now automatically detect all the different languages because you can have more than one speaker in a conversation and be able to detect all of those words that people use from different languages, that’s pretty neat.
Stephen

So, I tried this out and I want to show…you know, when I first saw this announcement, there was one thing that I thought of immediately. now, this is…I’m gonna play this bit of media to set the scene. Now, growing up as a kid, I watched a lot of “I Love Lucy” reruns and this is one of I think the funniest scenes ever shot on television. Okay, so I’m gonna add this to the screen, and then we’re gonna see how Transcribe does to handle this.
Policeman 1

[foreign language 00:32:40-00:32:48]
Policeman 2

[foreign language 00:32:49]
Man 1

[foreign language 00:32:53-00:32:59]
Ricky

[foreign language 00:33:00]
Lucy

What? What? What?
Ricky

Very good idea?
Lucy

What is?
Ricky

Oh, this guy can only speak French. See, now, this other cop here, he speaks French and German. And this fellow, he speaks German and Spanish.
Lucy

Oh.
Ricky

So, he’s going to ask the questions and we’ll translate them to you.
Lucy

Oh, good.
Ricky

[foreign language 00:33:21]
Man 1

[foreign language 00:33:22]
Policeman 2

[foreign language 00:33:23]
Policeman 1

[foreign language 00:33:24]
Policeman 2

[foreign language 00:33:26]
Man 1

[foreign language 00:33:28]
Ricky

Why do you get the money?
Lucy

I got it from a man on the street but I didn’t know it was counterfeit.
Ricky

[foreign language 00:33:38]
Man 1

[foreign language 00:33:41]
Policeman 2

[foreign language 00:33:46]
Policeman 1

Huh.
Policeman 2

Huh.
Man 1

Huh.
Ricky

Huh.
Lucy

It’s true.
Ricky

[foreign language 00:33:57]
Man 1

[foreign language 00:33:58]
Policeman 2

[foreign language 00:34:00]
Policeman 1

[foreign language 00:34:01-00:34:09]
Policeman 2

[foreign language 00:34:10-00:34:16]
Man 1

[foreign language 00:34:17-00:34:23]
Ricky

I want you to know something that the penalty for counterfeiting is imprisonment, life imprisonment and hard labor.
Lucy

Ugh.
Ricky

Ugh. No, no, no.
Stephen

All right, all right. So, I decided just for a bit of fun to take that video and put it into Transcribe and see what happens. So, just to show you how easy this is, all you really have to do, you create a transcript and…oh, it’s Amazon. Sorry, let me go back. You go to Amazon Transcript…oh, what’s happening here? Oh, did I get signed out? No, I didn’t. Okay, so you create a job, I called it “I Love Lucy,” and you can do this, “Multiple languages identification,” right there. So, that’s the feature that we’re just talking about. And so, in this case, you want English U.S. and you can select…what else is there? There is French and German and Spanish.

Okay, so then what you do? You specify where the video file is in s3, you tell it to put it into a bucket, it can even give you subtitles. So, I wanted to show you what happens out of that. So, this is what you end up getting. You get this really cool JSON file and look at the distribution at the end. It’s figured out, here we go. It found U.S. had 108 seconds in that clip, Spanish was 82-83 seconds, French was 57. I found it didn’t quite pick up the German because the German was always sandwiched between the English and the French. But I’ll show you what…so, what I did, this will actually generate a subtitle for you. So, if I enable the subtitles, let’s see what we get to. Here we go.
Man 1

[foreign language 00:36:37]
Policeman

[foreign language 00:36:39]
Stephen

So, it got the French.
Lucy

I am innocent.
Ricky

[foreign language 00:36:51]
Stephen

The Spanish.
Man 1

[foreign language 00:36:54]
Policeman 2

[foreign language 00:36:56]
Policeman 1

[foreign language 00:36:59]
Stephen

Now, it got its French.
Policeman 2

[foreign language 00:37:03]
Man 1

[foreign language 00:37:07]
Stephen

Back to Spanish.
Ricky

Good, if you pay the rest of the bill, you can go.
Lucy

Hooray.
Ricky

[foreign language 00:37:14]
Man 1

[foreign language 00:37:15]
Policeman 2

[foreign language 00:37:16]
Stephen

All right. So, pretty good. If you can have a conversation that switches from English to Spanish to French to Spanish to English, it was surprisingly good, I thought.
Rahul

Yeah, actually, the natural language detection and the transcription is becoming refined with every iteration. And, for us, one of the big initiatives that we’re doing internally is we have folks in 130 different countries who all collaborate across our group and bring in all of their perspectives into problem-solving, which is the thing that I love the most about my job working with all these amazing people. The thing that we’ve always lacked is, you know, we do hundreds of meetings in a day across all the different teams, but then we don’t have a mechanism of capturing that knowledge, that essence of those conversations.

So, one of the big pipelines that we built behind the scenes is whenever we do a Chime or a Zoom call, we stream all of that data, we run it through transcription, and then we save it as documents that then move into Kendra for indexing. And then you can ask a natural language question like, “Is there a conversation where they were talking about Product X, you know, and something else happened, you know, during the conversation?” Or, you know, “Were there these three specific people in a meeting and they talked about Product X?” Or if you want to remember a meeting, you’re like, “I wonder if I can pull up the transcripts for a meeting where I was there, Uznek [SP] was there, and then we had discussed this particular concept or this particular requirement from the customer.”

Those are the kinds of very, very interesting use cases around organization knowledge bases that Transcribe really enables you to kind of go after now. And Kendra, again, is another one where, to be able to take all of this transcript data…again, this is the NLP space, to be able to draw semantics out of it, to understand what it means at least in some sense, to be able to then answer natural language-based queries, is incredibly powerful. And like I’ve always said, AWS services, you should not think of them as just one service by itself. You should think of it as a whole set of building blocks that are made available to you and then it’s all about how you stitch these together to solve your particular problem. So, Transcribe for me is just one piece of a larger puzzle. It would be very interesting to see how folks are using Transcribe in different places.
Stephen

Yeah, and from transcribe, then you want to go to translate and then you can have everything in one common language base that then can then be searched or even just searched in…keeping in its native language and then search it, you know, upon translate, upon search, there’s a lot of different ways to think about that. So, all right, anything else on this segment? I know it was a bit long-winded but I think it’s pretty incredible, I’m kind of blown away that this even exists, right? And that the Hitchhiker’s Babel Fish is not that far away.
Rahul

True. Okay.
Stephen

All right.
Rahul

Let’s go to the next one.
Stephen

Sounds good. All right, this is attribute-based access control. Now, actually, I saw this tweet the other day, “I see your RBAC and I raise you ABAC.” So, RBAC, role-based access control, now we have ABAC, attributes-based access control role. Rahul, why don’t you summarize this?
Rahul

Yeah, so role-based access control, you know, is a more user-centric view of the world from an access control standpoint, like you basically define a role for a person and then that person gets, you know, those kinds of permissions. And this is very central to how IAM manages permission schemes today in AWS. It is all about what user you are, what roles you are part of, and what permissions do the roles have. Attribute-based access control is actually very interesting in the sense that you can now have tags for your resources, so you can carve out…or let me say it back. AWS has their way of seeing the world, AWS’s way of seeing the world is around resources they have and, you know, things at that level.

However, your business and the way you think about a particular pool of resources or a particular setpoint maybe completely different. For example, for some of our teams, they look at a project as their entire worldview of…you know, their entire worldviews around the project and the project has budgetary implications, the project has staffing implications, the project has implications around security and all other aspects of it. But the language that everyone speaks is the project, it could be a business unit for someone else, it could be something else.

But what the attribute-based access control allows you to do is it allows you to set tags for all of these resources or for a set of resources and users and whatever else. And you can then start defining permission schemes based on that rather than setting the permission scheme based on a role. A person might span across multiple different, you know, organizational units or BUs or projects, but you can define what permissions that they have at a time based on the tags that you’ve defined on the resources. So, I think it just gives everyone a lot more flexibility and a lot more ability to kind of move other levers from their perspective rather than stick to the one way everyone sees the world and that’s what’s incredible about ABAC.
Stephen

This reminds me a lot of the cost and tagging API, right? So, Amazon has one way of seeing costs but your way might be different, so you can look at that with tags. And similar to roles, Amazon sees things through an IAM lens, but here’s a way of imposing, like you said, a person-centric or maybe a person can wear different hats at different times and depending on, you know, the philosophical hat they’re wearing, they can have a certain set of attributes applied to them. I think it’s a really neat way of having this flexibility, where you can just impose your own view of things that are really important, both in access control and like we said last week with Steve Brain, the cost allocations. It’s really neat, this flexibility, you don’t have to use just their way of seeing the universe, you can kind of have your own running in parallel.
Rahul

Exactly. And I think you’ll start seeing ABAC kind of become ubiquitous across AWS services. There are a bunch of standard patterns that AWS, you know, works on and you’ll start seeing this become a theme across the board. There was a time when VPC endpoints was the thing, you know, where everything was public initially, and then everything had to be brought into VPCs. Serverless is another big trend that’s currently ongoing, like every team is trying to turn their services into serverless because this makes it easy for customers. There was a time when IAM had just been launched back in 2010, I think, where teams were starting to adopt IAM and that was a thing. I think ABAC will become one of those things where, you know, leveraging tags to create your permission schemes is actually a very neat way of organizing things. So, I think we’ll see a lot more of this going.
Stephen

Well, I don’t want to sound negative, but if I interacted with IAM less in the future, I wouldn’t be too sad about it.
Rahul

Absolutely. We’ve all had our IAM horror stories and nightmares. So, yeah, absolutely. I’m not sure that the complexity is going to completely go away. I think it just aligns a little better, so you have to kind of contort your worldview a little less. So, you can directly say, “This project is a permission scheme,” and get done with it. Right now, you have to figure out, “Oh, this project, I need a separate set of roles,” and then this is how it’s going to…you know, there’s a bunch of mappings that need to kind of go in place, this might actually make it easier. But I don’t think all the complexity will go away.
Stephen

No, I think it’ll be there but this will be a higher-level abstraction, so if you don’t need to deal with it immediately or if some may be central manager can manage the IAM roles and just assign attributes, that might be pleasant.
Rahul

Possibly. So, we learn more as more services implement ABAC and see how that pans out. But for now, definitely a welcome change.
Stephen

Well, let’s do another quick 30-second break, and then when we come back, we’re going to talk about The R6id instances.
Rahul

Yeah.
Woman

Is your AWS bill going up? CloudFix makes AWS cost savings easy and helps you with your cloud hygiene. Think of CloudFix as the Norton Utilities for AWS. AWS recommends hundreds of new fixes each year to help you run more efficiently and save money, but it’s difficult to track all these recommendations and tedious to implement them. stay on top of AWS recommended savings opportunities and continuously save 25% off your AWS bill. Visit cloudfix.com for a free savings assessment.
Stephen

All right, so we’re gonna be talking about the R6id. These are some…now I’ll put by screen back on, these are some heavy-duty instances. So, R6id, I’ll put the link in the chat. Here we go. This is the introduction of the R6id, 3.5 gigahertz, Xeon Scalable, 7.6 terabytes of local NVMe-based SSD, 15% better price, 58% higher terabyte storage per vCPU, 34% lower cost per terabyte memory…let’s see, total memory encryption, ideal for memory-intensive workloads. And you get 50 gigabytes per second and networking speed at 40 gigabits per second bandwidth to EBS. Looking at the instances…let’s see. There we go. Looking at the numbers themselves, R6id, so we’re getting to this level, the 16xlarge or the 24 or the 32. look at that, this is a gigabyte of RAM. So, this is 1024 gigabytes of RAM and one terabyte, plus four of these NVMes and a whole lot of this…and again, the units are now gigabits per second.
Rahul

Yeah, I think these are the kinds of instances you want to use for large data processing databases. So, there are a lot of databases where you have tons of stored procs, you have tons of, you know, local processing you need to do, whether that be for data analytics, like you’re building OLAP cubes in your instance. For things like that…or you’re running a Redshift cluster. For those kinds of operations where you need tons of memory, the ratio of memory to CPU is needed to be really high because you want to load up as much data in memory as possible.

You need a really good CPU because you’re processing a ton of that data, whether it’s building, you know, cubes in dimension and then being able to access it, or literally just running tons of stored procs because those are processing your data, you know, right then and there within the tables that you have in your DB. You know, in some cases, for example, if you’re running massive Elasticsearch clusters, these instances might be pretty interesting to run Elasticsearch on those. I actually see, you know, these instances being used for that. I was surprised to see another Intel variant come up this soon and for the price improvements, I was beginning to write off Intel, but it looks like they’re coming back pretty aggressively with some of these.

I don’t know how much further they can push these in…sorry, push on the performance, 3.5 gigahertz across all cores is pretty impressive. I don’t know how far you can push that. There was a recent…I was watching a little review recently, someone’s been able to take one of these processes and, you know, use liquid nitrogen, cooled it to a point where they’re able to, you know, eke out about six or a little over six gigahertz out of those processes but again, it’s not consistent. You know, it’s the peak performance boundary that you can push these cores. Yeah, I don’t know how far you can push this. You know, by looking at the TDPs for these kinds of processes, you’re basically building a heat engine at that point.
Stephen

I can imagine standing next to this rack is gonna be pretty loud.
Rahul

Yeah, you could be contributing pretty significantly to global warming with a processor, you know, running kind of that way. But, yeah, I see the need for workloads like that and these numbers look pretty impressive, by the way, just to be clear.
Stephen

For $10 an hour for a terabyte of RAM and 128 vCPUs, all that SSD space, and all that bandwidth, that’s pretty reasonable.
Rahul

Yeah. I think the only ones that might actually beat this one…could you pull up the X1e 32 Excels? Both the X1 32 XL and the X1e 32XL, those are also 128-core CPUs with two and four terabytes of RAMs.
Stephen

The X1e 32 is 2.6 times the price, the same number of vCPUs…
Rahul

Yeah, but it has four terabyte RAMs.
Stephen

A lot less SSD…or half the SSD.
Rahul

Yeah, half the SSD, but four times the RAM.
Stephen

Okay, so for the absolute most RAM you could possibly get, it’s this, the x1e.32xlarge?
Rahul

Correct. And what about the X1 32XL? Not the X for me, but can you look up the X1 32XL?
Stephen

Okay, so this is for a more comparable price, 128 vCPUs, twice the RAM, half the SSD, undefined network performance but told as high but not given a number.
Rahul

Yeah, I think this was about 14 Gbps, if I’m not mistaken, on the X1. X1 did move to the Nitro some time ago, or maybe that would be X1e. I need to go look this one up again, it’s been a while. But, yeah, I mean, these are the kind of use cases. You know, like the X1s are custom made for SAP HANA kind of workloads where you have a lot of, you know, in-memory processing that you want to do. you want tons and tons of memory, you want CPUs to kind of be able to process all of that data quickly and these are perfect use cases for, you know, workloads like that. And there isn’t much difference between the 10 bucks on the new R6id and X1.32. So, it’d be interesting to see where you make the trade-off. Does it say what the CPU type is for the X1.32XL
Stephen

It’s not saying here but let’s see, x1.32xlarge. X1 instances are memory-optimized Xeon E7s, it had some [crosstalk 00:55:01].
Rahul

Yeah, these are about two generations old, they are Xeons, E7s. And E7s have been out for actually over four years now, if I’m not mistaken, four or five years now. So, these are much older processes, so they probably pull…does it say how many gigahertz they pull on a per-core basis?
Stephen

It is not saying that. Well, I mean, we can look up that CPU, the E7. It’s not saying it, let’s see. 2.3.
Rahul

Yeah, so these are 2.3 and the new R6ids are pulling 3.5.
Stephen

Wow, it says here 22-nanometer. Basically, gigantic at this point. Q2 2015.
Rahul

Yeah, these are old, these are very old.
Stephen

The new iPhones are on, what, five-nanometer?
Rahul

Correct. I think these processes are…these Xeons, the Ice Lake processes that R6id is running on are the latest gen. 3.5 gigahertz is very impressive, so you’ll get way better CPU performance on these instances over even the x1.32xl. Unless your requirement really was insane memory…
Stephen

It’s interesting to tease apart and that’s where we’re really showing that vCPUs are an approximation, right? Because they both say 128 vCPUs, and does that mean that it takes more Haswell cores to make a vCPU than it does one of these newer ones?
Rahul

I think they just…so, you basically get 64 cores in the underlying…
Stephen

Hyperthreaded.
Rahul

Hyperthreaded cores. So, you get 64 hyperthreaded cores in the underlying processor. So, I think that’s what you get, and in one case, you’ll get 64 2.52 gigahertz…or 2.3, I forget how much that was. But you get 64 of those and in the R6ids, you’ll get 64 of the 3.5 gigahertz cores.
Stephen

So, in terms of 128 vCPUs, these 128 are going to be a lot faster and a lot more instruction throughput but…
Rahul

Yeah, it looks like 50% faster.
Stephen

Yeah. Well, I think we have time for one short one, which is related, it’s the R5n. So, these are announced extra availability, the R5n availability to Cape Town and Europe, Milan.
Rahul

Yep. So, one of the things where…these are interesting use cases where you have tons of data sitting in a data lake and you need to run massive MapReduce kind of operations where the data is the same but you kind of operating in little chunks and pieces, you have a lot of network transfer or data transfer that happens in these kinds of workloads. And for that, the instance type you want to pick…so the job themselves might be tiny, the jobs might be small, but the data transfer of the output of the job might be actually pretty large. In all those cases, you really need amazing network connectivity where the performance that you get over the network almost feels like it’s local disk performance. And when you have use cases like that, using these Nitro-based instances is incredibly valuable. That’s where the N series of processors comes in. And I think it just goes to show that…
Stephen

And these are the R5n specs.
Rahul

Yeah, here’s the new R5ns. I mean, just from a network performance standpoint, it’s mind-blowing that you could get 100 Gbps, you know, connectivity to whatever data you have. You couldn’t build your own data center with that kind of connectivity, you know, and sustained the throughput.
Stephen

And it’s somewhere to put that data that can take it that quickly.
Rahul

Exactly. I mean, that’s just an insane engineering challenge and AWS has done an amazing job of doing that. Again, back to the point. AWS is doing a lot of the thinking for us about what is the most optimized resource service or pattern to use for different kinds of workloads. Every time they come up with something, I have to think about, “Where would I use this?” We are so used to using general-purpose compute because that’s our general-purpose resources because the overhead of thinking through and creating absolutely custom stuff is just too high.

But when you’re building building blocks, like what AWS is, it just completely makes sense. You can then have the freedom of picking just the right resource, just the right service that does that one thing for you optimally. So, it moves you from being, you know, these generalist products and generalist, you know, kind of services that you can bake internally to building an absolutely kick-ass world-class product or service based on very pointed services. And that’s what’s remarkable about the set of services that AWS has.
Stephen

Yeah, it’s amazing, like the number of tools…I guess the number of different pieces that are different shapes is getting better and better and better. And like I said, it’s easy to always reach for that…or probably now, the new default instance will be the C7g, right? Kind of reach for that by default. But then now, you said, “Okay, well, when do I need this throughput?” Or, “Am I doing this huge thing in memory? Wait, there’s special tools for that.” I don’t have to deal with the constraints of my default instance, they’re gonna reach for a specialized interest. You know, I’m definitely not the first person that runs into this issue, and in fact, I’m probably not the 1000th person in the U.S. that thought this through.
Rahul

Exactly.
Stephen

All right. Well, it looks like, again, we’re about halfway through the list of articles to get through but we’re 100% of the way through our minutes. I guess that’s a good sign. AWS has a lot going on and it’s really fun to talk about.
Rahul

Absolutely. For the audience, please feel free to ask questions, post them offline, we’ll definitely bring them up in our next episode, and look forward to seeing you all soon.
Stephen

All right. Well, thanks, everyone. Have a nice afternoon or evening or morning wherever you are, and we’ll see you next time.
Woman

Is your AWS public cloud bill growing? While most solutions focus on visibility, CloudFix saves you 10% to 20% on your AWS bill by finding and implementing AWS-recommended fixes that are 100% safe. There’s zero downtime and zero degradation in performance. We’ve helped organizations save millions of dollars across tens of thousands of AWS instances. Interested in seeing how much money you can save? Visit cloudfix.com to schedule a free assessment and see your yearly savings.