AWS S3: Top 5 performance and cost questions

Question 1: How long do you need to keep old data?

Use case	How long do you need to keep old data?
Data lakes	While machine learning / artificial intelligence models are better calibrated with larger data sets, for some models older data sets are no longer useful since the underlying patterns have changed. Identify those datasets which are no longer useful after a certain number of years and either S3 tag them or set up an S3 lifecycle rule to expire (delete) such datasets identified by a prefix or tag after a certain number of years.
Backup and restore Data archives	Very rarely you need to access old backups or archives, though Compliance may require keeping these for a certain number of years. Set up an S3 lifecycle rule to expire (delete) all backups / archives older than the required number of years. If you use incremental backups, ensure you expire all files in the one incremental backup series.
Running cloud-native applications	The logs for cloud-native applications should expire (deleted) after a certain number of months.

Question 2: What data should you store in AWS S3?

Use case	What data should you store in AWS S3?
Data lakes	Use compressed binary files, rather than csv files whenever possible – we recommend using compressed Parquet files. Ensure you do not store duplicate files.
Backup and restore	Whenever possible use incremental backup. Use compression and do not backup duplicate files.
Data archives	Whenever possible use compression and do not archive duplicate files.
Running cloud-native applications	Set up AWS CloudFront to serve the static files from S3.

Question 3: What S3 storage class should you use?

Use case	What S3 storage classes should you use?
Data lakes	Data lakes have usually unpredictable access patterns. Use S3 Intelligent Tiering for optimizing the storage costs.
Backup and restore Data archives	The choice depends on how quickly you require the files to be restored. In general, the price is lower the longer the recovery time is. Use: Glacier Instant Retrieval Glacier Flexible Retrieval Glacier Deep Archive
Running cloud-native applications	For production files, use Intelligent Tiering. For logs, use S3 Standard for up to 7 days, and then let a lifecycle rule transition the logs to S3 One Zone – Infrequent Access or Glacier Flexible Retrieval.

Question 4: What AWS region should you use for the S3 bucket?

Use case	What AWS region should you use for the S3 bucket?
Data lakes	Use the AWS region where the majority of the applications reside. For global use, consider S3 Multi-Region Access Points.
Backup and restore Data archives	Use the AWS region of the source application to minimize the data transfer costs. In addition, consider the AWS region with the lowest storage costs.
Running cloud-native applications	Use the AWS region where the majority of the applications reside. For global use, consider S3 Multi-Region Access Points.

Question 5 – Do you need to enable object versioning?

Use case	Do you need to enable object versioning?
Data lakes	Yes, however, limit with a life cycle rule the number of versions you keep of every file and transition older versions first to Standard Infrequent Access and then to a Glacier storage class.
Backup and restore	No.
Data archives	Yes, transition older versions first to Standard Infrequent Access and then to a Glacier storage class.
Running cloud-native applications	No.