Amazon S3 is an object storage managed service designed for storing an unlimited amount of data.
Unlike file storage, AWS S3:
- does not allow appending additional data to an object – each object has to be updated in full
- does not provide a hierarchical (tree) object storage – one can emulate directories with prefixes – data in S3 is organized in S3 buckets with a single level of hierarchy.
In general, AWS S3 is used for these use cases:
- Data lakes
- Backup and restore
- Data archives
- Running cloud-native applications
To make the most of the AWS S3 service, be aware of the following questions.
Question 1: How long do you need to keep old data?
Use case | How long do you need to keep old data? |
---|---|
Data lakes | While machine learning / artificial intelligence models are better calibrated with larger data sets, for some models older data sets are no longer useful since the underlying patterns have changed. Identify those datasets which are no longer useful after a certain number of years and either S3 tag them or set up an S3 lifecycle rule to expire (delete) such datasets identified by a prefix or tag after a certain number of years. |
Backup and restore Data archives | Very rarely you need to access old backups or archives, though Compliance may require keeping these for a certain number of years. Set up an S3 lifecycle rule to expire (delete) all backups / archives older than the required number of years. If you use incremental backups, ensure you expire all files in the one incremental backup series. |
Running cloud-native applications | The logs for cloud-native applications should expire (deleted) after a certain number of months. |
Question 2: What data should you store in AWS S3?
Use case | What data should you store in AWS S3? |
---|---|
Data lakes | Use compressed binary files, rather than csv files whenever possible – we recommend using compressed Parquet files. Ensure you do not store duplicate files. |
Backup and restore | Whenever possible use incremental backup. Use compression and do not backup duplicate files. |
Data archives | Whenever possible use compression and do not archive duplicate files. |
Running cloud-native applications | Set up AWS CloudFront to serve the static files from S3. |
Question 3: What S3 storage class should you use?
Use case | What S3 storage classes should you use? |
---|---|
Data lakes | Data lakes have usually unpredictable access patterns. Use S3 Intelligent Tiering for optimizing the storage costs. |
Backup and restore Data archives | The choice depends on how quickly you require the files to be restored. In general, the price is lower the longer the recovery time is. Use:
|
Running cloud-native applications | For production files, use Intelligent Tiering. For logs, use S3 Standard for up to 7 days, and then let a lifecycle rule transition the logs to S3 One Zone – Infrequent Access or Glacier Flexible Retrieval. |
Question 4: What AWS region should you use for the S3 bucket?
Use case | What AWS region should you use for the S3 bucket? |
---|---|
Data lakes | Use the AWS region where the majority of the applications reside. For global use, consider S3 Multi-Region Access Points. |
Backup and restore Data archives | Use the AWS region of the source application to minimize the data transfer costs. In addition, consider the AWS region with the lowest storage costs. |
Running cloud-native applications | Use the AWS region where the majority of the applications reside. For global use, consider S3 Multi-Region Access Points. |
Question 5 – Do you need to enable object versioning?
Use case | Do you need to enable object versioning? |
---|---|
Data lakes | Yes, however, limit with a life cycle rule the number of versions you keep of every file and transition older versions first to Standard Infrequent Access and then to a Glacier storage class. |
Backup and restore | No. |
Data archives | Yes, transition older versions first to Standard Infrequent Access and then to a Glacier storage class. |
Running cloud-native applications | No. |