Amazon Web Services (AWS) Archiving

via S3 + Glacier via Bucket Lifecycle Rules

Since Amazon’s Glacier can be a good bit more difficult to work with directly than their S3, unless you have trivially small needs or untrivially deep pockets I recommend a hybrid solution: S3 + Glacier.

I prefer to use S3 via the awscli Python module for the initial ‘push the data to AWS’, as it’s darn easy: aws s3 sync somedir s3://somebucket/. If I have Lifecycle Rules applied to the S3 bucket, the data can be quickly (if you like) and automatically pushed to either S3 Infrequent Access or Glacier. I don’t tend to do much in the way of restores, and if so don’t have critical timelines on that, so I’m focusing on Glacier for this article.

How Much Will This Cost?

S3 pricing is fairly straightforward: https://aws.amazon.com/s3/pricing/. While it does vary by region, it’s a tiered monthly price (less $/GB as size goes up), plus very minor transaction charges. Easy to understand.

Glacier charges are a bit less intuitive, but the ‘non-intuitive’ bits are a very small part of your bill, unless you have a ridiculous number of objects.

So for example, as of today (2017-02-10), in my most-used region, the base cost of ‘cheap’ S3 (S3 Infrequent Access) is $12.80/TB/month vs. $4.096/TB/month (Glacier), or storage savings of 68% … but depending on how many ‘objects’ in your storage you’ll also be paying some small fees for that, as well.

The Step-By-Step

So let’s put this all together into a practical example, to see how simple it is.

  1. Install the AWS CLI (http://docs.aws.amazon.com/cli/latest/userguide/installing.html)
  2. Create an S3 bucket (http://docs.aws.amazon.com/cli/latest/reference/s3/index.html)
  3. Turn on Bucket Versioning (optional) (http://docs.aws.amazon.com/cli/latest/reference/s3api/put-bucket-versioning.html)
  4. Set your bucket Lifecycle Rules (http://docs.aws.amazon.com/cli/latest/reference/s3api/put-bucket-lifecycle-configuration.html)
  5. Sync your backup directory to your S3 bucket (http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html)

Install the AWS CLI

Your OS may come with an older version of the AWS CLI that will not work correctly with the latest AWS CLI Command Reference (good stuff, by the way). So use a virtualenv (really) to get the latest version of the AWS CLI, which is updated fairly often. And that said, this document was created with AWS CLI v1.11.46 (aws --version), so YMMV.

Create an S3 Bucket

Note: you will need to use your own, unique bucket name. I already ‘own’ wharton-research-Archive.

Turn on Bucket Versioning (optional)

If you would like to do ‘incremental’ backups—retaining all old versions of your files—you can turn on the Versioning feature in your S3 bucket. Note: this could get expensive. Unfortunately at this time AWS S3 doesn’t have the ability to set a number of versions to keep, so all versions of a file will be kept. Of course you could delete older objects, but that’s beginning to turn ‘this is simple’ into ‘this is getting tough’. It isn’t too hard to 1) do your sync, 2) remove any objects that are more than a certain ‘count’ version of that object, if you want to explore. I like versioning, so:

Set Bucket Lifecycle Rules

Create a JSON file with your Bucket Lifecycle JSON code:

For ‘help’ with creating the above JSON, I actually set it first via the AWS Console and then did a ‘get’ and saved that to a file. Now push the Lifecycle JSON to your bucket:

Verify that the Lifecycle has been applied to your bucket:

Note that the ‘Prefix’ is an empty string … you could also add a prefix to ONLY send a sub-directory of your S3 bucket to Glacier, like "glacier/", which would only move files within the “glacier/” sub-directory to Glacier, anything else would remain in S3.

Sync Backup Directory

Finally! You’re ready to run a backup! AWS CLI s3 sync is extremely easy to use, and fairly powerful. By default, it will do a sync only if the files have changed, which makes for nice incremental backups, and is recursive if you specify a directory, like your /archive directory. I use --storage-class REDUCED_REDUNDANCY since the files will only be in this class for a day, and we might as well save some money.

Throw that in a crontab (or script run out of cron), and you’re ready to roll!

Verify the files are in S3:

Look at the file details:

Note the "StorageClass": "REDUCED_REDUNDANCY". The files didn’t immediately archive to Glacier, even though our Lifecycle Rule says ‘0 days’. I haven’t found documentation on exactly how often things move, but it seems that it takes less than a day. The next day, when I checked the file status again, they had changed to:

Success!

Restoring Files

Restoring from Glacier is not as ‘immediate’ as just syncing files down from S3:

But it’s not hard, either. You just need to send a ‘restore-object‘ command, with a few details, like so:

This is similar to files moving from S3 to Glacier … it takes about a day. You might do some testing to see how YMMV.

Saving Money vs Incrementals

One of the nicest features of the above solution (aws s3 sync of an entire directory) is that it provides you with unlimited revisions of the files—granular ‘incrementals’. For example, if you have a file with three older revisions:

You can restore the previous version by adding the --version-id option to your restore request:

But if incrementals are unnecessary, you can likely save some cash (potentially a good bit) by zipping the data before syncing it to S3 for archiving in Glacier.

With two decades of experience supporting research and more than a decade at The Wharton School, Hugh enjoys the challenges and rewards of working with world-class researchers doing Amazing Things with research computing. Robust and scalable computational solutions (both on premise and in The Cloud), custom research programming solutions (clever ideas, simple code), and holistic, results-focused approaches to projects are the places where Hugh lives these days. On weekends you're likely to find him running through the woods with a topo map and compass, orienteering.