Amazon Web Services (AWS) Archiving

via S3 + Glacier via Bucket Lifecycle Rules

Since Amazon’s Glacier can be a good bit more difficult to work with directly than their S3, unless you have trivially small needs or untrivially deep pockets I recommend a hybrid solution: S3 + Glacier.

I prefer to use S3 via the awscli Python module for the initial ‘push the data to AWS’, as it’s darn easy: aws s3 sync somedir s3://somebucket/. If I have Lifecycle Rules applied to the S3 bucket, the data can be quickly (if you like) and automatically pushed to either S3 Infrequent Access or Glacier. I don’t tend to do much in the way of restores, and if so don’t have critical timelines on that, so I’m focusing on Glacier for this article.

How Much Will This Cost?

S3 pricing is fairly straightforward: https://aws.amazon.com/s3/pricing/. While it does vary by region, it’s a tiered monthly price (less $/GB as size goes up), plus very minor transaction charges. Easy to understand.

Glacier charges are a bit less intuitive, but the ‘non-intuitive’ bits are a very small part of your bill, unless you have a ridiculous number of objects.

So for example, as of today (2017-02-10), in my most-used region, the base cost of ‘cheap’ S3 (S3 Infrequent Access) is $12.80/TB/month vs. $4.096/TB/month (Glacier), or storage savings of 68% … but depending on how many ‘objects’ in your storage you’ll also be paying some small fees for that, as well.

The Step-By-Step

So let’s put this all together into a practical example, to see how simple it is.

Install the AWS CLI (http://docs.aws.amazon.com/cli/latest/userguide/installing.html)
Create an S3 bucket (http://docs.aws.amazon.com/cli/latest/reference/s3/index.html)
Turn on Bucket Versioning (optional) (http://docs.aws.amazon.com/cli/latest/reference/s3api/put-bucket-versioning.html)
Set your bucket Lifecycle Rules (http://docs.aws.amazon.com/cli/latest/reference/s3api/put-bucket-lifecycle-configuration.html)
Sync your backup directory to your S3 bucket (http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html)

Install the AWS CLI

Your OS may come with an older version of the AWS CLI that will not work correctly with the latest AWS CLI Command Reference (good stuff, by the way). So use a virtualenv (really) to get the latest version of the AWS CLI, which is updated fairly often. And that said, this document was created with AWS CLI v1.11.46 (aws --version), so YMMV.

$ mkvirtualenv awscli # because you should
$ pip install --upgrade pip setuptools
$ pip install --upgrade awscli
$ aws configure # follow the instructions, you'll need your access key and secret key

Create an S3 Bucket

Note: you will need to use your own, unique bucket name. I already ‘own’ wharton-research-Archive.

$ aws s3 mb s3://wharton-research-Archive

Turn on Bucket Versioning (optional)

If you would like to do ‘incremental’ backups—retaining all old versions of your files—you can turn on the Versioning feature in your S3 bucket. Note: this could get expensive. Unfortunately at this time AWS S3 doesn’t have the ability to set a number of versions to keep, so all versions of a file will be kept. Of course you could delete older objects, but that’s beginning to turn ‘this is simple’ into ‘this is getting tough’. It isn’t too hard to 1) do your sync, 2) remove any objects that are more than a certain ‘count’ version of that object, if you want to explore. I like versioning, so:

$ aws s3api put-bucket-versioning --bucket wharton-research-Archive --versioning-configuration Status=Enabled

Set Bucket Lifecycle Rules

Create a JSON file with your Bucket Lifecycle JSON code:

$ cat wharton-research-Archive-lifecycle.json
{
    "Rules": [
        {
            "Status": "Enabled",
            "Prefix": "",
            "Transition": [
                {
                    "Days": 0,
                    "StorageClass": "GLACIER"
                }
            ],
            "ID": "0-day-archive-to-glacier"
        }
    ]
}

For ‘help’ with creating the above JSON, I actually set it first via the AWS Console and then did a ‘get’ and saved that to a file. Now push the Lifecycle JSON to your bucket:

$ aws s3api put-bucket-lifecycle-configuration --bucket wharton-research-Archive --lifecycle-configuration file://wharton-research-Archive-lifecycle.json

Verify that the Lifecycle has been applied to your bucket:

$ aws s3api get-bucket-lifecycle-configuration --bucket wharton-research-Archive
{
    "Rules": [
        {
            "Status": "Enabled",
            "Prefix": "",
            "Transition": [
                {
                    "Days": 0,
                    "StorageClass": "GLACIER"
                }
            ],
            "ID": "0-day-archive-to-glacier"
        }
    ]
}

Note that the ‘Prefix’ is an empty string … you could also add a prefix to ONLY send a sub-directory of your S3 bucket to Glacier, like "glacier/", which would only move files within the “glacier/” sub-directory to Glacier, anything else would remain in S3.

Sync Backup Directory

Finally! You’re ready to run a backup! AWS CLI s3 sync is extremely easy to use, and fairly powerful. By default, it will do a sync only if the files have changed, which makes for nice incremental backups, and is recursive if you specify a directory, like your /archive directory. I use --storage-class REDUCED_REDUNDANCY since the files will only be in this class for a day, and we might as well save some money.

$ aws s3 sync --storage-class REDUCED_REDUNDANCY /archive s3://wharton-research-Archive
upload: archive/file1.txt to s3://wharton-research-Archive/file1.txt
upload: archive/file2.txt to s3://wharton-research-Archive/file2.txt
...

Throw that in a crontab (or script run out of cron), and you’re ready to roll!

Verify the files are in S3:

$ aws s3 ls s3://wharton-research-Archive
 PRE dir1/
 PRE dir2/
2017-02-10 13:40:46 6 file1.txt
2017-02-10 13:40:46 6 file2.txt
2017-02-10 13:40:46 6 file3.txt
2017-02-10 13:40:46 5 file4.txt
2017-02-10 13:40:46 265 wharton-research-Archive-lifecycle.json
...

Look at the file details:

$ aws s3api list-objects --bucket wharton-research-Archive
{
    "CommonPrefixes": [],
    "Contents": [
        {
            "LastModified": "2017-02-10T18:40:46.000Z",
            "ETag": "\"15c...\"",
            "StorageClass": "REDUCED_REDUNDANCY",
            "Key": "dir1/file1.txt",
            "Owner": {
                "DisplayName": "...",
                "ID": "......"
            },
            "Size": 6
        },
        {
            "LastModified": "2017-02-10T18:40:45.000Z",
            "ETag": "\"7a2...\"",
            "StorageClass": "REDUCED_REDUNDANCY",
            "Key": "dir1/file2.txt",
            "Owner": {
                "DisplayName": "...",
                "ID": "......"
            },
            "Size": 6
        },
        ...
    ]
}

Note the "StorageClass": "REDUCED_REDUNDANCY". The files didn’t immediately archive to Glacier, even though our Lifecycle Rule says ‘0 days’. I haven’t found documentation on exactly how often things move, but it seems that it takes less than a day. The next day, when I checked the file status again, they had changed to:

$ aws s3api list-objects --bucket wharton-research-Archive
{
    "CommonPrefixes": [],
    "Contents": [
        {
            "LastModified": "2017-02-10T18:40:46.000Z",
            "ETag": "\"15c...\"",
            "StorageClass": "GLACIER",
            "Key": "dir1/file1.txt",
            "Owner": {
                "DisplayName": "...",
                "ID": "......"
            },
            "Size": 6
        },
        {
            "LastModified": "2017-02-10T18:40:45.000Z",
            "ETag": "\"7a2...\"",
            "StorageClass": "GLACIER",
            "Key": "dir1/file2.txt",
            "Owner": {
                "DisplayName": "...",
                "ID": "......"
            },
            "Size": 6
        },
        ...
    ]
}

Success!

Restoring Files

Restoring from Glacier is not as ‘immediate’ as just syncing files down from S3:

$ aws s3 sync s3://wharton-research-Archive wharton-research-Archive
warning: Skipping file /home/wcit/hughmac/wharton-research-Archive/. File does not exist.
download failed: s3://wharton-research-Archive/afile-1.txt to wharton-research-Archive/afile-1.txt A client error (InvalidObjectState) occurred when calling the GetObject operation: The operation is not valid for the object's storage class
download failed: s3://wharton-research-Archive/afile-5.txt to wharton-research-Archive/afile-5.txt A client error (InvalidObjectState) occurred when calling the GetObject operation: The operation is not valid for the object's storage class

But it’s not hard, either. You just need to send a ‘restore-object‘ command, with a few details, like so:

$ aws s3api restore-object --bucket wharton-research-Archive --key "afile-3.txt" --cli-input-json '{ "RestoreRequest": { "Days": 2 }}'

This is similar to files moving from S3 to Glacier … it takes about a day. You might do some testing to see how YMMV.

Saving Money vs Incrementals

One of the nicest features of the above solution (aws s3 sync of an entire directory) is that it provides you with unlimited revisions of the files—granular ‘incrementals’. For example, if you have a file with three older revisions:

$ aws s3api list-object-versions --bucket wharton-research-Archive

You can restore the previous version by adding the --version-id option to your restore request:

$ aws s3api restore-object --bucket wharton-research-Archive --key "file4.txt" --version-id eE...Q --cli-input-json '{ "RestoreRequest": { "Days": 2 }}'

But if incrementals are unnecessary, you can likely save some cash (potentially a good bit) by zipping the data before syncing it to S3 for archiving in Glacier.

$ zip -r /archive/HomeDirs-$(date +%F).zip /home
aws s3 sync /archive s3://mydomain.com-backups-Glacier/

About Hugh MacMullan

With two decades of experience supporting research and more than a decade at The Wharton School, Hugh enjoys the challenges and rewards of working with world-class researchers doing Amazing Things with research computing. Robust and scalable computational solutions (both on premise and in The Cloud), custom research programming solutions (clever ideas, simple code), and holistic, results-focused approaches to projects are the places where Hugh lives these days. On weekends you're likely to find him running through the woods with a topo map and compass, orienteering.