2013-06-08

Amazon Glacier in Python with Boto (Part One: Uploading)

Note: Amazon Glacier works quite differently from how daily uploading and downloading work. For example, after uploading you won't see it immediately and you will wait hours before you can get a file. There are some new and technical concepts.

I recently found a very affordable cloud-based storage for my research data - Amazon Glacier. It costs only 1 cent (yes, $0.01) per GB per month. If my data is 2TB in total, that's $20/month. Besides the cost, I don't see other competitors, including RackSpace Cloud File or Google Storage, offering a storage-only service - one exception is DreamHost's DreamObject which is 7 cents (3 or 4 cents when promotion) per GB per month. I really don't wanna use their cloud machine for only storing my files because I will be charged for the CPU time.

Getting started with Amazon Glacier is not quite easy for me, a person who doesn't have much experience with cloud computing. It's the first time that I need to use APIs to write programs to upload/download files. Luckily, after Googling around for a while, I am having happy hours with Boto, the Python API for Amazon cloud services, including Glacier. Boto's documentation for Glacier is at http://docs.pythonboto.org/en/latest/ref/glacier.html

Creating vaults

I didn't create vaults using Boto but web interface at AWS console.

Uploading archives


Solution 1: Using APIs

(modified from [1])

from boto.glacier.layer1 import Layer1
from boto.glacier.concurrent import ConcurrentUploader
import sys
import os.path

from time import gmtime, strftime

access_key_id = "...your_aws_access_key_id..."
secret_key = "...your_aws_secret_key..."
target_vault_name = "...your_vault_name..."

fname = sys.argv[1]  # the file to be uploaded into the vault as an archive. 
fdes = sys.argv[2]   # a description you give to the file
 
if not os.path.isfile(fname) :
    print("Can't find the file to upload!")
    sys.exit(-1);
 
glacier_layer1 = Layer1(aws_access_key_id=access_key_id, aws_secret_access_key=secret_key)
 
uploader = ConcurrentUploader(glacier_layer1, target_vault_name, part_size=128*1024*1024, num_threads=4)

print("Begin at "+strftime("%Y-%m-%d %H:%M:%S", gmtime()))  # Time in Greenwich time.

print("uploading... "+fname+", "+fdes)
 
archive_id = uploader.upload(fname, fdes)
 
print("Success! archive id: '%s'"%(archive_id))

print("Finish at "+strftime("%Y-%m-%d %H:%M:%S", gmtime())) # Time in Greenwich time.

Please note that ConcurrentUploader allows uploading using multiple threads and in multiple parts. Here I need to explain threads and parts.
  • Threads and parts are independent. 
  • Threads: How many parallel threads to use to upload files. Default is 10. This only shortens the upload time. But do calculate according to your bandwidth.
  • Parts: How many pieces (must be power of 2) should the file be chopped for uploading. Check here for details. A file is chopped into pieces to be sent to Amazon Glacier server individually and is assembled into one after all pieces are received. This won't speed up the uploading. But using smaller pieces can avoid retransmitting a lot if the transmission is interrupted. In other words, you may use smaller pieces if your network is not stable.
  • Also note that Amazon charges you by the number of connections you establish with their servers. So you don't wanna chop into too many small pieces (or use too many threads - not sure about this part.)
IMPORTANT: Write down the archive ID returned. You will need it later to retrieve or delete your archive. If you don't do, you will need to request inventory content first.

Solution 2: Using glacier script

Boto itself comes with a glacier script under bin to allow uploading files. The usage is very simple:
glacier upload <vault> <files> --access_key <key> --secret_key <key> --region <region>
You can set environment variables to avoid entering access_key, secret_key and region every time. For details, please simply run glacier without any parameter and the help info will be printed.

Deleting an archive

There are two ways to do it. One in layer1 API of Boto and the other in layer2's.

Using Layer1:
from boto.glacier.layer1 import Layer1
 
access_key_id = "...your_aws_access_key_id..."
secret_key = "...your_aws_secret_key..."
vault_name = "...your_vault_name..."
archive_id = "...the_archive_id_you_wrote_down..."
 
glacier_layer1 = Layer1(aws_access_key_id=access_key_id, aws_secret_access_key=secret_key)
 
glacier_layer1.delete_archive(vault_name, archive_id)

Using Layer2 (from [2])
from boto.glacier.layer2 import Layer2
access_key_id = "...your_aws_access_key_id..."
secret_key = "...your_aws_secret_key..."
vault_name = "...your_vault_name..."
archive_id = "...the_archive_id_you_wrote_down..."
 
l = Layer2(aws_access_key_id=access_key, aws_secret_access_key=secret_key)
 
v = l.get_vault(vault_name)
 
v.delete_archive(archive_id)

References:

2 comments:

Andreas said...

You forgot the region-parameter to layer1(). It will default to us-east-1.

Forrest Bao said...

You are correct. Since I connect to the default data center, I don't set it.