Snowball

Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway. [Andy Tanenbaum, 1989]

As  someone who did a lot of computing before The Cloud or Dropbox was a thing, I have a little box of hard drives tucked away in my living room. A bunch of these drives will be paperweights by now, the ball bearings frozen-up or platters otherwise unreadable, but I would happily pay for the salvageable data to be thrown up on Amazon for posterity and my own nostalgia. I tried trickle-copying the data over our Sonic DSL connection, but things were happening at a geologic time scale. Enter Snowball, Amazon’s big data transfer service. You sign up and the service piggy-backs on your usual Amazon Web Services (AWS) billing & credentials. Then they ship you a physical computer, a 50 pound honking plastic thing that arrives on your doorstep via two-day UPS:

AWS Snowball Enclosure

The first thing I noticed was a cleverly-embedded Kindle that serves as both shipping label and user interface:

AWS Snowball, Shipping Label

The plastic enclosure itself opens DeLorean-style to reveal a handful of spooled cables:

AWS Snowball, Cable Spool

You plug the Snowball into your normal 120V AC mains power, and boot the thing:

AWS Snowball, Bootup

 

Next you install some AWS software on another machine on your network, and then use that software to copy data over the network to the Snowball itself:

AWS Snowball, CLI Client

Tucked away inside is a serious amount of disk storage, 50 terabytes in the case of the Snowball I tried. The device itself is an intimidating “engineering sample,” whatever that means:

AWS Snowball, Engineering Sample

This is where I noted the first serious snag in my plans: The Snowball relies upon your own (home) network for data transfer, which puts a bandwidth bottleneck at your router. My suddenly-beleaguered Netgear thing was tapped-out within moments, and installing Linux on the router (WW-DRT) would not have gotten me further than a 2x speedup.

Also the Snowball client runs on another machine on your network, which is not much of a limitation when used in an institution. However I was copying data from an external hard drive sitting in a SATA IDE to USB 3.0 adapter thing, which put another bottleneck and layer of complexity at the USB port.

Why not just interface my external hard drives directly to the Snowball? Or maybe even install the hard drives as, temporary, internal disks within the enclosure? The enclosure is almost hermetically sealed (“rugged enough to withstand a 6 G jolt”), and exposes only Cat 5 and fiber network ports.

Here is me telling the Snowball via its command-line client that it is ready to be returned to AWS in Oregon:

AWS Snowball, Return Label

So! I found the Snowball to be a relatively sophisticated and honest approach to the realities of the Internet bandwidth vs. storage size growth curve. However it is not a good solution for those of us wanting to upload a bunch of rotting hard drives to The Cloud. Amazon has a legacy service that accepted shipped disk drives directly, but I believe it has gone away. On the other hand, I expect Snowball to be a very efficient and slick solution for most organizations. But for the guy sitting on some dusty hard disks, it did not get the ball rolling.

Dreaming of the Cloud

So far cloud 2011 is just client-server 1997 with new jargon.

As a modeler who manages a serious EC2 cluster, someone who has handed thousands of dollars to Amazon over the last few years, I remain frustrated at what the industry has settled on as the main unit of value. Root access on a Linux virtual machine does an admirable job of isolating my applications from other users, but it is a poor way to economically prioritize. We need a smarter metaphor to distribute a long-running job across a bunch of machines and to make sure we pay for what we use. I don’t so much care about having a fleet of machines ready to handle a spike in web traffic. Instead I want to be able to swipe my credit card to ramp up what would usually take a week, so it will finish in a couple hours.

(If you are a Moore’s Law optimist who thinks glacial, CPU-bound code is a thing of the past, you might be surprised to hear that one of my models has been training on an EC2 m1.large instance for the last 14 hours, and is just over halfway finished… Think render farms and statistical NLP, not Photoshop filters.)

My dream cloud interface is not about booting virtual machines and monitoring jobs, but about spending money so my job finishes quicker. The cloud should let me launch some code, and get it chugging along in the background. Then later, I would like to spend a certain amount of money, and let reverse auction magic decide how much more CPU & RAM that money buys. This should feel like bidding for AdWords on Google. So where I might use the Unix command “nice” to prioritize a job, I could call “expensiveNice” on a PID to get that job more CPU or RAM. Virtual machines are hip this week, but applications & jobs are still the more natural way to think about computing tasks.

This sort of flexibility might require cloud applications to distribute themselves across one or more CPUs. So perhaps the cloud provider insists that applications be multi-threaded. Or Amazon could offer “expensiveNice” for applications written in a side-effect free language like Haskell, so GHC can take care of the CPU distribution.