Understanding Windows Azure Blob Storage (A Quick Overview)

Image by Bob Mical | Some Rights Reserved

In the course of the past two years Microsoft has made significant strides in creating a developer-friendly experience for using Windows Azure. From a rather disjointed (and to me, poorly understood) beginning, Windows Azure has grown into a fascinating playground for those of us wanting to explore aspects of “Infrastructure-As-A-Service” (IAAS).

At the time of this writing there has been a lot of attention focused upon such sexiness as deploying websites to Azure from integrated source control, and integration of Azure with Microsoft’s Webmatrix platform. However, cloud storage is one of the core elements of Azure’s IAAS offerings, and will likely play a role in any other Azure services you use.

Azure Blob storage, like Amazon S3, offers a handy (and cheap) way to persist content and make it available across the web. For example, there was a minor kerfluffle last year when Github decided to eliminate the “Downloads” feature of their project hosting platform. Services such as Amazon S3 or Azure Blob Storage present an alternative location to host project binary files and/or other resources for easy linking and download.

Or, if you’re like me, you might just find it fun (and informative) to experiment with modeling up your own cloud-based storage client for whatever purpose. On the whole, storage such as provided by Amazon or Azure is incredibly cheap, and it is possible to store a boatload of files for pennies each month (literally).

What’s a Blob?
Azure Blob Storage Model – Overview
Block Blob Specifics
Blob and Container Naming Requirements
Blob URLs, Supported Languages, and API Access

What Can it Do for Me?

According to the Azure Website, Azure Blob Storage is ” . . . a service for storing large amounts of unstructured data that can be accessed from anywhere in the world via HTTP or HTTPS.” From the same site, common uses for Azure Blob storage might include (but are definitely no limited to):

Serving images or documents directly to a browser
Storing files for distributed access
Streaming video and audio
Performing secure backup and disaster recovery
Storing data for analysis by an on-premises or Windows Azure-hosted service

While the Azure team appears to place the focus on “large amounts of unstructured data,” blob storage is equally useful for storing data of any size. As previously mentioned, it is possible to create a basic “cloud drive” type application using Azure storage, or Amazon S3 (I believe Dropbox uses Amazon as its backend).

What’s a Blob?

In the most general sense, the term “blob” is commonly understood to mean “Binary Large OBject.” Many of us are familiar with this term from its usage in database-land, where “blob data” might be data stored in our database which does not conform to an established data type as defined by the database. Such data are usually (if the database supports it) persisted as plain binary data (image files come to mind as an example).

Most of the major players in the “cloud” storage space have extended this notion to various generic storage implementations which allow client data of any sort to be uploaded/persisted on the vendor storage server in binary format. Amazon S3 implements a model in which binary data (“Objects”) are persisted in “Buckets.” Windows Azure persists binary data (“Blobs”) in “Containers.”

Of course, there is more to it than that, and that’s what I am discussing in this post.

The Azure Blob Storage Model: Overview

An Azure Storage Account will consist of one or more Containers, which are created and named by the user to hold Blobs. All blobs must be located in a container. In general (and at the time of this writing), an Azure user can have up to five separate storage accounts.

An individual storage account may contain an unlimited number of containers, and an individual container may hold an unlimited number of blobs. However, the total size of all containers may not exceed 100TB. Windows Azure defines two distinct types of blob – Block Blobs and Page Blobs. For the moment, I am going to focus on Block Blobs.

Block Blobs

According to the Azure team, the most common use-cases for blob storage will involve Block Blobs. Block blobs represent binary data that has been segmented into one or more blocks to enable ease of transmission over a network, and sensible management of large data files. The blocks which make up a blob may be of different sizes, up to 4 MB each. Each block within a blob is identified by a Block ID, and may optionally also include an MD5 hash of the blob content. The maximum size for a block blob is 200 GB, and a blob can consist of up to 50,000 individual blocks.

The idea here is that a large file may be broken up into blocks, which then may be uploaded or downloaded separately, in any sequence (and importantly, asynchronously) and then re-associated with each other, in the proper sequence. The blocks which make up a blob are associated with that specific blob through a list enumerating all the blocks within the blob.

When blocks are uploaded, they are associated with the designated blob, but do not formally become part of the blob until the blocks are Committed, by supplying the list of blocks. Each block identified in the list by its Block ID is formally made a part of the blob. Blocks which are uploaded but not formally committed remain uncommitted, and are discarded after one week. Uncommitted blocks may be committed at any point prior to being discarded. When a client attempts to write a block for a blob that does not exist, a new blob will be created.

Blob data can be modified at the block level. That is, individual blocks can be added to an existing blob, existing blocks can be overwritten or replaced, and specific blocks within a blob can be deleted.

Individual blocks are identified by a Block ID property, which are represented by strings of equal length. The MS Azure site indicates that most blob storage clients utilize base-64 encoding to create ID strings of equal length. Id’s must be unique within each blob, but do not need to be unique between different blobs. In other words, blocks in two different blobs may have the same block ID.

Block blob storage is designed to facilitate efficient network file handling and asynchronous file transfer. While the previous may seem a little complex, the Azure team has created API’s for a number of popular development languages which make much of this easier to deal with for common use cases.

Blob and Container Naming Requirements

Container Names must begin with a letter or a number, can contain only letters, numbers, and the dash (-) symbol, and letters must be all lowercase. Container names must be a minimum of 3 and not more than 63 characters in length.

Blob Names can contain any characters, but special characters and characters reserved for use in URLs must be escaped. Blob names can include upper and lower-case letters, and cannot exceed 1024 characters in length.

Container names and blob names are combined to form part of the URL by which the blob is accessed, so names for each must conform to DNS naming standards when combined.

Blob URLs and API Access

Blobs are accessible through URLs in the following format:

http://<your-storage-account-name>/blob.core.windows.net/<container-name>/blob-name

The Azure team has created a number of SDKs/APIs for programmatically working with blob storage. As of this writing, the following languages/platforms are supported with SDKs/APIs which expose various Azure services, including Blob Storage:

.NET SDK / .NET API Reference
Java SDK (Windows) | (Mac) | Linux / Java API Reference
PHP SDK
node.js SDK (on Github)
Ruby SDK | Windows | Mac | Linux
Python SDK

In addition, Azure defines a REST API for all storage services useable by any application which can send/receive HTTP or HTTPS requests/responses. Some of the SDKs above rely on the Rest API for programmatic access.