C#

Modeling a Directory Structure on Azure Blob Storage


Windows Azure has matured nicely over the past few years into a very developer-friendly “Infrastructure-As-A-Service” platform. While many of the recent public announcements have been focused Azure Websites, Deployment from Source Control, and the new general availability of Azure Virtual Machines, one of the core features of the Azure platform remains storage.

Specifically, Azure Blob Storage.

If you are not familiar with Azure blob storage, visit the Azure site, or check out an overview of what it’s all about.

Blobs are Great and All That, but What About Files and Folders?

So far as Azure itself is concerned, a blob represents one or blocks of binary data. Much like data on your local hard drive, the notion of directories, files, and the tree-like hierarchal model so familiar to most users is imposed upon the stored data as an abstraction. In the case of your local hard drive, this model is implemented by the operating system.

Within the Windows Azure storage model, there is no OS to impose such structure. Organization and interpretation of the data structure is left up to the client. It is in this way that we are able to store any type of data on the Blob storage platform, and similarly why blob storage is easily consumed by multiple languages and useable from any platform – the structure of the data is platform agnostic.

It’s All in the Name

The blob storage model and associated APIs establish addressing and naming conventions which serve to identify individual blobs within storage containers. We address blobs via URLs according to the following format:

http://<your-storage-account-name>/blob.core.windows.net/<container-name>/blob-name

In the above, the URL is composed of your Azure Storage Account name, the /blob.core.windows.net/ namespace, followed by the name of the container in which the blob is located, and finally the name of the blob itself. However, as you recall from our overview, blob names can contain any characters, including forward slashes. For example, one could create a blob named Documents/Photos/Graduation Pic.jpg, which, assuming the following details,  would be addressed as follows:

Storage Account: mystorageaccount

Container Name: mycontainer

Blob Name: Documents/Photos/Graduation Pic.jpg

The Blob name above implies that within the container mystorageaccount, there is a directory named Documents, containing a subdirectory named Photos, in which a file named Graduation Pic.jpg is located.

Access this blob using:

http://mystorageaccount/blob.core.windows.net/mycontainer/Documents/Photos/Graduation Pic.jpg

In the Azure storage Account view, a virtual hierarchal directory structure might look like this:

azure-storage-filenames

In the image above, note the names of the blobs in the ‘Name” column. As far as Azure is concerned, these are simply the blob names. However, much like your operating system, the various SDK implementations afford functionality which can optionally parse these names into a directory structure.

In the above, for example, the blobs File 1.txt, File 2.txt, and File 3.txt can be thought of a being at the root level of the Storage Container, while the blobs named Johns Files/File 1.txt, Johns Files/File 2.txt, and Johns Files/File 3.txt can be thought of as being contained in a directory named Johns Files/. Similarly, We can also see that there is an implied subdirectory within the Johns Files/ directory named Music/, again containing three files.

If we scroll to the right in the Azure browser view above, we see that the same structure is mirrored in the URL for each file, except the file name now conforms to legal URL naming standards:

azure-storage-urls

While Azure storage itself does not recognize the notion of a directory as implied by the delimited blob name, the Azure SDK for your chosen language does, to a certain extent, and assists us in parsing an implied directory structure out of the blob address + name. In this post we will look at the .NET API, however, similar implementations are available for Java, Ruby, Python, and other common platforms.

API Access to the Storage Account and Containers

First we will look at some of the basics involved in accessing an Azure storage account and the containers hosted therein. Using the Azure .NET API, we can create a simple Console application which presents required credentials to the Azure storage account, and then retrieves an IEnumerable <CloudBlobContainer> representing the blob storage containers as follows (note that the various other SDKs and APIs define similar functionality for the appropriate platform/language). To do this, we use the ListContainers() method of the CloudBlobClient class:

// in a real project, one might want to implement some security here:
const string ACCOUNT_NAME = "xiv";
const string ACCOUNT_KEY = "myAccountKey";
const string AZURE_CONNECTION_STRING = ""
    + DefaultEndpointsProtocol=https;AccountName={0};AccountKey={1}";
static void Main(string[] args)
{
    var storageAccount = CloudStorageAccount.Parse(string.Format(AZURE_CONNECTION_STRING, 
        ACCOUNT_NAME, ACCOUNT_KEY));
    var client = storageAccount.CreateCloudBlobClient();
    foreach (var container in client.ListContainers())
    {
        Console.WriteLine("Container: " + container.Name);
    }
    // This pauses execution so the console output can be viewed:
    Console.Read();
}

In the above, the code simply iterates over the Enumerable and prints the name for each container to the console output. However, the CloudBlobContainer class exposes a variety of useful methods designed to facilitate blob data access. For the purpose of this post, we are primarily interested in the ListBlobs() method.

The ListBlobs method, and the IListBlobItem Interface

The ListBlobs() method returns an instance of IEnumerable<IListBlobItem> which is essentially a list of the blobs within the container. However, the enumerable also includes representations of the implied directory structure.

Items returned by the Instance of Enumerable will one of two types, both of which implement the interface IListBlobItem. Actual blob data (“files” so to speak) will be returned as instances of CloudBlockBlob. Directories, which are essentially abstractions created by the API based upon the structure implied by blob names delimited with forward-slashes, are returned as instances of CloudBlobDirectory.

The instances of IListBlobItem returned in the enumerable can be cast to either CloudBlobDirectory or CloudBlockBlob. From there, we can access properties of each in order to print representations to the console window. For example, if we want to focus on directories only, we could add a method as follows, which iterates over the enumerable passed as an argument, checks the underlying type of each IListBlobItem, and if the underlying type is CloudBlobDirectory, prints the directory prefix to the console window. We then employ a little recursion and do the same for each CloudBlobDirectory instance:

Define the printCloudDirectories Method:
static void printCloudDirectories(IEnumerable<IListBlobItem> blobList)
{
    foreach (var blobitem in blobList)
    {
        if (blobitem is CloudBlobDirectory)
        {
            var directory = blobitem as CloudBlobDirectory;
            Console.WriteLine(directory.Prefix);
            printCloudDirectories(directory.ListBlobs());
        }
    }
}

Now we can add a line to our Main method as defined in the previous section, and after printing the name of each container, we can pass the output of the ListBlobs() method to our new printCloudDirectories method:

Call printCloudDirectories from Main:
static void Main(string[] args)
{
    var storageAccount = CloudStorageAccount.Parse(string.Format(AZURE_CONNECTION_STRING, 
        ACCOUNT_NAME, ACCOUNT_KEY));
    var client = storageAccount.CreateCloudBlobClient();
    foreach (var container in client.ListContainers())
    {
        Console.WriteLine("Container: " + container.Name);
        // ADD A CALL TO printCloudDirectories:
        Program.printCloudDirectories(container.ListBlobs());
    }
    // This pauses execution so the console output can be viewed:
    Console.Read();
}

The output from our little application, if run against the storage account example above, at this point looks like this:

output-printCloudDirectories-method

A Simple Console Demonstration

Let’s take this a few steps further, and create an application which iterates over the containers of a storage account, and writes a text representation of the implied directory structure to the console output.

In our Main method, we will authenticate as before by presenting our credentials. We will then get a reference to an instance of CloudBlobClient. From here, though, we will make a few minor modifications. We define a method named writeAzureDirectoriesToConsole which accepts an argument of IEnumerable<CloudBlobContainer>. Similar to our previous method, this iterates over the Enumerable, and writes the container name to the console output. From here, however, the output of the ListBlobs() method is passed as an argument to a method named getContainerDirectories, which returns a string representation of the file structure we are after.Unlike our previous example, the getContainerDirectories method employs a LINQ query to parse directories and blob files separately, and also adds some rudimentary indentation to the output for readability. The indented formatting also suggests the tree-like structure we associate with a hierarchal data structure:

// in a real project, one might want to implement some security here:
const string ACCOUNT_NAME = "xiv";
const string ACCOUNT_KEY = "MyAccountKey";
const string AZURE_CONNECTION_STRING = ""
    + "DefaultEndpointsProtocol=https;AccountName={0};AccountKey={1}";
static void Main(string[] args)
{
    var storageAccount = 
        CloudStorageAccount.Parse(string.Format(AZURE_CONNECTION_STRING, ACCOUNT_NAME, 
            ACCOUNT_KEY));
    var client = storageAccount.CreateCloudBlobClient();
    writeAzureDirectoriesToConsole(client.ListContainers());
    // This pauses execution so the console output can be viewed:
    Console.Read();
}
static void writeAzureDirectoriesToConsole(IEnumerable<CloudBlobContainer> containers)
{
    foreach (var container in containers)
    {
        string indent = "";
        Console.WriteLine("Container: " + container.Name);
        // Pass Ienumerable to recursive function to get "subdirectories":
        Console.WriteLine(getContainerDirectories(container.ListBlobs(), indent));
    }
}
static string getContainerDirectories(IEnumerable<IListBlobItem> blobList, string indent)
{
    // Indent each item in the output for the current subdirectory:
    indent = indent + "  ";
    StringBuilder sb = new StringBuilder("");
    // First list all the actual FILES within the current blob list. No recursion needed:
    foreach (var item in blobList.Where((blobItem, type) => blobItem is CloudBlockBlob))
    {
        var blobFile = item as CloudBlockBlob;
        sb.AppendLine(indent + blobFile.Name);
    }
    // List all additional subdirectories in the current directory, and call recursively:
    foreach (var item in blobList.Where((blobItem, type) => blobItem is CloudBlobDirectory))
    {
        var directory = item as CloudBlobDirectory;
        sb.AppendLine(indent + directory.Prefix.ToUpper());
        // Call this method recursively to retrieve subdirectories within the current:
        sb.AppendLine(getContainerDirectories(directory.ListBlobs(), indent));
    }
    return sb.ToString();
}

As we can see, once we have successfully authenticated using our storage account name and associated account key, we create an instance of CloubBlobClient using the CreateCloudBlobClient() method of our StorageAccount instance. From there, we pass an IEnumerable<CloudBlobContainer> (from the ListContainers() method of our client object) to our hacked-together method writeAzureDirectoriesToConsole.

The writeAzureDirectoriesToConsole method simply iterates over the containers passed in, writes the name of the container to the console (in uppercase) and passes each container reference to the slightly more complex getContainerDirectories method, along with an empty string as an initial indentation level. The indentation will help visually display our directory structure in the console by indenting subdirectories and their contents.

The getContainerDirectories method adds four spaces to the indent parameter passed in, then uses a LINQ query to return an IEnumerable<IListBlobItem> for which the instances of IListBlobItem can all be cast to type CloudBlockBlob. each instance is so cast, and then written to the StringBuilder instance for console output after pre-pending the indent.

Next, we retrieve an IEnumerable<IListBlobItem> for which each instance can be cast to type CloudBlobDirectory. As we have seen, the “directories” obtained this way do not represent actual files, but rather an API interpretation of the blob names. Again, each “directory” prefix is pre-pended with the indentation, and then appended to the StringBuilder for output to the console. However, at this point, the getContainerDirectories method is then called recursively, and passed the IEnumerable<IListBlobItem> returned from the current directory’s ListBlobs() method, and the result also appended to the StringBuilder for eventual output.

If we run our example against the cloud storage account pictured above, the console output looks something like this:

console-output-simple-messy

Simplify the Structure for Display:

We can go a step better here by trimming off the redundant sub-directory paths before we display our tree structure. In the following, I added some lines of code to the  which do just that:

static string getContainerDirectories(IEnumerable<IListBlobItem> blobList, string indent)
{
    // Indent each item in the output for the current subdirectory:
    indent = indent + "  ";
    StringBuilder sb = new StringBuilder("");
    // First list all the actual files within the current blob list. No recursion needed:
    foreach (var item in blobList.Where((blobItem, type) => blobItem is CloudBlockBlob))
    {
        var blobFile = item as CloudBlockBlob;
        // If the current blob item is at the root of a container, there
        // will be no parent directory. Use the file name as-is:
        string outputFileName = blobFile.Name;
        // Otherwise, remove the parent directory prefix from the displayed name:
        if (blobFile.Parent != null)
        {
            outputFileName = blobFile.Name.Replace(blobFile.Parent.Prefix, "");
        }
        // append to the output:
        sb.AppendLine(indent + outputFileName);
    }
    // List all additional subdirectories in the current directory, and call recursively:
    foreach (var item in blobList.Where((blobItem, type) => blobItem is CloudBlobDirectory))
    {
        var directory = item as CloudBlobDirectory;
        string directoryName = directory.Prefix;
        if(directory.Parent != null)
        {
            directoryName = directoryName.Replace(directory.Parent.Prefix, "");
        }
        sb.AppendLine(indent + directoryName.ToUpper());
        // Call this method recursively to retrieve subdirectories within the current:
        sb.AppendLine(getContainerDirectories(directory.ListBlobs(), indent));
    }
    return sb.ToString();
}

In the code above, we have simply added some logic to replace the parent directory prefix in the displayed names of subdirectories, using the Parent property of both CloudBlobDirectory and CloudBlockBlob. In both cases, the parent property returns the CloudBlobDirectory which is the parent of the current item. If there is no parent (in other words, if the current item is at the container root level) then the Parent property will return null, so we need to test for this.

After making the above modifications, when we run our code now the console output is a little more readable:

console-output-simple-cleaner

But . . . What Good Is It?

So far, we have taken a very cursory look at the manner in which Windows Azure allows us to impose a directory structure on our blob storage. We can see how we might expand upon these and other aspects of the blob storage API to create a fairly advanced file management application (for example, a desktop or web client) to facilitate management of data, and/or create a stand-alone “cloud drive” type service. As an example, the screenshot below is an early version of a desktop client I am putting together for an upcoming post. Note that it is currently pointed at the same Azure storage account as our console example. Look for this in an upcoming post.

Example of a More Developed Use Case:

azure-storage-native-client-example

The various APIs/SDKs offer a host of properties and methods allowing us to take advantage of Azure Blob Storage features. With the ever-growing cloud-based nature of our daily computer usage, and the nearly ubiquitous, pervasive presence of the internet in our daily lives, the ability for our applications to integrate with scalable, high-availability storage is not going away soon. Services such as Windows Azure, Amazon Web S3, and other API-based cloud storage services will become more and more important in our application architectures.

I will be exploring this topic further in the near future. I hope to shortly have a post discussing implementation of a very basic desktop client application similar to that pictured above. Additionally, I hope to have a similar example of a web-based client which implements at least some basic but effective authorization.

Additional Resources:

CodeProject
Git: Setting Sublime Text as the Default Editor for Git (Linux Mint/Ubuntu)
C#
C#: Query Excel and .CSV Files Using LinqToExcel
CodeProject
Managing Nested Libraries Using the GIT Subtree Merge Workflow