My previous blog post covered the need for handling state across devices/users and introduced the different Windows Azure storage options. In this post, I want to discuss the approach to data architecture in more detail.
Why not just use SQL databases?
While SQL databases provide many of the functionality known from a RDBMS they come with a higher price point and pretty hard size limitations (150GB as of March 2013). This makes them great for solutions with a predictable amount of data and scenarios which benefit from RDMBS capabilities such as Transact-SQL support. Another benefit might be the reuse of your client libraries because tabular data stream (TDS) being the communication protocol for both SQL Server and SQL databases.
However most services will have the need to store and query an increasing amount of data which pushes a single database at its scale up limitations. Since cloud computing is based on scale out we’re soon confronted with the challenge to partition our data across multiple storage nodes or different storage technologies (such as Tables, Blobs, Hadoop on Azure, SQL databases, …).
While traditional reasons for partioning where predominately about horizontal partitioning (e.g. sharding) the cloud provides new reasons for data partitioning such as cost optimization through the usage of different storage technologies or the ability to only temporarily store data (e.g. when running a Monte Carlo simulation on a Hadoop cluster on Windows Azure).
In horizontal partitioning, we spread all data across similar nodes to achieve massive scale out of data and load. In such a scenario, all queries within a partition are fast and simple while querying data cross-partitions becomes expensive. An example of horizontal partitioning is the distribution of an order table according to the customer which placed the order. In this example we partition the order table using the customer as the partition key. This would make it very efficient for retrieving orders that belong to a specific customer but very ineffective to retrieve information that involves cross customer queries such us “What are the customers that ordered product xyz”.
In vertical partitioning, we spread data across dis-similar nodes to take advantage of different storage capabilities within a logical dataset. By doing so, we can leverage more expensive indexed storage for frequently queried data but store large data entities in cheaper storage (such as blob and tables). For instance, we could store all order information in a SQL database except the order documents, which we store as pdf in blob storage. The downside of this approach is that retrieving a whole row requires more than just one query.
In hybrid partitioning we take advantage of horizontal and vertical partitioning within the same logical dataset. For instance leverage horizontal partitioning across multiple similar SQL databases (sharding) but use blob storage to store the order documents.
To take advantage of cheap cloud storage we must partition our data.
In all partitioned scenarios it is cheap to query data within a partition but expensive to query it across multiple partitions or storage types. However since storage is fairly cheap and available in unlimited capacity, it is a very common approach to aggressively duplicate data to ensure every query includes a partition key. By doing so, we optimize the service for data retrieval. For example, if we have an order table which is partitioned by customers, it is expensive to retrieve a list of customers which ordered product xyz. This is because we can’t provide the query with a partition key. One way to address this problem is to create a second table which duplicates the data but uses product as the partition key. We basically optimize our service for data retrieval and not for data inserts. Which is a fundamental change for many of us used to SQL databases.