Big data solutions: It’s time to start thinking outside the box
By Kathi Kellenberger on February 25, 2016
When it comes to big data solutions, it's time to start thinking outside the box. And that's because there's no perfect solution for each and every situation; different circumstances and needs require different techniques. Teaching T-SQL classes means that I often get questions about this. Most of the time I simply tell people that the "best technique" really depends. First off, it depends on having all of the necessary information to make that kind of decision, and, again, there’s usually no single right answer that will work for every situation.
There are no absolutes
Of course, there are a few obvious answers, like running DBCC DROPCLEANBUFFERS on your production SQL Server (don’t do it!). But the reality is that there are few absolute rules when it comes to SQL Server. I tell my students that every technique they know in T-SQL is a tool in their arsenal. Even the dreaded cursors have their strengths and should be used when it makes sense.
Since 2002, I have focused my career on SQL Server, and I’m usually called in to help customers with existing solutions hosted on SQL Server. SQL Server has been my world, and I’ve found that there are enough customers who need help with it that I don’t need to spend much time outside of this world. Unfortunately, I have had to learn that, despite how I feel about SQL Server, it is also just another tool in the toolbox.
Recently, I was asked to help a customer with a 1.2 TB database. The customer receives files from a vendor on a quarterly basis and imports the data from the files into SQL Server. Then the customer performs some aggregations on the data for analysis. They called on me for advice on how to load data more efficiently and make the calculations faster. Since the data was already living in SQL Server, I didn’t really think about moving it anywhere.
My first thought was that the large tables in the database needed to be partitioned. SQL Server partitioning allows you to easily move data into and out of a partitioned table based on a partition key. This looked like a very easy problem to solve except for one big issue: The server was not running Enterprise Edition, which is required for the partition feature. Upgrading to Enterprise Edition is an expensive proposition.
We then took a look at the process that performed the calculations. Processing 8000 rows took three minutes. After helping with indexes and some code optimizations, we were able to process 8000 rows in just eight seconds—not a bad improvement.
Finding other solutions
Months later, during a trip to the customer site for other work, they were excited to tell me about their new NoSQL solution hosted in Azure. The new solution allows them to process 20,000 rows in less than a second. I was anxious to learn more. It turned out that, for a minimal cost, they could use Azure Tables to store the data, rather than within a traditional database. Since this customer is only doing analytics, not transactional processing, this works.
It’s easy to copy data to Azure Tables. You don’t have to worry about locking or blocking. Azure Tables can be partitioned. The customer figured out the best way to partition the data, ending up with four million partitions. To process the data, they’re using a C# program. The process can be multi-threaded over all the partitions, and that’s why it’s so fast. It reminds me of the old COBOL processing where calculations are performed over files in code, but Azure Tables are much, much faster.
The customer has ideas for additional functionality using these techniques. By using multiple storage accounts in Azure, they can segregate different processes so that they all run fast with no interference between them. When hosted in SQL Server, there are limits based on the hardware in place. Those limits are eliminated with Big Data solutions.
This is an experience I wanted to share, because I learned quite a lot from it. For one, I need to start thinking outside the SQL Server box. If you’ve been stuck inside a box of your own, it’s time to make some room for other solutions. I’m looking forward to experimenting with the many solutions available, especially those found on Azure like Machine Learning and HDInsight. It’s certainly an exciting time to be in tech!