The Big Data components of Azure let you build solutions which can process billions of events, using technologies you already know. In this course, we build a real world Big Data solution in two phases, starting with just .NET technologies and then adding Hadoop tools.
How do you make sense of Big Data? When you’re receiving 100 million events per hour and you need to save them all permanently, but also process key metrics to show real-time dashboards, what technologies and platforms can you use? This course answers those questions using Microsoft Azure, .NET, and Hadoop technologies: Event Hubs, Cloud Services, Web Apps, Blob Storage, SQL Azure, and HDInsight. We build a real solution that can process ten billion events every month, store them for permanent access, and distill key streams of data into powerful real-time visualizations.
Ingesting Data into Event Hubs Hey, how are you doing? I'm Elton, and this is Ingesting Data into Event Hubs, the next module in Real World Big Data in Azure. At the core of a big data solution is a message queue, which can receive huge numbers of events and make them available for processes to read. In Azure, we have event hubs, which can scale up to receive millions of messages per second, store them for up to seven days and make them available for multiple processes to read. We can plug in different components to receive messages, and they all get a copy of every event which means there's really no limit to what we can do with our platform. In this module, we're going to look at how event hubs work and why they're a key part of the solution. We'll create an event hub and see how it can be configured. Some parts of the setup are fixed once it's created, so you need to understand what you can change and when. We'll look at why we're using our own API rather than having the client send to event hubs directly, and then we'll plug in the REST API from module one into the event hub and start capturing our events. We'll keep our focus on running efficiently and reliably, and we'll see some of the practical limits of event hubs which you need to know about to get the best from the platform. Then we'll look at how we can test our solution outside of Azure so we can build up a suite of end to end tests that give us confidence in the whole stack. And lastly, we'll deploy what we have so far to the cloud, push some events through and see how our JSON logging framework gives us good traceability.
Storing Event Data for Batch Queries Hey, how you doing? I'm Elton and this is Storing Event Data for Batch Queries. The next module in real world big data in Azure. In big data solutions typically is the real-time output that people want to see. The colorful dashboards and that they're constantly updating. They look great and they distill huge amounts of data into manageable chunks of really useful information. We'll be building that stuff soon but the other output of big data is deep storage. It doesn't look flash but if anything, it's even more important. Deep storage means storing every event permanently in its raw form so that we can query it later. This is a real enabler for the business. When you put your big data solution life often people don't know exactly what output they want to see. But later when someone says, "Can you tell me "how many of these? " Or "When is it people are doing this? " That's when deep storage comes in. As long as you've been storing all those events, you can build a query to answer new questions going back over all the historical data. We're going to use Azure Blob Storage for our deep storage component. Blob storage is Hadoop compatible so it's easy to query our events. It's reliable and highly available and it's cheap. We're just storing events on disks somewhere in a data center, we're not using any compute resource for the actual storage and only minimal resources to pull events from the hub and store them as blobs. So if we're handling billions of messages every week and storing terabytes of data every month our Azure bill is still going to be comparatively small. And we're going to have all the data we need to answer any type of questions from finding one single event to correlating and aggregating across billions of events.
Using HBase for Storage Hey, how are you doing? I'm Elton, and this is using HBase for Storage. The final module in Real World Big Data in Azure. We're going to learn about HBase, which is a NoSQL database specifically designed for storing massive quantities of data in a flexible and highly-performant way. It's a tool which is perfect for real-time analytics of big data, and the Azure implementation in HDInsight is fully-featured, albeit a couple of versions behind the official Apache release. HBase is an open source implementation of Google's Bigtable database, which was designed for indexing web searches, and scales to run reliably across thousands of nodes and store thousands of terabytes of data securely. HBase really deserves a course of its own, but we'll cover enough here to show you where it's a good fit, how to integrate it with other technologies that we've seen in the course, and the key factors to consider when you're designing your tables in HBase, which needs a very different approach from modeling data in SQL, and other NoSQL databases. In this module, we'll use HBase as an alternative data store for real-time analytics. Because we can store much more data than we could in SQL, but still access it very quickly, we can support a much finer level of detail in our analytics, and visualize more meaningful data in our dashboards. We'll populate an HBase table containing device error logs using Storm, and then show the data with a new dashboard in Dashing. net. We'll see how to set up HBase in Azure, how to read and write from it using the. NET SDK, and how to think about structuring your data for fast access. As this is the last module in the course, we'll end with a summary of what we've learned, and the solutions that we've built.