When it comes to large scale data analysis, NoSQL databases may pose unique challenges - especially with reporting and aggregation. In this course, you will learn how to go beyond simple queries against collections. We will cover key techniques and strategies for digging up and aggregating data in a Big Data world using MongoDB.
Getting Started with the Aggregation Framework The aggregation framework is a very powerful tool. To you, the user, it all boils down to running the aggregate command on a collection. What does the aggregation framework do that is different from the find operator? For one, it has mechanisms that allow you to transform source fields in the original documents. This is very important, especially since your documents can be very complex. Your documents may contain items in arrays or subdocuments. Your fields might be strings and you might only need to address part of them. Your field might be a date type field and you may want to process the month part only. The aggregation framework supports many operators, allowing you to do just that and so much more. Secondly, the aggregation framework inspects documents and acts as a collector. When you just query for documents, the find operator outputs a document or nothing. But in the aggregation framework, documents or fields are inspected and some memory of what is collected must exist. This is easily illustrated when you think about the sum operator. To sum numbers you would need a running sum, an accumulator. Each input field is added to the running sum. So the sum operator has storage that survives the scope of a single document being inspected. The find command simply can't do this kind of stuff.
The $group Pipeline Operator If you are going to run reports, you are going to use the group pipeline operator. Using group is a skill you'll find yourself using day in and day out if you run reports of any kind. Why is that? Aggregation is about collecting items together into groups. There. You see. I said group. Group lets us accumulate things that have something in common. Some similarity. We create groupings around those commonalities. Group things by color, group things by date, and so forth. The group operator lets us combine the data in each similarity group in many useful ways. How many shirts were the same color? What's the average snowfall per day? In this module we'll study the group pipeline operator and get familiar with its pivotal part of the aggregation framework. So without further adieu, I'm going to jump right in and show you the group operator in action.
Document Selection Until now, we've processed every document in the collection. In the real world it's more likely you're looking for only a subset of the data, some time window, a particular query, or some other matching criteria. Let's look at how to use aggregate and select particular documents to be processed or returned.
Shaping Documents Your operational data store is typically optimized for your real-time application. This is a healthy focus to keep when designing your document structure, but as a result, documents can become quite large and complex. When it comes time to dig out data, it would be nice to have your documents flatter, more uniform, and trimmed from extra information that is not relevant to the aggregation. Let's break it down to some reshape goals and their motivation. Flatten, for documents that contain arrays for aggregation-relevant data or nested subdocuments with specific fields that need to be handled. Uniform, for collections that contain documents of differing types or field arrangement or field names. The need to aggregate across them requires a uniform field be synthesized. Trim, to reduce the load on the server, extracting and processing the minimum data from the field reduces resource usage and speeds up processing.
Other Operators You have seen in a bunch of stages and operators in action. Before we move on, I'd like to very briefly scan the most common operator functions. Keep in mind, MongoDB is under live development and evolves pretty quickly so this list is going to expand over time. If you need something direly, well, you can also contribute to the open source project, or maybe code it up yourself.
Performance (Aggregate) Performance is a major concern when you run your aggregations. Applications typically touch a few documents here and there, but reports typically run over most or all documents in a collection. This is not just from the perspective of you, the consumer of the aggregation. It can adversely affect your running software. It is well worth taking the time and analyzing what your aggregation can do to your runtime system as well as what you can do to reduce the footprint of your aggregations.
Map/Reduce Aggregation framework is great and very powerful, but sometimes you need a little more oomph, well wait, oomph is a bit vague. Let's break it down. You might want to do some heavy processing on each document in a collection. You might need fractions or functions or operators that are not currently available in the aggregation framework. You might need to save your intermediary or final result. You might bump into the aggregation framework's limits.
Map/Reduce - Digging Deeper We did some basic Map/Reduce. Now let's go in deeper.