Author avatar

Chris Behrens

Grouping and Aggregating Data with LINQ

Chris Behrens

  • Nov 29, 2018
  • 7 Min read
  • 90,834 Views
  • Nov 29, 2018
  • 7 Min read
  • 90,834 Views
C#
LINQ

The Problem This Solves

It's helpful to be able to aggregate data according to certain criteria. For example, if we had a thousand orders yesterday, it might be very useful to query the database and know whether those thousand orders were from a thousand different products or two:

1SELECT COUNT(Product) as TOTAL, Product FROM Orders GROUP BY Product
tsql

This gives us results like this:

1TOTAL	Product
2
3---------------------
4983	Lead to Gold Transmuter
517	Selfie Stick

The group by statement gives us this same capability in LINQ – by grouping our objects, we can perform aggregate functions like count, sum, max and min on our sets, and create meaningful information from our data.

Our Scenario

We've been tasked with organizing the employee data from the Selecting and Transforming Data with LINQ, a set of employees, by employee location. In our employee data, each employee object has a group code that looks like this:

AA123

In this code, the first two letters indicate a location code, followed by three numbers indicating the office number. The first step in creating any query is to refine the fields we want to return – if the quality of our question is poor, we're unlikely to end up with a high-quality answer. The fields we want to return are:

  1. Location – the location in question
  2. Total – the total number of employees per location

Each result will have an int total, a number, and a string Location- the two-letter code that represents where the employees are. To make this happen, we're going to return an anonymous type. Anonymous types allow us to return a type that we define at query-time and is generated by the compiler.

A fluent LINQ query which returns EVERYTHING looks like this:

1from e in employees select employee
csharp

To return our anonymous type, we'll modify the query to look like this:

1from e in employees select new {Location, Total};
csharp

The compiler gives us two errors with this:

1_The name 'Location' does not exist in the current context_
2
3_The name 'Total' does not exist in the current context_

This makes perfect sense – we haven't defined those elements yet. To make this work, we need to group the content together. To do this, we use the group by statement:

1from e in employees
2group e by e.GroupCode.ToUpper().Substring(0, 2) into g
3select new { Location, Total };
csharp

Here, we're taking the first two characters of the GroupCode property of the Employee object and grouping the set by that value into a group set we've designated as g.** But we've still got our same two compiler errors – the system doesn't know what Location or Total mean yet.

At this point, the contents of g look like this:

AA (List of Employee Objects with GroupCode that starts with AA)

ZZ (Employee Object with GroupCode that starts with ZZ)

The AA and ZZ are clearly the values of the Location field we created .

1e.GroupCode.ToUpper().Substring(0, 2)
csharp

The lists contain the Employees associated with each Location. The key that we create for grouping, in this case the first two characters of the group code, is made available in a Key property on the g object:

1var groups = from e in employees
2group e by e.GroupCode.ToUpper().Substring(0, 2) into g
3select new { Location = g.Key, Total };
csharp

One compiler error disappears now that we've defined where the location is coming from. Next, we'll define Total as the count of objects associated with that key in group g:

1from e in employees
2group e by e.GroupCode.ToUpper().Substring(0, 2) into g
3select new { Location = g.Key, Total = g.Count() };
csharp

Count() here is a method, not a property. We're asking for the size of the employee list and that is our total. This gives us the following result:

AA 2

ZZ 1

Which is exactly what we wanted.

Having By Statements in LINQ

Let's say that you produce this beautiful report for your boss and she says, "Ah…you did give me exactly what I asked for. But what I actually wanted was a report of those groups that have duplicates. If they only have one employee, I don't want them included in the results."

If you're familiar with T-SQL you're already saying, "We need a HAVING clause". HAVING statements are like WHERE clauses for aggregate values. We would say something like this in T-SQL:

1SELECT COUNT(Location) as TOTAL, Location FROM Employees GROUP BY Location HAVING COUNT(Location) > 1
tsql

So you might give this a try in LINQ:

1var groups = 
2from e in employees
3group e by e.GroupCode.ToUpper().Substring(0, 2) into g
4having g.Count() > 1
5select new { Location = g.Key, Total = g.Count() };
csharp

No dice, the compiler doesn't recognize the term _having _in this context, so there are a number of errors that result from this. The good news is that, in T-SQL, there is the separation of filtering for values with WHERE from the filtering of aggregates with HAVING. There is no such separation in LINQ, it's all just WHERE:

1var groups = 
2from e in employees
3group e by e.GroupCode.ToUpper().Substring(0, 2) into g
4where g.Count() > 1
5select new { Location = g.Key, Total = g.Count() };
csharp

All we had to do was change "having" to "where" and it works. This query will filter out results which only have a single employee associated with the location and return a single result with the same format as before:

AA 2

Other Aggregations

GroupCode is a string, so there's not a lot that we can do with it. If we were working with a number instead, Salary, for example, we could perform numeric aggregations like SUM:

1from e in employees
2group e by e.GroupCode.ToUpper().Substring(0, 2) into g
3where g.Count() > 1
4select new { Location = g.Key, Total = g.Count(), TotalSalary = g.Sum(x => x.Salary)};
csharp

This would tell us exactly how much we were paying all the employees in a given location. We could add additional metrics like this:

1from e in employees
2group e by e.GroupCode.ToUpper().Substring(0, 2) into g
3where g.Count() > 1
4select new { Location = g.Key, Total = g.Count(),
5    TotalSalary = g.Sum(x => x.Salary),
6    AverageSalary = g.Average(x => x.Salary),
7    CheapestEmployee = g.Min(x => x.Salary),
8    MostExpensiveEmployee = g.Max(x => x.Salary)
9};
csharp

That gives a single result – remember that this query only returns locations with duplicates – that looks like this:

Location = "AA"

Total = 2

TotalSalary = 250

AverageSalary = 125

CheapestEmployee = 100

MostExpensiveEmployee = 150