Skip to content

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Grouping and Aggregating Data with LINQ

By grouping our objects, we can perform aggregate functions like count, sum, max and min on our sets, and create meaningful information from our data.

Nov 29, 2018 • 7 Minute Read

The Problem This Solves

It's helpful to be able to aggregate data according to certain criteria. For example, if we had a thousand orders yesterday, it might be very useful to query the database and know whether those thousand orders were from a thousand different products or two:

      SELECT COUNT(Product) as TOTAL, Product FROM Orders GROUP BY Product
    

This gives us results like this:

      TOTAL	Product

---------------------
983	Lead to Gold Transmuter
17	Selfie Stick
    

The group by statement gives us this same capability in LINQ – by grouping our objects, we can perform aggregate functions like count, sum, max and min on our sets, and create meaningful information from our data.

Our Scenario

We've been tasked with organizing the employee data from the Selecting and Transforming Data with LINQ, a set of employees, by employee location. In our employee data, each employee object has a group code that looks like this:

AA123

In this code, the first two letters indicate a location code, followed by three numbers indicating the office number. The first step in creating any query is to refine the fields we want to return – if the quality of our question is poor, we're unlikely to end up with a high-quality answer. The fields we want to return are:

  1. Location – the location in question
  2. Total – the total number of employees per location

Each result will have an int total, a number, and a string Location- the two-letter code that represents where the employees are. To make this happen, we're going to return an anonymous type. Anonymous types allow us to return a type that we define at query-time and is generated by the compiler.

A fluent LINQ query which returns EVERYTHING looks like this:

      from e in employees select employee
    

To return our anonymous type, we'll modify the query to look like this:

      from e in employees select new {Location, Total};
    

The compiler gives us two errors with this:

      _The name 'Location' does not exist in the current context_

_The name 'Total' does not exist in the current context_
    

This makes perfect sense – we haven't defined those elements yet. To make this work, we need to group the content together. To do this, we use the group by statement:

      from e in employees
group e by e.GroupCode.ToUpper().Substring(0, 2) into g
select new { Location, Total };
    

Here, we're taking the first two characters of the GroupCode property of the Employee object and grouping the set by that value into a group set we've designated as g.** But we've still got our same two compiler errors – the system doesn't know what Location or Total mean yet.

At this point, the contents of g look like this:

AA (List of Employee Objects with GroupCode that starts with AA)

ZZ (Employee Object with GroupCode that starts with ZZ)

The AA and ZZ are clearly the values of the Location field we created .

      e.GroupCode.ToUpper().Substring(0, 2)
    

The lists contain the Employees associated with each Location. The key that we create for grouping, in this case the first two characters of the group code, is made available in a Key property on the g object:

      var groups = from e in employees
group e by e.GroupCode.ToUpper().Substring(0, 2) into g
select new { Location = g.Key, Total };
    

One compiler error disappears now that we've defined where the location is coming from. Next, we'll define Total as the count of objects associated with that key in group g:

      from e in employees
group e by e.GroupCode.ToUpper().Substring(0, 2) into g
select new { Location = g.Key, Total = g.Count() };
    

Count() here is a method, not a property. We're asking for the size of the employee list and that is our total. This gives us the following result:

AA 2

ZZ 1

Which is exactly what we wanted.

Having By Statements in LINQ

Let's say that you produce this beautiful report for your boss and she says, "Ah…you did give me exactly what I asked for. But what I actually wanted was a report of those groups that have duplicates. If they only have one employee, I don't want them included in the results."

If you're familiar with T-SQL you're already saying, "We need a HAVING clause". HAVING statements are like WHERE clauses for aggregate values. We would say something like this in T-SQL:

      SELECT COUNT(Location) as TOTAL, Location FROM Employees GROUP BY Location HAVING COUNT(Location) > 1
    

So you might give this a try in LINQ:

      var groups = 
from e in employees
group e by e.GroupCode.ToUpper().Substring(0, 2) into g
having g.Count() > 1
select new { Location = g.Key, Total = g.Count() };
    

No dice, the compiler doesn't recognize the term _having _in this context, so there are a number of errors that result from this. The good news is that, in T-SQL, there is the separation of filtering for values with WHERE from the filtering of aggregates with HAVING. There is no such separation in LINQ, it's all just WHERE:

      var groups = 
from e in employees
group e by e.GroupCode.ToUpper().Substring(0, 2) into g
where g.Count() > 1
select new { Location = g.Key, Total = g.Count() };
    

All we had to do was change "having" to "where" and it works. This query will filter out results which only have a single employee associated with the location and return a single result with the same format as before:

AA 2

Other Aggregations

GroupCode is a string, so there's not a lot that we can do with it. If we were working with a number instead, Salary, for example, we could perform numeric aggregations like SUM:

      from e in employees
group e by e.GroupCode.ToUpper().Substring(0, 2) into g
where g.Count() > 1
select new { Location = g.Key, Total = g.Count(), TotalSalary = g.Sum(x => x.Salary)};
    

This would tell us exactly how much we were paying all the employees in a given location. We could add additional metrics like this:

      from e in employees
group e by e.GroupCode.ToUpper().Substring(0, 2) into g
where g.Count() > 1
select new { Location = g.Key, Total = g.Count(),
    TotalSalary = g.Sum(x => x.Salary),
    AverageSalary = g.Average(x => x.Salary),
    CheapestEmployee = g.Min(x => x.Salary),
    MostExpensiveEmployee = g.Max(x => x.Salary)
};
    

That gives a single result – remember that this query only returns locations with duplicates – that looks like this:

Location = "AA"

Total = 2

TotalSalary = 250

AverageSalary = 125

CheapestEmployee = 100

MostExpensiveEmployee = 150