InfoQ

News

LINQ Grouping Techniques

Posted by Jonathan Allen on Jan 07, 2008

Community
.NET
Topics
Data Access
Tags
LINQ

For the most part LINQ works very much like SQL. Sources, joins, selects, and where clauses are all pretty standard fare. The Group/By/Into clause is where this breaks down. Unlike SQL, which always returns a rectangular dataset, LINQ is capable of returning hierarchical data.

An example of this would be grouping customers by country and city. In SQL this would require either manually grouping on the client-side or performing 1 + N + (N*M) queries: one to get the countries, one for the cities in each country, and one for each customer list.

This can all be done with a single LINQ query by using a series of sub queries. However, the query gets increasing complex. Mitsu demonstrates:

var q = from c in db.Customers group c by c.Country into g select new { g.Key, Count = g.Count(), SubGroups = from c in g group c by c.City into g2 select g2};

Demonstrating the flexibility the LINQ framework, Mitsu reduces it down this single line:

var result = customers.GroupByMany(c => c.Country, c => c.City);

Mitsu did this in a general fashion suitable for any LINQ query. You can see the source code and an explanation of how it works on his blog.

err... by Frans Bouma Posted Jan 7, 2008 8:09 AM
Re: err... by Frans Bouma Posted Jan 7, 2008 8:17 AM
Re: err... by Francois Ward Posted Jan 7, 2008 5:42 PM
Re: err... by Jonathan Allen Posted Jan 7, 2008 9:25 PM
Re: err... by Jonathan Allen Posted Jan 7, 2008 9:23 PM
Re: err... by Frans Bouma Posted Jan 8, 2008 2:47 AM
Re: err... by Jonathan Allen Posted Jan 7, 2008 9:20 PM
Re: err... by Frans Bouma Posted Jan 8, 2008 2:45 AM
  1. Back to top

    err...

    Jan 7, 2008 8:09 AM by Frans Bouma

    It can be done in 1 query?
    select count(customerid) as numcustomer, country, city
    from customers
    group by country, city
    order by country asc, numcustomer desc

    Then all you need is a reader which reads it in, one row at a time. You have to traverse the list just once.

    Having it projected for you in a hierarchy is great of course, but I find it misleading you'd need 1+N+(N*M) queries, which is simply nonsense.

    The downside of this particular linq construct is that the line between what's executed in the DB and what's executed in memory is blurred. Can you tell which group by is ran in the DB and which one in memory? IMHO it's better to keep things separated, so a developer doesn't fall into the trap where s/he thinks the query runs on the DB but actually it's ran in memory...

  2. Back to top

    Re: err...

    Jan 7, 2008 8:17 AM by Frans Bouma

    Heh, actually, I ran the Linq query above using Linq to Sql and I get... I myriad of queries! Not just 1, but a lot.

    This is obvious because Linq to Sql can't merge hierarchical resultsets in memory, it does that by postponing queries for hierarchical data till the parent is evaluated (the customer) and then fetches for that row the child rows.

    So, although the idea is OK, don't be fooled by the MS marketing machine: unless the end result is 1 SQL query as I described above OR 2 (one for parents and 1 for children which are merged at runtime), you WILL get a lot of queries, and this will bring down performance tremendously.

  3. Back to top

    Re: err...

    Jan 7, 2008 5:42 PM by Francois Ward

    Indeed, even the old dataset methods had better ways of doing it than LINQ to SQL, if what you say is true (I didn't play much with LINQ to SQL, but I thought it could actually handle this particular scenario correctly... I'm probably mistaken).

    Seriously though, LINQ is getting to be pretty disapointing... LINQ to SQL is a joke (though it is nice to have around for simpler applications, being built in all). Using LINQs on collections is much more viable, but even that is inefficient as hell, and anywhere where you need performance, you're better off doing things with the extension methods on your own...

    I guess LINQ to SQL is like everything else in .NET, an entry level tool, in the same way the basic toolbox controls for winforms and asp.net are just entry level stuff, and when you get serious you roll up your owns or use third party controls like Component Art, Telerik, DevExpress... Same deal now. Use LINQ to SQL for prototypes or entry level projects, and use something like LLBLGEN when you get serious.

    At least making a LINQ provider doesn't look TOO hard.

  4. Back to top

    Re: err...

    Jan 7, 2008 9:20 PM by Jonathan Allen

    It can be done in 1 query?
    select count(customerid) as numcustomer, country, city
    from customers
    group by country, city
    order by country asc, numcustomer desc


    That won't return all of the customer records, a goal of the above query.

    Also, you still have to group all the cities in each country.

    The downside of this particular linq construct is that the line between what's executed in the DB and what's executed in memory is blurred. Can you tell which group by is ran in the DB and which one in memory? IMHO it's better to keep things separated, so a developer doesn't fall into the trap where s/he thinks the query runs on the DB but actually it's ran in memory...


    I do believe that it is going to be a major performance problem down the road. As you pointed out, it can be hard to separate database queries from in-memory queries.

    Fortunately it is quite easy to figure out what LINQ to SQL is doing. You just need to pipe the DataContext.Log property to a stream like Console.Out.

    If you really want to make sure the query is run in-memory, you can call .ToList or .ToArray. The first query can still be in the database, but at least you know everything against the List or Array will be local.

  5. Back to top

    Re: err...

    Jan 7, 2008 9:23 PM by Jonathan Allen

    I think the correct way to handle that would be to run a single query that gets back all the information you need, and turn it into a List. Then run all your grouping queries against that in-memory list.

    That said, doing it the lazy, one-query way is going to be really, really tempting.

  6. Back to top

    Re: err...

    Jan 7, 2008 9:25 PM by Jonathan Allen

    There are ways to improve in-memory queries. For example, there is "i4o" for indexing in-memory objects. www.infoq.com/news/2007/04/i40-intro

  7. Back to top

    Re: err...

    Jan 8, 2008 2:45 AM by Frans Bouma

    That won't return all of the customer records, a goal of the above query.

    No, the initial linq query also doesn't fetch the customer rows. This is logical, as 'c' isn't in scope after the 'into', also because if you would want to fetch all the customer data, you'd have a groupby query which had to group on all customer fields.

    Also, you still have to group all the cities in each country.

    No :) You can use a loop which traverses the results in 1 go and builds the same hierarchy using an O(n) algorithm. The count per city is already there, so all you have to do is add them per country, which can be done when you loop over the rows. You just have to compare previous with current row if it is a new country, but that's it.

  8. Back to top

    Re: err...

    Jan 8, 2008 2:47 AM by Frans Bouma

    Doing everything in memory is one way to do it, but it's still worse than O(n) because of the nature how the linq query has to be constructed: the nested grouping for the city is in the projection, so you can't execute that one AS WELL when traversing the list for grouping the city. If you want to do that internally in linq to objects, you've to analyze the whole query, which looks easy but is actually quite complex.

Educational Content

Brian Marick on 4 Challenges and 5 Guiding Values of Agile Software Development

Brian Marick takes us through a quick tour of the most important values and challenges to adopting Agile successfully (they aren't the typical challenges and values we hear in the community).

Are You a Software Architect?

The line between development and architecture is tricky. Does it exist at all? Is an ivory tower actually needed? There's a balance in the middle, but how do you move from developer to architect?

Agile – A Way of Life and Pragmatic Use of Authority

The word 'authority' sometimes produces an allergic response in hard-line agilists. Freedom and authority – both are bad if misused and both are good if used in right spirit for a noble cause.

Getting Started with Grails, Second Edition

"Getting Started with Grails" brings you up to speed on this modern web framework. Companies as varied as LinkedIn, Wired, and Taco Bell are all using Grails. Are you ready to get started as well?

Using ITIL V3 as a Foundation for SOA Governance

Those familiar with only ITIL V2 often scoff at the thought that ITIL could serve as a governance framework for SOA. With ITIL V3, the focus of the framework shifted towards service-orientation.

Adrian Colyer on AspectJ, tc Server and dm Server

SpringSource CTO Adrian Colyer discusses AspectJ, SpringSource's dm Server and tc Server products, OSGi and Scrum.

Adam Wiggins on Heroku

Heroku's Adam Wiggins talks about Rails, Background Jobs, Add-Ons, Ruby, and how Heroku manages to work around Ruby's inefficiencies using Erlang and other languages.

SOA as an Architectural Pattern: Best Practices in Software Architecture

For Grady Booch the foundation of a good architecture is patterns, SOA being just one of many patterns. In this Second Life presentation, Booch attempts to bring more clarity on what architecture is.