InfoQ

News

LINQ Grouping Techniques

Posted by Jonathan Allen on Jan 07, 2008 06:44 AM

Community
.NET
Topics
Data Access
Tags
LINQ

For the most part LINQ works very much like SQL. Sources, joins, selects, and where clauses are all pretty standard fare. The Group/By/Into clause is where this breaks down. Unlike SQL, which always returns a rectangular dataset, LINQ is capable of returning hierarchical data.

An example of this would be grouping customers by country and city. In SQL this would require either manually grouping on the client-side or performing 1 + N + (N*M) queries: one to get the countries, one for the cities in each country, and one for each customer list.

This can all be done with a single LINQ query by using a series of sub queries. However, the query gets increasing complex. Mitsu demonstrates:

var q = from c in db.Customers group c by c.Country into g select new { g.Key, Count = g.Count(), SubGroups = from c in g group c by c.City into g2 select g2};

Demonstrating the flexibility the LINQ framework, Mitsu reduces it down this single line:

var result = customers.GroupByMany(c => c.Country, c => c.City);

Mitsu did this in a general fashion suitable for any LINQ query. You can see the source code and an explanation of how it works on his blog.

err... by Frans Bouma Posted Jan 7, 2008 8:09 AM
Re: err... by Frans Bouma Posted Jan 7, 2008 8:17 AM
Re: err... by Francois Ward Posted Jan 7, 2008 5:42 PM
Re: err... by Jonathan Allen Posted Jan 7, 2008 9:25 PM
Re: err... by Jonathan Allen Posted Jan 7, 2008 9:23 PM
Re: err... by Frans Bouma Posted Jan 8, 2008 2:47 AM
Re: err... by Jonathan Allen Posted Jan 7, 2008 9:20 PM
Re: err... by Frans Bouma Posted Jan 8, 2008 2:45 AM
  1. Back to top

    err...

    Jan 7, 2008 8:09 AM by Frans Bouma

    It can be done in 1 query? select count(customerid) as numcustomer, country, city from customers group by country, city order by country asc, numcustomer desc Then all you need is a reader which reads it in, one row at a time. You have to traverse the list just once. Having it projected for you in a hierarchy is great of course, but I find it misleading you'd need 1+N+(N*M) queries, which is simply nonsense. The downside of this particular linq construct is that the line between what's executed in the DB and what's executed in memory is blurred. Can you tell which group by is ran in the DB and which one in memory? IMHO it's better to keep things separated, so a developer doesn't fall into the trap where s/he thinks the query runs on the DB but actually it's ran in memory...

  2. Back to top

    Re: err...

    Jan 7, 2008 8:17 AM by Frans Bouma

    Heh, actually, I ran the Linq query above using Linq to Sql and I get... I myriad of queries! Not just 1, but a lot. This is obvious because Linq to Sql can't merge hierarchical resultsets in memory, it does that by postponing queries for hierarchical data till the parent is evaluated (the customer) and then fetches for that row the child rows. So, although the idea is OK, don't be fooled by the MS marketing machine: unless the end result is 1 SQL query as I described above OR 2 (one for parents and 1 for children which are merged at runtime), you WILL get a lot of queries, and this will bring down performance tremendously.

  3. Back to top

    Re: err...

    Jan 7, 2008 5:42 PM by Francois Ward

    Indeed, even the old dataset methods had better ways of doing it than LINQ to SQL, if what you say is true (I didn't play much with LINQ to SQL, but I thought it could actually handle this particular scenario correctly... I'm probably mistaken). Seriously though, LINQ is getting to be pretty disapointing... LINQ to SQL is a joke (though it is nice to have around for simpler applications, being built in all). Using LINQs on collections is much more viable, but even that is inefficient as hell, and anywhere where you need performance, you're better off doing things with the extension methods on your own... I guess LINQ to SQL is like everything else in .NET, an entry level tool, in the same way the basic toolbox controls for winforms and asp.net are just entry level stuff, and when you get serious you roll up your owns or use third party controls like Component Art, Telerik, DevExpress... Same deal now. Use LINQ to SQL for prototypes or entry level projects, and use something like LLBLGEN when you get serious. At least making a LINQ provider doesn't look TOO hard.

  4. Back to top

    Re: err...

    Jan 7, 2008 9:20 PM by Jonathan Allen

    It can be done in 1 query? select count(customerid) as numcustomer, country, city from customers group by country, city order by country asc, numcustomer desc
    That won't return all of the customer records, a goal of the above query. Also, you still have to group all the cities in each country.
    The downside of this particular linq construct is that the line between what's executed in the DB and what's executed in memory is blurred. Can you tell which group by is ran in the DB and which one in memory? IMHO it's better to keep things separated, so a developer doesn't fall into the trap where s/he thinks the query runs on the DB but actually it's ran in memory...
    I do believe that it is going to be a major performance problem down the road. As you pointed out, it can be hard to separate database queries from in-memory queries. Fortunately it is quite easy to figure out what LINQ to SQL is doing. You just need to pipe the DataContext.Log property to a stream like Console.Out. If you really want to make sure the query is run in-memory, you can call .ToList or .ToArray. The first query can still be in the database, but at least you know everything against the List or Array will be local.

  5. Back to top

    Re: err...

    Jan 7, 2008 9:23 PM by Jonathan Allen

    I think the correct way to handle that would be to run a single query that gets back all the information you need, and turn it into a List. Then run all your grouping queries against that in-memory list. That said, doing it the lazy, one-query way is going to be really, really tempting.

  6. Back to top

    Re: err...

    Jan 7, 2008 9:25 PM by Jonathan Allen

    There are ways to improve in-memory queries. For example, there is "i4o" for indexing in-memory objects. http://www.infoq.com/news/2007/04/i40-intro

  7. Back to top

    Re: err...

    Jan 8, 2008 2:45 AM by Frans Bouma

    That won't return all of the customer records, a goal of the above query.
    No, the initial linq query also doesn't fetch the customer rows. This is logical, as 'c' isn't in scope after the 'into', also because if you would want to fetch all the customer data, you'd have a groupby query which had to group on all customer fields.
    Also, you still have to group all the cities in each country.
    No :) You can use a loop which traverses the results in 1 go and builds the same hierarchy using an O(n) algorithm. The count per city is already there, so all you have to do is add them per country, which can be done when you loop over the rows. You just have to compare previous with current row if it is a new country, but that's it.

  8. Back to top

    Re: err...

    Jan 8, 2008 2:47 AM by Frans Bouma

    Doing everything in memory is one way to do it, but it's still worse than O(n) because of the nature how the linq query has to be constructed: the nested grouping for the city is in the projection, so you can't execute that one AS WELL when traversing the list for grouping the city. If you want to do that internally in linq to objects, you've to analyze the whole query, which looks easy but is actually quite complex.

Educational Content

Bindings, Platforms, and Innovation

This presentation focuses on the Internet and separating myth from fact, history from the future, and the mundane from the imaginative. Bob Frankston presents a vision of what could and should be.

Orchestrating Long Running Activities with JBoss / JBPM

This article explores the use of JBoss and jBPM to implement design solutions that effectively address the issue of orchestrating long running activities.

Neo4j - The Benefits of Graph Databases

This presentation covers the use of graph databases as an optimal solution for data that is difficult to fit in static tables, rapidly evolving data or data that has a lot of optional attributes.

Realistic about Risk: Software development with Real Options

This session introduces Real Options and shows how it can help in running your project. Real Options is a decision-making process that can be used to manage risk.

Communication Flexibility Using Bindings

This article discusses the use of bindings on services and references (including the instance of non-configured bindings) as the means to implement SCA communications in a Web and SOA environment.

Writing DSLs in Groovy

After a short introduction to DSLs, Scott Davis plays with the keyboard showing how to approach the creation of a DSL by typing working snippets of Groovy code that get executed.

Scaling Agile with C/ALM (Collaborative Application Lifecycle Management)

IBM Rational and InfoQ present, Scaling Agile with C/ALM, an eBook showing organizations how to become “finely tuned software delivery machines” by enabling team integration and scaling.

Concurrent Programming with Microsoft F#

Amanda Laucher presents a real life enterprise application written in F#. She shows actual code snippets, explaining design decisions and suggesting how to use some of the F# constructs.