BT

LINQ Grouping Techniques

| by Jonathan Allen Follow 595 Followers on Jan 07, 2008. Estimated reading time: 1 minute |

For the most part LINQ works very much like SQL. Sources, joins, selects, and where clauses are all pretty standard fare. The Group/By/Into clause is where this breaks down. Unlike SQL, which always returns a rectangular dataset, LINQ is capable of returning hierarchical data.

An example of this would be grouping customers by country and city. In SQL this would require either manually grouping on the client-side or performing 1 + N + (N*M) queries: one to get the countries, one for the cities in each country, and one for each customer list.

This can all be done with a single LINQ query by using a series of sub queries. However, the query gets increasing complex. Mitsu demonstrates:

var q = from c in db.Customers group c by c.Country into g select new { g.Key, Count = g.Count(), SubGroups = from c in g group c by c.City into g2 select g2};

Demonstrating the flexibility the LINQ framework, Mitsu reduces it down this single line:

var result = customers.GroupByMany(c => c.Country, c => c.City);

Mitsu did this in a general fashion suitable for any LINQ query. You can see the source code and an explanation of how it works on his blog.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

err... by Frans Bouma

It can be done in 1 query?
select count(customerid) as numcustomer, country, city
from customers
group by country, city
order by country asc, numcustomer desc

Then all you need is a reader which reads it in, one row at a time. You have to traverse the list just once.

Having it projected for you in a hierarchy is great of course, but I find it misleading you'd need 1+N+(N*M) queries, which is simply nonsense.

The downside of this particular linq construct is that the line between what's executed in the DB and what's executed in memory is blurred. Can you tell which group by is ran in the DB and which one in memory? IMHO it's better to keep things separated, so a developer doesn't fall into the trap where s/he thinks the query runs on the DB but actually it's ran in memory...

Re: err... by Frans Bouma

Heh, actually, I ran the Linq query above using Linq to Sql and I get... I myriad of queries! Not just 1, but a lot.

This is obvious because Linq to Sql can't merge hierarchical resultsets in memory, it does that by postponing queries for hierarchical data till the parent is evaluated (the customer) and then fetches for that row the child rows.

So, although the idea is OK, don't be fooled by the MS marketing machine: unless the end result is 1 SQL query as I described above OR 2 (one for parents and 1 for children which are merged at runtime), you WILL get a lot of queries, and this will bring down performance tremendously.

Re: err... by Francois Ward

Indeed, even the old dataset methods had better ways of doing it than LINQ to SQL, if what you say is true (I didn't play much with LINQ to SQL, but I thought it could actually handle this particular scenario correctly... I'm probably mistaken).

Seriously though, LINQ is getting to be pretty disapointing... LINQ to SQL is a joke (though it is nice to have around for simpler applications, being built in all). Using LINQs on collections is much more viable, but even that is inefficient as hell, and anywhere where you need performance, you're better off doing things with the extension methods on your own...

I guess LINQ to SQL is like everything else in .NET, an entry level tool, in the same way the basic toolbox controls for winforms and asp.net are just entry level stuff, and when you get serious you roll up your owns or use third party controls like Component Art, Telerik, DevExpress... Same deal now. Use LINQ to SQL for prototypes or entry level projects, and use something like LLBLGEN when you get serious.

At least making a LINQ provider doesn't look TOO hard.

Re: err... by Jonathan Allen

It can be done in 1 query?
select count(customerid) as numcustomer, country, city
from customers
group by country, city
order by country asc, numcustomer desc


That won't return all of the customer records, a goal of the above query.

Also, you still have to group all the cities in each country.

The downside of this particular linq construct is that the line between what's executed in the DB and what's executed in memory is blurred. Can you tell which group by is ran in the DB and which one in memory? IMHO it's better to keep things separated, so a developer doesn't fall into the trap where s/he thinks the query runs on the DB but actually it's ran in memory...


I do believe that it is going to be a major performance problem down the road. As you pointed out, it can be hard to separate database queries from in-memory queries.

Fortunately it is quite easy to figure out what LINQ to SQL is doing. You just need to pipe the DataContext.Log property to a stream like Console.Out.

If you really want to make sure the query is run in-memory, you can call .ToList or .ToArray. The first query can still be in the database, but at least you know everything against the List or Array will be local.

Re: err... by Jonathan Allen

I think the correct way to handle that would be to run a single query that gets back all the information you need, and turn it into a List. Then run all your grouping queries against that in-memory list.

That said, doing it the lazy, one-query way is going to be really, really tempting.

Re: err... by Jonathan Allen

There are ways to improve in-memory queries. For example, there is "i4o" for indexing in-memory objects. www.infoq.com/news/2007/04/i40-intro

Re: err... by Frans Bouma

That won't return all of the customer records, a goal of the above query.

No, the initial linq query also doesn't fetch the customer rows. This is logical, as 'c' isn't in scope after the 'into', also because if you would want to fetch all the customer data, you'd have a groupby query which had to group on all customer fields.

Also, you still have to group all the cities in each country.

No :) You can use a loop which traverses the results in 1 go and builds the same hierarchy using an O(n) algorithm. The count per city is already there, so all you have to do is add them per country, which can be done when you loop over the rows. You just have to compare previous with current row if it is a new country, but that's it.

Re: err... by Frans Bouma

Doing everything in memory is one way to do it, but it's still worse than O(n) because of the nature how the linq query has to be constructed: the nested grouping for the city is in the projection, so you can't execute that one AS WELL when traversing the list for grouping the city. If you want to do that internally in linq to objects, you've to analyze the whole query, which looks easy but is actually quite complex.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

8 Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT