Bio Erik Meijer is the creator of LINQ at MS, where he works together with the Microsoft Visual C# and the Microsoft Visual Basic language design teams on data integration in programming languages. Erik is one of the designers of the standard functional programming language Haskell98 and more recently the Cw language.
I used to be a professor before I came to Microsoft and I felt I was a vegetarian butcher, I was teaching software engineering and designing programming languages but I had no clue about the real world. Then I decided to jump in the deep and join Microsoft to see if I could help real programmers instead of only talking about it.
The idea of LINQ is to solve the "impedance mismatch" between the different data models. Look at the 3 prevalent data models: Objects, XML and Relational Data. Programmers have to struggle with that daily when they write programs, mess around, querying relational data, doing some computations, then exposing the XML, and then vice versa. This is a problem that I wanted to attack when I moved from Academia to industry. There are several ways people try to attack this problem traditionally, and one is by saying: "Let's take one of this data models and take them as the uber model". You can say one popular view is XML, and say "Oh well, if we can make everything look like XML, then all the problems are solved". Or if we can make everything look like objects then everything is solved.
I believe that's a dead end because there will be more data models in the future, it doesn't scale. Imagine that tomorrow a completely new data model comes out, and you don't know if you can still map that to XML or Objects. The secret is to look at what mathematics has to offer you to solve this problem, and then try to translate this into something that normal programmers can understand. The way I explain it is like when you are designing a car. If you drive a car, you should know nothing about how the engine works, it should go smoothly; but if you are designing the car you have to know everything about physics of the engine and spark plugs and things like that. This is the same when you're designing this framework. We're using this theory from mathematics saying: "Can we abstract over any data model?" If once we can make it work for any data model, then you've solved the impedance mismatch, because then you are not dependant on any of these data models anymore.
What we do is look at the commonality between the different data models, and the commonality between all these data models is that you can do certain operations. For example, they all consist of collections of things. And then you can filter over these things, you can throw away certain elements, you can apply a certain transformation on each element, in SQL that's called selection, and you can iterate over several collections by taking a Cartesian product. And now you can look at these operations, and you see that you can do this on XML, on relations and on objects.
What we have done, and what the mathematicians call Monads, we have identified these sets of operations, we call them standard query operators; we have a list of about 25 standard operators that you can apply to any data model. And then in C Sharp and Visual Basic, we have now added special query syntax that the compiler translates into these standard query operators. You can now uniformly query against any data model, as long as the data model exposes itself using these standard operators. Instead of looking at the differences we look at the correspondences, and then there is no impedance mismatch. I don't know if you know the Matrix: "There is no spoon", there is no impedance mismatch if you look at it in the right way.
If you take a query when you're writing the language, the compiler translates that into a giant expression of the standard operators. For example, if I'm querying against XML, there is this particular implementation of these operators on XML, on Relational Data and/or over Objects. One of the most interesting ones is the one that goes against the relational data and the thing there is that the data is not in memory, but it's on the remote server. Traditionally what people do is open a connection to the database and you send your query as a string that you then pass to the connection and then get back some resultset. That's an example where this is real crufty low level implementations. What we can do now is implement these standard query operators, on top of the database connection and now you can write your query in this high level language, and it gets translated into these query operators. The special thing here is that the query operators are represented as expression trees.
The representation of the query is data and now the implementation of these operators takes the representation of the query, of the expression tree, translate it into the SQL string that you would write by hand, send it to the server and get back the result, and then surface the relational data as objects. There's a lot of magic in there, it's a very nice approach, treat expression trees as data. There's a really deep connection, a philosophical connection with the idea of quoting. In natural language when you quote, he says "Hi" or something like that but also our DNA works with quoting, when our DNA is copied, that's a quoting mechanism. In logic there's a lot of quoting going on, this is a very deep philosophical concept that we are using to do querying. Maybe that's a little bit geeky but it's very interesting that there is this very fundamental concept that we are using here to make arbitrary ordinary programmer's life easy.
First of all there is a mental gap that you have to do. You're programming in C# or VB and now when you're querying you have to suddenly change context and you have to think in terms of a different language, in terms of a different type system, different data model; that's a conceptual gap. But also the programs are expressed in strings strings, so suddenly you don't get anything like intellisense, syntax highlighting, anything you are all the way to the most low level implementation of a program namely a string. Whereas a programmer you want the programming structure your environment knows about, but you hide it there in the string and then in the API there's no context at all. And also your queries are type checked etc. Plus the other thing is that when you have your query as a structure, there's much more opportunity for the environment or the API to look at that and optimize it, and do things with it. Whereas if it's in the string, that is too closed to do anything useful, it's too low level.
XQuery is just like SQL. If you look at XQuery and SQL those are two examples where you have a data model and a query language that are tightly coupled. Again, if I had to query against XML and I would have to parse my query as an XQuery string, I would have to do a mental switch and think about it completely different type systems, different languages, different syntax. Conceptually you're doing very much the same thing, you're filtering, you're iterating over these things. Instead of having this data model and the query language tightly coupled, we have broken them down and said "let's see what the commonalities are. Can I do the same on XML as I do over relational data? And can I define these standard queries operators on top of XML?" In some sense there's no need for XQuery if you have LINQ because all the things you can do in XQuery are expressible in your normal programming language; you don't have to hide that inside the string.
That depends. If I'm querying against XML, the results are collections of XML notes. If you are querying against SQL, the situation is more interesting, because under the covers you are querying against tables, and what you want to get back are objects in many cases. LINQ itself allows you to query against a row table, because row table is a collection of rows and now you can get back a collection of rows. What programmers often want is to see more structure in their relational data. If you have customers and orders you know that there is a relationship between customers and orders. That's where there is often some way of mapping or bridging this gap between the row tables, where there's an implicit relationship via the foreign key and primary key, and the objects, where there is a type relationship. There are several mechanisms of doing that, such that you can expose these tables as objects that have relationships, and that have more structure than just raw tables.
The interesting thing about LINQ is that it looks as one technology, but it's really built out of several parts that by themselves are already useful and interesting. One thing we have in LINQ is something called "type inference". You can declare a variable, but you don't have to write down its type but the compiler will infer it for you. The reason we have that is that when you have a result of a query, the result can be arbitrary complicated. You don't want to force programmers to write down the type; before generics, types were always simple; you had lists, but that list could be a list of customers or a list of lists of orders or can be object, but it would always be a list. Right now with generics these types became much more structured, and writing it down is painful so this is where type inference comes in.
Another thing we have in LINQ is the "lambda expressions". These are building blocks, closures, blocks, people in Ruby or in other languages have similar concepts. These are the little building blocks that are used to create these expression trees, or these standard sequence operators we have talked about. Then there is the notion of "extension methods", where you can add new methods to a given type, which is also a problem that is solved here. These little things are by themselves very useful. Another one is "anonymous types", they correspond to the idea of rows in the database, where if you want to project a given customer, you want to just project the name and the age, you don't want to create a new type, a named type, a class that has a name field and an age field, but you just want to create a pair that has a name and an age. These anonymous types allow you to do that, they make programming a lot easier. All these things then fit together into LINQ as a whole.
I don't expect that a lot of developers will define their extension methods; it's the same as with generics in some sense. There are now collections of classes that are generic and people will use generics and in some sense I think it's the same with extension methods. Probably there will be people that write libraries and frameworks that use them, and then people that consume them. For example, the way we use them is as follows: currently you can't have implementations on interfaces but one of the types on which we define these standard sequence operators is the IEnumerable interface. This is implemented with extension methods. We have extension methods that implement all these standard operators on IEnumerable. Another example is if a type is sealed, and you want to extend string. Currently there is no way to do that. With extension methods you can give the illusion that you can add new methods to string. You have to be careful because the extension method is an illusion, it's a compile time thing, we are not really changing the representation of the types, it's the compiler that gives you the idea that there are new methods on the type, but at runtime we don't change the representation of the types.
If you have your database with lots of tables, you often want to have a higher level representation of those tables in terms of objects, where you can make the structure, the associations and relationships that are encoded in the database. You want to make them visible to the users. There are several ways of doing that. In one situation you can take classes and you can put custom attributes on them, that gives a method that tells the tools how this class maps to a table, how this property maps to a relationship etc. Or you can have an external model that describes these relationships, and from there you can generate your classes, or you can generate your database or you can even take the metadata from your database and generate your types and some external schema. There are a lot of possibilities to do that, and I expect that there are many of these frameworks around, and I expect that there will be many more to do that.
11. Let's take your typical projects: you created the schema; you created the object model, now you actually have to map one to the other. Where does LINQ fit in and not fit in for this particular problem?
This is a very typical situation where you have defined both or only your database schema, or you have only your objects and want to put them, so where do you put this mapping information? There are several places where you can put it. You can put it as custom attributes or in Java terms annotations on your classes- or you can put it in an external mapping description, XML file or other description.
LINQ is completely independent of this, because LINQ is all about the querying mechanism and the standard sequence operators. If you want to have an implementation of the query operators that go against the database, the implementation of these query operators have to know about the mapping, because they have to take your query that is defined in terms of objects and then translate it into SQL that accesses your tables. Your specific implementation of the query operators will have to access this mapping somehow, but the whole LINQ mechanism is completely independent in the framework and then you can use mapping. LINQ itself is not tied to any particular form of mapping, whether it's in classes, external or whatever.
If you talk about mapping you have already seen that there are two approaches there. There's the LINQ to SQL, which tries to address the simpler and direct scenarios, then there's the EDM that tries to address the very complicated enterprise scenarios, where you have many tables with complex mapping scenarios. Those are two manifestations of the querying databases using LINQ.
Then there's the querying XML, where in addition to the standard query operatoes we have a set of new XML APIs, to constrict XML and to query into the XML to get elements and child elements, and descendants and parents and so on, to navigate the XML object models.
Then there is the LINQ over objects, where you are querying normal in-memory collections, such as arrays or trees. At this point the three main manifestations of LINQ are: LINQ to Objects, LINQ to XML and then several variations of LINQ against relational databases. I think especially in the latter space there will be lots of interesting developments, because as I said, we are really using the power of LINQ where we are using the queries as data, which is an exciting and novel thing instead of the SQL strings.
DLINQ is and instance of LINQ over Relational Data, which is trying to address the simpler mapping scenarios. In DLINQ there are several ways to do this. You can point at your database and get your classes from the metadata and the database, you can put custom attributes on your classes, and define your mapping like that or you can use an external mapping file. DLINQ, like many other object relation mappings, has this notion of context, which is the bridge between the object world and the database world, that does the change tracking and holds the database context, and the transaction context and so on. Also this context will take your expression trees, translate them to SQL and then materialize your objects. DLINQ is one example of using LINQ to query against Relational Data.
If you have your own implementation of the standard query operators, users can just write in VB or C# queries that will then run against your data source. There's a whole spectrum there, because taking these expression trees which are basically full C# or VB expressions, including method calls, you will have to translate them into SQL. You want to have some help there using a framework on how to write these things. We are working on these provider frameworks, to make that easy, but that is something that would require completely separate topic, but the essence of LINQ is that we make it inherently pluggable because these query operators are not dependant on a certain data source. As long as you can expose these operations on your data source people can query it. That's the deep elegance of LINQ, the fact that it works for any data type, for any data source, for any container type. There's no limitation. If your data behaves as data you can make it LINQ enabled.
Yes, the EDM is designed for situations where you have really complicated schemas and giant databases, where your mapping scenarios are much more complex, where you want to do de-normalization, and where you want to map different tables to a single type or you want to have one table mapped to different types, or you have complicated implementations of inheritance. These are situations where you have really complicated enterprise applications, legacy applications, where simple direct one to one mapping doesn't fit.
LINQ is more or less in the end game of shipping LINQ. The design is mostly done so now the developers and program managers and QA teams are working hard to get everything into production quality. I've been looking to see what we can do with this beautiful infrastructure in the next phase. This is a really good foundation for the next step where we want to make web programming easy. LINQ is the first step, and what I'm doing now is working with my colleague Brian Beckman in an incubation team where were looking at how we can use the ideas of LINQ and move them to the next level. The way we phrase this is that what VB did for Windows programming we want to do for web programming and we want to take the LINQ technology as the basis for that. How can we democratize the web using LINQ? That is the next step.
I have several books that are on my mind. One is "The Long Tale", that's not necessarily technical computer science book but I can recommend that to anybody. Another book that I'm reading currently is called "The change function". This is about why some technologies fail and some technologies succeed, but my favorite computer science book is called "LEAN programming", by Mary and Tom Poppendieck. There's a lot of buzz today about Agile development and extreme programming, but to me a lot of that feels anecdotal and very dogmatic, "You have to do it this way", but it's not very scientific. Whereas this LEAN programming which is based on LEAN production and the Toyota methods is finely where I feel that this is right, it has some deep resonance with me and it's beautiful. Another technical book is by Andreas Zellar, and it's called "Why Programs Fail". This is about finding bugs in programs, debugging and so on. This is another field that I think there is a lot of progress to be made, and Andreas has done some wonderful things there that he describes in his book, some of them are mind blowing and you think "This is completely impossible", but he just does it. So "LEAN programming" and "Why Programs Fail", are books that I can recommend to anybody.