Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Interviews Markus Völter on DSLs, Modeling, MPS, Mbeddr

Markus Völter on DSLs, Modeling, MPS, Mbeddr


1. We are here at the Code Generation Conference 2012 in Cambridge. I’m sitting here with Markus Voelter. Markus, who are you?

I’m an independent consultant, coach, researcher, trainer- whatever you want to call it, developer; I’ve sold most of my time to company called itemis in Germany so I’m kind of associated with them and I mostly spend my time these days with domain specific languages, code generation, models, and then, of course, associated with that is the whole area of architecture, product lines - this kind of space.


2. Lets’ jump right into it with DSLs; so at this conference here talked about something called mbeddr that’s built on something called MPS, the quick elevator pitch for these two things - introduction for these two things.

I’ll start with MPS. So, MPS is a language workbench; a language workbench, Martin Fowler defined the term in 2004, I think, is a tool that allows you to efficiently work with languages - you can build languages, you can extend languages, you can then build programs using combination of these various languages; MPS is one of these tools.

What’s special about MPS and a few others but most of them are different is that most of them is a projectional editor; so you don’t do parsing; you have text, you have tables; in the future, we’ll have graphic annotations but you don’ t do any parsing and this allows you to mix notations, you can embed tables and text and it allows you to extend languages and compose languages; so, it’s a very good environment to work with families of related languages.

Now mbeddr is an approach for embedded software development where we combine C with domain specific extensions and formal verification; so instead of just coding C, you can take C and extend C with domain specific extensions such as state machines, components, interfaces, physical units - all kinds of things we already have all those I mentioned but you can build your own and using these ingredients, you can - you’re more productive, more reliable, better quality for embedded software; and as a consequence of the fact that you don’t express things at low levels of abstraction, in C but rather at reasonable abstractions, you can then do formal analysis through model checking or SAT solving much easier and that is what we exploit.

Mbeddr is built on MPS; so mbeddr exploits MPS’s ability to flexibly extend, merge and compose languages; it’s all open source, by the way; MPS is open source as well as mbeddr; MPS is done by Jetbrains, mbeddr is done by a bunch of people in a German government-funded research project and both are independent but both are open source.


3. You mentioned you extend C with specific language features to make modeling easier, is that true?

I’m hung up a bit about the term ‘modeling’ because if you do that, I mean, a state machine is typically considered a model. C code is typically considered a program; now, if you mix the two, the distinction between program and model goes away and Anneke Kleppe coined the term ‘mogram’ or 'prodel' which I actually prefer and so the distinction between model and program goes away. Sometimes what you want is you want to find the appropriate level of abstraction for whatever you want to express and appropriate is defined by the purpose of what you want to do with the program - if you want to analyze it in a certain way, you need certain abstractions to make the analysis simple; if you want to generate code maybe for different target platforms, then you mustn’t make any assumptions about the target platform in your model or program.

So the goal of mbeddr is, in some sense, the following: the embedded development community can be very roughly separated into two groups - there are the bit twiddlers, the people who optimize every bit of performance at a very low level, they don’t want to do any abstraction because that’s all costly and dangerous; and then there are the picture drawers who basically use some high level modeling tool, draw up under pictures, press some buttons and don’t really care about the code; there isn’t a lot in the middle and that’s where we want to be.

We want to start with C and then add incrementally and modularly domain specific extensions, some of them pre-existing, some of them specific to your particular domain.


4. So when you mean ‘pre-existing’, so people have already in the past created their own extension to C

It’s part of the project; as part of the open source endeavor of building this; we have implemented all of C in MPS as well as state machines, interfaces, components, physical units, requirements traceability, you can attach requirement traces to any arbitrary program element and trace the requirements as well as support for product line variability, presence conditions connected to feature models.

Currently, depending on when this is airs, but at the point of when we record this, the requirements, the product lines and all of C is open sourced and then the next two or three months we’ll open source all of the rest, we only want to do it public once we have a certain amount of documentation.


5. So the way MPS works is you say, it doesn’t contain any parsers, that’s not quite true, is it, because you have to write your code. I think you gave a demo were you wrote C code, who parses that?

Nobody, that’s the trick; so I had a long discussion with a parser person; in the end it comes down to how you define parser and grammar. I define parser, I think most people would agree, as somebody or something which takes a bunch of text characters as input and builds a data structure from that; strictly speaking, many parsers can be separated into a tokenizer and the actual parser, there are parsers without tokenizers but you can basically, it goes from text to an object structure; and MPS does not do that; so, if you, for example, type, let’s say you have a local variable declaration: 'int i =2', let us assume you have that; it’s already there; now you want to change that to 'int i = 2 * 3’; so the ‘2 * 3’ is of course, a little subtree with the '*' at the root, the ‘2’ at the left, the ‘3’ at the right of the argument slots.

So if you want to change the '2' to ‘2 * 3’, what you have to do technically is you have to remove the '2', put the '*' in, put the '2' at the left child slot of the '*' and then put the cursor on the right slot you can put in the '3'; and what you saw yesterday or whenever it was, when I just typed '2 * 3', that’s the reason why MPS is actually usable although it does not parse.

Projectional editors have had a bad reputation because you literally had to remove the '2' put in the '*' and then put the '2' back which is extremely annoying and in MPS you literally just you put the cursor on the right side of the '2', type the '*' for the multiplication and then it automatically does the refactoring but there is never any parsing literally there is basically an event handler that says, it's like the MVC pattern, if you press the '*' on the right side of an expression, then a tree refactoring happens that takes existing three outputs; so there is never ever any parsing although you can type code kind of linearly almost like in a parsing environment and that is, in the beginning it’s strange because you feel you can’t type anymore because there are some things that aren’t quite as in regular text but the cool thing is that so if you do grammars, there are different grammar classes, LL, LR, LL* whatever and the problem is if you combine two grammars, both of them are in one of the subclasses, if you combine them, it’s quite likely that the resulting grammar is not in that subclass and since the specific parser always supports a specific subclass, it’s not parseable.

So there are two solutions to that dilemma: one is you use parsers, generalized parsers that can parse the whole set of Context Free Grammars, they don’t have that subclass problem; they can do quite a bit of language composition; or you get rid of the whole parsing step in the first place. And so that’s what MPS does. There’s never technically any ambiguity and that’s why you can always combine, at least, from a syntactic perspective, semantic is a different question, from a syntactic perspective, you can always combine any arbitrary languages.


6. So the 'projectional' in projectional editor means it projects from the AST or from the tree to text?

Yes, to a notation, that maybe text, that maybe tables; that maybe graphical. You may have heard about the Intentional software guys? They use the same approach, or, maybe I should say, MPS uses the same as Intentional because Charles kind of invented it, I guess; and they have a system that isn’t easily accessible, I mean there's a whole commercial story but they already support graphical notations as well.

MPS isn’t quite there yet, that will be there in 2013, but that’s the point: if you project any way, it doesn’t matter much whether you project a bunch of text symbols, or a table or graphics, it’s the same kind of approach and that’s why it’s so nicely flexible.


7. An important thing that MPS does, I guess, is to have a nice simulation of text input which behind the scenes fiddles with the tree.

Yes, that is what makes it usable. I mean, people still complain in the first two or three hours, rightly so; I mean, it is a bit strange in the beginning but considering what this approach allows you to do, in my opinion, is absolutely worth spending a day feeling like an idiot because you think you can't type. Of course, other people disagree and don’t use it.


8. There’s of course, the question then, can I import existing code into MPS since there is no parser?

Yes, that is a question that has to be answered carefully because assuming you built, you represent a language like C or Java in MPS, then this language has to be defined to be parseable; so there are existing parsers, so what you can do is you can integrate these parsers; and as a side effect of the parsing, you build the MPS tree and this can happen in two situations: one is the so-called stub mechanism where you take, for example, in C a header file or in Java, a bunch of Java classes, poing MPS at this directory and it reads all of them and builds their respective models in MPS so you can at least call into these things. It only reads the signatures because that’s enough to be able to call into it.

The other place where you need this text to tree parsing is when you paste something, when you copy from text and paste into the tool, both of that are supported for MPS, for Java; MPS comes with Java as a default and it has these two features.

In our C world, we are not yet quite there but we will be there probably by the time this airs. This only works for languages that are parseable. If you define a language in MPS that is not parseable because it uses tables or because you use reserved words as identifiers which is not a problem in MPS but you can't parse it later most, it depends on the parser, then you cannot do this anymore.

But the point is you referred basically to existing legacy code; by definition, that is written in the language that’s parseable; so in that case, it always works but your own programs are not thought as text but as a tree, XML basically.

And to answer the question you’re going ask next, there is of course, an integration with the usual version control system, so when you do a diff and merge, you do this on the projected syntax not on XML level. ...because otherwise it would be completely useless; nobody is going to do the XMI-ish kind of merge, you'd shoot yourself.


9. Because every time you change something it would be a big change to an XMI or XML file which we have to push?

We work with git basically; so MPS writes XML files which we use normal git commands and check them in and so on, update them, and then if we get a merge conflict, the files are tagged or somehow merged or whatever git does to mark something as conflict. MPS reads this and I can go through conflicting files; I see the diff in the projection no matter whether it’s graphic or text or tabular and I can do the usual ‘move this way, that way’ to fix the merge.


10. And so since you move from syntax to semantics you can add various features things like state machines and others so, somebody has to make those do something, how do you do that? So you have to basically combine the existing semantics with the new semantics?

Yes, well and there’s one case that is in some sense, simple; so let’s take a step back. What is a domain specific language? There are two ways of defining the domain for which you build your language; one way is somebody tells you, "My company builds ‘x’ and these ‘x’s have these characteristics" and so you build a DSL for that; so you do domain analysis, you try to understand what the ‘x’ really is; what characteristics it has and you try to come up with a language, you iterate forever because people never are good at explaining what they do and so that’s kind of the top-down approach.

There’s also that bottom-up approach where you take a bunch of existing programs written for a given domain, say, automotive software, embedded C stuff; and then you look at the code and identify recurring patterns and idioms; and then a DSL is simply a language which provides direct syntactic abstractions for these typical patterns and idioms.

In that latter case, it’s trivially simple to think about what a generator would do because it would simply recreate those same idioms and patterns from which we have previously extracted new language concepts.

So that’s what’s typically called reduction or assimilation; so you map an extended concept, let’s say, a state machine embedded into C; you map that to switch case stuff or a cross referencing array data structure - whatever you do for state machines; and this case is simple; sometimes in addition to reducing something in place you may also generate something else in some configuration file or something.

Usually if you do incremental language extension by starting with, let’s say, general purpose language and incrementally adding domain specific abstractions, there isn’t any question usually how you generate it because you know that already because that’s what you abstract from, so it’s not a big deal; it may be a lot of work to write a generator but it’s not conceptually a big deal.

Of course, this isn’t in some sense the ideal case. If you start top-down, then part of the domain analysis is to understand what would be a good implementation of the stuff; how you can make it efficient; what kind of optimization do we have to put into the generator - that can be a lot of work depending on the amount of optimizations.

In general, I guess, one could say, unless you have to do significant optimizations, building generators is not a big problem like it ‘s something you do in days usually for typical sized extensions. Of course, if you do a lot of sophisticated optimizations, this can extend to any length of time.


11. So moving away from MPS, how would people without MPS compose their DSLs if they wish to do so?

The problem is that, so this discussion is extremely hard to do without any relationship to any tool; so I have to put in a plug. I’m actually writing a book on DSLs which will consist, among other things, of a section on DSL design; DSL design means it discusses design aspects independent of implementation aspects in any given tool; implementation will be in a separate section.

And as part of this design section, we have identified four ways of composing languages be they domain specific or not, it doesn’t matter at this context: the first one is referencing - so you have two independent models; as you can see from the shapes, this is written with language ‘A’ and this is written with language ‘B’ and the only connection they have is that they kind of have name cross references and you may have an IDE that checks those for consistency - that’s standard, boring, we’ve done that all the time.

So the characteristic is that there is no syntactic composition, they stay in their own files but that language may have references to objects here so there’s a dependency.

So next way of composing languages is extension; in this case, there is also the dependency but there is also syntactic integration; so you kind of inline the state machine directly into your C code or you maybe inline C expressions into your state machine; still so you have syntactic composition and dependencies.

And, of course, I guess it’s clear how you continue: you have reuse which is ‘no syntactic composition and no dependencies’ so you have two completely independent languages and you still want to use them together; basically you do this with an adaptor language in the middle which references both but you have no syntactic integration.

And then the fourth is embedding where you have, again, two independent languages and no dependencies and you can still embedded them like you have Java, you have SQL, you want to embed SQL in Java, you do that again with an adaptor but now again, you have to solve the syntactic integration problem.

So referencing is possible with any DSL tool, with any language workbench easily. Extension is possible, looking at Xtext which is the most well known language workbench, I guess, or DSL tool; so you can do, you can extend one language; so you have one linearized extension inheritance hierarchy, if you will, you can do that; there are a couple of details but this works.

Reuse, you can do that with Xtext as well because you need one inheritance and one reference that works; what you cannot do with Xtext is embedding because you would have to extend from two base languages, you simply can’t do it unless you somehow manually build everything up. I’ve seen a demo where they had integrated HTML into Java but the HTML editor was completely hand coded; they didn’t use any Xtext stuff, and of course, that’s cheating .

There are other language workbenches, Spoofax, Rascal, they are parser-based but they’re based on GLR parsers; so they’re much better at composing these things.

In general, I would recommend that people go to, that’s the homepage of the language workbench challenge where I think, at this point, it should be like 15 different tools are compared by all these tools implementing the same challenge and the 2011 challenge explicitly addressed some of these things; 2012 challenge is a bit different - so that’s probably a good way to learn about the stuff.


12. On the topic of DSL; DSLs and models are quite related, I suppose, is that right?

A model..I have two ways of defining model: one is extremely pragmatic and technical; so if you work with Eclipse EMF, then everything that’s represented in EMF is a model, it doesn’t matter what abstraction level it is. If you take Java code in Eclipse, by default, not represented in EMF, it has its own Java JDT tree, so it’s not a model; but if you were to express the same program in an EMF representation of Java, then it would be a model.

So this is a very technical definition; so as long as I can work on it with my modeling infrastructure, it’s a model; and the more conceptual definition of a model is, of course, it’s an abstraction and simplification of reality; you leave certain things away; you add abstractions and to do that, you need a suitable language and that’s where domain specific languages comes in; a DSL is nothing else than a language that expresses something with a set of abstractions that are useful for whatever you want to do with that something.

So you can express the same thing with different abstractions; different models and you probably need different languages for that; and a DSL simply says, or, the idea of DSL simply says, "I don’t believe that you can represent every model with suitable abstractions with a general purpose language such as UML".

The argument is that, if you want things to be analyzable and easily generatable, you want to define a custom language, so that’s where the DS comes in.


13. Which brings us to the question, so if I’m programming, am I modeling; or, if I’m modeling, am I programming or am I doing both?

I couldn’t care less; actually I have a set of slides, a talk, which I started with: "I don’t want to do modeling", and of course, there’s more on the slide but people picked out that and cited me this way, so what I really want to do is I want to express behavior and structure at a suitable level of abstraction with suitable notation formally enough so I can process it. Ideally, declarative but if that’s not possible, I can also write imperative code; so I think once you have a tool like a modern workbench without using the name MPS, it doesn’t matter; you can always build an abstraction that’s suitable and it doesn’t matter if you call this a model or a program, it’s really completely irrelevant.

I have to make one caveat to that: there is two different general flavors of modeling; there is prescriptive and descriptive. Descriptive modeling is basically you have something and you represent it some way typically to discuss with other people; to understand it -that is not what I’m talking about; I talk about prescriptive or executable modeling. I assume that the models are formal enough to transform them into something else which I can execute and if we presume that, then the discussion whether something is a model or program is completely irrelevant; maybe in some context, you define these terms, you say that, "everything I express with a general purpose language, I call that program; everything I express with a DSL I call that model." Fine, you can do this.

Other people say, "Everything that’s expressed textually is a program and real modelers draw pictures; so in that case, you kind of imply the definition of model but useful models are not about necessarily the notation, although notation is important, but the most important thing is the right abstraction level; and so in that sense, it’s arbitrary; and it’s always a good idea if you’re at a modeling conference and you want to have something to do in the evening at the bar, you just go and ask; so you’re modeling? What is a model? And the evening is depending on your viewpoint either ruined or saved because the discussion won’t end.


14. You’ll probably get a dozen different answer, I guess.

Sure, and it really depends on what your purpose is and realistically it also depends quite a bit on what your background is; so if you happen to sell graphical modeling tools, then you probably emphasize the term ‘modeling’; if you sell or open source something like MPS where you can do anything on any level you can extend any language, there’s no point in the distinction; so you probably tend not to make the distinction. So it’s a lot also based on just where people come from; what their background is; what their interests are.


16. I might be kidding. But, if I want to parse my Java code, I can’t because there aren’t really any easy accessible Java parsersor for C, for instance.

But there are. I mean, for all of these language, there are existing parsers on and you just grab them, so that’s not a problem.


17. Usually my experience is that these parsers are always insufficient or based on outdated grammars and so on.

What I can say very specifically is that we are currently working on parsing C code for the reasons we talked about before; we’ve integrated the parser from the Eclipse C development tools and it works.

Of course, domain specific languages tend to be simpler and less sophisticated and therefore easier to parse and also for humans, for your brain parser because they address the limited domain and so typically they’re simpler. There are exceptions, SQL, which you could argue is a DSL is not simpler than C, it’s quite different but not simpler; but many DSLs are relatively small and; therefore it’s easier to tackle them technically but also humanly.


18. So finally, I think one tpic you’re interested in is verification of models. What’s the state for that?

First, let me prefix that: I’m really not at all an expert on that, I mean, I can talk to some extent authoritatively about the other stuff but not about this verification stuff. I’m really a newbie to this area; however, a year ago, I would have said, it’s useless, it doesn’t scale, nobody understands it, practical people don’t use it; this has changed.

So I’ve worked with a guy called Daniel Ratiu and he knows about this stuff and he’s part of this research project where we do the mbeddr stuff and we’ve integrated model checking into our environment directly with integrated SAT solving and the way we do the model checking, for example, is that, from a nicely written state machine in the C code, we generate the ‘weird’ input to the model checking tool then run the model checking tool and then re-interpret the results in the context of the higher level state machine abstraction; by the way, other people had done that for years. The point is, that if you do it this way, then this kind of formal verification can be made accessible to a kind of normal people and the algorithms for doing the model checking stuff have improved in the last ten years to the point where they scale to realistic sizes, not arbitrarily, but to realistic sizes beyond toy examples.

And so it is absolutely feasible to create a state machine, press a button, and then get a proof that it works correctly for whatever definition of correctly, I mean, you have to specify what you mean by correctly but you can literally do this automatically; and of course, that’s much better than testing because in testing, you always have the coverage problem and you have to think about all the possible things that can go wrong in order to test them - that’s not the case for this formal verification methods.

And so I would say that this whole topic, formal verification, goes together with DSLs very well because, if, instead of trying to analyze low level C code which is very hard because of a big state space, a lot of low level details, if you have an ability to easily extend your language, and thereby raise the abstraction level to a degree where the analysis becomes simpler, then you win.

And so this adding abstractions and then on that level doing the analysis is really cool, we’re planning to do things like if you have two state machines, like a protocol state machines for provided interface, and maybe a protocol state machine that defines a client’s behavior, we can check them for compatibility and check that pre- and post-conditions and behavior sequencing is always correct for all cases.

I don’t know how to do this; my colleague does. So and this is extremely cool stuff if you kind of go to our safety critical systems ; so I would really encourage people to take a look at the stuff. It’s not simple in the same sense that building DSLs is not something that you like learn in a week; you need experience; same thing is true for formula verification stuff but it’s worth taking a look.


19. We will all take a look and just to wrap up, anything else you have to plug?

Markus: The book.

Warner: What’s the name of the book?

Markus: That’s a good question, well, it’s probably going to be called, DSL Engineering or something, or DSL Engineering with Language Workbenches , it’s going to be either free or very cheap; it’s going to be self-published so maybe it’s a 3.99 Kindle book or maybe it’s a free pdf, I don’t know yet. It will have three parts: first will be on DSL design; second will be on DSL implementation illustrated with MPS, Spoofax and Xtext; and the third part will be discussions about how DSLs relate to different aspects of software engineering, like requirements, architecture, implementation and stuff like that; details at


20. And mbeddr is also available online I think?

Yes, and because we’re cool, we have avoided bunch of vowels; so you get that at .

The point is, even if you don’t do embedded development, it’s really an extremely cool and interesting case of using and composing the DSLs; so it’s just from that and we have a bunch of nice papers and descriptions; so if you’re interested in DSLs, it certainly should spawn discussions and inspire people.

Jun 07, 2012