InfoQ Homepage Articles Developing a Complex External DSL

Developing a Complex External DSL

This item in japanese

Apr 13, 2009 36 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

The use of a domain-specific language, or DSL, is becoming a realistic and even necessary solution for software developers on all sorts of software development projects. You've heard about DSLs, and you may know that DSLs are divided into a few different styles, internal and external. But what is an internal DSL and external DSL? When would you decide to use one or the other? And, primarily, how would you go about developing a complex external DSL? This article answers these questions, with a focus on developing a complex external DSL.

Defining Domain-Specific Language

A domain-specific language (DSL) is a computer language that is developed to specialize in addressing the needs of a given problem domain. The domain itself could be many things. It could be specific to an industry, such as insurance, education, aerospace, medicine, etc., or to a technology or methodology, such as JEE, .NET, database, services, messaging, architecture, or domain-driven design.

The reason I would develop a DSL is to make dealing with a set of challenges in a domain I am working in more elegant and easier to deal with. That's because the language I create will be just what I need to address my unique set of challenges, and no more than that. And of course, if I provide my language for others to use it may have to broaden a bit to address what they need, but still nothing more. The effort has the goal of making it feel more natural to use the DSL than to use a general purpose programming language or some other non-targeted tool.

An important distinction for this article is internal DSLs and external DSLs. Each of these is a style of DSL, and it is important to understand what style is appropriate for a specific problem domain. It is out of the scope of this article to delve deeply into what defines a DSL in general and what makes an internal and external DSL. Martin Fowler and others have lead such thought experiments and I suggest that you read such work for more detail on the subject. However, I do provide some basic context setting here.

Internal DSL

The language that is developed could be very closely tied to and implemented on top of the primary general purpose programming language in use in your project, such as Java, C#, or Ruby.

The Rails framework has been called a Ruby-based DSL that manages web applications written in Ruby. One of the reasons that Rails is called a DSL is because some of the Ruby language features are used to make programming against Rails seem different from programming against the general purpose Ruby language. When thought of as a language, Rails being created on top of Ruby as its foundation had a big head start in becoming a language in its own right.

I am not sure if Dave Thomas (PragDave) views Rails as a whole as a DSL, but he notes that several features of Rails are supported by different DSLs. The example he presents is Active Record Declarations as a DSL. By using some simple jargon specific to domain model entity associations, Rails developers allow the DSL to manage all sorts of complex infrastructure and operations behind the scenes while themselves focusing on high-level entity association concepts.

Whether its creators or huge consumer base view Rails as a whole a DSL, or only some features within Rails (Active Record Declarations), what I have been discussing here is an internal DSL. This DSL style has been labeled internal because, again, it is closely tied to and implemented on top of a primary programming language, but employs techniques to make it seem like it comprises a specialized language in its own right.

One of the key defining characteristics of the API of a framework or library fitting the bill of internal DSL, according to Martin Fowler and Eric Evans, is that it has a fluent interface. It basically means that you can stitch together short object expressions to form longer expressions that read more naturally.

I have been using and designing fluent APIs off and on for a while. For example, much of my early experience with fluent APIs was in Smalltalk. There are two ways to develop and use fluent interfaces in Smalltalk. First you can make the answer (return value) from one object message expression (method invocation) become the receiver of the next message:

   1 + 2 + 4

Here the number (object) 1 receives the + message with the number (object) 2 as its argument and the resulting number 3 (implicit) itself becomes the receiver of the next + message with the argument of number 4. (For clarity, number literals are not primitives in Smalltalk; they are first-class objects.) Of course this seems natural to all of us, so much so that its roots aren't even in programming. Well, that's the point. I could have accomplished the same thing like this:

  result := 1 add: 2.
  result := result add: 4.

But that's not natural or fluent to most. At first glance the fluent expression says it's clearly 7; not so the second example. And since this technique is not limited to a numeric math domain, the fluent nature of Smalltalk programming makes it easy to deal with the domain-specific aspects of many different domains. You are welcome to look into Smalltalk cascades, which is a second language facility that supports fluent interfaces. I demonstrated the first approach since it is supported by modern object-oriented languages (but in some cases not in the same way with number literals).

The significance here is that the fluency around a given API is designed in to make dealing with the given problem domain more elegant and efficient, even natural to experts within the domain. That's what makes it a DSL.

Of course we are each free to think of things like this in a manner that makes most sense to us. Whether or not we personally choose to think of a given API as a DSL or not is our choice. But it is important to recognize what Martin Fowler and others see what I have described above as internal DSLs. They tend to have a lot of influence on what the rest of the industry thinks and does. It's a nomenclature that is likely to stick around and come up in discussions for a while.

In my experience we tend to design and develop internal DSLs when we need a technical API for ourself and fellow developers to use. It helps when the general purpose programming language on which we base the internal DSL has the rich rich set of facilities that support this style. Clearly some languages such as Smalltalk and Ruby make this easier. Some languages such as Java and C# make it less easy. Targeting a fluent API and other internal DSL facilities at capable software developers cuts complexity and development time. However, if we are trying to simplify and add power to non-programmer domain experts, internal DSLs are not an option.

External DSL

Frankly I never thought of using, designing, and developing fluent APIs as work in domain-specific languages. So I have to admit that because of a long history with fluent APIs, it is difficult for me to shoehorn the concept into the internal DSL category. But I'm learning. On the other hand, when I first read about DSLs I immediately associated the definition with my work of creating a number of “little languages.” Because of this I'd suggest that if you had trouble swallowing the definition of internal DSL above, this flavor will taste much better.

Defining an external DSL is much easier than defining an internal DSL. It is analogous to creating a general purpose programming language that is either compiled or interpreted. The language has a formal grammar; that is to say, the language is constrained to allow only well-defined keywords and expression types. The source code of a program written in the language is kept in one or more files that have a text format, or might be more of a tabular or even graphical format. In the textual DSL case you'd create the source file using a text editor or a full fledged IDE. You then compile the source code and run it as part of a resulting program, or otherwise run the source code directly through an interpretor.

The main difference between a general purpose language source and an external DSL source is that when compiled the DSL is generally not output as a directly executable program (but it could be). Generally the external DSL will be translated into a resource that is compatible with the core application's operational environment, or it may be translated into source code of the same primary general purpose programming language used by and built as part of the core application.

An example of an external DSL that is translated into a resource that is consumed by the application is the object-relational mapping files used by Hibernate and NHibernate. Another example is presented by Jay Fields of ThoughtWorks on “Business Natural Languages.” You can imagine an external DSL that contains metadata that your application needs to, say, validate user input. You would read the metadata, transform it into an efficient internal format, and use it at runtime.

An example of an external DSL that gets translated into the source code of a target application are the languages developed using the MetaCase MetaEdit + Workbench and the Jetbrains Meta Programming System (MPS). Another example is Markus Völter's work on “Architecture as Language.” In this example Markus is able to define a software architecture, validate it, and generate code from textual architecture descriptions.

An external DSL may be used to directly support the design and development effort and thus be used by software developers. In that case it could be used to generate code that consumes a fluent API developed as an internal DSL. An external DSL is also appropriate for use by non-programmer domain experts, if properly designed and targeted for that class of user.

In many cases an external DSL is best when accompanied with language support, such as tooling. When only a small group of software developers, including the DSL's authors, are its users, you may only need a simple text editor. But when the DSL is distributed outside its team of creators, or when it is targeted at non-programmer domain experts, a syntax highlighting and code assist editor is essential to the success of the DSL. Other tooling may also be appropriate or necessary.

Language Complexity

I consider a complex external DSL one that:

Is not easy to parse. A comma delimited file of text records is relatively easy to parse. A language such as Java is not relatively easy to parse. A complex external DSL is somewhere in between the two, and probably leans closer to the Java language than a CSV in terms of complexity. (You may still want to develop a formal grammar for a CSV anyway, but my point is it is not essential and possibly quicker not to.)
Requires a complex internal representation once parsed. The complex representation is a tree or object graph that contains the source artifact expressions in a convenient and optimized state. This representation supports validation and both/either interpreted and/or generative tooling.
Facilities the generation of one, several, or even many complex target artifacts from a single source file or multiple source files. This of course is only the case if the language is not only interpreted but is also/instead used by generative tooling. If you only decorate the target artifact with a few minor details not provided by the source DSL, I'd question why the target source language was not used in the first place.

With the above context you may be wondering how you'd actually implement a complex external DSL. I describe that in the next section.

Design and Development

Designing and developing any complex language is a big challenge. Even if you have a good idea of what you want out of a language, working out the details of a complex language can bend your mind. It seems like just about the time you think you've got all the language features identified and syntax worked out, your future user base (including yourself) thinks of something new. This is many times even more so the case than when designing a general purpose language.

Of course, like anything useful in our field, a language will undergo enhancement after enhancement over time. One thing we need to do is make sure that we can support those enhancements over time. The language design, therefore, needs to embrace change. Further, a good language design will make developing the language far easier.

I mention a wide range of language syntaxes in this section, including graphical and textual. However, I must limit my scope to just one syntax type in order to keep this article to a reasonable length. Thus, I have chosen to focus on textual DSLs. A textual DSL is easily understood, and is likely applicable even when a graphical DSL is in use (see why below).

Syntax Design For Now and Later

Fortunately most language designers don't wake up in the morning and say “I'm going to create a language today, and I wonder what it will be.” If we are considering the development of a language we have a good idea why. This is important because if we really don't have a clear vision for the language our resulting design will be just as weak. So an important first step in language design is to know a lot about what you want your language to do.

Knowing what you want your language to do doesn't necessarily mean you know what it's syntax will be. A language's syntax is not only key to its usability, but can also influence its ability to adapt to enhancements. Of course a language's syntax must be appropriate to its audience. If your language's users are technically inclined, even programmers, the language syntax you choose will be different from the language you choose for non-programmer domain experts.

In discussing syntax I am not limiting my context to textual DSLs, even though that is my focus for the remainder of the article. It is possible that your language will have two syntaxes: one that the language user interfaces with and the one that gets saved in a file, and gets parsed and translated. In such a case you will likely develop the user-centric syntax as a graphical user interface (such as a table with rows and columns) or even a glyph/shape based language (similar to Visio diagrams) where the user draws the syntax instead of writing it. In the background, at the file level, your language can be as technical as necessary. We tend to think of graphical syntaxes as models, but models are not limited to graphics just as complex DSLs are not limited to the textual variety. As you will soon see, at some point you will start to think of your language in terms of a model no matter what its syntax is like.

If the technical syntax of your complex DSL is not flexible enough to support future enhancements you face the danger of limited or zero backward compatibility. That is, you may create a new language feature that invalidates sources that existed before the enhancements. I believe that this danger is greater when supporting dual syntaxes. Of course it is possible to provide a source file (model) upgrade transformation tool. However, even file syntax upgrade utilities have limited benefits depending on the nature and complexities of the enhancements. They also are an unwelcome necessity to the user and spell resistance to upgrades.

I can provide a few suggestions for choosing an appropriate and extensible syntax, which are in order of relevance as your design efforts advance from early to later stages:

Study other languages. Think about why languages such as Java and C# are designed the way they are and are so successful. For certain Java and C# syntaxes are designed for acceptance by communities of developers that existed before these newer languages. But there are other reasons. Block based scoping languages are inherently extensible because new blocks of various types may be added adjacent to existing block types and also nested within them. This does not mean that you should reinvent such languages. It means that you can reuse certain aspects of successful languages in your own. Make sure you think about other languages such as Ruby, Smalltalk, Perl, and Python. What do you like and dislike about them? If you could change and blend any set of languages, what would you do? Can you reuse aspects of an existing successful language to improve your language ideas?
Experiment with various syntax ideas using agile techniques. How does it feel to write using each syntax? What do others think of the experimental syntaxes? Can the syntaxes be defined as a formal grammar? What would it be like to support tooling for your favorite syntax?
Identify as many language features as possible. Experiment with various syntaxes (as suggested in point #2) that supports the top 70-80% of those features and withhold the rest. Once you think you have a winning syntax consider what it would require to add the remaining 20-30% of withheld features. Is the first version of the language brittle or is it extensible? Besides withholding and later adding features, purposely implement some of the syntax incorrectly. Consider what is involved in correcting the mistakes and ask yourself what would happen if you had to continue to support both the wrong syntax and the new corrected and enhanced syntax. Is there something about the syntax that makes dealing with these issues better or worse? Is there anything you could do to the syntax to make the issues easier to deal with?
Present your language before communities of potential users. What do they think of the syntax? Is it intimidating to even smart people? Ask people you want to use your language for their honest opinion.
Beta test your language. It is easier to get users to accept high impact changes to a language at version 0.9 than at or above version 1.0. Prepare your beta testers for the possibility that their feedback will result in obsoleting their work of creating source artifacts.

Of course if you are developing a graphical DSL and it will never be authored outside the graphical environment, it is appropriate and perhaps best to use XML. However, I would never suggest that an XML schema be used as a source syntax for any directly edited textual DSL. Just imagine your language users authoring Ant (or Nant) programs and I think you'll get my drift. As Martin Fowler stated: “XML makes it easy to parse, although not as easily readable as a custom format might be. People do write plug-ins for IDEs to help manipulate the XML files for those who find that angle brackets hurt the eyes.” If you are today designing a directly edited textual DSL avoid XML-based syntax like the plague.

As a final reemphasizing statement about DSL syntax, a complex textual DSL needs to be definable as a formal grammar using a BNF (or EBNF). If your language cannot be expressed as a formal grammar then it is going to be very difficult or impossible to parse. More on parsing and BNF is provided below.

Designing the Language Metamodel

Think of source code written in a language's grammar (syntax) as a model of concepts that you are describing. The concepts you are describing could be data, structure, and behavior, which are typical concepts in computerlandia. From the language designer's point of view, the descriptions of these concepts are a model, not just source code. Thus, when you parse a source model and put its representational contents into objects, the objects are called a metamodel.

Even if a language source artifact is loaded into an abstract syntax tree (AST), an AST is a metamodel of sorts. An AST is a metamodel of the parts of a source syntax pertinent to describing its abstract structure, albeit closely tied to the syntax. In any event, I'd suggest that many complex textual DSLs should not be loaded into an AST but rather into a richer metamodel. I favor making the metamodel a graph (as necessary) that is much like the Model layer of the Model-View-Controller pattern; a Domain Model. In this case, however, the graph is not a model but a metamodel of the source model. (Note that Martin Fowler uses the name Semantic Model for what I here call Metamodel. He also defines this concept as an object model that is a Domain Model.)

Although this topic follows that of syntax design, it does not mean that the language's metamodel cannot be considered prior to a finalized syntax. The fact is, your language's metamodel is as important to the internal workings of your DSL as your syntax is to its external acceptance and future enhancements. The metamodel's design can begin even before the language syntax is conceived because the metamodel is not (or should not be) strongly tied to a syntax.

To illustrate this, recently James Gosling called Java's formal grammar (syntax) “a fraud” (video at approximately 27:00 and also 60:00) because the language's original design didn't call for a C-like syntax. Nonetheless, Java internally still had interfaces, classes, methods, fields, threads, primitives, and ran on byte codes. If there was no effort to attract C/C++ programmers to Java by use of a familiar syntax Java could have ended up looking to us much different than it does today. However, you can be sure that Java's metamodel didn't have to change (or at least not drastically) in order to facilitate a C-like syntax (perhaps for concepts such as pre- and post-increments, unless those where already supported by a non-C syntax). That's because the underlying metamodel defined the language's concepts in an abstract way that could be mapped from multiple syntaxes. This feature is what makes the Java VM a great host for scripting languages such as Groovy and JRuby.

When thinking about a metamodel remember that it is an object model that holds meta information about the source model. Thus, any of the concepts of your language should be expressed richly in the metamodel. Let's take a familiar example, a metamodel of an object-oriented language. The classes in the object-oriented language metamodel would include:

The class MetaClass would contain metadata about the class in the source model that it represents. For instance, if some source code defined a class named EmailAddress then you'd have a MetaClass instance that has a nameattribute/field set to the string “EmailAddress”. Class MetaClass would also contain a collection of MetaFieldinstances and another collection of MetaMethod instances. If the model source class EmailAddress had a field declared with the name address then the MetaClass would contain at least one element in its fields collection, an instance of class MetaField that has its name attribute/field set to the string “address”. Further, each MetaFieldinstance would have a reference to the MetaClass type of that model field. Thus, the metamodel forms a graph.

You would use meta classes analogous to this example, but for the particular model representation expressed in your DSL. I suggest considering a meta class hierarchy that starts with the abstract base class MetaObject. The MetaObjectwould be used to provide default state and behavior for all meta subclasses. For example, perhaps many meta objects your language supports have a name. In that case class MetaObject would contain the name attribute/field and accessors in behalf of all subclasses. After you have defined a useful MetaObject you can start designing the full class hierarchy of your metamodel. Of course over time as your metamodel grows you would refactor common state and behavior down into classMetaObject.

To take this approach one step further, if you are familiar with Eric Evans' “Domain-Driven Design” patterns (DDD), you could apply those patterns to your metamodel. I use this approach for my DomainMETHOD tool, which is a DSL that facilitates DDD and generates a working application domain layer. So I get the best of both worlds: I design and develop my tool using DDD, and I support the design and generation of working DDD-based domain layers. My tool's design is complete with Entities, Value Objects, Aggregates, Repositories, and more. I store metamodel object Aggregates that are read from model source in my MetaObjectRepository. I use a ProjectRepository to find project configurations and custom generation directives, and to create, find, and store generated target artifacts, which are eventually persisted to output files. I also have a repository that I use for finding and for managing the state of source templates, aptly namedTemplateRepository.

Using DDD can be a practical and powerful means of conceptualizing, designing, and implementing your metamodel. This might not always work out depending on the DSL characteristics, but if it can you should consider using DDD.

Relationships Between Metamodels and Target Models

In this article I have mentioned three potential target uses of an external DSL. I want to restate these here in the context of metamodels to show interesting relationships between metamodels and target artifacts or models.

Here are the three targets I have mentioned: the DSL source model could be parsed and translated into source code to become part of your application; the DSL source model could be parsed and interpreted in your application's runtime environment; the DSL source model could be parsed and translated into another form of data for your application to consume at runtime.

The first use of the metamodel, to achieve target source code artifact output, is a straightforward albeit usually more complex one. Essentially you are transforming one or more source models into a metamodel and then into a set of target models (or source artifacts).

The last two variations, interpreted and translated data format, have similarities because an interpreted model and any other form of translated data are just varying types of metamodels. In essence taking your metamodel built by your parser and turning it into another data format supported by your application could be thought of and even be executed as a model-to-model transformation. But the point is, it can remain a metamodel. This notwithstanding, if you persist the data format metamodel rather than keep it only in memory, a final persistent transformation as part of a generative process will need to occur.

There may be other differences between these last two variations since an interpreted model may be more behavioral or a mixture of behavior with state and state transitions. However, if the interpretation goal is simple in nature you will see real similarities between the two.

Metamodel Generation

It is possible to get your metamodel for free. The openArchitectureWare Xtext tool (now part of the Eclipse Modeling Framework) creates your metamodel automatically as an artifact of its tooling. All you do is define your formal grammar using Xtext, a step you would likely take anyway with a different parser-generator tool, and then generate artifacts from it. Your metamodel is generated along with a parser. When your resulting parser parses the DSL source model the parser instantiates corresponding parts of the metamodel, making it ready for validation, interpretation, and/or code generation. That's pretty handy.

Defining the Parser

I purposely named this subsection “Defining the Parser” rather than “Developing the Parser.” Since a complex external DSL has a complex syntax that is not easy to parse, I believe that attempting to manually design and code a language parser is an exercise in futility. Many experienced in language parsing agree. There are very few developers who are qualified to create a custom language lexical analyzer and parser. Even for those who could do so, unless rave performance is essential, they would probably choose to use a tool that allows them to define, rather than hand code, a parser. It is simply easier, faster, more efficient, and less error prone to do so.

Most parser-generator tools support the description of a language's formal grammar (it's syntax) using the Backus-Naur Form (BNF) or an extended variation (EBNF). The BNF provides a formal way to define a formal syntax, and an EBNF makes it even easier to do so:

INTEGER_LITERAL : ('-')? ('0'..'9')+ ;

As can be seen above, an EBNF improves BNF by supporting optional and repeating specifications. In the case of INTEGER_LITERAL, the specification states that it may or may not have a minus sign and will be made up of one or more digits between 0 and 9. If you know regular expressions you will have a big head start here. While EBNF is not the same as regular expressions, they look like next of kin.

A typical language grammar for a general purpose language may be defined at its highest level as follows:

prog : expr+ ;

This simply states that the program (prog) is made up of one or more expressions (expr). Of course you are not done yet. You must define what an expression (expr) is, which is where the fun begins:

expr : ... ;

Further decomposing a language's syntax from the highest abstraction down to its lowest detail has a learning curve. Nonetheless, this approach is much easier and must faster than hand coding a buggy lexer and parser. Just run the parser-generator tool against your properly defined language formal grammar definition and in a second or two you have a working, bug free lexical analyzer and parser to use. Even better, once you've gotten the hang of the EBNF that your tool supports your productivity will soar.

There are several open source and otherwise low cost parser-generator tools available. I like the open source tool ANTLR (pronounced “antler”) for several reasons. ANTLR supports the generation of parsers in several target languages. ANTLR version 3 has very sophisticated source stream look ahead techniques that saves you hours (or days!) of frustration in resolving grammatical ambiguities. Grammatical ambiguities occur when various parts of your language clash and confuse the parser. This is a complicated topic in itself and I can't provide further information now. But believe me, if you haven't experienced this problem you don't want to, and if you have experienced it you just want to download ANTRL 3 and be done with it. ANTRL supports parameter passing to element definitions, and simple and complex return values as well, and may more professional features. By the way, ANTRL's EBNF is a great example of a complex external DSL.

Most modern parser-generators allow you to associate custom code with language element definitions. This boils down to inserting your custom code at the points where the parser evaluates the source stream as matching an element condition. In DSL terms Martin Fowler names this the Foreign Code pattern since the custom code is foreign to the EBNF language. Since your custom code gets inserted into the parser it allows you to, among other things, instantiate your metamodel on matching events. Thinking back to the object-oriented language example above, here's how it works (in an incomplete yet simplified fragment):

classDef[MetaObjectRepository repo]
  : 'class' (ident = identifier) '{' cb = classBody '}'
  {
    MetaClass metaclass = new  MetaClass($ident.text);
    // ...
    repo.addClass(metaclass);
  }
  ;

The classDef element is matched by the generated parser when the lexical token stream has the following sequence:

The text string “class”
An identifier element (defined elsewhere)
The text string “{”
A classBody element (defined elsewhere)
The text string “}”

In the generated parser code (Java in this case) the matcher for classDef has the custom code between the (unquoted) curly braces inserted. When a match occurs the custom code is executed. This allows the new MetaClass to be instantiated with its class name set to the value of the text string in the ident variable. The ident variable was assigned upon matching of the identifier element (the same goes with the assignment of cb with the match of classBody). Notice that the MetaObjectRepository instance repo is passed as an argument to the classDef element, and available for the custom code to use. Once the new MetaClass is instantiated and fully constituted (code not show) it is added to the repository.

There is no magic here. The key to understanding how this all fits together is in the fact that each EBNF element is generated as a Java (or other parser target language) method. Thus, there is a method generated for classDef, for identifier, for classBody, and for all other elements defined in the grammar.

Parsing One or Multiple Models

You will need to determine if one or many DSL model sources will be parsed into a single metamodel instance. For example, if the DSL sources represent an abstraction of target application sources, it is very likely that DSL source authors will produce many such sources. In that case your parser will need to know how to look for multiple sources.

This requirement defines a simple directory-file crawler. The DSL tooling might accept a project base path and crawl the directory structure matching all files with a file extension defined by your language specification. As each properly named file is encountered and matched it is read, parsed, and the results placed into a single metamodel instance. Associations between sources represented in the metamodel thus form a graph. Once the single metamodel instance has been constituted from all DSL source models you are ready to take one of the three common actions: translate and generate, interpret, or transform into a different kind of model such as a data format suitable for your application.

If your DSL and tooling support only a single DSL source model for constituting your metamodel, your parser will be a bit simpler. Just design your tooling to accept a single model source, parser it, build your metamodel, and then take one of the noted three actions.

If you are generating code or some form of data from your DSL, the next subsection will be of interest.

Generating Code

In discussing code generation note that the principles here also apply to data generation. It's just that data generation is usually a tad or a lot simpler than code generation, so I address code generation head on. The good news is that if all you need to do is generate some kind of data format from your DSL you will probably be able to use the more simplistic strategies I here discuss.

At first it may seem that code generation from a metamodel is a straightforward exercise. Indeed, depending on the simplicity of your DSL, your metamodel, and the target artifacts, it may be just that. In my experience, however, a complex external DSL means that code generation from its corresponding metamodel is rarely simple. I will progress from common code generation strategies to the more powerful ones, outlining the competing forces with each approach.

Direct From Model

One approach to code generation is to translate the target artifact output without creating a metamodel at all. This technique replaces all or most of the custom code presented above used to constitute your metamodel with code that outputs directly to target artifacts. If you can use this strategy then why not? Certainly if you don't have to go to the trouble to design a metamodel and then write the custom parser code snippets to constitute it, you are way ahead of the curve to completing your overarching task and moving on.

If you believe that your DSL will never need a metamodel then I suggest considering this approach first. My only caution is, if your DSL source is an abstraction so close to your target artifact output, is it possible that you should be coding directly to your target artifacts in the first place? You will have to answer this question, and it is possible that the answer is “no,” which is good for you.

Are you targeting multiple artifacts from a single source? Are you targeting multiple artifacts from multiple sources? If so it is very likely impractical to use a direct-from-model generation approach.

Walking the Metamodel

After your metamodel has been fully constituted from DSL model source, you will have to “walk the metamodel” to determine what target artifacts (source code, configuration files, etc.) must be generated. This involves starting at a main aggregate root or iterating across many main aggregate root objects, and then navigating down through them looking for significant metadata and exercising needed meta domain behavior.

The primary problem with this approach is when you reach some metamodel context you may not (and likely will not) have all the context you need to properly generate a given target artifact. Necessary metadata will tend to be spread more widely throughout the metamodel:

If you have a separate specialty code generator for each target artifact type, you may find that you have to walk the model multiple times to find all metadata you need to generate each target artifact. You will end up developing multiple complex navigators for each artifact type. Or you will design your metamodel as a more complex graph so you can associate dependent metadata for all necessary context. Using either of these approaches to solve navigation needs can be difficult to achieve.

Eventing Metamodel With Region-Aware Artifacts

A very practical solution to the complexities of walking the metamodel multiple times is to design an eventing metamodel. This can be done with a simple Publish-Subscribe pattern. Design a single metamodel walker that fires events when it encounters significant parts of the metamodel. Subscribers are delivered the events as they occur, and respond to the events by generating code just in time. This pattern has proved very valuable in my DSL code generation efforts.

The eventing metamodel may be incomplete without support for artifacts with region awareness and workspaces. Source code artifacts of all sorts have regions:

As events of interest occur to each code generator listener, inject newly generated source code into the appropriate regions of the artifact. The regions can be managed by name and/or index. Nested region artifact management is even more powerful, because you can keep tacking on generative output as needed and where appropriate:

But what if your specific generator receives an event, and once again it doesn't have enough context with the current metamodel metadata? A knee-jerk reaction is to immediately walk the model to find what you need right now. But with a bit of patience and an artifact workspace you can wait for the context you need and generate code more elegantly. When an event delivers incomplete metadata context, just store the partial context in a well defined (unique) workspace area of the artifact. When subsequent events occur build on the saved workspace metadata. Finally when the necessary contexts have all been delivered and the source segment is completely built, grab the source segment out of its unique workspace area, remove the workspace area, and save the completed source segment into the appropriate artifact region:

Once the target artifact is completed it can be saved to the appropriate output file.

You'll need to design an artifact persister that knows how to collapse nested artifact regions into correct artifact files.

Code Templates

Unless the source code you are generating is very simple, you will benefit from the use of code templates and a template engine. Consider the alternative to the use of code templates. As you reach the point of a particular source code generation event, you need to create the following C# property:

private string _address;
public string Address
{
  get { return this._address; }
  set { this._address = value; }
}

You could take a common approach and create a C# source snippet text string as follows:

string propertyDef =
  “private” + propertyType + “ ” + hiddenPropertyName “;\n” +
  “public ” + propertyType + “ ” + propertyName + “\n” +
  “{” + “\n” +
  indent + “get { return this._” +  hiddenPropertyName + “; }\n” +
  indent + “set { this._” + hiddenPropertyName + “ = value; }\n” +
  “}” + “\n”;

Honestly, as I wrote this example I was confused a few different times as to where I was in the generative stream. I know how to create a C# property, but stitching these generative fragments together makes for confusion and several round trips before the generated code is correct. In fact I am not now certain that the above fragment is correct. Nonetheless, this example is one the the simplest pieces of code you will need to generate for a C# class.

Next consider the use of a code template and template engine. First of all, here is a template for creating a C# property:

property(propertyType, propertyName, hiddenPropertyName) ::= [[
private $propertyType$ $hiddenPropertyName$;
public $propertyType$ $propertyName$
{
  get { return this.$hiddenPropertyName$; }
  set { this.$hiddenPropertyName$ = value; }
}
]]

The template is much clearer. First note that the template definition has a name, property, and takes a set of parameters, the propertyType, propertyName , and hiddenPropertyName. So it looks familiar, like a function or method. The body of the template–everything between the ::=<< and >> (Note: A formatting error prevents the display of these characters in the code sample above so I used [[ and ]] instead) tokens–basically looks a lot like the C# source you would write if you were creating the property manually, but with a few minor differences. The template body is parameterized. By marking parameterized values with surrounding $ characters we allow the template engine to search and replace the matching $propertyType$ and other such placeholders with the passed in values.

To use the template above you would do the following:

StringTemplate template = getTemplate(“property”);
template.setAttribute(“propertyType”, “string”);
template.setAttribute(“propertyName”, “Address”);
template.setAttribute(“hiddenPropertyName”, “_address”);
String code = template.toString();

In a real code generator the parameter values set as template attributes would be soft so you could reuse the above code to generate any number of C# properties as specified by a given DSL source model.

I suggest that the template engine you use should support not only parameterized values, but also conditionals, repeating expressions (collections), and auto-indentation. The template example above is the syntax and API of ANTLR's StringTemplate subproject, which supports a rich set of template features.

Clearly you should use templates and a template engine if you do even simplistic code generation.

Conclusion

I have provided a high-level overview of what DSLs are in general and a bit more specifically what internal and external DSLs are. I also cover the main challenges and patterns involved in developing a complex external DSL. This provides a brief but firm foundation for doing meaningful DSL development. Obtaining the proper tools to define and generate parsers and metamodels will help you make rapid progress, but there is no tool that will replace the thought and design that goes into your language's formal grammar, metamodel, and code generation.

If you have not yet ventured into the development of a complex external DSL I hope this information has brought you a few steps closer to doing so. I welcome your feedback and would be glad to discuss with you details on this topic.

About the Author

Vaughn Vernon is an independent consultant with more than 26 years of experience as a software developer, architect, and designer. He has conceptualized and developed several software development tools, including DomainMETHOD, a DSL supporting rapid design and generation of domain models based on the patterns of domain-driven design. Vaughn has authored numerous articles and patterns, and has spoken at a variety of technical conferences. You can reference more of his work at www.shiftmethod.com and contact him at vvernon at shiftmethod dot com.

InfoQ Software Architects' Newsletter