Google SoC Series: ANTLR v3 Ruby Parser

XRuby is an effort to create a Ruby to Java bytecode compiler. A compiler, of course, needs a way to parse the input language, and so the XRuby team created their own Ruby parser using the popular ANTLR parser generator. Parser generators take the grammar of a language and generate code to parse it. Using ANTLR meant that the grammar and parser had to be created from scratch, which is different from JRuby's approach. Ruby's parser uses another parser generator called YACC, and JRuby chose to reuse this grammar and use a Java port of Yacc to generate it's parser.

When asked about whether Ruby is a difficult language to parse, Wang Haofei, who works on this Google SoC project and the XRuby team, explains:

Yes it is. There are lots of ambiguities in the language. For example, "<<" can be either left shift operator or start of heredoc. To distinguish these two the lexer has to maintain state (context dependent): http://seclib.blogspot.com/2005/11/distinguish-leftshift-and-heredoc.html

Others likes ID/function ambiguity, expression substitution inside string, heredoc etc are all very difficult to deal with.

When it comes to a difficult task like this, it's always good to ask just how far along it really is. Wang Huofei:

Since the first public release it can parse the entire ruby standard libraries and and Ruby on Rails (have not try the latest one recently): http://seclib.blogspot.com/2006/02/first-release-of-rubyfront.html

Xue has fixed a few bugs since then, but generally it is very stable. We will write and run more tests during the SOC project and it may help us to uncover some unknown issues.

Xue Yong Zhi is the mentor for this Google SoC project and also works on XRuby.

One big part of the Google SoC project seems to be porting the existing parser to ANTLR's new version 3. Wang Huofei:

1. ANTLR v3 is a rewrite of v2 and has significantly enhanced parsing strength via LL(*) parsing, v2 is much weaker (limited LL(k)) and it forced me to add some hacks to workaround some problems. While ANTLR based parser is easier to maintain than others, migration to v3 will help us to make it better and cleaner.
2. There will be a ruby backend for ANTLR v3, so that we may be able to have ruby parser in ruby.
3. ANTLR v3 's performance is much better.

The second item in this list brings up an a very interesting point. Ruby lacks a Ruby parser written in Ruby. This is an issue for writing tools that handle Ruby code. Code analyzers, refactoring tools and automated refactorings, formatters and more, are difficult, if not impossible, to write in Ruby, because there is no way to parse Ruby source with Ruby code. There are workarounds, like Ryan Davis' ParseTree which uses the parser of Ruby's interpreter (via a native extension) to get at the Abstract Syntax Tree (AST) for Ruby source. An AST is a tree representation of source code, and is necessary for tools that need to know about the structure of the code. Yet, even ParseTree is not a complete solution, since it's current versions don't give the source locations of individual nodes. Obviously, a refactoring algorithm that, for instance, wants to rename an identifier in a Ruby source file needs to know where the identifier actually is.

This issue has become ever more obvious in the past year, due to the arrival of various Ruby IDEs. These ship with code analyzers (to warn about potential errors in the code) and the Eclipse based RDT is the first IDE to feature extensive refactoring support for Ruby. Other features are support for popular Ruby based files, such as Rake files, Ruby's equivalent of make or Ant. The problem: these tools are all built with Java (or other languages) and Java IDEs all use JRuby's parser.

This means that the functionality of these tools is locked in those languages, and even worse, often tied to particular IDEs. For instance, the logic for RDT's refactoring support is not available for, say, Ruby in Steel, an IDE built on Visual Studio. This is different than, for instance, in the Java space, where parsers are available. Tools such as PMD or Findbugs are naturally written in Java and thus available wherever Java runs and, more importantly, can be extended with Java code.

Because the Google SoC description of this project isn't 100% clear on the Ruby based parser, Wang Huofei clarifies the project plan:

It depends on how well we are doing. We are willing to do that even if it may not fit in SOC schedule.

Good news.

One necessity for making code tools is the AST, to analyze source. ParseTree, as already mentioned, offers a format to represent Ruby source code. Tools, based on ParseTree, already exist, such as Ruby2Ruby which can turn an AST back into Ruby source code; useful if a tool wants to modify an AST and then output it as Ruby source. Rubinius, a project to implement a Ruby VM in Ruby, also uses ParseTree output to compile Ruby to the Rubinius bytecode, which is then interpreted. When asked about the output of the parser, Wang Haofei explains:

ANTLR has its own builtin AST support, and it is very handy if you want to serialize to a string or change to other struture. It looks similar to ParseTree's output. In XRuby we turned the AST into a DOM-like structure and use visitor pattern to generate java bytecode.

While ParseTree output doesn't seem to be planned, it's entirely possible to translate the ANTLR generated AST into the ParseTree format. A similar approach is already used by JParseTree, a port of ParseTree that works on JRuby, now a part of JRuby Extras which provides JRuby ports of popular Ruby libraries.

One way to watch the progress of XRuby and it's parser project is the XRuby team blog.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Dynamic Languages topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter