ruby_parser 1.0: a Ruby Parser written in Ruby

Ryan Davis announced ruby_parser, a parser for Ruby source code written in Ruby. The parser was written using Ruby yACC (RACC), a parser generator that is bundled with the Ruby standard library:

ruby_parser (RP) is a ruby parser written in pure ruby (utilizing racc--which does by default use a C extension). RP's output is the same as ParseTree's output: s-expressions using ruby's arrays and base types.

The library is easy enough to use:

RubyParser.new.parse "1+1"

which returns

s(:call, s(:lit, 1), :+, s(:array, s(:lit, 1)))

A Ruby parser, written in pure Ruby has long been missing in the Ruby world. Just to clarify the term "Pure Ruby" in this context: this means the parser's code

consists solely of Ruby source files
does not add any native extensions or other C code (eg. with RubyInline) which requires a C compiler to be present on the user's system

These properties are crucial to ensure that the code runs across all Ruby runtimes. An implementation that would require a C-based native extension, would be unusable on Ruby versions that don't support them, e.g. JRuby, XRuby, or the .NET based runtimes IronRuby or Ruby.NET. Even if they supported native extensions (something under consideration for JRuby), native extensions cause deployment problems, because a shared library or DLL of the extension would have to be shipped for any conceivable OS/CPU combination (otherwise some users would not be able to use it). RubyInline, also a project by Ryan Davis, helps with this a bit by automatically compiling the inline C code, but this still requires a C compiler to be present on the target system - something that's not guaranteed, particularly on Windows systems.

A lack of a pure Ruby parser has been negligible for some time in Ruby's history, since getting the Abstract Syntax Tree (AST) of some Ruby code was possible with utilities such as ParseTree. However, ever since the explosion of alternative Ruby runtimes, Ruby parsers have been reimplemented several times - twice in Java (JRuby and XRuby), once in C# (Ruby.NET wrote the parser also used by IronRuby). All of these provide different ASTs and different ways of getting at it.
This has caused some issues for Ruby source tools. E.g. the Ruby Refactoring tools, now part of the Eclipse-based Aptana/RDT, are tied to both Java and the JRuby AST and not usable from other Ruby implementations. With similar tools now being (re)written for other Java based Ruby IDEs, this means that a large amount of code quality and code manipulation tools are now locked into Java and JRuby. Not just that: the logic of these tools is written in Java and not Ruby, which makes them less approachable for Ruby developers.

The pure Ruby parser now offers a chance to change that - a Ruby IDE or other tool, can now get at a Ruby AST without getting locked in. E.g. a Java based IDE can keep a JRuby instance around and run ruby_parser in it. To that end, the current version still needs to add correct source locations to the generated output - i.e. every AST node needs to know the offset where it's source code representation starts and ends. This is crucial for source tools - pure structural information is useful, but if the tool doesn't know where a node is, it can't modify it in the file.

Another client for ruby_parser is Rubinius, a Ruby VM written in (mostly) Ruby, which takes it's Ruby parser from MRI. Using ruby_parser will allow it to remove yet another piece of C code. To avoid the chicken-and-egg question "How can a Ruby VM work if it's parser is Ruby code?": during the build process of the Rubinius VM, the Ruby source code of ruby_parser can be compiled into Rubinius bytecode. When Rubinius starts, it loads the ruby_parser bytecode file - something that doesn't need the parser - and now has a running Ruby parser.

There's still a lot of work left for ruby_parser, as can be seen by a few of the issues mentioned in the release notes:

Known Issue: Speed sucks currently. 5500 tests currently run in 21 min.

Known Issue: Code is waaay ugly. Port of a port. Not my fault. Will fix RSN.

Known Issue: I don't currently support newline nodes.

Known Issue: Totally awesome.

Known Issue: dasgn_curr decls can be out of order from ParseTree's.

TODO: Add comment nodes.

InfoQ Software Architects' Newsletter

Follow us on

Rate this Article

This content is in the Compilers topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter