State of Unicode and Ruby Compatibility for JRuby 1.0

| by Werner Schuster Follow 4 Followers on Apr 10, 2007. Estimated reading time: 2 minutes |

A note to our readers: As per your request we have developed a set of features that allow you to reduce the noise, while not losing sight of anything that is important. Get email and web notifications by choosing the topics you are interested in.

The true nature of Strings in JRuby has been a difficult topic in the past. Ruby uses byte arrays, Java has full Unicode support for Strings, representing them internally as UTF-16. Problems soon appeared as subtle differences in code running on Ruby and JRuby, as Charles O. Nutter explains:

But the APIs did not conform to what Ruby applications expected, frequently returning 16 bit values for individual characters and reporting incorrect byte lengths for strings that couldn't encode into all 8-bit characters. It was broken, as far as Ruby code was concerned.

He goes on to describe the solution that is likely to be used in JRuby 1.0:

  • Ruby strings are byte[] and conform to Ruby string semantics
  • Java strings passing into Ruby code will be encoded as UTF-8, with the implication that you should expect to be working with UTF-8 byte[] in the receiving code
  • Ruby strings passing out of Ruby into Java libraries will be assumed to be UTF-8, and the resulting string on the Java side of the call will reflect that assumption

Fixing the string encoding issue is just one of the many little, unglamorous steps necessary to hit the goal of being optimally-compatible with Ruby.

A related issue has to do with supporting Ruby regular expressions on JRuby. The easy solution, used for a long time, was to simply use java.util.regex, the regular expression library shipped with Java, to handle the Ruby regular expressions. However, bug reports for subtle differences in behavior kept coming in and other concerns also made it clear that a better solution was required. Performance troubles of java.util.regex were known, and performance could only get worse with the decision for using byte arrays to represent Ruby strings internally (java.util.regex doesn't work with byte arrays and would require the Ruby strings to be converted before it could work with them).

So Ola Bini, JRuby core team member, decided to bite the bullet, detail a solution and start to work on it. After an interim solution of using JRegex, he is working on REJ, here Ola Bini's description:

REJ is a project I've started, which will be a direct port of the MRI 1.8.6 regular expression engine. The important thing about this is that the semantics for JRuby will match MRI very closely. We will be able to match UTF-8, SJIS and EUC regular expressions, and we are able to have the same quirks as MRI, even though people shouldn't depend on such quirks.

By May 2007, these and more changes will ensure that JRuby 1.0 comes as close to Ruby as possible.

Rate this Article

Adoption Stage

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Ruby, Unicode by Sean Sullivan

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

1 Discuss

Login to InfoQ to interact with what matters most to you.

Recover your password...


Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.


More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.


Stay up-to-date

Set up your notifications and don't miss out on content that matters to you