BT

State of Unicode and Ruby Compatibility for JRuby 1.0

by Werner Schuster on Apr 10, 2007 |

The true nature of Strings in JRuby has been a difficult topic in the past. Ruby uses byte arrays, Java has full Unicode support for Strings, representing them internally as UTF-16. Problems soon appeared as subtle differences in code running on Ruby and JRuby, as Charles O. Nutter explains:

But the APIs did not conform to what Ruby applications expected, frequently returning 16 bit values for individual characters and reporting incorrect byte lengths for strings that couldn't encode into all 8-bit characters. It was broken, as far as Ruby code was concerned.

He goes on to describe the solution that is likely to be used in JRuby 1.0:

  • Ruby strings are byte[] and conform to Ruby string semantics
  • Java strings passing into Ruby code will be encoded as UTF-8, with the implication that you should expect to be working with UTF-8 byte[] in the receiving code
  • Ruby strings passing out of Ruby into Java libraries will be assumed to be UTF-8, and the resulting string on the Java side of the call will reflect that assumption

Fixing the string encoding issue is just one of the many little, unglamorous steps necessary to hit the goal of being optimally-compatible with Ruby.

A related issue has to do with supporting Ruby regular expressions on JRuby. The easy solution, used for a long time, was to simply use java.util.regex, the regular expression library shipped with Java, to handle the Ruby regular expressions. However, bug reports for subtle differences in behavior kept coming in and other concerns also made it clear that a better solution was required. Performance troubles of java.util.regex were known, and performance could only get worse with the decision for using byte arrays to represent Ruby strings internally (java.util.regex doesn't work with byte arrays and would require the Ruby strings to be converted before it could work with them).

So Ola Bini, JRuby core team member, decided to bite the bullet, detail a solution and start to work on it. After an interim solution of using JRegex, he is working on REJ, here Ola Bini's description:

REJ is a project I've started, which will be a direct port of the MRI 1.8.6 regular expression engine. The important thing about this is that the semantics for JRuby will match MRI very closely. We will be able to match UTF-8, SJIS and EUC regular expressions, and we are able to have the same quirks as MRI, even though people shouldn't depend on such quirks.

By May 2007, these and more changes will ensure that JRuby 1.0 comes as close to Ruby as possible.

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Ruby, Unicode by Sean Sullivan

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

1 Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT