State of Unicode and Ruby Compatibility for JRuby 1.0
The true nature of Strings in JRuby has been a difficult topic in the past. Ruby uses byte arrays, Java has full Unicode support for Strings, representing them internally as UTF-16. Problems soon appeared as subtle differences in code running on Ruby and JRuby, as Charles O. Nutter explains:
But the APIs did not conform to what Ruby applications expected, frequently returning 16 bit values for individual characters and reporting incorrect byte lengths for strings that couldn't encode into all 8-bit characters. It was broken, as far as Ruby code was concerned.
He goes on to describe the solution that is likely to be used in JRuby 1.0:
- Ruby strings are byte and conform to Ruby string semantics
- Java strings passing into Ruby code will be encoded as UTF-8, with the implication that you should expect to be working with UTF-8 byte in the receiving code
- Ruby strings passing out of Ruby into Java libraries will be assumed to be UTF-8, and the resulting string on the Java side of the call will reflect that assumption
Fixing the string encoding issue is just one of the many little, unglamorous steps necessary to hit the goal of being optimally-compatible with Ruby.
A related issue has to do with supporting Ruby regular expressions on JRuby. The easy solution, used for a long time, was to simply use
java.util.regex, the regular expression library shipped with Java, to handle the Ruby regular expressions. However, bug reports for subtle differences in behavior kept coming in and other concerns also made it clear that a better solution was required. Performance troubles of
java.util.regex were known, and performance could only get worse with the decision for using byte arrays to represent Ruby strings internally (
java.util.regex doesn't work with byte arrays and would require the Ruby strings to be converted before it could work with them).
So Ola Bini, JRuby core team member, decided to bite the bullet, detail a solution and start to work on it. After an interim solution of using JRegex, he is working on REJ, here Ola Bini's description:
REJ is a project I've started, which will be a direct port of the MRI 1.8.6 regular expression engine. The important thing about this is that the semantics for JRuby will match MRI very closely. We will be able to match UTF-8, SJIS and EUC regular expressions, and we are able to have the same quirks as MRI, even though people shouldn't depend on such quirks.
By May 2007, these and more changes will ensure that JRuby 1.0 comes as close to Ruby as possible.
Randy Shoup Jul 03, 2015