InfoQ

News

State of Unicode and Ruby Compatibility for JRuby 1.0

Posted by Werner Schuster on Apr 10, 2007 12:00 PM

Community
Java,
Ruby
Topics
JRuby,
Internationalization
Tags
Releases

The true nature of Strings in JRuby has been a difficult topic in the past. Ruby uses byte arrays, Java has full Unicode support for Strings, representing them internally as UTF-16. Problems soon appeared as subtle differences in code running on Ruby and JRuby, as Charles O. Nutter explains:

But the APIs did not conform to what Ruby applications expected, frequently returning 16 bit values for individual characters and reporting incorrect byte lengths for strings that couldn't encode into all 8-bit characters. It was broken, as far as Ruby code was concerned.

He goes on to describe the solution that is likely to be used in JRuby 1.0:

  • Ruby strings are byte[] and conform to Ruby string semantics
  • Java strings passing into Ruby code will be encoded as UTF-8, with the implication that you should expect to be working with UTF-8 byte[] in the receiving code
  • Ruby strings passing out of Ruby into Java libraries will be assumed to be UTF-8, and the resulting string on the Java side of the call will reflect that assumption

Fixing the string encoding issue is just one of the many little, unglamorous steps necessary to hit the goal of being optimally-compatible with Ruby.

A related issue has to do with supporting Ruby regular expressions on JRuby. The easy solution, used for a long time, was to simply use java.util.regex, the regular expression library shipped with Java, to handle the Ruby regular expressions. However, bug reports for subtle differences in behavior kept coming in and other concerns also made it clear that a better solution was required. Performance troubles of java.util.regex were known, and performance could only get worse with the decision for using byte arrays to represent Ruby strings internally (java.util.regex doesn't work with byte arrays and would require the Ruby strings to be converted before it could work with them).

So Ola Bini, JRuby core team member, decided to bite the bullet, detail a solution and start to work on it. After an interim solution of using JRegex, he is working on REJ, here Ola Bini's description:

REJ is a project I've started, which will be a direct port of the MRI 1.8.6 regular expression engine. The important thing about this is that the semantics for JRuby will match MRI very closely. We will be able to match UTF-8, SJIS and EUC regular expressions, and we are able to have the same quirks as MRI, even though people shouldn't depend on such quirks.

By May 2007, these and more changes will ensure that JRuby 1.0 comes as close to Ruby as possible.

1 comment

Reply

Ruby, Unicode by Sean Sullivan Posted Apr 11, 2007 10:45 AM
  1. Back to top

    Ruby, Unicode

    Apr 11, 2007 10:45 AM by Sean Sullivan

    Tim Bray's blog discusses Ruby and Unicode: http://www.tbray.org/ongoing/When/200x/2006/10/22/Unicode-and-Ruby http://www.tbray.org/talks/rubyconf2006.pdf

Exclusive Content

Rob Windsor on WCF with REST, JSON and RSS

WCF is not just for SOAP based services and can be used with popular protocols like RSS, REST and JSON. Join Rob Windsor as he introduces WCF 3.5 and its new native support for non-SOAP services.

Christophe Coenraets Discusses Flex 3, AIR, and BlazeDS

Christophe Coenraets discusses Flex 3, Flex Builder, AIR, BlazeDS, Adobe and open source, integrating Flex with existing applications, and integrating RIAs with search engines and browsers.

Debunking Common Refactoring Misconceptions

Danijel Arsenovski attempts to dispel some of the myths around refactoring and how it applies to .NET developers.

REST Eye for the SOA Guy

In this presentation, recorded at QCon San Francisco, CORBA guru Steve Vinoski explains REST from the view of someone who comes to SOA from a traditional, RPC-oriented background.

Choose Feature Teams over Component Teams for Agility

Feature teams are key to scaling agility for large teams. In an excerpt from "Scaling Lean and Agile Development," Larman & Vodde show how feature teams resolve traditional problems & raise new issues

Billy Newport explains Virtualization

Billy Newport talks about virtualization, eXtreme Transaction Processing (XTP) and WebSphere Virtual Enterprise. He discusses hardware, hypervisor, JVM, application and data virtualization.

Virtualization and Security

While virtualization provides many benefits, security can not be a forgotten concept in its application.

Introduction to Agile for Traditional Project Managers

This session is specifically aimed at traditionally trained project managers who are new to Agile, and who would like to be able to relate the PMI's best practices to their Agile equivalents.