InfoQ

InfoQ

News

My Bookmarks

Login or Register to enable bookmarks for unlimited time.

The content has been bookmarked!

There was an error bookmarking this content! Please retry.

State of Unicode and Ruby Compatibility for JRuby 1.0

Posted by Werner Schuster on Apr 10, 2007

Sections
Development,
Architecture & Design
Topics
JRuby ,
Ruby ,
Internationalization ,
Java
Tags
Releases

The true nature of Strings in JRuby has been a difficult topic in the past. Ruby uses byte arrays, Java has full Unicode support for Strings, representing them internally as UTF-16. Problems soon appeared as subtle differences in code running on Ruby and JRuby, as Charles O. Nutter explains:

But the APIs did not conform to what Ruby applications expected, frequently returning 16 bit values for individual characters and reporting incorrect byte lengths for strings that couldn't encode into all 8-bit characters. It was broken, as far as Ruby code was concerned.

He goes on to describe the solution that is likely to be used in JRuby 1.0:

  • Ruby strings are byte[] and conform to Ruby string semantics
  • Java strings passing into Ruby code will be encoded as UTF-8, with the implication that you should expect to be working with UTF-8 byte[] in the receiving code
  • Ruby strings passing out of Ruby into Java libraries will be assumed to be UTF-8, and the resulting string on the Java side of the call will reflect that assumption

Fixing the string encoding issue is just one of the many little, unglamorous steps necessary to hit the goal of being optimally-compatible with Ruby.

A related issue has to do with supporting Ruby regular expressions on JRuby. The easy solution, used for a long time, was to simply use java.util.regex, the regular expression library shipped with Java, to handle the Ruby regular expressions. However, bug reports for subtle differences in behavior kept coming in and other concerns also made it clear that a better solution was required. Performance troubles of java.util.regex were known, and performance could only get worse with the decision for using byte arrays to represent Ruby strings internally (java.util.regex doesn't work with byte arrays and would require the Ruby strings to be converted before it could work with them).

So Ola Bini, JRuby core team member, decided to bite the bullet, detail a solution and start to work on it. After an interim solution of using JRegex, he is working on REJ, here Ola Bini's description:

REJ is a project I've started, which will be a direct port of the MRI 1.8.6 regular expression engine. The important thing about this is that the semantics for JRuby will match MRI very closely. We will be able to match UTF-8, SJIS and EUC regular expressions, and we are able to have the same quirks as MRI, even though people shouldn't depend on such quirks.

By May 2007, these and more changes will ensure that JRuby 1.0 comes as close to Ruby as possible.

Ruby, Unicode by Sean Sullivan Posted
  1. Back to top

    Ruby, Unicode

    by Sean Sullivan

Educational Content

Jesper Boeg on Priming Kanban

In this interview, Jesper Boeg, author of the new InfoQ book – Priming Kanban, discusses the keys to using Kanban effectively, and how to get started if you are currently using other approaches.

New-age Transactional Systems - Not Your Grandpa's OLTP

John Hugg discusses high volume transaction processing applications with high and low frequency profiles, and how VoltDB can be used for that purpose.

Cool Code

Kevlin Henney examines code samples to see what can be learned from them starting from the premise that one won’t write great code unless he knows how to read it.

Collaboration: At the Extremities of Extreme

Jason Ayers share the observations he made watching a team of developers collaborating in real time on the same code base, pushing XP, pair programming and continuous integration to their extremes.

Yesod Web Framework

Michael Snoyman presents Yesod, a web framework written in Haskell and containing a web server, templating, ORM, libraries (templating, gravatar, etc.).

Transactions without Transactions

Richard Kreuter and Kyle Banker on how to avoid classical RDBMS transactional systems by using compensation mechanisms, transactional messaging or transactional procedures.

Attila Szegedi on JVM and GC Performance Tuning at Twitter

Attila Szegedi talks about performance tuning Java and Scala programs at Twitter: how to approach GC problems, the importance of asynchronous I/O, when to use MySQL/Cassandra/Redis, and much more.

10 tips on how to prevent business value risk

One category of risk that project teams need to ensure they address is business value failure – delivering a product that fails to provide value for the business investor.