InfoQ

News

Multibyte for Rails: A Unicode Solution for Rails?

Posted by Obie Fernandez on Sep 25, 2006 06:52 PM

Community
Ruby
Topics
Ruby on Rails,
Internationalization
Tags
i18n,
Unicode,
Rails Plugins

The issue of proper Unicode support for Ruby on Rails continues to generate lots of discussion and development activity.

The Multibyte for Rails project was started by Julian 'Julik' Tarkhanov with the 'unicode_hacks' plugin. Manfred Stienstra, Jan Behrens and Thijs van der Vossen later joined the development team.

ActiveSupport::Multibyte extends the standard Ruby string with a chars proxy method. This proxy allows you to use a multibyte encoded string as a sequence of characters (or Unicode codepoints) instead of bytes. On chars you can call multibyte-safe implementations of all standard Ruby string methods. The proxy also provides methods for Unicode normalization, composition and decomposition and preliminary support for working with grapheme clusters.

Three months ago Julian submitted a test implementation of his ActiveSupport::Multibyte string extension patch to the Rails core team for inclusion in Ruby on Rails proper. According to a recent thread started by project team member and rails-core mailing list regular Manfred Siesta, the Multibyte team has been working continually on improvements to their extension for three months:

The code has been completely refactored to be more transparent and easier to understand. There is now a single optional accelerated backend and all multibyte-safe operations have a pure Ruby implementation. Test structure and coverage has also been greatly improved.

For anyone interested in trying out ActiveSupport::Multibyte, it is available as a plugin and can be converted to a patch using the included 'create_patch' rake task.

Why won't DHH and the core team just go ahead and patch Rails the way that folks such as Julian and his Multibyte team propose?

First of all, there is a performance penalty incurred by multibyte string operations. According to the project FAQ:

Multibyte safe operations through a proxy are obviously slower than single-byte operations directly on the string. The proxy introduces two levels of indirection and multibyte safe operations are more complex than single-byte operations and therefore slower.

A quick benchmark shows that for example a multibyte safe slice operation through the proxy is on average 50 times slower than a single-byte slice operation. Even though this makes the performance impact seem severe, remember that most of the string operations do not need to be multibyte safe. For a typical Rails application you're unlikely to even notice a performance penalty, but you have the satisfaction of knowing that you'll never break your user's text ever again.

Then there's the problem of UTF-8 and Unicode being terribly unpopular in Japan and China because of the Han Unification issue. On the other hand, as Sam Ruby pointed out in a brief comment to the rails-core mailing list:

Java and C# seem to do OK in Japan. I would also imagine that ASCII wouldn't be very popular in Japan. :-)

In the end, it seems that it will take more time before any single solution to the question of internationalization is adopted by Rails core. Copenhagen blogger Casper Fabricius sheds light on the situation based on recent comments by DHH at his user group:

...shouldn’t 5 or more plugins for internationalization indicate quite clearly that the Rails community craves unified support implemented in the core?

No, DHH answered, from the Core Team’s point of view, this means that people want to support and implement internationalization in a lot of different ways, and that there is no universal solution that will make everybody happy. Even inside the Core Team people can’t agree how it should be done. Although, DHH added, I can’t rule out that the 37signals needs internationalization, is the day that Rails get it.

4 comments

Reply

Unicode support should be part of the language not a library by Faui Gerzigerk Posted Sep 26, 2006 2:54 PM
Re: Unicode support should be part of the language not a library by Christian Romney Posted Sep 26, 2006 10:26 PM
Re: Unicode support should be part of the language not a library by Devin Ben-Hur Posted Sep 27, 2006 1:02 AM
Re: Unicode support should be part of the language not a library by sanane zorlayin Posted Jul 31, 2008 7:40 PM
  1. I really don't think it makes sense to put Unicode support into Rails. Why is this not part of Ruby itself? And if it has to be a library, it must be implemented in C, and it must be fast. The argument that you actually don't need it very often and thus it can be 50 times slower is just an excuse for not doing it properly. I mean we're talking about operations like getting the length of a string, searching in a string, extracting substrings, doing regular expressions, etc. These are very frequent operations. And we have all the character conversion stuff going on as well. Every time I read in something from an XML file/message or from a database or from an HTTP request, some conversion may be required. This has to be very fast indeed!

  2. Speed would be nice. Mutibyte text would be nicer. I wouldn't let speed get in the way of actually having the feature, especially as Ruby is already slow compared to many other languages yet is perfectly fast enough for all of my needs. The day it gets faster, well, that's just a bonus.

  3. Why is this not part of Ruby itself?
    If you really want to know, it's mostly because the ideographic asian languages (japanese, chinese, etc.) were given the shaft when unicode was developed by a bunch of arrogant westerners. Ruby's inventor is Japanese and Japan is still its main locus of use and development. Matz doesn't particularly like or need unicode, so he hasn't previously built support in to the language. However, he doesn't really to desire to cause other people pain, so he is actively working on improved unicode support for Ruby 1.9/2.0. You can read his ongoing thoughts and conversations. Try expressing your concerns via ruby-talk -- much more useful than complaining in a forum the language designer/implementor is unlikely to ever see.

  4. client for Windows that can be used to communicate, share, play or work with others on Web Designers around the world, either in multi-user group conferences or in one-to-one private discussions. It has a clean, practical interface that is highly configurable and supports features such as buddy lists kurye web tasarım e-ticaret vprx google kayıt adwords reklam google reklam geciktirici seks büyütücü penis büyütücü kurye web tasarımı file transfers, multi-server connections, SSL encryption, proxy support, UTF-8 display, customizable sounds, spoken messages, tray notifications, message logging, and more. mIRC also has a powerful scripting language that can be used both to automate mIRC and to create applications that perform a wide range of functions from network communications to playing games. mIRC has been in development for over a decade and is constantly being improved and updated with new technologies. The latest news about mIRC can be found on the latest news page.

Exclusive Content

Using Ruby Fibers for Async I/O: NeverBlock and Revactor

Ruby 1.9's Fibers and non-blocking I/O are getting more attention - we talked to Mohammad A. Ali of the NeverBlock project and Tony Arcieri of the Revactor project.

Agile and Beyond - The Power of Aspirational Teams

Tim Mackinnon talks about the aspirations behind the Agile principles and practices, the desire to become efficient, to write quality code which does not end up being thrown away.

Concurrency: Past and Present

Brian Goetz discusses the difficulties of creating multithreaded programs correctly, incorrect synchronization, race conditions, deadlock, STM, concurrency, alternatives to threads, Erlang, Scala.

ActionScript 3 for Java Programmers

Often the hardest part of changing technologies is language syntax differences. This new article provides Java developers with a transition guide to Actionscript which forms the foundation of Flex.

Neal Ford On Programming Languages and Platforms

Neal Ford talks about having multiple languages running on one of the two major platforms: Java and .NET. He also presents the advantages offered by Ruby compared to static languages like Java or C#.

Future Directions for Agile

David Anderson talks about the history of Agile, the current status of it and his vision for the future. The role of Agile consists in finding ways to implement its principles.

Nick Sieger on JRuby

Nick Sieger talks about the future of JRuby, Java Integration, and his work on JEE deployment tools for Ruby on Rails like Warbler.

Rustan Leino and Mike Barnett on Spec#

Rustan Leino and Mike Barnett of Microsoft Research discuss the technology in Spec# and its futures.