InfoQ

InfoQ

News

My Bookmarks

Login or Register to enable bookmarks for unlimited time.

The content has been bookmarked!

There was an error bookmarking this content! Please retry.

The Unicode Debate Rekindled

Posted by Obie Fernandez on Jun 17, 2006

Sections
Development,
Architecture & Design
Topics
Ruby ,
Programming
Tags
JRuby ,
Rails ,
Unicode

The great ongoing Unicode in Ruby debate sprung back to life this week with a thread on the ruby-talk mailing list asking, is there a plan to get Unicode support into the language anytime soon? The seemingly simple question ballooned into a huge discussion that included Matz and members of the Rails and JRuby teams.

Unicode is an industry standard designed to support computer encoding of text and symbols from all of the writing systems of the world. Unicode characters can be encoded (as bytes) using any of several schemes termed Unicode Transformation Formats (UTF). Does Ruby support Unicode? It depends on who you ask, hence the perennial debate about how to improve Unicode support in future versions of the language.

The question of Unicode support is important for Charles Nutter, one of the leads of the JRuby project. Since Java does support Unicode, and quite well at that, it is particularly embarrassing for JRuby, running on the JVM, to not support it. Charles asks the blogsphere, "What should that support look like?"

As for Rails, DHH has taken the position that it is the responsibility of Rails application developers to handle Unicode properly, rather than creating a different set of string manipulation methods for Rails. That hasn't stopped people from asking for Rails-based solutions, or proposing them. Julian Tarkhanov has probably done the most to help ease the pain of Unicode in Rails. His proxy solution adds a character array property to Ruby's String objects that handles Unicode characters properly.

Unicode specifies the encoding of the characters of strings,rather than the glyphs (renderings) for such characters, which has led to controversies over the usefulness of Unicode for Chinese, Japanese and Korean languages. The process of Han unification, where essentially a common set of glyphs was developed for use by all three languages was controversial, with most of the opposition coming from Japan. Opponents of Han unification state that it steamrolls over thousands of years of cultural tradition, misses many of the subtleties that are one of the most important features of these languages, and renders serious literature and academic research in these languages impossible. Since Ruby is from Japan, it should not be surprising that support for Unicode in Ruby has been controversial too.

All Ruby programs are written (encoded) in 7-bit ASCII, Kanji or UTF8. If a code set other than ASCII is used, a global option named KCODE must be set appropriately. The option is used by Ruby's string manipulation methods, because internally Ruby keeps string data as a stream of bytes (one byte per character). Some Unicode encodings specify multiple bytes per character. In contrast, Java and other modern languages have a relatively easier time handling different types of character encodings and multi-byte characters because their String objects are arrays of character objects.

For further reading, see the Fingertips writeup and the RedHanded article

already a possible solution? by Alex Popescu Posted
  1. Back to top

    already a possible solution?

    by Alex Popescu

    It looks like Rob Leslie has already proposed a solution on the ruby-talk ML.

    ./alex
    --
    .w( the_mindstorm )p.

Educational Content

10 tips on how to prevent business value risk

One category of risk that project teams need to ensure they address is business value failure – delivering a product that fails to provide value for the business investor.

Interview: Software Systems Architecture: Working With Stakeholders Using Viewpoints and Perspectives

InfoQ spoke to the authors of Software Systems Architecture on a couple of new topics, the System Context viewpoint and Agile, which have been added to the second edition.

Beauty Is in the Eye of the Beholder

Alex Papadimoulis discusses ugly code, where it comes from, how to avoid it, and how to get rid of it.

Architecting Visa for Massive Scale and Continuous Innovation

John Davies examines Visa’s architecture and shows how enterprises have architected complex integrations incorporating Hadoop, memcached, Ruby on Rails, and others to deliver innovative solutions.

Max Protect: Scalability and Caching at ESPN.com

Sean Comerford unveils ESPN.com’s architecture, what components are used and why, and the current changes the website goes through.

The Seven Deadly Sins of Enterprise Agile Adoption

Are there repeated patterns of failure on Enterprise Agile Enablement efforts? Sanjiv and Arlen discuss Seven Deadly Sins to avoid when adopting Agile in an enterprise.

Questions for an Enterprise Architect

Erik Dörnenburg answers: What is Enterprise and Evolutionary Architecture?, discussing 4 issues: Turning strategy into execution, Ensuring conformance, Where do the architects sit? Buying or building?

Wrap Your SQL Head Around Riak MapReduce

Sean Cribbs explains what Map-Reduce and Riak are, why and how to use Map-Reduce with Riak, and how to convert SQL queries into their Map-Reduce equivalents.