InfoQ

InfoQ

News

My Bookmarks

Login or Register to enable bookmarks for unlimited time.

The content has been bookmarked!

There was an error bookmarking this content! Please retry.

Character Encodings and M17N Explained

Posted by Mirko Stocker on May 10, 2009

Sections
Development,
Architecture & Design
Topics
Internationalization ,
Ruby
Tags
Documentation ,
Ruby1.9

James Edward Gray II recently finished a series of ten posts on character encoding in Ruby called Understanding M17n.

Ruby 1.9 introduced many changes with regards to character support and has great support for working with different and mixed encodings, which is required in many projects; in particular open source that is developed from people all over the world.

He starts with the basics, what Unicode is, and how it is encoded, followed by some Ruby 1.8 specific posts, for example on encoding conversion with iconv and how Ruby 1.8 handles Unicode. After that comes a comprehensive treatise on Ruby 1.9's String and how Ruby 1.9 is different from most other languages:

It's common to pick one versatile encoding, likely a Unicode encoding, and work with all data in that one format. Ruby 1.9 goes a different way. Instead of favoring one encoding, Ruby 1.9 makes it possible to work with data in over 80 encodings.

One thing that is new in Ruby 1.9's m17n are the three default encodings, whereas Ruby 1.8 had just a single global variable. But why do we need them? Consider the following scenario: 

I may want to write my code in UTF-8, but some Japanese programmer may want to write his code in Shift JIS. Ruby should support that and, in fact, 1.9 does. Let's complicate things a bit more though: imagine that I bundle up that UTF-8 code I wrote in a gem and the Japanese programmer later uses it to help with his Shift JIS code. How do we make that work seamlessly?

Read on in the post on Ruby 1.9's Three Default Encodings. The last article contains miscellaneous topics, for example on working with binary data and regular expressions encodings.

Reading through all ten posts should make you ready for Ruby 1.9's very powerful m17n capabilities and provide you with various tricks, even if you plan to stick with 1.8 for the moment. And if you haven't had enough on Unicode, you might want to read Joel Spolsky's legendary The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) if you haven't done so already.

Another article on unicode by Jelmer Kuperus Posted
  1. Back to top

    Another article on unicode

    by Jelmer Kuperus

    Some shameless selfpromotion on my part. A while back i wrote a very extensive (though java centric) blogpost on this subject. Might be worth checking out for those not quite getting it yet

    jelmer.jteam.nl/2007/08/12/on-character-set-enc...

Educational Content

Jesper Boeg on Priming Kanban

In this interview, Jesper Boeg, author of the new InfoQ book – Priming Kanban, discusses the keys to using Kanban effectively, and how to get started if you are currently using other approaches.

New-age Transactional Systems - Not Your Grandpa's OLTP

John Hugg discusses high volume transaction processing applications with high and low frequency profiles, and how VoltDB can be used for that purpose.

Cool Code

Kevlin Henney examines code samples to see what can be learned from them starting from the premise that one won’t write great code unless he knows how to read it.

Collaboration: At the Extremities of Extreme

Jason Ayers share the observations he made watching a team of developers collaborating in real time on the same code base, pushing XP, pair programming and continuous integration to their extremes.

Yesod Web Framework

Michael Snoyman presents Yesod, a web framework written in Haskell and containing a web server, templating, ORM, libraries (templating, gravatar, etc.).

Transactions without Transactions

Richard Kreuter and Kyle Banker on how to avoid classical RDBMS transactional systems by using compensation mechanisms, transactional messaging or transactional procedures.

Attila Szegedi on JVM and GC Performance Tuning at Twitter

Attila Szegedi talks about performance tuning Java and Scala programs at Twitter: how to approach GC problems, the importance of asynchronous I/O, when to use MySQL/Cassandra/Redis, and much more.

10 tips on how to prevent business value risk

One category of risk that project teams need to ensure they address is business value failure – delivering a product that fails to provide value for the business investor.