Character Encodings and M17N Explained

James Edward Gray II recently finished a series of ten posts on character encoding in Ruby called Understanding M17n.

Ruby 1.9 introduced many changes with regards to character support and has great support for working with different and mixed encodings, which is required in many projects; in particular open source that is developed from people all over the world.

He starts with the basics, what Unicode is, and how it is encoded, followed by some Ruby 1.8 specific posts, for example on encoding conversion with iconv and how Ruby 1.8 handles Unicode. After that comes a comprehensive treatise on Ruby 1.9's String and how Ruby 1.9 is different from most other languages:

It's common to pick one versatile encoding, likely a Unicode encoding, and work with all data in that one format. Ruby 1.9 goes a different way. Instead of favoring one encoding, Ruby 1.9 makes it possible to work with data in over 80 encodings.

One thing that is new in Ruby 1.9's m17n are the three default encodings, whereas Ruby 1.8 had just a single global variable. But why do we need them? Consider the following scenario:

I may want to write my code in UTF-8, but some Japanese programmer may want to write his code in Shift JIS. Ruby should support that and, in fact, 1.9 does. Let's complicate things a bit more though: imagine that I bundle up that UTF-8 code I wrote in a gem and the Japanese programmer later uses it to help with his Shift JIS code. How do we make that work seamlessly?

Read on in the post on Ruby 1.9's Three Default Encodings. The last article contains miscellaneous topics, for example on working with binary data and regular expressions encodings.

Reading through all ten posts should make you ready for Ruby 1.9's very powerful m17n capabilities and provide you with various tricks, even if you plan to stick with 1.8 for the moment. And if you haven't had enough on Unicode, you might want to read Joel Spolsky's legendary The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) if you haven't done so already.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Documentation topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter