BT
x Your opinion matters! Please fill in the InfoQ Survey about your reading habits!

Character Encodings and M17N Explained

by Mirko Stocker on May 10, 2009 |

James Edward Gray II recently finished a series of ten posts on character encoding in Ruby called Understanding M17n.

Ruby 1.9 introduced many changes with regards to character support and has great support for working with different and mixed encodings, which is required in many projects; in particular open source that is developed from people all over the world.

He starts with the basics, what Unicode is, and how it is encoded, followed by some Ruby 1.8 specific posts, for example on encoding conversion with iconv and how Ruby 1.8 handles Unicode. After that comes a comprehensive treatise on Ruby 1.9's String and how Ruby 1.9 is different from most other languages:

It's common to pick one versatile encoding, likely a Unicode encoding, and work with all data in that one format. Ruby 1.9 goes a different way. Instead of favoring one encoding, Ruby 1.9 makes it possible to work with data in over 80 encodings.

One thing that is new in Ruby 1.9's m17n are the three default encodings, whereas Ruby 1.8 had just a single global variable. But why do we need them? Consider the following scenario: 

I may want to write my code in UTF-8, but some Japanese programmer may want to write his code in Shift JIS. Ruby should support that and, in fact, 1.9 does. Let's complicate things a bit more though: imagine that I bundle up that UTF-8 code I wrote in a gem and the Japanese programmer later uses it to help with his Shift JIS code. How do we make that work seamlessly?

Read on in the post on Ruby 1.9's Three Default Encodings. The last article contains miscellaneous topics, for example on working with binary data and regular expressions encodings.

Reading through all ten posts should make you ready for Ruby 1.9's very powerful m17n capabilities and provide you with various tricks, even if you plan to stick with 1.8 for the moment. And if you haven't had enough on Unicode, you might want to read Joel Spolsky's legendary The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) if you haven't done so already.

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Another article on unicode by Jelmer Kuperus

Some shameless selfpromotion on my part. A while back i wrote a very extensive (though java centric) blogpost on this subject. Might be worth checking out for those not quite getting it yet

jelmer.jteam.nl/2007/08/12/on-character-set-enc...

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

1 Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT