BT

New Early adopter or innovator? InfoQ has been working on some new features for you. Learn more

Character Encodings and M17N Explained

| by Mirko Stocker Follow 0 Followers on May 10, 2009. Estimated reading time: 1 minute |

James Edward Gray II recently finished a series of ten posts on character encoding in Ruby called Understanding M17n.

Ruby 1.9 introduced many changes with regards to character support and has great support for working with different and mixed encodings, which is required in many projects; in particular open source that is developed from people all over the world.

He starts with the basics, what Unicode is, and how it is encoded, followed by some Ruby 1.8 specific posts, for example on encoding conversion with iconv and how Ruby 1.8 handles Unicode. After that comes a comprehensive treatise on Ruby 1.9's String and how Ruby 1.9 is different from most other languages:

It's common to pick one versatile encoding, likely a Unicode encoding, and work with all data in that one format. Ruby 1.9 goes a different way. Instead of favoring one encoding, Ruby 1.9 makes it possible to work with data in over 80 encodings.

One thing that is new in Ruby 1.9's m17n are the three default encodings, whereas Ruby 1.8 had just a single global variable. But why do we need them? Consider the following scenario: 

I may want to write my code in UTF-8, but some Japanese programmer may want to write his code in Shift JIS. Ruby should support that and, in fact, 1.9 does. Let's complicate things a bit more though: imagine that I bundle up that UTF-8 code I wrote in a gem and the Japanese programmer later uses it to help with his Shift JIS code. How do we make that work seamlessly?

Read on in the post on Ruby 1.9's Three Default Encodings. The last article contains miscellaneous topics, for example on working with binary data and regular expressions encodings.

Reading through all ten posts should make you ready for Ruby 1.9's very powerful m17n capabilities and provide you with various tricks, even if you plan to stick with 1.8 for the moment. And if you haven't had enough on Unicode, you might want to read Joel Spolsky's legendary The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) if you haven't done so already.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Another article on unicode by Jelmer Kuperus

Some shameless selfpromotion on my part. A while back i wrote a very extensive (though java centric) blogpost on this subject. Might be worth checking out for those not quite getting it yet

jelmer.jteam.nl/2007/08/12/on-character-set-enc...

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

1 Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT