InfoQ Homepage Presentations JS � Character Encodings

JS � Character Encodings

Bookmarks

View Presentation

Speed:

Download

51:12

Summary

Anna Henningsen gives an overview over what character encodings are, what the JavaScript language provides to interact with them, and how to avoid the most common mistakes in Node.js and the Web.

Bio

Anna Henningsen is a Node.js core developer at NearForm Research.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Henningsen: I'm Anna. I work at NearForm, in a department called NearForm Research. For me, specifically, that means that I get to work on Node.js core full time. @addaleax, that's my twitter and GitHub handle.

Let's talk about character encodings. This is a screenshot from a Travis CI run that I ran a while back. It's not that long ago, it was earlier this year. This bug has been around for a long time because I gave a talk about this topic in 2017 in March, and they had that bug there already. I don't know who of you can tell why there are these replacement characters just randomly in the middle of the text. If you can, that's good, because hopefully you'll learn it right now.

What is a character encoding? In the end, when people do things with computers, they tend to work in text forms, whether that's programs, or whether that is some other input that they give to the computer. It's usually text. Text is conceptually a list of characters. That's what separates it from random images on paper. The idea with character encoding is that the computers definitely would prefer numbers. We take these characters, we assign them numbers, integers. Then we figure out a way to transcribe those integers into a list of bytes. That whole process of going from text to a list of bytes is known as encoding. The reverse step is known as decoding. For example, this would be the standard ASCII approach to this. Hello is a five letter word, we will split that into five separate characters. Each of these characters is being assigned a number. At least in this case, we take these numbers and say, each of these numbers corresponds to 1 byte in the final output. Once you start working with more characters, that system breaks down because when you say each character is 1 byte, then you're stuck with 256 characters. That doesn't work for more complicated use cases like Chinese characters. What are we going to do about that?

ASCII

The simplest version that you can do this is ASCII. At least, historically, it's the most important one of the first character encodings that came into existence. The idea is, we take about 127 characters. Not all of these are printable characters that you could see on paper, and we assign each of the numbers. These are the decimal and hexadecimal values that we give them. We say each of these values will just be encoded as a single byte in the final output. I will use hexadecimal representations a lot. It doesn't really matter because the exact values don't matter. ASCII is a 7-bit character encoding. It covers most use cases that appear in the English languages and languages that use a similar alphabet, which aren't all that many. That's pretty much it. There's not a lot else that you can do with it, which is frustrating when you do want to support other languages. There are other character encodings that historically were concurring with ASCII. For example, there's EBCDIC, which is basically only used on IBM mainframes these days. I was thinking for April Fool's this year, I might want to open a PR against Node that has support for that character encoding, because, again, only supports IBM mainframes. It's not really in use since the '70s. Reality is stranger than fiction, now we actually do have an EBCDIC encoder in the Node source tree anyway, because we actually support one of these weird IBM systems, or start supporting them.

ISO 8859-*, Windows Code Pages

The first step towards making ASCII work for other languages is to extend it. The idea behind a lot of the character encodings that came next, in particular, the ISO 8859-*, and Windows code pages. ASCII is 7-bit, which means we have another 128 characters available. We're just going to create a lot of character encodings that covers some specific languages. For example, there's Latin-1, which stands for ISO 8859-1, but rolls over the tongue a bit more nicely. That is for Western languages, you can write Spanish, French with it, languages like that, German too. Then there's other encodings for other languages, like there's the Cyrillic variant in that standard. There's the Cyrillic Windows code page. These are not necessarily compatible. There's an example where there's characters in two different character encodings for the same language, so both for Cyrillic languages. If you encode something as one of them and decode as another, it will come out as garbage. That was also not a great situation overall. This doesn't really cover, for example, Chinese character use case.

GBK

This is one example of a character encoding that was used for Chinese characters is GBK. The idea is, 128 extra characters, that's not enough. If there's a character that's not ASCII, with the upper bit set, then the next byte will also count towards this character. It's either 1 byte ASCII or 2 byte Chinese character. That gives you about 30,000 more characters, which is still not enough to cover all of the Chinese characters that exist, but it's practical enough. Basically, the situation that one ended up with was, there are hundreds of character encodings out there, which is not an exaggeration, actually. Let's work on something that works for everybody. This xkcd comic is obviously somewhat sarcastic, but for character encodings that actually worked.

Unicode: Multiple Encodings

What we ended up with is Unicode, which is not an encoding. It is a character set, which says, each of the characters which I chose because it has a non-ASCII character in it, to each of these characters, we assign the number, and that is Unicode. That should ideally cover all the use cases that were previously covered by other encodings. Then we specify an actual encoding, and there are multiple of those, which define how to translate these integers into byte sequences, like there's UTF-8, UTF-16, UTF-32, and quite a few others. UTF-8 and UTF-16 are the most important ones, and definitely the most important ones for JavaScript.

Unicode Characters

The way that Unicode characters are usually spelled out is U+ and then four hex digits, or five sometimes if the characters don't fit into the four hex digit range. That is how you specify, this is the character I'm talking about, does not specify how it is encoded. The numbering is compatible with Latin-1. The first 256 Unicode characters are the 256 Latin-1 characters. That's actually important for JavaScript too. The maximum number that one can have is larger than 1 million, so we'll have a little more than 1 million characters in total available for Unicode. Hopefully, that's enough for the future, we'll see. Right now there's no issue with that. It also includes emoji, which is something that the Unicode standard is famous for these days. Every new revision of the Unicode standard comes with new emoji. It has its own replacement character, which is something that previous characters encodings didn't necessarily feature. There's a special character that can be used when something cannot be decoded successfully. This is also the character encoding that we talk about when we use character escapes in HTML or in JavaScript. That always refers to some Unicode code point.

UTF-8

The most common character encoding that is used with Unicode is UTF-8. It's a variable length encoding. The higher the character number is, the longer the byte sequence is in which it is encoded. In particular, it's ASCII compatible. The ASCII characters in UTF-8 are the ASCII characters as they are in code of ASCII, which is very nice. It's also a nice property of using the scheme. These particular byte sequences don't really have to worry about how the actual bits are encoded. If there is something broken, some invalid byte in there when you decode it, that won't break decoding of the rest of the string.

For example, if we will encode a string using Latin-1, this middle string, and then we decode it as UTF-8, so this fc, this is not a valid byte in UTF-8, never ever. What we can do, we can actually replace it with this replacement character instead of having it gobble up all the rest of the string.

UTF-16

Then there's UTF-16, which is 2-byte code units, so 65,000 characters that can be encoded in a single 2-byte unit. Ones that do not fit into that range, they're split into two separate pairs of code units. Because it uses 2 bytes, there are two different variants. I don't know if you are familiar with that. There's generally, little endian machines and big endian machines. The little endian ones put the low byte first and then the higher value byte, big endian is the reverse situation. Most modern processors use little endian. That's for example why Node.js only supports that variant. One thing that is sometimes used in this case is a special character, which is called the byte order mark, which is sometimes prepended to a UTF-16 string so that you can figure out, "This is little endian," or, "this is big endian," by looking at the first bytes of a string.

When we were talking earlier before this talk, me and Richard, he was like JavaScript still uses this 16-bit encoding scheme. Sometimes I hear people say that JavaScript uses UTF-16, and that's not quite correct. The spec doesn't tell you how the characters are actually encoded in the engine. It doesn't tell you anything about that. It could be UTF-16. Generally, JavaScript engines are very clever about this so you don't really need to worry about it. We do use character codes from 0 to 65,000. We do split code points that don't fit into that range into two separate units. For example, if we use emoji, which generally don't fit into that basic set of characters, then we will split them into two separate ones. JavaScript uses a string representation that is very much based on UTF-16. How the actual representation looks like, the spec doesn't say anything about it.

For V8 I noticed because I do work a lot with V8. SpiderMonkey, Firefox's JavaScript engine seems to do the same thing. They do have a representation that works for Latin-1 only thing as strings. If your Spring is mostly ASCII or mostly Western Latin characters, then it will probably try to use that representation instead to save space, to save memory. Don't overthink it. There are tons of different string representations in modern JavaScript engines.

Converting Back and Forth in JS

The way that you use these character encodings in JavaScript, the very basic thing you want to do is you want to convert between a string and you want to convert between a list of bytes, which in JavaScript is usually Uint8Array, or in Node.js the buffer, which is also a fancy Uint8Array. In Node.js, what you can use this buffer.from to encode a string into a buffer, and then use buf.toString to do the reverse transformation. The browser has text encoder and text decoder APIs that allow you to do that, so you can create instances of these classes TextDecoder and TextEncoder, and call encode and decode messages. They perform that conversion also between strings and Uint8Arrays. Modern Node.js versions also support these. Node 12 have these as global, for Node 10 you need to require them manually from the module. Text encoder supports a wide range of encoding, which makes sense because browsers have to support a number of legacy encodings as well, anyway, because they have to read legacy websites in those encodings. For encoding, they only support UTF-8. That's very much pushing people towards using UTF-8 everywhere, which is generally a good thing.

Dealing with Decoding Errors

If you ever wonder about what happens when a text decoder encounters something invalid, usually it replaces it with a replacement character sign, like this 0xff character that is never a valid UTF-8. You can also parse a flag to it that says, please throw an exception if there is something wrong in that string that cannot be decoded. I don't generally know why you would want to do that, or at least not if you actually want to decode a string. You would probably only use this for testing whether a buffer contains valid UTF-8 or not. Because if you want to decode a string, you can usually just live with a replacement character showing up. It's not an ideal situation, and something went wrong somewhere. You still end up with something that mostly works and that is your goal.

One of the reasons I picked this particular example, is this problem is something that pops up in our Node.js issue trackers on GitHub from time to time. It is something that even Node core contributors get wrong or tend to get wrong from time to time when writing tests for a test suite. What is wrong with this code example? This is the Node.js code example. You could figure out something similar with text decoder in the browser. The first thing that I'm just noticing right now is that data should have been let instead of const, but let's look over that. There is a string, whenever standard input gets some chunk of data or some buffer, we append it to that string. Once the standard input is all read we write it out again. UNIX actually has a command like this that's called sponge. It sponges up all the data and then press it as out at the end when the input is done.

What is wrong with this part? The first hint that I can give you is that, when you concatenate the string and object and the buffer or Uint8Array as an object, that it will call toString on that object. In this case, it always calls it on the buffer, so we call .toString on each individual buffer here. What is the case in which this can go wrong? What can happen is we don't control the size or the boundaries of the buffer chunks that we read from standard input. For example, when the input is Mull, what can happen in the worst case, is that it gets split right in the middle of the U character. For example, we could read it from the operating system, first 2 bytes then 3 bytes. When we call toString on these individually, it won't work because each of these contains parts of a valid character but not an entire one. We will end up with M replacement, replacement ll, which is what happens here. What Travis CI actually does internally, probably, is it reads data from the terminal that it created, and it converts it to a string. It does it for each chunk that it reads. Because these chunks are usually rather large, I would guess a couple kilobytes, at least. It doesn't happen for every single character in the output. It doesn't happen when you happen to not hit the character boundary. It works 99% of the time, but sometimes it fails. That's what's happening here.

The way to fix this in Node.js is very easy, actually, because strings have a set encoding method that basically tells the stream, "I don't want to read buffers, I want to read strings, and I want to read them using this encoding that you use to decode the buffers." It will take care of decoding automatically. It will take care of cases like the one where partial characters might be read.

Under The Hood: Streaming Decoders

The way that it works internally is, Node.js has the StringDecoder class, which is a very lightweight transform string-y thing. You can use it to write buffers to it, and it will try to decode them. When it encounters partial characters at the end of a buffer, it will wait for the next buffer before it continues decoding. The text encoder API in the browser actually has a similar feature, which is called stream. Again, streaming decoding. It will do exactly the same thing. If there's a partial character at the end of the input, it will wait for that character to be finished before continuing with decoding. Actually, I think that it's a funny pattern that there are some Node.js APIs that were first introduced there that later became browser APIs in a very different format. For example, we had buffers, which in the browser were later introduced as Uint8Arrays, or in the language generally. Same with string decoder and text decoder, same with streams and [inaudible 00:22:40] streams, and so on.

Surrogates in JS

JavaScript uses the same pattern as UTF-16 for dealing with characters that don't fit into that basic 65,000 character range. For example, when I take a clown emoji, that is actually two separate units in JavaScript, and the string length will be reported as two units in that string. How do we get the actual number of characters? How do we work with this fact that every entry in a string is not necessarily a single character?

Option 1: Strings Are Iterables

There's an easy option for doing this. Strings in JavaScript are iterables, which means you can expand them into arrays, and you can iterate over them with for...of. That will do the right thing. For example, when you expand a string like that, it will give you an array where each individual entry is also an individual character. Even though that last one is actually made of two character codes. If you want the number of characters, in the sense of Unicode code points within a single string, you can use for...of, to iterate over that. It's not maybe the prettiest thing in the world. It is definitely more accurate than string.length for this case.

Option 2: Manual Work

JavaScript has some APIs to do manual extraction of code points from a string. A long time ago, all we had was string.charCodeAt, and string.fromCharCode. It did the right thing in the sense that it gave you the content of a string at a particular index. It didn't give you the code point in the case that it didn't fit into that 65,000-character range. Some newer APIs were introduced, which are called .codePointAt and string.fromCodePoint, which basically deal with this the way you would expect them to. If a character doesn't into that range, it will decode the entire character, and tell you this is the actual Unicode code point, the actual Unicode number associated with that character.

This one is also an issue that sometimes pops up, not that often. It's one of these things where you're like, if you don't know why this is happening, then you don't know. If you know that this is happening, then you know. You're like, "We never thought that," or at least I was. We have two regular expressions, one checks whether an E occurs in a string two to four times, one, whether the cat emoji occurs two to four times. It works in the case of the E and it doesn't work in the case of the cat, which looks surprising to me at least. The cat emoji is two separate units. The way that JavaScript interprets that string is it says the first half of the cat once, and then two to four times the second half of the cat, which then doesn't match that two cat example.

There's an easy solution. JavaScript regular expressions have received this Unicode flag that you can set on them, which is just a small u, and it basically makes this work correctly. It's generally something that you want to set on regular expressions. There's no reason to ever not do this. When I first heard that JavaScript regular expressions had Unicode support, I was really excited to see if they had character class support, or something like that. I think they didn't but now they actually do. This is not supported in old browsers yet. Apparently, sadly, Firefox, which I personally use doesn't, but it's going to come at some point. You can use this \p escape to match characters based on some metadata that exists about them. There's a ton of different Unicode classes of characters. For example, this one emoji presentation. You can use this actual figure very reliably what characters in a string are emerging.

One of the rules that Unicode has when it infuses characters is that if that character exists in a previous character encoding, and it is separate from other characters, then it is also going to be separate in Unicode. I don't know, if you remember for example from the slides where I showed the two Cyrillic character encodings, there's a Cyrillic lowercase a, and the Latin lowercase a, which look exactly the same. Nowadays, if we didn't have previous character encodings that separated these two, we would probably not have separate code points for them in Unicode. There's legacy code out there in the world. They are separate in Unicode. This is not what this is about. This is about the fact that Unicode characters can be composed in different ways. This is what you get when you try to expand these into separate code points.

For example, with the French name, Andre, you get that A-N-D-R, those are equal in both of these strings. In one of them, you get two separate code points and one of them a single one. The reason for this is that one of them is an E plus a combining accent character, which Unicode features. You can't have characters that you can just put on other characters, or below, or whatever. I think that's pretty much how you get Cyrillic script. You can also just have u with an accent character on its own. I don't know if you know how Korean works? Basically, each of these two characters, you can think of them as being composed of three separate letters, not written in a single row. This is the Korean word Hangul, which is the script name, the name of the alphabet, basically. You could either write as Han plus gul, in Unicode, or H-A-N-G-U-L, basically.

Unicode Normalization

There are ways to deal with this in JavaScript. The way to do that is String.prototype.normalize. You can call that on any string. You can basically do the two things that you want to do with them. Either you can make each combine character be as combined as possible, like the E with the accent as a single one, or the han as a single one too. Or you can call it with a parameter that's NFD instead of NFC, and that's saying, please decompose these as much as possible. E with an accent, and two separate ones, or han, and three separate ones. You may want to use this when comparing strings, because what you end up with is characters that look the same, and that behave the same, and that should be equivalent, but that don't compare equal in the sense of JavaScript equality.

Going a bit further, there's two more parameters that String.normalize can take. Another thing that people do and you might have mostly noticed from fancy Twitter handles or Twitter names, you can use characters that are variants of base characters. For example, this word HELLO, the first time it's written on the string is with mathematical bold letters, the way that you would use them in math script. You may want to be able to figure out, what are the base characters that correspond to this? For this, you have the parameter modes, NFKC or NFKD. They first convert the character to develop its base variant, and then do the same things on the page before, so either combine them or decompose them. This is something that, first of all, you may want to apply this to search parameters, because conceptually, these strings have the same semantic meaning. For example, if you were writing a profanity filter for usernames or something, you wouldn't want somebody to circumvent that by using fancy letters instead of their standard equivalents.

What Does Str.length Actually Tell Us?

Str.length doesn't give us the number of characters, what does it actually give us? Characters can be composed of multiple characters, so it doesn't give us that. They can be split into two separate units if they don't fit into the basic set of 65,000 characters. It also doesn't give us the string width. If you remember for that Hangul example, the string length was six because there was six different characters. Even though it would only have a string width of four, because each Far East Asian character would typically be deserving a width of two, as opposed to Latin characters, they're usually half as wide as they are tall. Still, it's definitely not accurate to say that that has a width of six. The only thing that it really gives you is half the byte length when you're encoding it as UTF-16. If you ever use the length of the string, make sure that you're using it for the right thing because it's easy to use it for something that is accurate 99% of the time, but fails sometimes.

It doesn't give you the string width, which in browsers you can work around that and you will have to account for the font anyway. For CLI applications, that's a bit different. For example, Node.js has this built in console.table tool, which does the same thing as it does in browsers. It automatically figures out that the emoji that I use here is wider than the other strings in that table. It figures out that the third column should be a bit larger than the second one. Basically, the only way to do this properly is not do it yourself. You can use an npm module that does this, it's called string-width. It gives you the right thing. Str.length wouldn't. It would give you a very unpleasant break in that right part of the table.

Demo

I have these characters copied here because I don't know how to type them on my keyboard. I'm going to start Node, which is in this case the master version but it could also be Node 13. It has the same issue there. I just typed that str.length, and it works. The display is obviously broken in some way, and I don't quite know what is going on there. It's hard to tell them apart when they look the same. If you use the other variant, where they're composed, then it will do the right thing. It won't have this display black with the autocomplete.

If you're interested, and if this talk is really giving you the energy of, "I want to work on internationalization stuff in encoding issues." Feel free to reach out to me and tell me if you want to look into this. I think it might be that we want to always call .NFC before trying to compute this width of the string. I don't know if that's what's actually happening.

About The Binary Node.js Encoding

Node.js also has support for encoding that you can parse with that is called binary. I'm going to tell you why you should never use it. In the early days of JavaScript, we did not have Uint8Array. We did not necessarily have Node.js buffers in a browser environment, and we still wanted to work with binary data in some way. There are two approaches that you can take with that. Either you're going to be like, I'm just going to use arrays of numbers between 0 and 255, and that is my binary data type in that case, or you were using strings. The idea was that you could use or abuse the code points 0 through 255 to represent the bytes 0 through 255, which gives you these kinds of strings where the characters are mostly garbage and do not have any meaning. Please don't do that anymore. Uint8Array or buffers, those are the solutions that we have for this problem.

If you think about it, the first 255 Unicode code points they're exactly the ones from Latin-1 from ISO 8859-1. That is what the actual encoding name is. If you want to use Latin-1 then be explicit about that fact and don't just call it binary. Because the name is really misleading. This is something that pops up under issue trackers a lot. Because people think when they say something with the encoding binary, that it will take the text and convert it to binary. Usually, what you want to do for encoding texts into binary data is use UTF-8. Binary is not a good name for that because all character encodings do that, they all convert text to binary data. Because of the way that Latin-1 is implemented in Node.js, so if you parse in characters that don't fit into Latin-1, it will just truncate off the first half of that, and truncate off the higher 8 bits of that character.

What you can actually end up with, and this is a problem that I have seen in the wild, is you can have two Unicode strings which map to the same byte sequence when encoded with this binary encoding or Latin-1 word. That's not good, for example, because when you want to hash data, or hash a string, you would usually want to hash its entire contents and make sure that it's unique. If you have strings, where two strings are mapped to the same bytes before they are encoded, then you will get two strings that map to the same hash if you use the binary encoding. Really, not something you want to do. I don't know how many of you use Python a lot in your daily lives. This is one of the things that Python 2 didn't really get right and that everybody tried to fix with Python 3. If you use Python a lot you already know this. The original Python 2, its green square, equivalent to these binary strings, and that really didn't work out the way that anybody wanted it to. I think the only use case that currently still remains for these is the atob and btoa methods in browsers that do Base64 conversion, because they do work with binary strings. That is because they simply predate Unit8Array. There's no other reason for that. You can basically consider them legacy APIs at this point.

For the character encodings that Node.js itself supports, one there's ASCII, and I really don't know why we do that. Because the way that ASCII works in Node.js, when you encode it, it will do the same thing as encoding using Latin-1 but it will cut off the highest bit for Latin-1 characters that are not ASCII, which simply give you another one that is ASCII but that has no relation to the original one. It basically does the same thing as encoding using Latin-1, but with some extra steps. Definitely, in Node.js we regret that we're stuck with for the rest of eternity. We have UTF-8, UTF-16 little endian. Because most of modern processors are little endian based, so that's a bit easier to use there. Also, because V8 already gives us a way to deal with UTF-16 and UTF-8. We can already use V8 facilities. We don't have to do the encoding or decoding mechanisms ourselves in the Node.js core. There is an alias for UTF-16, which is called UCS-2. That is the legacy name for UTF-16 in a way that doesn't support characters outside of the basic 65,000 range at all, which is basically why you always want to use full UTF-16 at this point. Node.js does support big endian machines but on those it actually does manually reverse the byte order when it encodes or decodes using UTF-16.

We have Latin-1 which is equivalent to binary, which you should never use. We have Base64 and hex, and this can be a bit confusing because those are not like the others. For character encoding, the general thing you want to do is you want to encode text as bytes. String to bytes is encoding and the reverse is decoding. For Base64 and hex, it's the reverse situation. Those are binary to text encodings, which means that string to bytes is actually decoding and the reverse is encoding. That should usually not be an issue. It can be a bit confusing because it means that buffer.from and buffer.toString, they can either encode or decode each or one of them depending on what encoding parameter you use.

Everybody Uses UTF-8 Now

In the end, everybody uses UTF-8 now anyway, but there are still some issues that you can run into even if you don't worry about this all too much. Sometimes, obviously, legacy code and legacy website exists. Sometimes people don't know that they don't use UTF-8. This is sometimes misuse of the binary encoding, or sometimes they don't think about it. Sometimes they just notice that buffer.toString gives them a string, and they don't care that it's encoded using UTF-8. The Node.js file system API supports buffers for the pathnames, which might be interesting because you would usually expect that for a pathname, you parse in a string. This has actually been added because our company had a client that did mix multiple encodings in its directory pathnames. You would have a directory whose name was encoded using one encoding, and another directory inside that that was encoded using a different encoding. Again, legacy code. That was actually motivated by a real use case, that you can figure out how to encode the path into bytes, and then submit that buffer to the Node file system API.

In the end, Node.js interacts with the operating system purely based on bytes too, there's no way to parse in strings or anything that conceptually consists of characters. When it talks to the operating system, that's always in bytes, whether that's for file paths, or for writing data to a network socket or something. There are never any characters involved. It's always an encoding step in between. Also, I don't know how many of you use the native C or C++ Windows APIs, they are big fans of UTF-16. They made the wrong bet on that before UTF-8 was really popular I think. Most Windows methods support one ASCII mode, and one UTF-16 mode. Even when you use UTF-8, things can still go wrong. Even the QCon website didn't accept the talk title at first. I had to manually edit the replacement character in the middle.

Why are we all using UTF-8 anyway? There are a number of reasons why it's nice to use UTF-8, but one big reason is, it's backwards compatible with ASCII. You can write code for ASCII and it will work 99% of the time with UTF-8 characters, which is really your goal in the end. You don't want to have to rewrite all of software basically to get to a state where you have a nice encode. I truly believe that if we really all made the switch, for example, from UTF-8 to UTF-16 in our applications, we would have a lot less issues with encodings because it would have been a clean break. Any time you write new code that uses UTF-16, you would have to think about how this affects your application. The world as it is, is a big fan of backwards compatibility, where you can write code the way you're used to it and not have to think about this.

Resources

One thing that I used a lot during this talk is the Unicode command. You can do pretty cool stuff with that. You can give it a string and it will tell you the Unicode code points, how it looks like when you encode it in some ways using UTF-8 or UTF-16, what categories it's in, all that stuff. You can also do the reverse thing. You can get all the cat faces by grabbing for it basically, using the cat face facial expression. I didn't know there were that many. It's a super handy tool for if you're dealing with Unicode characters in general. There's also iconv, it's a character encoding conversion tool. I can say, 'Hi,' I can pipe it in there. This won't work because my terminal is configured to use UTF-8 as everybody's terminal should be. For example, we can hex on it, then you'll see, it does prepend this very special character ff fe, to make clear, this is the byte order mark, this is the byte order that we're going to use. In this case, it picked little endian. This is the actual string when encoded as UTF-16. Then there's a new line afterwards. Iconv is great. There are also Node bindings for it if you really need to deal with grit character encodings. I have a list of MDN pages that are pretty useful, but you would be able to Google them yourselves if you want to learn about something. There's also the Node.js buffer API docs. I did go through them a couple times. I definitely picked up a lot of things that should be explained a lot better.

Questions and Answers

Participant 1: Let's say I'm about to create a site where people from all over the world can add comments and do this in their own language, and I will ensure that no question marks appear on the screen. Should I use UTF-32 then, or what should I use?

Henningsen: At the points where you need to deal with encodings you should always use UTF-8. UTF-8 does support the full range of Unicode characters. It supports exactly the same ones as UTF-32 or UTF-16. The issues that only pop up basically were the question mark issues that we place in [inaudible 00:50:39]. They only pop up when you mix different encodings. Because everybody is basically standardizing UTF-8 right now, you will want to use that for everything.

Participant 1: If I store the comment in 16, then I could certainly interpret [inaudible 00:50:56].

Henningsen: If you mix those, if you decode with one and encode with another, that's not going to work out.

See more presentations with transcripts

Recorded at:

Jun 25, 2020

Anna Henningsen

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?