Mathias Bynens: JavaScript ♥ Unicode
This presentation explains the various ways in which JavaScript relies on Unicode, what the consequences are for JavaScript developers, and how ECMAScript 6 will make our lives a bit easier in this regard.
First off, the basics of Unicode are explained. Once that’s out of the way, I’ll talk a little bit about different character encodings, only to determine the character encoding that JavaScript uses internally.
Then we’ll explore the various consequences of JavaScript exposing “characters” according to UCS-2/UTF-16, and I’ll explain why it can be problematic.
Finally, I’ll present robust ECMAScript 5-compatible workarounds to the issues encountered, and explain how ECMAScript 6 will make it easier to support full Unicode in JavaScript strings and regular expressions.
Transcript
>> I'm going to talk about three of my things, Unicode, JavaScript and gangster rap. I work for Opera Software in Developer Relations, I like to collaborate on Open Source projects, and one example of such a project is Punycode.js, which I wrote a couple years ago, and nowadays, it's part of Node.js itself. It ships, with Node, so if you can want to play around with it, you can just require it and start using it. You don't have to `npm --install` anything. The Punycode algorithm is what browsers and other software that deals with URLS use to convert internationalized domain names like these into pure ASCII forms, so they can still be used at the DNS level. Now, I wanted to write a JavaScript implementation of that. And I always thought it would be really simple because the algorithm is well-defined in an RFC. I figured I would port that algorithm into JavaScript and that would be it. But while I was working on this actual implementation I ran into lots of new things -- things I didn't know about JavaScripts and that's what this talk will be about, basically. But, first, does anyone know who this guy is? Yeah, of course, everyone knows it's Jay-Z. He's a famous American rapper. But a little known fact about him is he decided to write a JavaScript program once and instead he ended up writing a song about his frustration with the language. And you may have heard of it before, it's called "99 Problems." Now don't get me wrong I love JavaScript, even though I'm making fun of it here. Just like with any other programming language, if you don't know the entire language by heart, and really who does, then you're going to be surprised. That's just what life as programmer is like. That's why we have PHPWTF.org, which neatly lists all of the weird quirks in PHP, and similarly WTF.com for JavaScript. Everything on these web slides can be explained if you look at the specification, the PHP, the documentation or implementation. But even then if particular behavior of language is confusing to a lot of people, that is a bit of a problem. So I think the way JavaScript handles Unicode is well, very surprising to say the least. But before we get into that, let's talk about Unicode itself, just to make sure we're all on the same page. I'm going to tell you the absolute minimum about Unicode in order to work with strings in JavaScript correctly and nothing else. It's easiest to think of Unicode of a database that matches any symbol you can think of to a unique number, the code points and a you cane unique con noncal name, it's easy to refer to any specific symbol by using the unique name or code point, you don't have to use the symbol itself to talk about the symbol. For example, Unicode map it is Latin capital letter A to U + 0041, this is a hexadecimal number, usually four code unit codes like this ‑‑ now, another example is the letter ‑‑ Latin small letter A, a different letter all together, it gets it's own code point and it's own con noncal name's a lot of stuff in Unicode, each symbol gets it's own code point's a lot of weird stuff, there's a snow man symbol in Unicode, not sure why wow would you need it. There's everyone's favorite character the pile of Poo, that's the con noncal name for the symbol. You can see it's U + 1F 4A 9, you may be wondering at this point how many code points are there what's the highest possible code point value in Unicode, well the possible code point value from 0 to 10FFFF that's over one million possible symbols, to keep things organized Unicode devised this range of code points into 17 planes that consist of about 65,000 code points each. The first of these plane s is called the BMP, the most important pine one because it contains all the most commonly used symbols, most of the time you don't need codes outside of BMP to write a text code with English, Spanish or German, just like any other plane it contains about 65000 symbols, there's a grouping for the other planes, they're called supple men tear planes, totals up to about one million other code points, the vast majority of the Unicode code points base, these are called astral planes or sup men tear plane, ‑‑ supple men tear planes. Now, astral code points are easy to recognize because when ever you need more than four hexadecimal digits to represent that, it means it's an ascentral code point. So, now that we have basic understanding of Unicode, let's see how it applies to JavaScript springs, starts with back clear X two hexadecimal digit, grow know a bit about Unicode, you know they refer oUnicode code points, another way to represent the string ABC in all caps, this is the way to represent the stringabc in lower case, this is useful because you have weird characters that are maybe hard to type on your keyboard or avoid encoding issues, if you save the file with different encoding you can use escape sequences to represent these characters, this is useful but we're still limited to two hexadecimal, which means can only be used up toFF. There's a lot of other escapes, call Unicode, starts with back slash followed by U and four hexadecimal digit, we can make the same exescapes but for four ‑‑ like 2661 which is the code for white heart symbol, this makes it possible to escape all these code points all those in the BMP, actually. But, what about all the other planes, what about those ascentral code points we need more than four hexadecimal digitter for them how do we escape them. What about the pile ofpoo and other equally important ascentral symbols of course. We can actually escape those, it's kind of sort of come plait. I will, in fact be really easy because you have these things called Unicode code point escapes, back clash you followed by braces you groups six hexadecimal digit, enough to represent any Unicode. You can simply escape any Unicode symbol based on it's code point, it couldn't be easier, if you need something that works to day in ESF, the unfortunate solution is to use surrogate pairs, each escape represents the code point of a surrogate half, only when that two halves combine together it form it is symbol. The surrogate code points they don't look like anything like the original code point. Now there are formula that you can use based on a given astral code point. Here's a JavaScript implementation, you don't have to learn by heart, you know this exists if want to deal with string in javaScripts. Any way the whole concept of using the escape to represent BMP and two separate escapes is a bit confusing and has a lot of annoying consequences throughout the language. For example say you want to count number of symbol in a string, my first thought would be to use the strings length property. In these, indeed it reflects number of character in the strick, if you escape the character in thestring and count the number of escape sequences we see there's only one escape sequences in each of these strings, however, let's try some other similar looking characters. This is the mathematical bold capital letter A, and mathematical bold B, they're different code point, in this case they're ascentral code points because of that they have a length of two in JavaScript rather than one even though there's only a single symbol there. What JavaScript does it kind of exposes these surrogate halves as if they were separate characters, and this is really confusing because we as human beings we think in terms of Unicode characters or even Grapimeems, of and there's a very obvious joke to be made here, instead I'm going to show you a real world example where this can actually break things, this is Twitter while ago, it allows 140 characters per tweet, their back end doesn't mind what symbol, ascentral symbols, they all count as one. Now, at some point they had a bug in their front end and the JavaScript similarly read out the string length without accounting for surrogate pairs, it would decrees by steps of two with each astral, it wasn't possible to tweet more than 70 piles of Poo at a time, it was absolutely terrible. Accountable.J, counts the number of words, letter, paragraphs and displies the numbers elsewhere ‑‑ displays the numbers elsewhere on the page. I entered a pile of Poo and counted it as two characters instead of one, these are just some example, this is an honest mistake the make because it's surprising, if you're write ago JavaScript library that involves strings whatsoever in JavaScript, you have to make it work with all symbols not just those in the BMP, sooner or later one of your users is going to enter one of those Emoji or other ascentral similar G. way to verify is to test it with centrals symbols, throw some piles ofpoo in there and see what happens. This is what it felt like to enter that pile of Poo,I just kind of new that something would likely go wrong, you know. So ‑‑ yeah. That's not good. Getting back to our question, how can it be done, how can we accurately count number of symbols in the Java string or number of code points. One thing you could use is a third party library like Punycoe. The USC 2 decode method takes a string and returns an away of code points one for each point, you then get the length of the resulting array rather thanstring directly this gives a more accurate result. No matter whareyou're doing you is to account for surrogate halfs in JavaScript when you're dealing with strings like that, you don't have a choice, you have to do this. There's a cementlator solution for this, you can use array.from pass in the string and that would return an array of string, one for each symbol in an array of string, this would use the ES 6 string iter raptor, and the it aerator deals with whole code points when ever possible, so you don't have to deal with the surrogates yourself anymore. This is not the most efficient solution to this problem, you could use a regular expression if you need to count the regular characters with ‑‑ but we'll get to that later. Now, I've made a tool that takes any string as input and shows the escape e Wednesday within the string, so it's kind of like a Hexdom, which is sometimes more useful. Now, with this tool, it's really easy to tell which Unicode characters the string contains exactly, even if the symbols are non‑printable or just wide space. Now, depending on your use case,it's not that simple, we know how to count code points correctly, but if we're being really pen den tack counting the symbols in a string is different. Visually there's no way to tell that it's different. What's going on, I took two strings and copied them into the tool I just showed you this is what I got. The first string using the string F 1 for the end character with the ‑‑ N character with theTilde, and the second one used U 0303 the code for combining Tilde, they just get applied to previous symbol, we don't care about that, if want to count the number of symbol in the string we expect six, so how can we make that happen. The answer S. we can use Unicode normallytion, another new feature in ES 6 it ships to day in home and Opera, if you want to use it in other browsers, use a poly fill called Unorm. Before you account for these ascentral symbols you called normalize on the stream, pass in the normallytion form you would like to use, this effectively gets rid of the look alike symbols. Now, is this solution perfect, well, it kind of depends, if you need to support stuff like that, if which case I count nine different symbols, this is basically a string with lots of combining marks that get applied to the previous character what's why it looks weird. I count ninety, but our Pendanticl count is 16, if you need to support ‑‑ it beast to use a regular expression that removes the combining marks and then gets lengths just like question before. We'll get to that later. Another example is reversing a string? JavaScript, it seems very simple, if you Google for it you'll end up a solution that looks like this, take the string, split it into an array, refers to array and put it back together into a string. And indeed this reverses the string Abc correctly, let's try Manana, here we're reversing again, reverse function seems to work as expected, if we try to other Mana where the N and Tilde are two separate characters we see it gets applied to letter "A" ande instead of "N," also if we try it on an ascentral symbol like the pile ofpoo we know they consist of two surrogate halves, the first result is completely unusable the surrogate halves are not in order this is data loss. Reversing a string becomes really strictly if you need full Unicode support in JavaScript. Lucky for us a brilliant computer scientist calledMissy Elliott came up with a brilliant algorithm that accounts for this issue. It goes, I put my Thng down flip it, and reverse it. Indeed swiping the position with the symbol they belong to, as well as reversing any surrogate pairs before further processing these strings all the issues are avoided successfully, thank you Missy. Here's a JavaScript ‑‑ S reverse, if you need trovers a string in JavaScript, feel free to use it. These were some examples and this behavior really affects all the string methods too. It's everywhere. For example,string.fromcharcode allows you to use a string based on a Unicode, but it only works for BMP range, if you use it with an ascentral code, you'll get a different result (Astral) you have to implement those former lists that I showed you earlier, you could use a third‑party library, this is introducing dependency to create a string out of a number which really shouldn't happen. Luckily in ES 6 there's a solution for this in the form of a new method, Storm.fromCodePoin, this works properly including from Austria symbols. Similar ‑‑ if you use characterat to reSteve the first symbol in the string containing the pile of Poo character you'll only get the first surrogate half instead of the whole symbol is no solution for this inES 6 there, is a proposal to add string.prototype.add which would do the same thing as character add it would deal with full symbols instead of surrogate halves when ever possible. If you use character code at to restreet the first code point or the first symbol in the string you'll get the first symbol of the surrogate half rather than whole pile of Poo character, now, in ES 6 there's a new method called string.pointAT, it deal was surrogate halfs when ever possible. The links at the bottom they point to poly fills you start using today it works in every browser that way. Now another example is you want to loop over every symbol in a string and do something with each symbol you'd have to write a lot of boilerplate code just to account for surrogate pairs to consider them as a single character. Now, in ES 6 this will be much, much easier because you can just use for of this will use the new string iter raptor, and this will deal with whole symbols instead of surrogate pairs automatically. Now, this behavior affects pretty much all string methods so be really careful when you're using them, really unfortunate, as 2PAC would say, that's just the way it is, some things will never change. Another similar problem lies with regular problems, so the dot operator in regular expressions only matches a single character, because because JavaScript exposes surrogate halfs as separate characters it won't ever match an Astral symbol, I would expect an answer of true here but it's actually false. What regular expression can we use to much any Unicode symbol, any ideas? Well, we can't use the.operator, as weave demonstrated that wouldn't match line Riggs, we can fix that by doing something like, this match any character but wouldn't match Stral symbol ‑‑ Astral symbols, because as far as JavaScript concerned that is two characters, what is the regular expression to match any Unicode point, like, this apparently. Yeah ... so the first part of this regular expression matches any BMP symbol that does not store a surrogate pair the second half matches a surrogate pair, in other words, an astral symbol, this is not the type of regular expression you would enjoy writing by hand, let alone maintaining it, I use JavaScript library to generate these for me, it's called regenerate. Here I create a new instance, I add the range of all Unicode, then I call two string that produces a string that can be used a expression literal I's supposed to be used as part of a build script. Here east another example, a bit more advance. Start with all Unicode points, remove a range of symbols based on their string values, then remove a single symbol, then we call two string on that to turn that set into a regular expression. And apparently that looks like this. So, once again, this is not a regular expression, you would enjoy writing by hand or that you want to maintain, imagine having to adjust one code point to that set and rewriting that whole thing, it would probably take a very long time, it's much easier to change the five lines of code, write above the regular expression, store that in a build script and run the build scrip again after you modify et. Another example, say you need another expression that matches all Greek symbols in the Unicode standard are BMP models that contain for each category, script, block or property. If you combine these with regenerate it's easy to build complex Unicode aware Unicode expression. So I'm just requiring all the Greek symbol in Unicode 6.3.0 as an array than padsing that to regenerate and turning that into a regular expression. You may have noticed I'm using the data for Unicode 36‑point .0, justify weeks ago Unicode 7 was released so this is out of date. I just need to change a single line of code. Changing that regular expression, manually would have taken a very long time, now all I have to do is change the Unicode version number, run the script again, just like that an up to date regular expression comes out. Getting back to the dot operator, well ES 6 introduces a new flag for regular expression, the U flag, it stands for Unicode, it matches ‑‑ so it kind of makes regular expressions work the way you think they would work. Here's another example, in character classes, it's possible to use ranges, as I'm sure you know, probably expect A, ‑‑ C to match A ‑‑ C and C, so similarly, you might expect this regular expression to match the pile of Poo, the flexed buy accepts and the busy symbol, right, that just makes sense, because that's the order of the code point, it doesn't work that way. If you run this code it will throw an error on the very first line it will say the regular expression is invalid, the range is out of order. So what's going on there? Well, as far as JavaScript is concerned, this is what that regular expression looks like, so, it's not trying to create a range between the two Emoji, it's trying to create range between the one surrogate and the other surrogate for the other character, which is not the right range, and because the range is not in the right order, it throws an error. This is really easy to fix, you can add the U flag and that magically makes it work as you would expect. If you need something that works today in ES 5, you're stuck with using a tool like regenerate to build the regular expressions for you. But nowadays there's a better solution, I just released this,it's called Regexpu, Compiler for regular expressions that use the U flag, it will look through your code for regular expressions and rewrite them so they're equivalent so they still work in Es 5. Hire's a demo of that. This is live translation in the browser, you can use it in node I've submitted a patch through intersure, you can start using this today. Write regular expressions theE is 6 way and compiling down to code that works to day without worrying about it. So in summary, I think it's fair to say that JavaScript has a Unicode problem, rather than complaining about it like JayZ did, we should probably be doing this instead. We should just deal with it and work around these issues because it's honestly not that hard, and, you know,all you have to do is add some pile of Poo to your unit test, there's a lot of talented JavaScript programmers here ‑‑ developers here, if we would add a Poo request for any library that you're using or you've written, I think we can collectively discover lots of bugs and help get them fixed. It's not that developers don't want the fix these issues, I think problem is they don't know about it because it's such surprising behavior. Thank you. Edit transcript via pull request.