Course: JavaScript

Progress (0%)

JavaScript Strings - Unicode

Chapter 19 48 mins

Learning outcomes:

  1. What is Unicode and UTF-16
  2. The charCodeAt() and codePointAt() string methods
  3. The String.fromCharCode() and String.fromCodePoint() static methods
  4. How the length property works
  5. Unicode escape sequences
  6. Lexicographic comparison
  7. The string @@iterator() method

Introduction

In the previous chapter, we covered the basics of strings in JavaScript. Now, it's the high time to dig a little bit deeper and figure out exactly how strings work under the hood in JavaScript.

In this chapter, we'll cover a bit of advanced material. Starting off with the character encoding format used by JavaScript, we'll dive into many technical details such as exactly what does the length property return, how to determine a particular character's associated code unit, and what really is a string.

We'll also see things such as Unicode escape sequences and lexicographic comparison of strings, before ending with how to indirectly use the string @@iterator() method to work with strings without having to worry about variable-length characters.

This chapter surely holds an immense amount of significance in this unit of strings as it'll make you much more confident with string processing, and the language JavaScript, in general. Not only this but you'll also be able to better appreciate the way strings work in other programming languages.

Without any further ado, let's get going.

Unicode and UTF-16

We already know that a string is a sequence of characters.

But what exactly is a 'character' here? Let's explore it.

Everything that happens inside the processor of a computer is calculation over mere numbers. String processing, image processing, audio processing, video processing — literally everything revolves around numbers.

In the case of characters of a string, each of them is associated with a particular number as well. This association is given by the character set in use.

A character set can be thought of as a table containing character names along with their associated numbers.

These numbers are sometimes also referred to as character codes. Strings, that are sequences of characters, are, in other words, sequences of these numbers.

Now there are numerous character sets known and even currently used across computers, each with its own legacy. ASCII, clearly one of the most popular character sets known to date, goes back to the very old nascent times of computing.

In ASCII, the lowercase 'a' from the English alphabet is associated with the number 97. The uppercase 'A' is associated with the number 65. Similarly, lowercase 'b' is 98, while uppercase 'B' is 66. The space character is 32.

As stated before, ASCII isn't the only character set known or used across computers. There are tons of them. For instance, another common character set used primarily on Windows is Windows-1252.

Accomodating multiple languages and symbols typically leads to multiple character sets. That's why we have so many of them. However, this is merely a legacy of the past when there was no single universal character set to turn to.

Then, fortunately, came in Unicode.

Unicode is a universal character set that contains characters and symbols from a humongous variety of languages. It covers 149,186 graphically-representable characters as of version 15.0 (year 2022) in addition to other control characters.

Clearly, it's impossible to show the complete mapping of these many characters on this page, likewise we only limit ourselves to the first couple of characters.

Here's a list of Unicode's first 128 character mappings, excluding the starting 31 control characters (line feeds, carriage returns, tabs, and so on):

Character nameGlyphCodeCode (hex)Character nameGlyphCodeCode (hex)
Space 320x20Exclamation Mark33!0x21
Quotation Mark"340x22Number Sign35#0x23
Dollar Sign$360x24Percent Sign37%0x25
Ampersand&380x26Apostrophe39'0x27
Left Parenthesis(400x28Right Parenthesis41)0x29
Asterisk*420x2APlus Sign43+0x2B
Comma,440x2CHyphen-Minus45-0x2D
Full Stop.460x2ESolidus47/0x2F
Digit Zero0480x30Digit One4910x31
Digit Two2500x32Digit Three5130x33
Digit Four4520x34Digit Five5350x35
Digit Six6540x36Digit Seven5570x37
Digit Eight8560x38Digit Nine5790x39
Colon:580x3ASemicolon59;0x3B
Less-Than Sign<600x3CEquals Sign61=0x3D
Greater-Than Sign>620x3EQuestion Mark63?0x3F
Commercial At@640x40Latin Capital Letter A65A0x41
Latin Capital Letter BB660x42Latin Capital Letter C67C0x43
Latin Capital Letter DD680x44Latin Capital Letter E69E0x45
Latin Capital Letter FF700x46Latin Capital Letter G71G0x47
Latin Capital Letter HH720x48Latin Capital Letter I73I0x49
Latin Capital Letter JJ740x4ALatin Capital Letter K75K0x4B
Latin Capital Letter LL760x4CLatin Capital Letter M77M0x4D
Latin Capital Letter NN780x4ELatin Capital Letter O79O0x4F
Latin Capital Letter PP800x50Latin Capital Letter Q81Q0x51
Latin Capital Letter RR820x52Latin Capital Letter S83S0x53
Latin Capital Letter TT840x54Latin Capital Letter U85U0x55
Latin Capital Letter VV860x56Latin Capital Letter W87W0x57
Latin Capital Letter XX880x58Latin Capital Letter Y89Y0x59
Latin Capital Letter ZZ900x5ALeft Square Bracket91[0x5B
Reverse Solidus\920x5CRight Square Bracket93]0x5D
Circumflex Accent^940x5ELow Line95_0x5F
Grave Accent`960x60Latin Small Letter A97a0x61
Latin Small Letter Bb980x62Latin Small Letter C99c0x63
Latin Small Letter Dd1000x64Latin Small Letter E101e0x65
Latin Small Letter Ff1020x66Latin Small Letter G103g0x67
Latin Small Letter Hh1040x68Latin Small Letter I105i0x69
Latin Small Letter Jj1060x6ALatin Small Letter K107k0x6B
Latin Small Letter Ll1080x6CLatin Small Letter M109m0x6D
Latin Small Letter Nn1100x6ELatin Small Letter O111o0x6F
Latin Small Letter Pp1120x70Latin Small Letter Q113q0x71
Latin Small Letter Rr1140x72Latin Small Letter S115s0x73
Latin Small Letter Tt1160x74Latin Small Letter U117u0x75
Latin Small Letter Vv1180x76Latin Small Letter W119w0x77
Latin Small Letter Xx1200x78Latin Small Letter Y121y0x79
Latin Small Letter Zz1220x7ALeft Curly Bracket123{0x7B
Vertical Line|1240x7CRight Curly Bracket125}0x7D
Tilde~1260x7EDelete1270x7F
Unicode table starting at code 32 and ending at code 127.

The number associated with every character in Unicode is often referred to as its code point.

To convert back and forth from characters to numbers (which is encoding a character) and numbers to characters (which is decoding a character), we ought to use some kind of encoding-decoding mechanism, also known as a character encoding scheme, or a character encoding format.

A character encoding scheme is a system of encoding and decoding characters on a computer.

One such scheme is Unicode Transformation Format - 32, commonly known as UTF-32. It uses 32 bits to represent every single character in the Unicode character set.

However, obviously, this has a big downside — wastage of memory. Needless to say, that's the reason why it's not very mainstream.

A better and more popular alternative is UTF-16.

In the UTF-16 encoding scheme, each character occupies at least 16 bits. Note that a block of 16 bits refers to one code unit.

Undoubtedly, not all characters in Unicode can be accomodated in 16 bits; we have to increase the storage to accomodate them. That's where some mathematics comes into the game. UTF-16 is variable-length i.e. some characters span one code unit (i.e. 16 bits), some span two code units (i.e. 32 bits).

Those that span two code units are given as a surrogate pair — a high surrogate value (the first code unit) followed by a low surrogate value (the second code unit) — in the Unicode terminology.

The thing is that certain code units (i.e. blocks of 16 bits) in UTF-16 are reserved as having special meanings — some are high surrogates while some are low surrogates. When a high surrogate is encountered by a UTF-16 decoder, it right away knows that the next code unit ought to be read as well in order to determine the current character.

There is a need of surrogates in UTF-16 only because it is a variable-length encoding scheme. There has to be some way for the decoder to distinguish between a code unit that represents one single character and a code unit that has to be combined with the next code unit to represent one single character. Surrogates help it make this distinction.

Now exactly how these high and low surrogate values are melded together to obtain one single code point is purely an implementation detail of UTF-16, out of the scope of this chapter. To dig deeper into it, you can refer to FAQ - UTF-8, UTF-16, UTF-32 & BOM - Unicode.org.

Coming all the way back to JavaScript, it uses the UTF-16 encoding scheme by default for its string type.

Hence, Unicode is the character set used. And in that way, each character in a given string in JavaScript occupies either 16 bits or 32 bits, depending on the character.

So from the perspective of JavaScript,

A string is a sequence of unsigned 16-bit integers.

In other words, a JavaScript string is a sequence of Unicode code units (which are unsigned 16-bit integers). This is an extremely crucial definition to keep in mind when working with strings.

The UTF-16 decoder (converting a character to a number) reads a given string as an array of 16 bit integers and processes each code unit to determine the corresponding character ought to be output.

As we learnt in the discussion above, not all code units represent characters on their own — some have to be linked with the following code unit to produce the final character. Remember what's the first and last code unit for such characters called?

So finally, now we understand the meaning of a 'character' in JavaScript, i.e. it's an unsigned integer spanning 16 bits or 32 bits, based on the UTF-16 format.

Simple.

Now, let's look into the charCodeAt() method available on strings that helps us see these unsigned integers.

The charCodeAt() method

Thanks to the charCodeAt() string method, we can inspect the code unit (i.e. the number) associated with a particular character in a string.

string.charCodeAt([index])

Just pass in the index of the character whose code you wish to inspect, as the first index argument, and the method will return back an integer corresponding to that very character based on the UTF-16 format.

In case index is omitted, it defaults to 0.

Consider the code below:

var str = 'abc';

console.log(str.charCodeAt(0));
console.log(str.charCodeAt(1));
console.log(str.charCodeAt(2));
97 98 99

If we see in the Unicode table above, the character a has the decimal code 97, followed by b with the code 98, followed by c with 99. That's exactly what's returned by the successive str.charCodeAt() calls above.

If we pass in an index to charCodeAt() that doesn't exist in the given string, the method returns NaN.

This can be seen as follows:

var str = 'abc';

console.log(str.charCodeAt(3));
NaN

Clearly, there is no fourth character in 'abc', likewise str.charCodeAt(3) evaluates to NaN.

Now, let's try a slightly more involved example. Consider the following code:

var str = '🙂';

console.log(str.charCodeAt(0));
console.log(str.charCodeAt(1));

What do you think would be logged here? Or let's put it this way — what do you think would be logged by line 4? Some integer or NaN?

Let's see it:

55357 56898

Now this is a surprise. We can only see one character in the string above, hence str.charCodeAt(1) should've returned NaN. However, that's NOT the case.

What exactly is going on here?

Well, it's only a matter or properly defining what charCodeAt() does.

At the start of this section, we said that charCodeAt() returns the code of a particular character in a given string. This is not that precise of a definition.

A much better one would be to say that charCodeAt(i) returns the ith code unit in the given string.

As we just learnt in the previous section, JavaScript treats a string as a sequence of code units (16-bit blocks). Now regardless of the fact that the string contains two code units to denote one character or two code units to denote two separate characters, it is comprised of two units and hence we can individually access both of them.

Coming back to the code above, the 🙂 character is comprised of two code units that together represent a surrogate pair. The first code unit is the high surrogate with the value 55,357 while the second one is the low surrogate with the value 56,898. Together, these two code units form the 🙂 character in UTF-16.

This code unit–oriented, instead of character-oriented, treatment of strings doesn't just happen in charCodeAt() — it's essentially the very design of JavaScript.

For example, if we access the first or the second character of the string '🙂' using bracket notation, we get the same behavior:

var str = '🙂'
undefined
str[0]
'\uD83D'
str[1]
'\uDE42'

The string '\uD83D' returned by str[0] is a Unicode escape sequence (denoted by \u). It's simply used to denote a character in a string with a particular code, in hexadecimal. We'll explore it later on in this chapter.

The hexadecimal number D83D represents the decimal number 55,357 while DE42 represents 56,898, resembling the values emitted by the charCodeAt(0) and charCodeAt(1) calls above, respectively.

If we know that the given string has a character with a code point beyond the 16-bit range (i.e. 65,535), represented using 2 code units, then we must use the codePointAt() method to retrieve the entire code point.

The codePointAt() method

The codePointAt() method is much similar to charCodeAt() in that it also allows us to inspect the codes of given characters in a string.

Here's its syntax:

string.codePointAt([index])

index is the index of the code unit to inspect in the given string. If omitted, it defaults to 0.

However, there is a slight, but very important, difference between codePointAt() and charCodeAt().

That is, if the inspected code unit represents a high surrogate value, codePointAt() combines the following low surrogate code unit along with the high surrogate one (the one inspected) to produce the complete code point.

Let's see this in action. Consider the following code:

var str = '🙂';

console.log(str.codePointAt(0));
console.log(str.codePointAt(1));
128578 56898

Notice the first log here. Since the character '🙂' comprises of two code units, and in str.codePointAt(0) we inspect the first code unit which is a high surrogate, the method combines the next code unit (whose value is 56,898) along with the high surrogate to produce 128,578 (the code point of '🙂' in Unicode).

But also notice the second log.

If the inspected code unit doesn't represent a high surrogate, codePointAt() returns the unit as it is. Likewise, for the string '🙂', both str.codePointAt(1) and str.charCodeAt(1) return the same thing.

Apart from this, if the inspected code unit doesn't exist in the given string, where charCodeAt() returns NaN, codePointAt() returns undefined.

This can be seen as follows:

var str = '🙂';

console.log(str.codePointAt(2));
undefined

The character '🙂' is made up of two code units, and likewise there is no third code unit. Hence, codePointAt(2) returns undefined.

The length property

Do you remember the length property of strings? It returns the number of 'characters' in a given string, also referred to as the length of the string.

Well, technically speaking, it's not quite precise of a definition to say that length is the number of 'characters' in given string.

Precisely speaking:

The length string property returns back the total number of code units in the given string.

Hence, the string '🙂' doesn't have a length of 1; rather, it has a length of 2, as '🙂' is comprised of two code units.

Consider the following snippet:

'abc'.length
3
'🙂'.length
2
'Smile 🙂'.length
8
'🙂🙁'.length
4
  • The string 'abc' is comprised of three code units (with three characters), hence the length 3.
  • '😀' is comprised of two code units (with one character), hence the length 2.
  • 'Smile 🙂' has 8 code units (with 7 characters), hence the length 8.
  • '🙂🙁' has 4 code units (with 2 characters), hence the length 4.

Notice how the number of 'characters' above is NOT always the same as the return value of length (i.e. the number of code units).

If we want to obtain the total number of characters (not code units) in a given string, we have to simply go over the entire string, checking each individual code unit to determine if it represents a high surrogate value (and should therefore be combined with the next code unit to produce a single character).

A custom home-made function can do this as well, but fortunately JavaScript provides ready-made facilities to do so. We'll come to that in a while.

To recap the main point of this section: length does NOT return the total number of characters in a string; instead it returns the total number of code units in the string, as a string is simply a sequence of code units.

Unicode escape sequences

In the discussion above, we saw a couple of escape sequences beginning with \u. Such sequences are commonly known as Unicode escape sequences.

The name follows from the fact that these sequences represent a particular Unicode code point or code unit. In fact, the 'u' in \u stands for 'Unicode'.

Unicode escape sequences are useful when we have the hexadecimal code of a particular character and want to showcase it.

This is obviously not the only way to go from the character's code to its visual representation — we can also use String.fromCharCode() and String.fromCodePoint(), given that the code is already in decimal representation, or converted to decimal representation from hexadecimal.

Now there are two variations of Unicode escape sequences:

  • One is used to denote a code point within the range of a single code unit.
  • The other is used to denote any arbitrary code point.

The general form of the first variation is as follows:

\uXXXX

A hexadecimal number follows \u, spanning exactly 4 digits (each denoted as X). The maximum can obviously be \uFFFF, which is the largest a single code unit can get.

In this form, it's mandatory to use four hexadecimal digits in the escape sequence. Any characters that follow are parsed as literal characters.

On the other hand, the general form of a sequence used to denote any arbitrary code point is as follows:

\u{X}
\u{XX}
...
\u{XXXXXX}

Once again, \u is followed by a hexadecimal number, but this time it's enclosed within a pair of curly braces ({}). Plus, the hexadecimal number can go beyond 4 digits, to a max of 6 digits.

Purpose of curly braces ({}) here

The curly braces ({}) are introduced to remove ambiguity from the escape sequence.

For instance, imagine if we didn't have to use the brackets. Given this situation, consider the string '\u1030A'. Now does this denote a single character with the code point 1030A, or a character with the code point 1030 followed by the literal character 'A'?

Ambiguous, right? This is what {} solves. With '\u{1030A}', we are completely sure that the string denotes a single character with the code point 1030A.

Anyways, let's take a look at some quick examples.

As a quick reference: '\n' has the hex code A, which can be expressed as the same number in many ways such as 0A, 00A, 000A and so on; ' ' (space) has the hex code 20; 'a' has the hex code 61; and 😀 has the hex code 1f642.

Consider the snippet below where we represent these characters using Unicode escape sequences of the first variant (\uXXXX):

'\u000A'
'\n'
'\u0020'
' '
'\u0061'
'a'

The '🙂' character can't be represented in one go using this variant. That's simply because this emoticon consists of two code units, and \uXXXX is only capable of denoting code units.

If we have to represent '🙂' using \uXXXX exclusively, we ought to use two such escape sequences, corresponding to the two code units of the character, as shown below:

'\uD83D\uDE42'
'\n'

Recall from the discussion near the start of this chapter that '🙂' is comprised of the following two code units: 55,357 (D83D in hex) and 56,898 (DE42 in hex). Likewise, we use '\uD83D\uDE42' to represent these code units which together produce '🙂'.

Now, let's shift to the second variant (the one with curly braces):

'\u{A}'
'\n'
'\u{00A}'
'\n'
'\u{00000A}'
'\n'
'\u{20}'
' '
'\u{61}'
'a'
'\u{1F642}'
'🙂'

The last statement here is worth noting. We directly embed the code point associated with '🙂' (in Unicode) — that is, 1F642 in hex — inside the curly braces ({}). There's no need to break the code point down into individual code units, as we did previously.

Note that whatever code point we have, we can add as many leading zeroes as we want to so long as the number doesn't get larger by 6 digits.

Adding zeroes after the hexadecimal number can lead to the wrong character. Worse yet, exceeding the maximum range of the Unicode character set, which is the hexadecimal number 10FFFF, simply leads to an error.

Both of these cases are demonstrated below.

First, for adding zeroes at the end of the hexadecimal code:

'\u{61}'
'a'
'\u{0061}'
'a'
'\u{6100}'
'愀'

In the last statement, instead of adding zeroes before the hexadecimal number 61, we add them after it. The result is that the hexadecimal number gets changed and likewise we get the wrong character output.

And second, for exceeding the maximum range:

'\u{61}'
'a'
'\u{610000}'
Uncaught SyntaxError: Undefined Unicode code-point

Clearly, the number 610000 is beyond the maximum value of 10FFFF, likewise JavaScript throws an error.

Lexicographic comparison

We saw the relational operators <, <=, >, >=, === and !== being used on strings in the chapter JavaScript Operators. Now, it's time to dig deeper into how these all work.

Each of the binary operators shown above perform what's called a lexicographic comparison of the given strings. This is more or less like the way a dictionary sorts words alphabetically.

In this comparison, one of the strings might be 'behind' the other, 'ahead' of the other, or 'identical' to the other.

But how exactly does this comparison happen, that's the real interesting part.

Let's understand it.

How lexicographic comparison happens?

Iteration is performed over both the given strings simultaneously, comparing corresponding code units with one another.

If at any point a mismatch is found, the string with the code unit whose value is greater than the corresponding code unit in the other string is termed as lexicographically larger than the other string; accordingly, the other string is termed as lexicographically smaller.

However, if no mismatch is found until either one of the strings has been exhausted by the iterations (i.e. its end reached), then at that point there are two possibilities:

  • The length of both the strings is the same. In that case, the string are exactly identical to one another.
  • The length of both the strings is NOT the same. In that case, the string with a larger length is lexicographically larger than the other one; accordingly, the other string is the lexicographically smaller one.

And that's it.

Coming back to the operators, given two string variables a and b:

  • a < b says that 'a is lexicographically smaller than b'.
  • a > b says that 'a is lexicographically larger than b'.
  • a === b says that 'a is identical to b'.

The other relational operators, <=, >= and !==, can be derived from these three simple operators.

Anyways, time to consider some very quick examples:

First, with <:

'a' < 'b'
true
'a' < 'a'
false
'a' < 'A'
false
'ab' < 'abc'
true

In the third statement, since the code unit of 'a' (97 in decimal) is larger than that of 'A' (65 in decimal), 'a' < 'A' yields false.

In the last statement, since there is no mismatch in the first two code units of 'ab' and 'abc', the comparison comes down to the length of the strings, and since 'ab' has a smaller length, it's indeed lexicographically smaller than 'abc'. Hence, 'ab' < 'abc' yields true.

Now, with >:

'a' > 'b'
false
'a' > 'a'
false
'a' > 'A'
true
'ab' > 'abc'
false

And finally, with ===:

'a' === 'a'
true
'a' === 'A'
false
'ab' === 'abc'
false

With these extremely simple examples, let's now see whether you could determine the return values of some relational expressions — slightly complicated ones — on your own.

What does '🙂' < 'abcd' return?

  • true
  • false
Clearly, there is a mismatch between 'a' and the first code unit in '🙂'. Moreover, since the code unit of '🙂' is larger in value, '🙂' is lexicographically larger than 'abcd'. Hence, '🙂' < 'abcd' yields false.

What does '🙂' === '\uD83D\uDE43' return?

Inspect the code units of 🙂 to determine the answer.

  • true
  • false
The first code unit of '🙂' is \uD83D. This is the same as the first code unit in '\uD83D\uDE43'. Hence, we move to the second one. This code unit in '🙂' is \uDE42, however the same code unit in '\uD83D\uDE43' is \uDE43. Since this is a mismatch, both the given strings are not identical to one another, and likewise the given expression is false.

What does '\uD83D\uDE90' > '\u{1F642}' return?

  • true
  • false
The string '\u{1F642}' can also be expressed as '\uD83D\uDE42'. Since, there is a mismatch in the second code unit, and \uDE90 is the larger one, the string '\uD83D\uDE90' is lexicographically larger than '\u{1F642}'. Hence, the given expression returns true.

The string iterator

With ECMAScript 6, the idea of iterators took birth in JavaScript. At the core, they're nothing more than a set of rules to implement in order to perform iteration over given sequences. But overall, they're a slightly technical and complicated matter.

Likewise we won't go over them in this unit; in fact, not even in this course. To learn more about iterators, you can refer to JavaScript Iterators — Introduction from our Advanced JavaScript course.

Anyways,

A string iterator is merely a way to iterate over all the characters in a string.

Note the emphasis on 'character' here — the iterator distinguishes between 'characters', NOT code units. And that's a great thing.

To perform iteration over a string using the string iterator, there are a couple of ways:

  • Use the @@iterator string method (@@iterator means that we are referring to the string method whose key is stored in Symbol.iterator).
  • Use the for...of loop

We'll go with the former, but not calling the method directly. Rather, we'll use the spread operator (...) which converts a given sequence — a string in this case — into an array.

Consider the following code:

var str = 'abc';
var arr = [...str];

console.log(arr);
['a', 'b', 'c']

The most important part here is [...str]. The spread operator (...) invokes the @@iterator() method internally on str and dumps all the individual characters of the string into the array.

What's remarkable about this method, as stated before, is that it distinguishes between individual 'characters'.

This means that we don't need to worry about high surrogate and low surrogate code units — the method will do all the heavy-lifting on its own.

As an illustration, consider the following:

var str = 'Smile 🙂';

console.log(str.length);
console.log([...str].length);
8 7

str.length returns the total number of code units in str, which are 8 in total (6 for 'Smile ' and 2 for '🙂'). However, [...str].length returns the number of elements in the array [...str] which altogether has 7 elements.

Let's inspect the [...str] array here to see what it actually contains:

[...str]
['S', 'm', 'i', 'l', 'e', ' ', '🙂']

As you can see, [...str] contains all the individual characters of str; most importantly, the '🙂' character is denoted as a single character and not as two separate characters in the array.

Isn't this string method amazing?

Fortunately, the good thing about this approach is that we can also utilize it to access a particular character from the given string.

For instance, given the same string str shown above, we can access its first and last characters as follows:

var chars = [...str]
undefined
chars[0]
'S'
chars[chars.length - 1]
'🙂'

And that's it for this chapter.

"I created Codeguage to save you from falling into the same learning conundrums that I fell into."

— Bilal Adnan, Founder of Codeguage