Introduction
In the previous chapter, we covered the basics of strings in JavaScript. Now, it's the high time to dig a little bit deeper and figure out exactly how strings work under the hood in JavaScript.
In this chapter, we'll cover a bit of advanced material. Starting off with the character encoding format used by JavaScript, we'll dive into many technical details such as exactly what does the length
property return, how to determine a particular character's associated code unit, and what really is a string.
We'll also see things such as Unicode escape sequences and lexicographic comparison of strings, before ending with how to indirectly use the string @@iterator()
method to work with strings without having to worry about variable-length characters.
This chapter surely holds an immense amount of significance in this unit of strings as it'll make you much more confident with string processing, and the language JavaScript, in general. Not only this but you'll also be able to better appreciate the way strings work in other programming languages.
Without any further ado, let's get going.
Unicode and UTF-16
We already know that a string is a sequence of characters.
But what exactly is a 'character' here? Let's explore it.
Everything that happens inside the processor of a computer is calculation over mere numbers. String processing, image processing, audio processing, video processing — literally everything revolves around numbers.
In the case of characters of a string, each of them is associated with a particular number as well. This association is given by the character set in use.
These numbers are sometimes also referred to as character codes. Strings, that are sequences of characters, are, in other words, sequences of these numbers.
Now there are numerous character sets known and even currently used across computers, each with its own legacy. ASCII, clearly one of the most popular character sets known to date, goes back to the very old nascent times of computing.
In ASCII, the lowercase 'a' from the English alphabet is associated with the number 97. The uppercase 'A' is associated with the number 65. Similarly, lowercase 'b' is 98, while uppercase 'B' is 66. The space character is 32.
As stated before, ASCII isn't the only character set known or used across computers. There are tons of them. For instance, another common character set used primarily on Windows is Windows-1252.
Accomodating multiple languages and symbols typically leads to multiple character sets. That's why we have so many of them. However, this is merely a legacy of the past when there was no single universal character set to turn to.
Then, fortunately, came in Unicode.
Clearly, it's impossible to show the complete mapping of these many characters on this page, likewise we only limit ourselves to the first couple of characters.
Here's a list of Unicode's first 128 character mappings, excluding the starting 31 control characters (line feeds, carriage returns, tabs, and so on):
Character name | Glyph | Code | Code (hex) | Character name | Glyph | Code | Code (hex) |
---|---|---|---|---|---|---|---|
Space | | 32 | 0x20 | Exclamation Mark | 33 | ! | 0x21 |
Quotation Mark | " | 34 | 0x22 | Number Sign | 35 | # | 0x23 |
Dollar Sign | $ | 36 | 0x24 | Percent Sign | 37 | % | 0x25 |
Ampersand | & | 38 | 0x26 | Apostrophe | 39 | ' | 0x27 |
Left Parenthesis | ( | 40 | 0x28 | Right Parenthesis | 41 | ) | 0x29 |
Asterisk | * | 42 | 0x2A | Plus Sign | 43 | + | 0x2B |
Comma | , | 44 | 0x2C | Hyphen-Minus | 45 | - | 0x2D |
Full Stop | . | 46 | 0x2E | Solidus | 47 | / | 0x2F |
Digit Zero | 0 | 48 | 0x30 | Digit One | 49 | 1 | 0x31 |
Digit Two | 2 | 50 | 0x32 | Digit Three | 51 | 3 | 0x33 |
Digit Four | 4 | 52 | 0x34 | Digit Five | 53 | 5 | 0x35 |
Digit Six | 6 | 54 | 0x36 | Digit Seven | 55 | 7 | 0x37 |
Digit Eight | 8 | 56 | 0x38 | Digit Nine | 57 | 9 | 0x39 |
Colon | : | 58 | 0x3A | Semicolon | 59 | ; | 0x3B |
Less-Than Sign | < | 60 | 0x3C | Equals Sign | 61 | = | 0x3D |
Greater-Than Sign | > | 62 | 0x3E | Question Mark | 63 | ? | 0x3F |
Commercial At | @ | 64 | 0x40 | Latin Capital Letter A | 65 | A | 0x41 |
Latin Capital Letter B | B | 66 | 0x42 | Latin Capital Letter C | 67 | C | 0x43 |
Latin Capital Letter D | D | 68 | 0x44 | Latin Capital Letter E | 69 | E | 0x45 |
Latin Capital Letter F | F | 70 | 0x46 | Latin Capital Letter G | 71 | G | 0x47 |
Latin Capital Letter H | H | 72 | 0x48 | Latin Capital Letter I | 73 | I | 0x49 |
Latin Capital Letter J | J | 74 | 0x4A | Latin Capital Letter K | 75 | K | 0x4B |
Latin Capital Letter L | L | 76 | 0x4C | Latin Capital Letter M | 77 | M | 0x4D |
Latin Capital Letter N | N | 78 | 0x4E | Latin Capital Letter O | 79 | O | 0x4F |
Latin Capital Letter P | P | 80 | 0x50 | Latin Capital Letter Q | 81 | Q | 0x51 |
Latin Capital Letter R | R | 82 | 0x52 | Latin Capital Letter S | 83 | S | 0x53 |
Latin Capital Letter T | T | 84 | 0x54 | Latin Capital Letter U | 85 | U | 0x55 |
Latin Capital Letter V | V | 86 | 0x56 | Latin Capital Letter W | 87 | W | 0x57 |
Latin Capital Letter X | X | 88 | 0x58 | Latin Capital Letter Y | 89 | Y | 0x59 |
Latin Capital Letter Z | Z | 90 | 0x5A | Left Square Bracket | 91 | [ | 0x5B |
Reverse Solidus | \ | 92 | 0x5C | Right Square Bracket | 93 | ] | 0x5D |
Circumflex Accent | ^ | 94 | 0x5E | Low Line | 95 | _ | 0x5F |
Grave Accent | ` | 96 | 0x60 | Latin Small Letter A | 97 | a | 0x61 |
Latin Small Letter B | b | 98 | 0x62 | Latin Small Letter C | 99 | c | 0x63 |
Latin Small Letter D | d | 100 | 0x64 | Latin Small Letter E | 101 | e | 0x65 |
Latin Small Letter F | f | 102 | 0x66 | Latin Small Letter G | 103 | g | 0x67 |
Latin Small Letter H | h | 104 | 0x68 | Latin Small Letter I | 105 | i | 0x69 |
Latin Small Letter J | j | 106 | 0x6A | Latin Small Letter K | 107 | k | 0x6B |
Latin Small Letter L | l | 108 | 0x6C | Latin Small Letter M | 109 | m | 0x6D |
Latin Small Letter N | n | 110 | 0x6E | Latin Small Letter O | 111 | o | 0x6F |
Latin Small Letter P | p | 112 | 0x70 | Latin Small Letter Q | 113 | q | 0x71 |
Latin Small Letter R | r | 114 | 0x72 | Latin Small Letter S | 115 | s | 0x73 |
Latin Small Letter T | t | 116 | 0x74 | Latin Small Letter U | 117 | u | 0x75 |
Latin Small Letter V | v | 118 | 0x76 | Latin Small Letter W | 119 | w | 0x77 |
Latin Small Letter X | x | 120 | 0x78 | Latin Small Letter Y | 121 | y | 0x79 |
Latin Small Letter Z | z | 122 | 0x7A | Left Curly Bracket | 123 | { | 0x7B |
Vertical Line | | | 124 | 0x7C | Right Curly Bracket | 125 | } | 0x7D |
Tilde | ~ | 126 | 0x7E | Delete | 127 | | 0x7F |
The number associated with every character in Unicode is often referred to as its code point.
To convert back and forth from characters to numbers (which is encoding a character) and numbers to characters (which is decoding a character), we ought to use some kind of encoding-decoding mechanism, also known as a character encoding scheme, or a character encoding format.
One such scheme is Unicode Transformation Format - 32, commonly known as UTF-32. It uses 32 bits to represent every single character in the Unicode character set.
However, obviously, this has a big downside — wastage of memory. Needless to say, that's the reason why it's not very mainstream.
A better and more popular alternative is UTF-16.
In the UTF-16 encoding scheme, each character occupies at least 16 bits. Note that a block of 16 bits refers to one code unit.
Undoubtedly, not all characters in Unicode can be accomodated in 16 bits; we have to increase the storage to accomodate them. That's where some mathematics comes into the game. UTF-16 is variable-length i.e. some characters span one code unit (i.e. 16 bits), some span two code units (i.e. 32 bits).
Those that span two code units are given as a surrogate pair — a high surrogate value (the first code unit) followed by a low surrogate value (the second code unit) — in the Unicode terminology.
The thing is that certain code units (i.e. blocks of 16 bits) in UTF-16 are reserved as having special meanings — some are high surrogates while some are low surrogates. When a high surrogate is encountered by a UTF-16 decoder, it right away knows that the next code unit ought to be read as well in order to determine the current character.
Now exactly how these high and low surrogate values are melded together to obtain one single code point is purely an implementation detail of UTF-16, out of the scope of this chapter. To dig deeper into it, you can refer to FAQ - UTF-8, UTF-16, UTF-32 & BOM - Unicode.org.
Coming all the way back to JavaScript, it uses the UTF-16 encoding scheme by default for its string type.
Hence, Unicode is the character set used. And in that way, each character in a given string in JavaScript occupies either 16 bits or 32 bits, depending on the character.
So from the perspective of JavaScript,
In other words, a JavaScript string is a sequence of Unicode code units (which are unsigned 16-bit integers). This is an extremely crucial definition to keep in mind when working with strings.
The UTF-16 decoder (converting a character to a number) reads a given string as an array of 16 bit integers and processes each code unit to determine the corresponding character ought to be output.
So finally, now we understand the meaning of a 'character' in JavaScript, i.e. it's an unsigned integer spanning 16 bits or 32 bits, based on the UTF-16 format.
Simple.
Now, let's look into the charCodeAt()
method available on strings that helps us see these unsigned integers.
The charCodeAt()
method
Thanks to the charCodeAt()
string method, we can inspect the code unit (i.e. the number) associated with a particular character in a string.
string.charCodeAt([index])
Just pass in the index of the character whose code you wish to inspect, as the first index
argument, and the method will return back an integer corresponding to that very character based on the UTF-16 format.
In case index
is omitted, it defaults to 0
.
Consider the code below:
var str = 'abc';
console.log(str.charCodeAt(0));
console.log(str.charCodeAt(1));
console.log(str.charCodeAt(2));
If we see in the Unicode table above, the character a
has the decimal code 97, followed by b
with the code 98, followed by c
with 99. That's exactly what's returned by the successive str.charCodeAt()
calls above.
If we pass in an index to charCodeAt()
that doesn't exist in the given string, the method returns NaN
.
This can be seen as follows:
var str = 'abc';
console.log(str.charCodeAt(3));
Clearly, there is no fourth character in 'abc'
, likewise str.charCodeAt(3)
evaluates to NaN
.
Now, let's try a slightly more involved example. Consider the following code:
var str = '🙂';
console.log(str.charCodeAt(0));
console.log(str.charCodeAt(1));
What do you think would be logged here? Or let's put it this way — what do you think would be logged by line 4? Some integer or NaN
?
Let's see it:
Now this is a surprise. We can only see one character in the string above, hence str.charCodeAt(1)
should've returned NaN
. However, that's NOT the case.
What exactly is going on here?
Well, it's only a matter or properly defining what charCodeAt()
does.
At the start of this section, we said that charCodeAt()
returns the code of a particular character in a given string. This is not that precise of a definition.
A much better one would be to say that charCodeAt(i)
returns the i
th code unit in the given string.
As we just learnt in the previous section, JavaScript treats a string as a sequence of code units (16-bit blocks). Now regardless of the fact that the string contains two code units to denote one character or two code units to denote two separate characters, it is comprised of two units and hence we can individually access both of them.
Coming back to the code above, the 🙂 character is comprised of two code units that together represent a surrogate pair. The first code unit is the high surrogate with the value 55,357 while the second one is the low surrogate with the value 56,898. Together, these two code units form the 🙂 character in UTF-16.
This code unit–oriented, instead of character-oriented, treatment of strings doesn't just happen in charCodeAt()
— it's essentially the very design of JavaScript.
For example, if we access the first or the second character of the string '🙂'
using bracket notation, we get the same behavior:
var str = '🙂'
str[0]
str[1]
The string '\uD83D'
returned by str[0]
is a Unicode escape sequence (denoted by \u
). It's simply used to denote a character in a string with a particular code, in hexadecimal. We'll explore it later on in this chapter.
The hexadecimal number D83D
represents the decimal number 55,357 while DE42
represents 56,898, resembling the values emitted by the charCodeAt(0)
and charCodeAt(1)
calls above, respectively.
If we know that the given string has a character with a code point beyond the 16-bit range (i.e. 65,535), represented using 2 code units, then we must use the codePointAt()
method to retrieve the entire code point.
The codePointAt()
method
The codePointAt()
method is much similar to charCodeAt()
in that it also allows us to inspect the codes of given characters in a string.
Here's its syntax:
string.codePointAt([index])
index
is the index of the code unit to inspect in the given string. If omitted, it defaults to 0
.
However, there is a slight, but very important, difference between codePointAt()
and charCodeAt()
.
That is, if the inspected code unit represents a high surrogate value, codePointAt()
combines the following low surrogate code unit along with the high surrogate one (the one inspected) to produce the complete code point.
Let's see this in action. Consider the following code:
var str = '🙂';
console.log(str.codePointAt(0));
console.log(str.codePointAt(1));
Notice the first log here. Since the character '🙂' comprises of two code units, and in str.codePointAt(0)
we inspect the first code unit which is a high surrogate, the method combines the next code unit (whose value is 56,898) along with the high surrogate to produce 128,578 (the code point of '🙂' in Unicode).
But also notice the second log.
If the inspected code unit doesn't represent a high surrogate, codePointAt()
returns the unit as it is. Likewise, for the string '🙂'
, both str.codePointAt(1)
and str.charCodeAt(1)
return the same thing.
Apart from this, if the inspected code unit doesn't exist in the given string, where charCodeAt()
returns NaN
, codePointAt()
returns undefined
.
This can be seen as follows:
var str = '🙂';
console.log(str.codePointAt(2));
The character '🙂' is made up of two code units, and likewise there is no third code unit. Hence, codePointAt(2)
returns undefined
.
The length
property
Do you remember the length
property of strings? It returns the number of 'characters' in a given string, also referred to as the length of the string.
Well, technically speaking, it's not quite precise of a definition to say that length
is the number of 'characters' in given string.
Precisely speaking:
length
string property returns back the total number of code units in the given string.Hence, the string '🙂'
doesn't have a length of 1
; rather, it has a length of 2
, as '🙂'
is comprised of two code units.
Consider the following snippet:
'abc'.length
'🙂'.length
'Smile 🙂'.length
'🙂🙁'.length
- The string
'abc'
is comprised of three code units (with three characters), hence the length3
. '😀'
is comprised of two code units (with one character), hence the length2
.'Smile 🙂'
has 8 code units (with 7 characters), hence the length8
.'🙂🙁'
has 4 code units (with 2 characters), hence the length4
.
Notice how the number of 'characters' above is NOT always the same as the return value of length
(i.e. the number of code units).
If we want to obtain the total number of characters (not code units) in a given string, we have to simply go over the entire string, checking each individual code unit to determine if it represents a high surrogate value (and should therefore be combined with the next code unit to produce a single character).
A custom home-made function can do this as well, but fortunately JavaScript provides ready-made facilities to do so. We'll come to that in a while.
To recap the main point of this section: length
does NOT return the total number of characters in a string; instead it returns the total number of code units in the string, as a string is simply a sequence of code units.
Unicode escape sequences
In the discussion above, we saw a couple of escape sequences beginning with \u
. Such sequences are commonly known as Unicode escape sequences.
The name follows from the fact that these sequences represent a particular Unicode code point or code unit. In fact, the 'u' in \u
stands for 'Unicode'.
Unicode escape sequences are useful when we have the hexadecimal code of a particular character and want to showcase it.
String.fromCharCode()
and String.fromCodePoint()
, given that the code is already in decimal representation, or converted to decimal representation from hexadecimal.Now there are two variations of Unicode escape sequences:
- One is used to denote a code point within the range of a single code unit.
- The other is used to denote any arbitrary code point.
The general form of the first variation is as follows:
\uXXXX
A hexadecimal number follows \u
, spanning exactly 4 digits (each denoted as X
). The maximum can obviously be \uFFFF
, which is the largest a single code unit can get.
On the other hand, the general form of a sequence used to denote any arbitrary code point is as follows:
\u{X}
\u{XX}
...
\u{XXXXXX}
Once again, \u
is followed by a hexadecimal number, but this time it's enclosed within a pair of curly braces ({}
). Plus, the hexadecimal number can go beyond 4 digits, to a max of 6 digits.
Purpose of curly braces ({}
) here
The curly braces ({}
) are introduced to remove ambiguity from the escape sequence.
For instance, imagine if we didn't have to use the brackets. Given this situation, consider the string '\u1030A'
. Now does this denote a single character with the code point 1030A, or a character with the code point 1030 followed by the literal character 'A'?
Ambiguous, right? This is what {}
solves. With '\u{1030A}'
, we are completely sure that the string denotes a single character with the code point 1030A.
Anyways, let's take a look at some quick examples.
As a quick reference: '\n'
has the hex code A
, which can be expressed as the same number in many ways such as 0A
, 00A
, 000A
and so on; ' '
(space) has the hex code 20
; 'a'
has the hex code 61
; and 😀 has the hex code 1f642
.
Consider the snippet below where we represent these characters using Unicode escape sequences of the first variant (\uXXXX
):
'\u000A'
'\u0020'
'\u0061'
The '🙂' character can't be represented in one go using this variant. That's simply because this emoticon consists of two code units, and \uXXXX
is only capable of denoting code units.
If we have to represent '🙂' using \uXXXX
exclusively, we ought to use two such escape sequences, corresponding to the two code units of the character, as shown below:
'\uD83D\uDE42'
Recall from the discussion near the start of this chapter that '🙂' is comprised of the following two code units: 55,357 (D83D
in hex) and 56,898 (DE42
in hex). Likewise, we use '\uD83D\uDE42'
to represent these code units which together produce '🙂'.
Now, let's shift to the second variant (the one with curly braces):
'\u{A}'
'\u{00A}'
'\u{00000A}'
'\u{20}'
'\u{61}'
'\u{1F642}'
The last statement here is worth noting. We directly embed the code point associated with '🙂' (in Unicode) — that is, 1F642
in hex — inside the curly braces ({}
). There's no need to break the code point down into individual code units, as we did previously.
Note that whatever code point we have, we can add as many leading zeroes as we want to so long as the number doesn't get larger by 6 digits.
Adding zeroes after the hexadecimal number can lead to the wrong character. Worse yet, exceeding the maximum range of the Unicode character set, which is the hexadecimal number 10FFFF
, simply leads to an error.
Both of these cases are demonstrated below.
First, for adding zeroes at the end of the hexadecimal code:
'\u{61}'
'\u{0061}'
'\u{6100}'
In the last statement, instead of adding zeroes before the hexadecimal number 61
, we add them after it. The result is that the hexadecimal number gets changed and likewise we get the wrong character output.
And second, for exceeding the maximum range:
'\u{61}'
'\u{610000}'
Clearly, the number 610000
is beyond the maximum value of 10FFFF
, likewise JavaScript throws an error.
Lexicographic comparison
We saw the relational operators <
, <=
, >
, >=
, ===
and !==
being used on strings in the chapter JavaScript Operators. Now, it's time to dig deeper into how these all work.
Each of the binary operators shown above perform what's called a lexicographic comparison of the given strings. This is more or less like the way a dictionary sorts words alphabetically.
In this comparison, one of the strings might be 'behind' the other, 'ahead' of the other, or 'identical' to the other.
But how exactly does this comparison happen, that's the real interesting part.
Let's understand it.
How lexicographic comparison happens?
Iteration is performed over both the given strings simultaneously, comparing corresponding code units with one another.
If at any point a mismatch is found, the string with the code unit whose value is greater than the corresponding code unit in the other string is termed as lexicographically larger than the other string; accordingly, the other string is termed as lexicographically smaller.
However, if no mismatch is found until either one of the strings has been exhausted by the iterations (i.e. its end reached), then at that point there are two possibilities:
- The length of both the strings is the same. In that case, the string are exactly identical to one another.
- The length of both the strings is NOT the same. In that case, the string with a larger length is lexicographically larger than the other one; accordingly, the other string is the lexicographically smaller one.
And that's it.
Coming back to the operators, given two string variables a
and b
:
a < b
says that 'a
is lexicographically smaller thanb
'.a > b
says that 'a
is lexicographically larger thanb
'.a === b
says that 'a
is identical tob
'.
The other relational operators, <=
, >=
and !==
, can be derived from these three simple operators.
Anyways, time to consider some very quick examples:
First, with <
:
'a' < 'b'
'a' < 'a'
'a' < 'A'
'ab' < 'abc'
In the third statement, since the code unit of 'a'
(97 in decimal) is larger than that of 'A'
(65 in decimal), 'a' < 'A'
yields false
.
In the last statement, since there is no mismatch in the first two code units of 'ab'
and 'abc'
, the comparison comes down to the length of the strings, and since 'ab'
has a smaller length, it's indeed lexicographically smaller than 'abc'
. Hence, 'ab' < 'abc'
yields true
.
Now, with >
:
'a' > 'b'
'a' > 'a'
'a' > 'A'
'ab' > 'abc'
And finally, with ===
:
'a' === 'a'
'a' === 'A'
'ab' === 'abc'
With these extremely simple examples, let's now see whether you could determine the return values of some relational expressions — slightly complicated ones — on your own.
What does '🙂' < 'abcd'
return?
true
false
'a'
and the first code unit in '🙂'
. Moreover, since the code unit of '🙂'
is larger in value, '🙂'
is lexicographically larger than 'abcd'
. Hence, '🙂' < 'abcd'
yields false
.What does '🙂' === '\uD83D\uDE43'
return?
Inspect the code units of 🙂 to determine the answer.
true
false
'🙂'
is \uD83D
. This is the same as the first code unit in '\uD83D\uDE43'
. Hence, we move to the second one. This code unit in '🙂'
is \uDE42
, however the same code unit in '\uD83D\uDE43'
is \uDE43
. Since this is a mismatch, both the given strings are not identical to one another, and likewise the given expression is false
.What does '\uD83D\uDE90' > '\u{1F642}'
return?
true
false
'\u{1F642}'
can also be expressed as '\uD83D\uDE42'
. Since, there is a mismatch in the second code unit, and \uDE90
is the larger one, the string '\uD83D\uDE90'
is lexicographically larger than '\u{1F642}'
. Hence, the given expression returns true
.The string iterator
With ECMAScript 6, the idea of iterators took birth in JavaScript. At the core, they're nothing more than a set of rules to implement in order to perform iteration over given sequences. But overall, they're a slightly technical and complicated matter.
Likewise we won't go over them in this unit; in fact, not even in this course. To learn more about iterators, you can refer to JavaScript Iterators — Introduction from our Advanced JavaScript course.
Anyways,
Note the emphasis on 'character' here — the iterator distinguishes between 'characters', NOT code units. And that's a great thing.
To perform iteration over a string using the string iterator, there are a couple of ways:
- Use the
@@iterator
string method (@@iterator
means that we are referring to the string method whose key is stored inSymbol.iterator
). - Use the
for...of
loop
We'll go with the former, but not calling the method directly. Rather, we'll use the spread operator (...
) which converts a given sequence — a string in this case — into an array.
Consider the following code:
var str = 'abc';
var arr = [...str];
console.log(arr);
['a', 'b', 'c']
The most important part here is [...str]
. The spread operator (...
) invokes the @@iterator()
method internally on str
and dumps all the individual characters of the string into the array.
What's remarkable about this method, as stated before, is that it distinguishes between individual 'characters'.
This means that we don't need to worry about high surrogate and low surrogate code units — the method will do all the heavy-lifting on its own.
As an illustration, consider the following:
var str = 'Smile 🙂';
console.log(str.length);
console.log([...str].length);
str.length
returns the total number of code units in str
, which are 8 in total (6 for 'Smile '
and 2 for '🙂'
). However, [...str].length
returns the number of elements in the array [...str]
which altogether has 7 elements.
Let's inspect the [...str]
array here to see what it actually contains:
['S', 'm', 'i', 'l', 'e', ' ', '🙂']
As you can see, [...str]
contains all the individual characters of str
; most importantly, the '🙂' character is denoted as a single character and not as two separate characters in the array.
Isn't this string method amazing?
Fortunately, the good thing about this approach is that we can also utilize it to access a particular character from the given string.
For instance, given the same string str
shown above, we can access its first and last characters as follows:
var chars = [...str]
chars[0]
chars[chars.length - 1]
And that's it for this chapter.