RegExp Character Sets

Chapter 5 21 mins

Learning outcomes:

  1. What are character sets
  2. Using character sets
  3. Negated character sets

Introduction

How can you select all the words within a string that follow the pattern 'b', then a vowel and finally the alphabet 't'. In other words how can you match all the words 'bat', 'bet', 'bit', 'bot', and 'but'.

Luckily you might be able to see a pattern in this case - the characters between the 'b' and 't' are changing while they both remain the same. What we need is a character set.

So far we've explored a handful of concepts including flags and quantifiers in our regular expressions journey. Fortunately this exploration is not over as of yet!

In this chapter we shall take a look at another powerful concept in regexp known as character sets that can give a range of characters to match a character with. Let's enroll!

What is a character set?

In simple terms:

A character set, given by the square brackets [ ], denotes multiple characters, to match a single character against.

A character set essentially provides a bunch of characters to match against a single character in a given test string.

For example, if you want to match the characters 'a' or 'b' you can simply write /[ab]/. The set will match either an 'a' or a 'b' in a given test string.

The order of characters here doesn't matter - we could've also used the set [ba]. The reason is because, internally the regexp engine matches a character in the string against each character in a set until a match if found.

Similarly to match the characters 'a', 'b', 'c' or 'd', you can construct the character set [abcd]. Once again remember that this set will resolve down to a single character - in this case either an 'a', 'b', 'c' or a 'd'.

Often developers make the mistake of thinking that a character set denotes a word, but this is not the case. In itself, a character set is meant to denote only a single character.

What this simply means is the fact that [How] won't match the word 'How', but instead the characters 'H', 'o' and 'w'.

Emphasising it once again, remember that character sets, as the name suggests, are meant to denote characters, NOT a word!

Write an expression, using a character set, to check whether a given test string contains a '0' or '1'.

The expression is /[01]/ (or equivalently /[10]/). We have to check for the two characters '0' or '1'; hence we form the set [01]. It will match either a '0' or '1'.

Construct an expression using a set to match all substrings, inside a test string, that start with a 'b', followed by a vowel, and end with a 't'.

In other words, you shall match all the substrings 'bat', 'bet', 'bit', 'bot' and 'but'.

The expression is /b[aeiou]t/g.

Speciality changes inside a set!

One thing worth mentioning over here is that not all characters, considered special inside a regular expression, are also considered special inside a character set.

For example, ? + * : | . all are considered special inside a regular expression, but are NOT special inside a set. This means that they can be given without a backslash.

Likewise, the set [.] isn't equivalent to the wildcard . character. The pattern /./ will match any single character, however the set [.] will match only the period symbol '.', literally.

Listing all non-special characters inside a set will otherwise be quite tedious so we'll approach it the other way - listing all special characters.

What is considered special inside a set includes:

  1. The caret ^ only if it appears at the start of the set. This is the negation character and likewise denotes a negation set as we shall discover below.
  2. The hyphen - if it appears between two characters. It denotes a range inside a set as we shall see below.
  3. The backslash \ which can be used to include character classes inside a set. We'll see what are character classes in the next chapter.

To include any of these special charcters in the given locations, like a hyphen between two characters, we ought to use the backslash \ to escape them.

The - is considered special only if it appears between two characters in a set like [a-z]. This is a range and will match any character from a to z. We shall explore ranges later in this chapter.

However if this is not the case and it instead appears on the ends of the set, like in [a-] and [-a] then it won't denote a range, but instead the literal character -. Hence [a-] will match an 'a' or '-'.

Quantification of a set

A character set on its own will always resolve down to a single character. A quantifier, however, can quantify it to match any of the specified characters a given number of times.

For instance /[ae]+/ will match 'a', 'e', 'ae', 'ea', 'ee', 'aa', 'aea' and so on. The regexp engine will match each character in the test string with the characters in the set and as soon as a match is found, will proceed to match against the next character, depending on the quantifier used.

The expression /[ae]+/ will thus match the first sequence of the characters a and e (in any order) of length 1 or greater.

Given the string "This is eaaxhusje for two days eeeaeygeeaaaa aeEa", highlight all the substrings within it that the expression /[ae]+/g will match.

The expression /[at]+/g will match all sequences (due to the global flag) of the characters a and/or e that are of length 1 or greater (due to the + quantifier).

Below is an illustration of what will be matched:

"This is eaaxhusje for two days eeeaeygeeaaaa aeEe"

Given the string "I love caaaattts", match the substring "aaaattt" using a character set.

The expression to solve this question is /[at]+/ Start by noticing which characters appear in the substring "aaaatttt". They are a and t, hence our set becomes [at] and the final expression becomes /[at]+/.

Don't mistakenly write /[at+]/ - this set will match an a, b or a +! A set is given by the [] square brackets, and hence any quantifier must come after these brackets.

Consider the string " -------- Hello World -----". Write an expression to remove the sequence of hyphens and spaces at the start and end of this string so that it becomes "Hello World".

If you realize, the start and ends both contain either a hypen or a space character. Hence we can form the set [- ] and then use the + quantifier to match the entire such sequence.

The final expression will therefore become /[- ]+/g.

Notice the fact that because the - character here doesn't appear between two characters inside the set, it won't be considered special.

Construct an expression to match all three-digit numbers in a given test string, with digits in the range 3-6.

For example in the string "981356" you should select the substring '356' since it contains (exactly) three digits and each of them is in the range 3-6.

Similarly in "366565" the matches should be '366' and '565'. Unacceptable matches in this string include "256" as it contains '2' - a digit out-of-range; "35" as it contains only two digits, and so on.

The acceptable digits are 3-6, hence the set becomes [3456].

Furthermore since we need only numbers of length 3 we will use the custom quantifier {3}. Lastly because we need to find all such matches we will use the flag g.

Ultimately the expression becomes /[3456]{3}/g.

Write an expression to match all sequences of spaces and newline characters in a given test string.

As is the spotlight of this chapter, a character set will do the job.

Since the characters we need to match are space and newline characters the set will become [\n ]. Furthermore we will need the + quantifier to replicate this set for more than one number of times and thus match a whole such sequence. Lastly we'll need the g global flag to match all such sequences.

Hence the final expression becomes /[\n ]+/g.

Range of characters

There are many instances when we want to include a range of characters inside a character set.

For example in the occasion of matching one of the characters 'a', 'b', 'c' or 'd' we can either use the set [abcd] (as we saw above); or noticing the fact that this is a range of characters i.e a to d, we can reduce it down to [a-d].

A hyphen - is used to denote a range of characters when it comes between two characters, inside a set.

Going with the same example above, we can use the set [a-d], instead of [abcd] to match either a, b, c or d.

In the same way the set [a-z] will match any character between a and z; [A-Z] will match any character between A and Z. Digits can also be given as in the set [0-9], which will match any digit between 0 and 9.

Don't get into the thinking that to match the numbers between 100-200, you can write the set [100-200]!

First of all it negates the idea that a set matches a single character - we have it matching three characters. Secondly it doesn't make any sense because the hyphen operates on its adjacent characters in a set which in this case are 0 and 2 - [100-200].

The idea of a range in a set is quite powerful and efficient. Instead of writing sets like [abcdefghijklmnop] we can simply define these in terms of ranges i.e [a-p] and get our job done in the span of seconds.

What's even more interesting is that we can define multiple ranges in one set.

For example the set [a-z0-9] matches any digit or lowercase alphabetic character. As you can see here, we've used two hyphens with characters on each of their sides to denote two ranges. Amazing isn't it?

Construct an expression to match all numbers in a given string.

For example, in the string "10 plus 20 is 30" the matches shall be '10', '20' and '30'.

Numbers have digits; hence the set will be [0-9]. Furthermore, numbers are sequences of digits; which means that the set needs to be quantified by +. Lastly, to match all such occurences, we need the g flag.

All together, we get the expression /[0-9]+/g.

Construct an expression to match all sequences S in a given string, where S is any digit or lowercase alphabet occuring one or more times.

For example, in the string "0a8ejuds ABDaloe5-l99" the matches shall be "0a8ejuds ABDaloe5-l99".

To match any digit or lowercase alphabet we use the set [0-9a-z]. This set needs to be quantified by the + quantifier to match a sequence S. To end with, in order to match all such sequences we use the g flag.

All together, the expression becomes /[0-9a-z]+/g.

It doesn't matter if we use the set [a-z0-9] instead of [0-9a-z]. Both are exactly the same!

Thing to note!

One thing to note when working with ranges is that they must have comparable characters as their starting and ending values.

What this means is that a range like [a-Z] won't be understood by the regexp engine; the starting value is lowercase whereas the ending value is uppercase and hence are NOT comparable to each other.

In contrast, [a-z] and [A-Z] both contain comparable characters denoting the ranges - lower a to lower z, and upper A to upper Z respectively.

Negated sets

Say you want to write an expression to match all sequences of characters in a string that don't include 'a' in them. If you have begun to construct the expression by writing all possible characters on your keyboard, then you've already taken the wrong way!

What we need is a negated set...

Where a normal character set matches the characters listed within it,

A negated/complemented character set matches the characters NOT listed within it. It's denoted by [^ ].

Whatever is inside a complemented set, that is excluded from the match and everything else is considered instead.

For example, [^a] denotes a negated set that matches all characters that are NOT 'a'.

Although this is just a fairly simple example, the ways in which you can construct a normal set also apply to negated sets i.e negated sets can also have ranges.

Inside a character set, the character ^ is only considered special if it appears right after the first square bracket i.e [^. If it, however, appears on any other place for example [a^] then it won't account for a negation set and will literally match ^. Hence [a^] will match an a or a ^.

Construct an expression to match all sequences of characters, in a given test string, that are NOT uppercase alphabets (from A to Z) using a character set.

Solving this problem is fairly easy - a negated set will do the job.

Since we have to match all characters that are not uppercase alphabets the set becomes [^A-Z]. To match such a sequence we will need the + quantifier. Lastly to match all such sequences we will need the g flag.

Altogether our expression becomes /[^A-Z]+/g.

Construct an expression to match all sequences of characters, in a given test string, that do NOT include any digits at all.

Since we need to match everything except for digits, we'll need the negated set [^0-9]. The expression will thus become /[^0-9]+/g.