## Introduction

How can you select all the words within a string that follow the pattern ** 'b', then a vowel and finally the alphabet 't'**. In other words how can you match all the words

*'bat'*,

*'bet'*,

*'bit'*,

*'bot'*, and

*'but'*.

Luckily you might be able to see a pattern in this case - the characters between the *'b'* and *'t'* are changing while they both remain the same. What we need is a ** character set**.

So far we've explored a handful of concepts including flags and quantifiers in our regular expressions journey. Fortunately this exploration is not over as of yet!

In this chapter we shall take a look at another powerful concept in regexp known as character sets that can give a range of characters to match a character with. Let's enroll!

## What is a character set?

In simple terms:

`[ ]`

, denotes **multiple characters**, to match a

**single character**against.

A character set essentially provides a *bunch* of characters to match against a single character in a given test string.

For example, if you want to match the characters *'a'* or *'b'* you can simply write `/[ab]/`

. The set will match either an `'a'`

or a `'b'`

in a given test string.

The order of characters here doesn't matter - we could've also used the set `[ba]`

. The reason is because, internally the regexp engine matches a character in the string against each character in a set until a match if found.

Similarly to match the characters *'a', 'b', 'c' or 'd'*, you can construct the character set `[abcd]`

. Once again remember that this set will resolve down to a single character - in this case either an `'a'`

, `'b'`

, `'c'`

or a `'d'`

.

Often developers make the mistake of thinking that a character set denotes a word, but this is not the case. In itself, a character set is meant to denote only a **single character**.

What this simply means is the fact that `[How]`

won't match the word `'How'`

, but instead the characters `'H'`

, `'o'`

and `'w'`

.

*characters*, NOT a

*word!*

Write an expression, using a **character set**, to check whether a given test string contains a `'0'`

or `'1'`

.

The expression is ** /[01]/** (or equivalently

`/[10]/`

). We have to check for the two characters `'0'`

or `'1'`

; hence we form the set `[01]`

. It will match either a `'0'`

or `'1'`

.Construct an expression **using a set** to match all substrings, inside a test string, that start with a `'b'`

, followed by a *vowel*, and end with a `'t'`

.

In other words, you shall match all the substrings `'bat'`

, `'bet'`

, `'bit'`

, `'bot'`

and `'but'`

.

The expression is `/b[aeiou]t/g`

.

*Speciality* changes inside a set!

One thing worth mentioning over here is that **not all** characters, considered special inside a regular expression, are also considered special inside a character set.

For example, `? + * : | .`

all are considered special inside a regular expression, but are NOT special inside a set. This means that they can be given **without a backslash**.

Likewise, the set `[.]`

isn't equivalent to the wildcard `.`

character. The pattern `/./`

will match any single character, however the set `[.]`

will match only the period symbol `'.'`

, literally.

Listing all non-special characters inside a set will otherwise be quite tedious so we'll approach it the other way - *listing all special characters*.

What is considered special inside a set includes:

- The caret
`^`

only if it appears at the start of the set. This is theand likewise denotes a negation set as we shall discover below.*negation character* - The hyphen
`-`

if it appears between two characters. It denotes a range inside a set as we shall see below. - The backslash
`\`

which can be used to include character classes inside a set. We'll see what are character classes in the next chapter.

To include any of these special charcters in the given locations, like a hyphen between two characters, we ought to use the backslash `\`

to escape them.

`-`

is considered *special*only if it appears between two characters in a set like

`[a-z]`

. This is a range and will match any character from a to z. We shall explore ranges later in this chapter.However if this is not the case and it instead appears on the ends of the set, like in

`[a-]`

and `[-a]`

then it won't denote a range, but instead the literal character `-`

. Hence `[a-]`

will match an `'a'`

or `'-'`

.## Quantification of a set

A character set on its own will always resolve down to a single character. A quantifier, however, can quantify it to match **any** of the specified characters a given number of times.

For instance `/[ae]`

will match **+**/`'a'`

, `'e'`

, `'ae'`

, `'ea'`

, `'ee'`

, `'aa'`

, `'aea'`

and so on. The regexp engine will match each character in the test string with the characters in the set and as soon as a match is found, will proceed to match against the next character, depending on the quantifier used.

The expression `/[ae]+/`

will thus match the first sequence of the characters `a`

and `e`

(in any order) of length 1 or greater.

Given the string `"This is eaaxhusje for two days eeeaeygeeaaaa aeEa"`

, highlight all the substrings within it that the expression `/[ae]+/g`

will match.

The expression `/[at]+/g`

will match all sequences (due to the global flag) of the characters `a`

and/or `e`

that are of length 1 or greater (due to the `+`

quantifier).

Below is an illustration of what will be matched:

`"This is `

**eaa**xhusj**e** for two d**a**ys **eeeae**yg**eeaaaa** **ae**E**e**"

Given the string `"I love c`

, match the substring **aaaattt**s"`"aaaattt"`

using a character set.

The expression to solve this question is `/[at]+/`

Start by noticing which characters appear in the substring `"aaaatttt"`

. They are `a`

and `t`

, hence our set becomes `[at]`

and the final expression becomes `/[at]+/`

.

`/[at+]/`

- this set will match an `a`

, `b`

or a `+`

! A set is given by the `[]`

square brackets, and hence any quantifier **must come after**these brackets.

Consider the string `" -------- Hello World -----"`

. Write an expression to remove the sequence of hyphens and spaces at the start and end of this string so that it becomes `"Hello World"`

.

If you realize, the start and ends both contain either a hypen or a space character. Hence we can form the set `[- ]`

and then use the `+`

quantifier to match the entire such sequence.

The final expression will therefore become `/[- ]+/g`

.

*Notice the fact that because the - character here doesn't appear between two characters inside the set, it won't be considered special*.

Construct an expression to match **all three-digit** numbers in a given test string, with digits in the range **3-6**.

For example in the string `"981356"`

you should select the substring `'356'`

since it contains (exactly) three digits and each of them is in the range 3-6.

Similarly in `"366565"`

the matches should be `'366'`

and `'565'`

. Unacceptable matches in this string include `"256"`

as it contains `'2'`

- a digit out-of-range; `"35"`

as it contains only two digits, and so on.

The acceptable digits are 3-6, hence the set becomes `[3456]`

.

Furthermore since we need only numbers of length 3 we will use the custom quantifier `{3}`

. Lastly because we need to find all such matches we will use the flag `g`

.

Ultimately the expression becomes ** /[3456]{3}/g**.

Write an expression to match **all sequences** of **spaces and newline characters** in a given test string.

As is the spotlight of this chapter, a character set will do the job.

Since the characters we need to match are space and newline characters the set will become `[\n ]`

. Furthermore we will need the `+`

quantifier to replicate this set for more than one number of times and thus match a *whole such sequence*. Lastly we'll need the `g`

global flag to match *all* such sequences.

Hence the final expression becomes ** /[\n ]+/g**.

## Range of characters

There are many instances when we want to include a *range* of characters inside a character set.

For example in the occasion of matching one of the characters *'a', 'b', 'c' or 'd'* we can either use the set `[abcd]`

(as we saw above); or noticing the fact that this is a range of characters i.e *a to d*, we can reduce it down to `[a-d]`

.

**hyphen**is used to denote a range of characters when it comes

`-`

**between two characters**, inside a set.

Going with the same example above, we can use the set `[a-d]`

, instead of `[abcd]`

to match either *a, b, c or d*.

In the same way the set `[a-z]`

will match any character between a and z; `[A-Z]`

will match any character between A and Z. Digits can also be given as in the set `[0-9]`

, which will match any digit between 0 and 9.

`[100-200]`

!First of all it negates the idea that a set matches a single character - we have it matching three characters. Secondly it doesn't make any sense because the hyphen operates on its

*adjacent*characters in a set which in this case are

`0`

and `2`

- `[10`**0-2**00]

.The idea of a range in a set is quite powerful and efficient. Instead of writing sets like `[abcdefghijklmnop]`

we can simply define these in terms of ranges i.e `[a-p]`

and get our job done in the span of seconds.

What's even more interesting is that we can define **multiple ranges** in one set.

For example the set `[a-z0-9]`

matches any *digit* or *lowercase alphabetic character*. As you can see here, we've used two hyphens with characters on each of their sides to denote two ranges. Amazing isn't it?

Construct an expression to match all numbers in a given string.

For example, in the string `"10 plus 20 is 30"`

the matches shall be `'10'`

, `'20'`

and `'30'`

.

Numbers have digits; hence the set will be `[0-9]`

. Furthermore, numbers are sequences of digits; which means that the set needs to be quantified by `+`

. Lastly, to match all such occurences, we need the `g`

flag.

All together, we get the expression `/[0-9]+/g`

.

Construct an expression to match all sequences * S* in a given string, where

*is*

`S`

**any digit or lowercase alphabet occuring one or more times**.

For example, in the string `"0a8ejuds ABDaloe5-l99"`

the matches shall be `"`

.**0a8ejuds** ABD**aloe5**-**l99**"

To match any digit or lowercase alphabet we use the set `[0-9a-z]`

. This set needs to be quantified by the `+`

quantifier to match a sequence * S*. To end with, in order to match all such sequences we use the

`g`

flag.All together, the expression becomes ** /[0-9a-z]+/g**.

`[a-z0-9]`

instead of `[0-9a-z]`

. Both are exactly the same!### Thing to note!

One thing to note when working with ranges is that they must have **comparable characters** as their starting and ending values.

What this means is that a range like `[a-Z]`

won't be understood by the regexp engine; the starting value is lowercase whereas the ending value is uppercase and hence are NOT comparable to each other.

In contrast, `[a-z]`

and `[A-Z]`

both contain comparable characters denoting the ranges - lower a to lower z, and upper A to upper Z respectively.

## Negated sets

Say you want to write an expression to match all sequences of characters in a string that don't include `'a'`

in them. If you have begun to construct the expression by writing all possible characters on your keyboard, then you've already taken the wrong way!

What we need is a negated set...

Where a normal character set matches the characters listed within it,

**character set matches the characters NOT listed within it. It's denoted by**

*negated/complemented*`[^ ]`

.Whatever is inside a complemented set, that is excluded from the match and everything *else* is considered instead.

For example, `[^a]`

denotes a negated set that matches all characters that are NOT `'a'`

.

Although this is just a fairly simple example, the ways in which you can construct a normal set also apply to negated sets i.e negated sets can also have ranges.

`^`

is only considered *special*if it appears right after the first square bracket i.e

`[^`

. If it, however, appears on any other place for example `[a^]`

then it won't account for a negation set and will literally match `^`

. Hence `[a^]`

will match an `a`

or a `^`

.Construct an expression to match all sequences of characters, in a given test string, that are NOT **uppercase alphabets** (from A to Z) using a character set.

Solving this problem is fairly easy - a negated set will do the job.

Since we have to match all characters that are **not uppercase alphabets** the set becomes `[^A-Z]`

. To match such a sequence we will need the `+`

quantifier. Lastly to match *all* such sequences we will need the `g`

flag.

Altogether our expression becomes `/[^A-Z]+/g`

.

Construct an expression to match all sequences of characters, in a given test string, that do NOT include **any digits** at all.

Since we need to match everything except for digits, we'll need the negated set `[^0-9]`

. The expression will thus become `/[^0-9]+/g`

.