RegExp Character Classes

Chapter 6 17 mins

Learning outcomes:

  1. What are character classes
  2. Using character classes

Introduction

As we saw in the previous RegExp Character Sets chapter, character sets can be used to gather multiple candidate characters to be matched with a character in the test string. Some sets are so common in regular expressions like the number set [0-9], that they are given as predefined character classes.

In this chapter will explore many character classes and solve common problems with them in a more comprehensive manner. It will be a fun treck to take so let's hop on.

What are character classes?

As simplest as it can get:

A character class is a special character that denotes a predefined character set.

Character classes are nothing more than predefined character sets. They are given to help you in constructing common expressions more quickly and easily.

Following is a table showing all the character classes in regular expressions.

ClassNameCharacter set
.Wildcard[^\r\n] - matches anything except for a newline.
\dDigit[0-9] - matches a digit
\DNon-digit[^0-9] - matches all but a digit
\wWord[a-zA-Z0-9_] - matches a word character
\WNon-word[^a-zA-Z0-9_] - matches all but a word character
\sSpaceMatches any whitespace character, including spaces and tabs.
\SNon-spaceMatches all but a whitespace character.

To give you a quick example, consider the class /d used to represent numeric digits. It's equivalent to the character set [0-9], which means that the two expressions /\d/ and /[0-9]/ are identical to each other.

You use one or the other, it's exactly the same.

On these same lines we can also use the class \w when we want to look for word characters. The expression /\w+/ will match 'Hello' in the string "Hello World".

Let's have a look over some common character classes.

Wildcard class

Perhaps one of the most useful character classes in regular expressions is the wildcard character class.

Denoted by a . period character, the wildcard class matches everything but a newline character. It's equivalent to the set [^\r\n].

\r is the carriage return character used to denote newlines in certain platforms.

As you might've realised, using the wildcard class together with the + quantifier will match the first line in a given string.

Consider the example below:

Construct an expression to match all lines in a given test string.

For example, in the following string str:

var str = `Hello world
How are you?
Thanks for your time..`;

the expression should match each of three lines 'Hello world', 'How are you?' and 'Thanks for your time..'.

To match a single line, in fact the first line, we need to use the pattern /.+/. With the global flag set, this expression will then match all lines.

Altogether we get /.+/g.

You have been given a string with numerous occurences of the word 'football' in it. The problem is that all these words are followed by some mistyped character, not known.

Your task is to construct an expression that can match all these words along with their mistyped characters.

For example, say the string is "Football* is good. I love to play football]. Football> is my passion.". Your expression should match 'Football*', 'football]' and 'football>'.

What we need to match is every word 'football' along with its mistyped character. Since the mistyped character is not exactly known, we can use the wildcard class to match it.

Recall that the wildcard class matches everything but newlines, and will therefore well suit to our problem.

The pattern /football./ will match the first word 'football' along with its mistyped character, and the global flag will serve to take into account all such occurences.

The solution is /football./g.

Word and non-word characters

We'll start by looking at the word character class - \w.

It's equivalent character set is [a-zA-Z0-9_]. In other words, it'll match any lowercase or uppercase alphabet, or a digit, or an underscore _ character.

Consider the example below:

Construct an expression to match all words in a given test string.

The definition of a word here is a sequence of alphabets (lowercase and uppercase), digits and underscore characters.

The definition of a word given in the question matches with the definition of the class \w. Likewise we'll use the \w to represent a single word character.

This then, obviously, needs to be quantified by the + quantifier to match a sequence, and to end with it needs to be flagged by the global flag so that it could match all such sequences.

Altogether we have /\w+/g.

What will the expression /\w+/ match if executed on the string "Hello-world.".

  • 'Hello'
  • 'Hello-world'
  • 'Hello', 'world'
  • "Hello", 'world', '.'

One important thing to note over here is that \w won't match word characters from other languages like á, ö, ū because by definition, \w is [a-zA-Z0-9_] i.e a lowercase or uppercase english alphabet or a digit or an underscore.

To match word characters from other languages, we need to use unicode representations in character sets.

Where \w matches a word character, its counter class \W matches everything but a word character.

The character class \W matches a non-word character. It's the negation of \w.

This idea of negated classes shouldn't be any surprising to you - it follows directly from the idea of negated character sets, where the set matches everything but the characters listed within it.

Similarly, \W matches everything, except for the characters matched by \w.

Remember that character classes denote character sets and thus by themselves always match a single character.

Construct an expression to match all non-word sequences in a given test string.

To match a non-word sequence we need to use the class \W along with the + quantifier. To match all such sequences we need to add the global flag.

The final expression hence becomes /\W+/g.

Digits and non-digits

Matching digits is a fairly common practice in the world of regular expressions and so is the use of the \d class.

The digit character class, denoted by \d, matches a single digit (a number from 0 to 9).

Blending it with other aspects of regular expressions, one can construct patterns to match integers and floats in a string. Following are examples on this.

Construct an expression to match all integers in a given test string.

Remember that integers are signed whole numbers, such as -5, 0, 12.

We'll start by matching the sign of every integer, if it exists. This will be accomplished using the hyphen - symbol and the ? quantifier since the sign can either be there or not be there.

After this, we'll need to match the magnitude of the integer, which can simply be done using \d+.

Altogether we have /-?\d+/g

Consider the following string:

var str = "JavaScript has 10 characters in it";

Replace the number '10' in it with '50' using only the character class \d.

You shall save the resultant string in a new variable replacedStr.

The way we can solve this problem is by using the character class \d twice in the pattern to denote two digits. Hence the expression becomes: /\d\d/.

var str = "JavaScript has 10 characters in it";
var replacedStr = str.replace(/\d\d/, "50");

Obviously we could've also used a quantifier in this case but the manual work of writing \d twice wasn't that tiring to do!

Construct an expression to match all floats in a given test string.

Remember that floats are numbers with a decimal point, such as -5.66, 0.155, 12.001.

We'll start by matching the sign of every float, if it exists. As before, this will be accomplished using -?.

After this, we'll need to match the rest of the number which is a sequence of digits, followed a dot, followed by another sequence of digits. The pattern \d+\.\d+ can easily do this.

Note that we've used \., instead of ., simply because the latter doesn't just denote a dot - rather it denotes every character (except for newlines). We need to strictly match the dot '.' character and so we have to escape . using a backslash.

Altogether we have the expression /-?\d+\.\d+/g.

As with \w, the class \d also has a counter negated class, denoted by \D.

It matches everything except for digit characters and is equivalent to the set [^0-9].

Mixing classes with sets

All character classes, except for the wildcard, come with the benefit of being able to be merged into a character set along with other characters and ranges.

Let's say someone asks you to write an expression that matches either a word character or the characters '-', '?' and '!'. How would you approach this task?

One way surely is to construct the whole set from scratch - [a-zA-Z0-9_?!-]. However noticing the fact that [a-zA-Z0-9_] is the \w class you can compress it down to simply just [\w?!-].

We say that the \w class has been merged into a character set.

The exception to this behavior is the wildcard class denoted by a period . character - it can't be merged into a set. On appearing in a set, the period character denotes the literal period character '.', not the period class.

Write an expression to match the first sequence of the set A in a given string. A can be a digit, a hypen or a period.

The expression is /[\d.-]/.

We just have to construct a set meeting the mentioned requirements and quantify it to match a sequence. The set can resolve down to a digit, a hypen or a period; in other words it can be \d, '-' or '.'. Thus the set becomes [\d.-].

"I created Codeguage to save you from falling into the same learning conundrums that I fell into."

— Bilal Adnan, Founder of Codeguage