RegExp Assertions

Chapter 10 15 mins

Learning outcomes:

  1. What are assertions
  2. Lookaheads
  3. Lookbehinds
  4. How to use assertions

Introduction

Say you are busy constructing regular expressions and your friend comes to you and asks you to write an expression such that it matches all words 'cat' followed by the substring ' is cute' in a test string.

You give him the expression /(cat) is cute/, but he argues that he wants an expression such that it itself matches only the word 'cat' while at the same time also checks for the substring following it. The expression you gave him matches the whole substring 'cat is cute' with one capture i.e 'cat'. He doesn't want this.

So is there any way to match stuff that is followed or even preceded by some other stuff? Definitely.

What are assertions?

In the dictionary of regular expressions:

Assertions are patterns to be looked for beyond a match or before a match.

Once something is matched by a regexp, assertions can serve to additionally check whether it is preceded our followed by a given pattern.

Assertions are considered by a regexp engine in the searching process but never included in the final match.

Applying this to the example above; you needed to match only the word 'cat' followed by the substring ' is cute', not the whole word. In other words, you needed to consider ' is cute' in the searching process, but not to be included in the final match. This is the idea behind an assertion.

At the basic level there are two types of assertions - lookaheads and the other lookbehinds. Let's see what both of them do.

Lookaheads - go beyond

As the name suggests, lookaheads serve to literally look ahead of a match and see if it's followed by a given pattern. The lookahead is just a part of the search, not a part of the final match returned by the expression. Remember this!

A lookahead has the general form shown below:

pattern(?=lookAheadPattern)

pattern is what we need to match for, and lookAheadPattern is the pattern we need to make sure follows pattern. lookAheadPattern goes inside a pair of curly braces that's begun by the ?= symbol.

As an example let's solve the problem discussed at the start of the chapter. We needed to match each word 'cat' that is followed by the substring ' is cute'.

Now since the match of the whole expression has to be just 'cat', and not the whole word 'cat is cute', we need to use a lookahead.

The expression will be /cat(?= is cute)/. We start by cat since it is what we actually need to match. Then comes the lookahead where the pattern is cute goes (inside the brackets) as it is what we need to look for after 'cat'.

And so in this way we have used a lookahead to solve yet another regexp problem.

Shall I use ?: or ?=?

Now you might find this notation quite similar to the one for a non-capturing group - (?:). Well indeed it is and this could sometimes lead to confusion!

The best way to avoid this confusion is to remember that whenever we have an equal sign = involved, which is usually related to 'assignments', we are talking about an 'assertion'.

Negation is a term quite familiar to you by now back from Character Sets and Character Classes chapters. Lookaheads also involve the idea of negations to test whether a matched sequence is not followed by a given pattern.

Such lookaheads are referred to as negated lookaheads.

Let's see whether you can guess the symbols used to denote a negated lookahead.

As a rough guess what can you think the notation for a negated lookahead will look like?

Ask yourself this question for every choice: is this notation simple and at the same time able to distinguish itself from a positive lookahead?

  • ?!
  • ?!=
  • ?=!

Where the notation for lookaheads is simply (?=), the one for negated lookaheads is (?!). It replaces the equal sign with an exclamation mark to indicate negation.

Now how to use it? Here's an example.

Write an expression to match all words 'cat' that are not followed by the substring ' is cute'.

The question clearly indicates the usage of assertions, particularly a negated lookahead by the term 'not followed'.

The thing that has to be checked for beyond the match of 'cat' is the substring ' is cute'; hence it will go inside the negated lookahead symbols. Furthermore because we need to look for all such matches we use the global g flag.

Likewise the expression becomes /cat(?! is cute)/g.

So far the patterns we've used on our assertions have been pretty basic. However they don't have to be so! Assertions can contain any valid regexp pattern including alternations, character sets, classes, boundaries, groups, quantifiers etc.

However it's useless to quantify assertions in JavaScript, since the regexp engine ignores quantifiers when they follow an assertion.

Why it does that - simply because it might probably have to do more work in cases where already-quantified patterns exist within the lookahead.

Likewise, to quantify a pattern inside a lookahead simply put the quantifier next to the pattern, within the lookahead.

A couple of examples are shown below.

Construct an expression to match all words 'cat' in a string that are followed by the sequence '123' occurring two or more times.

Assertions are the essence of this problem!

We have to match the word 'cat' therefore we start with the expression /cat/.

Then we need to check whether the match is followed by the sequence '123' two or more times. Hence we first use the non-capturing group (?:123) quantified by {2,} and then put this inside a lookahead, giving us /cat(?=(?:123){2,})/.

For more info on non-capturing groups please read JavaScript RegExp Grouping.

Finally we end with adding a global flag to the expression to make it search for all such occurrences. Hence it becomes /cat(?=(?:123){2,})/g.

Always remember that lookaheads aren't included in the final match of any expression.

Write an expression to match all words in a string that are followed by a space and an uppercase alphabet or one of the characters ;, :, ,.

For instance in the string "I love Booleans; they are literally amazing" you should match 'love' (that's followed by a space and a 'B') and 'Booleans' (that's followed by a semi colon ';').

The definition of a word here is any sequence of the \w character class.

Firstly to match words we will use the pattern \w+. Then we will need to look beyond each match using a lookahead and see whether it is followed by any of the substrings discussed above.

We will use an alternation to look either for 'a space and an uppercase alphabet' or a 'character from the characters ;, :, ,'.

The lookahead will hence become (?= [A-Z]|[;:,]). Merging both the patterns and adding the global flag to match all such occurrences leads to /\w+(?= [A-Z]|[;:,])/g.

These are just some of the instances where lookaheads can come into the play and conquer it all. However lookaheads aren't the only one of their type to take the rule - there is another class of assertions known as lookbehinds as we will discover below.

Lookbehinds - go back

Just like lookaheads serve to search beyond a match, lookbehinds search before it. We can use lookbehinds to check whether a given pattern is preceded by another pattern or not.

And similar to lookaheads, we have positive lookbehinds, given by (?<=) as well as negated lookbehinds, given by (?<!).

Notice the additional symbol < here denoting a lookbehind - it points to the left and likewise gives the indication of going backwards, which is the essence of lookbehinds i.e searching backwards.

However, since lookbehinds search backwards they come before the matching pattern in the expression, unlike lookaheads that come after it. Technically this notion can be expressed as:

/(?<=lookBehindPattern)pattern/

If someday, you somehow forget how to denote a lookbehind, just recall that it is written exactly like a lookahead with the addition of the < bracket that points to the left.

Let's consider a few examples on lookbehinds before winding up with the concept of assertions.

Write an expression to match all words 'cat' that are preceded by the substring 'good '.

The question clearly indicates the usage of a lookbehind by the term preceded. The thing that has to be checked for before the match of 'cat' is the substring 'good '; hence it will go inside the lookbehind symbols. Furthermore because we need to look for all such matches we will use the global g flag. Likewise the expression becomes /(?<=good )cat/g.

Suppose that the variable str holds the code of a JavaScript program. Your task is to write an expression to match all variable names that are declared using the var keyword.

For simplicity you shall assume that none of the variables itself contains the sequence 'var' for example 'var bi_var = 0'; and that none of the variable names are illegal so that you can use \w to match a character of the name.

Also, that the var keyword is only followed by a single space and then the name of the variable.

We have to match all variable names given by the var keyword, or in other words, simply, all words that are preceded by the substring 'var '.

Now the problem becomes pretty straightforward to solve - just use a lookbehind and get the job done. To match the variable names we will use the pattern \w+ and to further check whether they are preceded by 'var ' we will use the lookbehind (?<=var ).

Combining both these and the global flag to match all such occurrences we get /(?<=var )\w+/g.