RegExp Grouping

Chapter 8 13 mins

Learning outcomes:

  1. What is grouping
  2. Capturing and non-capturing groups

Introduction

How would you go about solving the problem of matching a sequence of the whole word 'cat' occuring one or more times in a given string? Yes you will have to use a quantifier but what will be the pattern preceding it - you need to quantify the whole word not just any single character!

Well the idea you need to know to solve this type of a problem is known as grouping, which we will unveil in this very chapter in great detail. Let's ride over!

What is grouping?

The name grouping is quite self-explanatory in what purpose it serves in regular expressions. Anyways defining it technically:

Grouping is to unify up a pattern, so that it is matched as a complete block.

Grouping simply groups up a sequence of regexp tokens into one single unit.

A token is anything that appears in an expression.

To group a pattern we use a pair of parentheses (), and between them the pattern we wish to group. For example to quantify the whole sequence 'abc' one or more times we can use (abc)+.

This type of grouping is called a capturing group. There is yet another way to group a pattern, known as a non-capturing group, given by (?:).

Since both of these ways employ the use of parentheses we at least know for sure that whenever the need be to group up a pattern, we have to use parentheses.

We will see what do both of these group types, capturing and non-capturing, mean in just a while.

Groups can contain any valid regexp pattern including alternation, character sets, classes, quantifiers, even sub groups.

Consider the following example.

Write an expression to match the first sequence of the word 'cat' occuring one or more times, in a given string.

'One or more times' indicates the usage of the + quantifier. 'First sequence' means we don't have to use the global flag. To quantify the whole word 'cat' we will need to use a group.

But before doing that let's consider some instances of an expression without a group and see how they all fail to solve our problem.

If we use /cat+/ it will match the substrings "cat", "catt", "cattt"....

If we use /ca+t/ it will match the substrings "cat", "caat", "caaat"....

Or if we use /c+at/ it will match the substrings "cat", "ccat", "cccat"....

but we need to match the substrings "cat", "catcat", "catcatcat"....

So none of the expressions, without a group, can solve our problem. We need to quantify the whole sequence 'cat', not just any single character; hence we need to use a group.

Likewise /(cat)+/ is the solution, perhaps one of the solutions, to this problem.

The quantifier + will quantify its preceding expression which in this case is the group (cat), or in other words it will quantify the whole pattern cat, for one or more times, and therefore match the desired substrings "cat", "caat", "caaat".

Capturing vs non-capturing

The examples we saw above, solely used the parentheses to form a group and quantify it. They used what is called a capturing group.

So what does that mean? In the dictionary of regular expressions we essentially have two types of groups, capturing and non-capturing.

Let's start with the former.

Capturing groups, denoted by (pattern), capture their matches, much like a camera captures pictures. They save their matches in memory to be used later for retrieval and further processing.

Simply whatever is contained within a group is what is captured and likewise saved in memory by that very group.

Taking an instance to clarify all this explanation, the expression /(cat)/ will match the first word 'cat' in a string and likewise save it in memory.

Similarly the expression /(\w\d)/ will match a word character followed by a digit and consequently save it in memory. For example in the string "I know a5" it will match the substring "a5" and save it in memory.

The idea is very simple and sensible; only whatever is inside a capturing group is what is captured by it.

On the other hand:

Non-capturing groups, given by (?:pattern), don't capture their matches. They only serve to group content.

When you don't need to save the matches in memory to use them for further processing, you should go with a non-capturing group.

Remember that capturing groups take on memory and over-burden the overall regexp searching (although just slightly), so use them only when you really need them.

We will explore the use cases of capturing groups in great detail in the next chapter.

Consider the string str = "catcatcatcatcat".

Replace all the sequences 'cat' with 'fit' so that the final result becomes "fitfitfitfitfit".

The solution is pretty simple. We have to replaces all the sequences of "cat" with "fit", which means that our expression will be /cat/g, and hence the following code.

var str = "catcatcatcatcat";
var patt = /cat/g;
str = str.replace(patt, "fit"); // "fitfitfitfitfit"

Now replace the whole sequence of 'cat' with the string 'rabbit' so that the final result becomes "rabbit". There is no need to save any of the matches in the string.

We need to match the whole sequence of the word 'cat' quantified, not just any single character quantified. In other words we need to use a group, together with the quantifier +.

We can also use the custom quantifier {5} to match the sequence of five occurrences of 'cat' in this case, but even + will do our job so we will use it instead.

Since the word 'cat' has to be quantified, therefore it will go inside the group. The question clearly tells us that there is no need to save matches likewise we will use a non-capturing group.

Ultimately the final expression becomes /(?:cat)+/.

It will match any of the strings "cat", "catcat", "catcatcat"... With the expression made, the replacement is just a matter of seconds.

var str = "catcatcatcatcat";
var patt = /(?:cat)+/;
str = str.replace(patt, "rabbit"); // "rabbit"

Groups within groups

As we said earlier, groups can contain any valid regexp pattern, which includes further sub-groups. For example the pattern /(cat(\d\d)+)+/ will match all strings "cat", "cat54cat30", "catcat30cat5084".

Since laying out subgroups is quite a tedious task, therefore whenever you're constructing such expressions just make sure you don't miss out any minute details such as the parentheses, or a quantifer etc. A slightly different expression can yield a very different result.

Let's try out a few examples to see subgroups in action.

Consider the string str = "45catcat78catcatcat" and many such strings.

Write an expression to match all substrings that contain the sequence S, one or more times. S is defined as two digits followed by the word "cat" zero or one times.

For str it should match the entire string. However for the string "The code is 45cat7814cat" it should match the substring "45cat7814cat", but match nothing in the string "The code is 45cat_7814catcat".

Let's start by constructing a pattern for the sequence S described above: two digits followed by the word "cat" zero or one times. It will simply be \d\d(?:cat)?. Just follow along the statement for the sequence and write tokens as you do so. Notice that we've used a non-capturing group since there is no need to save the matches.

Now we just have to quantify this whole pattern for one or more times using + and employ the flag g to match all occurrences. Combining all this the final expression becomes /(?:\d\d(?:cat)?)+/g.

And this is just one example of groups appearing within groups. Looks complex right? Welcome to the world of regular expressions!

So this marks the completion of grouping in regular expressions, or rather the completion of non-capturing grouping.

In the next chapter we will be diving a lot deeper into capturing groups together with the idea of backreferencing and see how can we use them to extract complex, unknown chunks of information.