RegExp Backreferencing

Chapter 9 16 mins

Learning outcomes:

  1. What is backreferencing
  2. How to backreference

Introduction

In the previous RegExp Grouping chapter, we saw how to group up individual regex tokens into a single unit and then use this unit in the matching process just like a single token.

Particularly, two types of groups were explored: capturing groups which save their matches and non-capturing groups which don't save their matches.

In this chapter we shall specifically dig deeper into the former type, i.e capturing groups, and in the way understand the concept of backreferencing.

What is backreferencing?

When a capturing group is used in a regular expression, whatever is matched by the group, that stuff is saved in memory for later use. This is where capturing groups get their name from - they literally capture their matches.

Now storing matches in memory would obviously be useless if we couldn't use them later on. Backreferencing is the name given to the action of using these matches.

Backreferencing means to reference a captured match, saved in memory, by a capturing group.

In simple words, when we use up the captures made by capturing groups, we are backreferencing these captures.

We construct a capturing group, it matches something, saves it in memory and then we use this saved value in some other place. This is backreferencing!

Backreferencing isn't anything new in the world of regular expressions, but rather just an extension to the concept of capturing groups.

Let's further clarify this with the aid of an example.

Say you want to replace all vowels in a string with a parenthesis followed by the vowel followed by another parenthesis. For example, the string "Abed" shall become "(A)b(e)d".

To accomplish this task we will definitely need the replace() method, since we need to perform replacements. Furthermore, we'll also need to save each matched vowel in memory so that while replacing it we could refer back to it and include it in the replacement string.

In contrary to this, if we only had to replace each e (not E) with an '(e)' from a given string, we could've simply used the following code:

var str = "Twenty three";
var patt = /e/g;

var replacedStr = str.replace(patt, "(e)");

Here there's no need to use a capturing group and then backreference the match, because we know exactly what will be matched - an e.

In cases where we don't know what will be matched, such as in replacing all vowels, we ought to use backreferencing to call on whatever was matched.

Don't worry if you haven't understood backreferencing till yet. The next section with all its examples will be more than sufficient to explain the concept in precise detail.

How to backreference?

Now that we know what is backreferencing, it's time to see how to do it.

We can backreference a captured match in essentially two places:

  1. A replacement string, i.e the second argument of the replace() method
  2. The actual pattern

Inside a replacement string, a backreference is denoted by $n while in the pattern, it's denoted by \n where n is the number of the group.

If you don't know about the replace() method then head over to String Properties and Methods.

Group numbers start at 1. The first group has the number 1, the second has the number 2 and so on. This means that to backreference the match of the first group we would use $1 in the replacement string and \1 in the pattern.

Let's solve the vowel problem we saw above using backreferencing.

Construct an expression such that it matches all vowels in a string.

After this complete the following code to replace all the matches of this expression in str with an opening parenthesis (, followed by the match, followed by a closing parenthesis ), and then finally save this in replacedStr.

var str = "Eighty one";
var patt = /* pattern goes here */;
var replacedStr = /* replaced string goes here */;

The five vowels are a, e, i, o and u; likewise to match these we'll use the set [aeiou]. These can even be present in str in uppercase form, so we'll need to use the i flag. The expression will therefore become /([aeiou])/ig, along with the parentheses to create a capturing group.

The real job is to figure out the replacement string. As stated in the question, the replacement string consists of an opening parenthesis (, followed by the match, followed by a closing parenthesis ). This gives the string "($1)".

Since, in this case, we are dealing with the replacement string, the backreference will be of the form $n. Moreover, since we are refering to the first group, n will be equal to 1.

Altogether the code is:

var str = "Eighty one";
var patt = /([aeiou])/ig;
var replacedStr = str.replace(patt, "($1)"); // "(E)(i)ghty (o)n(e)"

Let's now see how to backreference within a pattern.

Construct an expression to match all substrings in a given test string, that begin with a vowel, followed by a single word, and finally followed by the same vowel.

For example, in the string "There were two logos", the matches shall be "There were two logos".

The real deal here is that both the vowels sitting on the ends must be the same. This can only be done using a backreference.

To match the first vowel we'll need the set [aeiou]. This will go inside a capturing group so that the match could be saved for later use.

Moving on, to match the next single word character we'll use the character class \w. For more details on \w please refer to RegExp Character Classes.

After this, we need to match the same vowel as was matched in the first capturing group; and in order to do, we'll need to backreference it using \1.

Recall that backreferences in the actual pattern are denoted by \n.

Altogther we get the expression /([aeiou])\w\1/g.

In this way, backreferencing enables one to construct complex expressions that can match anything and then even use that anything for further refinement.

You would surely agree that backreferencing ain't that difficult.

You just have to be sure what you need to reference; do you even need a reference and a capturing group to solve the problem; and that which capturing group you are willing to refer to in an expression.

It's now your time to tackle backreferencing!

In a given test string, replace all occurrences of two digits with a hyphen '-', followed by those digits, followed by another hyphen, followed by a space.

For example, in "136593" the final result should be "-13- -65- -93- ".

Since we have to use the matches in our ultimate replacement we require a capturing group. What we need to match and save are two digits, so the expression will become /(\d\d)/g, where the global flag is given to match all occurrences.

Here we could've also used \d{2} instead of \d\d to match two digits.

With this done, the replacement string will simply be "-$1- ", just as instructed in the task.

var str = "136593";
var patt = /(\d\d)/g;
var replacedStr = str.replace(patt, "-$1- "); // "-13- -65- -93- "

You are provided with the following set of strings. Each one has three blocks of digits delimited by hyphens.

"465-768-9076", "864-304-685", "1085-067-304", "761-20850-820"

Write some code such that it can extract out all the numbers between the hypens and then replace each sequence with "(", the sequence itself and finally ")". Between these replacements, in the final string, you should also have a single space.

For example "465-768-9076" should become "(465) (786) (9076)".

The problem is fairly straightforward and so we will approach it directly.

There are three blocks of digits delimited by hypens, therefore we will create three capturing groups. Each of them will hold the pattern \d+ to match the sequence of one or more digits. Likewise we arrive at the expression /(\d+)-(\d+)-(\d+)/.

Note that the hypens in the expression are needed to match the way the test strings are layed out i.e delimited by hyphens.

With the expression out of the way now we are only left to perform the replacement. It's also fairly simple, just use the three back references.

 //  example with string "465-768-9076"
var str = "465-768-9076";
var patt = /(\d+)-(\d+)-(\d+)/;
str = str.replace(patt, "($1) ($2) ($3)"); // "(465) (786) (9076)"

Groups within subgroups

Now let's consider a handful of examples demonstrating groups within groups.

Consider the string str = "ghx879".

In the expression /(\w+(\d+))/, what will each of the groups capture when applied over str.

We have two capturing groups so accordingly we will have two captures available to be used. The first group will match "ghx879" and the second one will match "879". In other words the back reference $1 will hold "ghx879" and $2 will hold "879".

Just remember the old saying: whatever is inside a group is what is captured for it.

Now it's your turn to think through the expression and see what captures what.

In the string "http://localhost:5610/", what will each of the back references $1 and $2 hold for the expression /http://(\w+:(\d+))/ in the given order.

  • "localhost:5610", "5610"
  • "5610", "localhost:5610"
  • "localhost", "5610"
  • "5610", "localhost"

And this finally completes the whole concept of grouping now that we've scrutinized backreferencing in great detail. Monotonously our regexp journey hasn't ended even as of yet, there are still quite many avenues to discover so don't just stop here - keep riding!