RegExp Boundaries

Chapter 7 14 mins

Learning outcomes:

  1. What are boundaries
  2. Using boundaries

Introduction

What choices do you have to accomplish the task of matching all strings that start with an a and end with a t for example 'ant', 'art', 'artist' and so on. Or much further, how can you match all strings that have two digits at their start and two alphabets at their end.

Well if you don't know the concept of boundaries in regular expressions then you unfortunately have no choices. In this chapter we will explore what are boundaries and how can we use them to solve complex pattern problems.

What are boundaries?

In a string,

Boundaries are the places between characters.

They are not to be confused with actual characters - rather they are only positions between characters. Think of boundaries as walls between adjacent characters.

Following is an illustration of boundaries in the string "A boundary".

RegExp boundaries

We have two types of boundaries - word boundaries and non-word boundaries.

A word boundary, given by the special character \b, is a position that bounds a word i.e a place where the word starts or ends. The definition of a word here is any sequence of the \w character class.

More specifically, a word boundary denotes a place between a word and a non-word character; the start of the string; as well as the end of the string.

For example in the string "A boundary" there exist four word boundaries - before and after 'A', and before and after 'boundary'.

RegExp word boundaries

On the other hand a non-word boundary, given by the special character \B, is the exact opposite of a word boundary. It is the negation of \b and likewise matches any position a word boundary doesn't.

More specifically it will match a place between a word and a word character; and between a non-word and a non-word character.

For example in the same string "A boundary", there exist 7 non-word boundaries - since 4 out of all the 11 boundaries are word boundaries, as we saw above.

RegExp non-word boundaries

Recall from the character classes chapter that a class given by an uppercase letter is the negation of the class given in lowercase and vice versa. Examples include \w which is the negation of \W, /d which is the negation of \D etc. The same idea applies here as well.

Write an expression to test whether in a given test string there exits a word which starts with a 'b', followed by a vowel and ends at a 't'.

Taking note of the emphasis given to word above, here we don't have to just match the character sequence described, but also make sure that it is a complete word.

You might construct the expression /b[aeiou]t/ thinking it will solve the problem but it might not. If the test string is "A battle", the expression will find a match i.e "A battle", but it won't meet our requirements that it has to be a whole single word.

To solve this problem we have to use the \b character. A complete word has two word boundaries on both of its ends, hence the expression will become /\bb[aeiou]t\b/.

Now this won't match 'bat' in "A battle", but rather in the string "A bat". We are in good hands!

The beginning and the end

The idea of boundaries isn't over as of yet - we are left with one more concept to be unveiled.

What if you were asked to write an expression to match all strings that start with an 'A'; or all those strings that end with a period; or those strings that both start with an 'A' and end at a period.

How will you solve such type of a problem? Well you will need the special characters ^ and $ to do this.

The caret ^ matches the start of a given string, or in other words the first boundary.

The dollar $ matches the end of a given string, or in other words the last boundary.

Once again note that these characters denote the boundaries of a string and hence aren't literal characters that will be included in the final result.

Hence whatever comes after ^ is matched with the beginning of the string and similarly whatever comes before $ is matched with the end of the string.

So let's now use both these ideas to solve the three problems listed at the start of this section.

Write an expression to check whether a given string starts with an 'A' or not.

Since we are concerned with the start of the string we'll use the ^ character followed by an A. ^ will match the start of the string and then a will match the first character.

Likewise the final expression will become /^A/. It will match "A", "A boy", "Alligator" but not "a", "DART".

Write an expression to check whether a given string ends with a period . or not.

Since we are concerned with the end of the string we'll use the $ character preceded by a . so that the final expression becomes /\.$/. Note that the period has to be escaped because it otherwise has a special meaning. This expression will match "Amazing.", "Regular expressions.", even "." but not "Amazing", "Bad".

Write an expression to check whether a given string starts with an 'A' and ends with an a period '.'.

Well we just have to merge the two cases shown above with an additional thing. Note that we have to only make sure that the string starts with an 'A' and ends at a period - not that the whole string shall be compromised of just these two characters.

Hence writing /^A\.$/ would be incorrect since it will match only the string "A." and nothing else. What we need to use in addition to this is the wildcard character quantified for zero or more number of times, so that the final expression becomes /^A.*\.$/.

This will make sure that we are also including those strings that have characters between an 'A' and a period '.'.

Write an expression to match all strings that start with two digits and end with two alphabets (lowercase or uppercase).

The simplest expression is /^\d\d.*[a-z]{2}$/i.

First we need to match two digits at the beginning of the string hence we start with ^\d\d. You could've also used ^\d{2}, it's all your choice.

Next we need to add .* to match any characters in between. Finally we end with [a-z]{2} to match any two alphabets at the end. Note that we didn't use the \w character class because it matches underscores as well.

The i flag is used to ignore case when matching the last two alphabets against the set [a-z]. Without i we would've had to rewrite the set as [a-zA-Z] to match both lower and upper alphabets.

Which of the following strings will the expression /^A\b.*\.$/ match?

  1. "Ants"
  2. "A good program."
  3. "Are these mine."
  4. "A lasting effect"
  • 1, 2 and 3
  • 1 only
  • 2 only
  • None

Using the multiline flag

Recall the multiline flag denoted by m that we saw back in the Flags chapter. It is now that we will explore it in detail and see how it is linked with the concept of boundaries and how it can be used to solve more variations of regexp problems.

The multiline m flag lets ^ and $ match the start and end of every single line in a string that contains multiple lines, instead of only the start and end of the string.

For example for the string "Line 1\nLine 2" the expression /^.+$/g, without the multiline flag, will match 'Line 1\nLine 2'. However the one with the multine flag, /^.+$/gm, will match the substrings 'Line 1' and then 'Line 2', separately.

In the second expression above if we had eliminated the global flag the match would've only been 'Line 1' i.e since by default regexp only searches only up to the first match.
People tend to confuse the multiline flag as making a regexp continue its search for multiple lines. This is not the purpose of m! It only serves to match the start and end of each line with ^ and $. To continue a regexp search for multiple lines we have to use the same, old g global flag!

In a given string test whether there exists any line that starts with an 'A' and ends with a period '.'.

Now since we need to match the start and end we need to use ^ and $ respectively. Furthermore, because this time every line has to be matched for the start and the end and not just the whole string, we will also need to use the multiline flag m.

Hence the final expression becomes /^A.*\.$/m. Note that we have not used the flag g since we have to only check whether there exists any such line.

And this concludes boundaries in our regexp journey. Surely boundaries are an important concept to know that you will more than likely use in your pattern matching work at some point in time.