JavaScript Regex Introduction

Chapter 1 12 mins

Learning outcomes:

  1. The importance of strings
  2. The problem with naive searching
  3. What are regular expressions
  4. About this course

The importance of strings

Before we can understand what regular expressions are, we need to spare a few minutes to appreciate the importance of strings in modern-day programming.

Strings are everywhere. They have been a fundamental and integral part of programming since a very long time. At the core, a string is merely a means of working with textual data in a program. But that mere textual data form a huge part of today's computing ecosystem.

The source code of a language is treated as textual data, i.e. a string, by another program. That's a big thing already in itself; it means that much of what powers all the modern the systems today — that is, source code of various programming languages — goes all the way back to basic text pieces and, likewise, to strings.

Programs routinely require the user's input and also produce some output; both of these are instances of strings. The messages we send to one another and the engagements we do on social media platforms all involve strings at one point or another.

Talking about the web, it's intrinsically based on a lot textual data. Strings show up in such things as URLs, HTML, CSS, JavaScript, JSON data storage files, and so on. And did we forget to mention HTML form inputs — they all send their data to the server as text, which then gets converted on the server to respective strings.

The list of applications of strings could go on and on and on.

In short, this brief discussion is reminiscent of the fact that strings are more than just a crucial part of programming. They're used in all but the simplest or purely mathematical programs of today.

Now when strings are used in programs, they aren't just used as is. That is, more often than not, strings are processed. For example, a string might be converted to lowercase characters, or it might be sliced from a starting point to an ending point, or it might be repeated a particular number of times, and so on.

But, perhaps, the most common operation performed on strings in all programming languages is that of searching — or better to say, analyzing them for given patterns.

Validating form input, parsing source code, powering actual search utilities, all require strings to be processed for analysis to find given patterns. Those patterns might sometimes simply be words but other times, they're particularly intricate sequences of characters.

As an elementary example, consider the task the validating a 10-digit ISBN number.

Putting hyphens (-) aside, it's really easy to determine if a particular string is an ISBN number or not: check that every character is a digit and that there are no less and no more than 10 characters in the string.

Validating a particular string for this format is just one instance of a common concern in programs, i.e. comparing strings with certain patterns.

And this is exactly where regular expressions step in.

But wait a minute...why can't we just use basic string processing utilities to find patterns in strings?

Well, let's see why.

The problem with naive searching

Let's imagine a very simple scenario to help us better understand the purpose of even resorting to using regular expressions for string analysis.

This is utmost desirable because otherwise we'll be using regular expressions without having the least of clue as to why exactly we need them.

Say we have to verify that an arbitrary string has the following general form: a lowercase letter from the English alphabet, followed by a digit, followed by another digit. Quite simple.

We want a function isCorrect() in JavaScript to help us encapsulate this logic.

Well, the task seems pretty straightforward so let's tackle without further ado. We'll use three individual conditional expressions and combine them using the logical AND (&&) operator to make the deduction as to whether the string matches the required pattern or not.

Here's the definition of isCorrect():

function isCorrect(str) {
   return (
      str.length === 3
      && ('a' <= str[0] && str[0] <= 'z')
      && ('0' <= str[1] && str[1] <= '9')
      && ('0' <= str[2] && str[2] <= '9')
   );
}

Pretty easy, wasn't it?

Let's now test the function on some strings:

isCorrect('a10')
true
isCorrect('z05')
true
isCorrect('c90')
true
isCorrect('_50')
false
isCorrect('A50')
false
isCorrect('a5')
false

Yup, it works flawlessly!

Now, let's say the pattern was as follows: a lowercase letter (as before), followed by 10 digits.

How would you approach this problem?

Well, one thing is clear: using 10 manual conditional expressions would be completely insane! And so what we'll use instead is a loop.

Here's the code to solve this problem:

function isCorrect(str) {
   if (str.length === 11) {
      if ('a' <= str[0] && str[0] <= 'z') {
         for (var i = 1; i < 11; i++) {
            if (!('0' <= str[1] && str[1] <= '9')) {
               return false;
            }
         }
         return true;
      }
      return false;
   }
   return false;
}

Not really that delightful to look at, right? Well, without a single doubt, the code sure is way too complex for solving an extremely basic pattern-matching problem.

Anyways, let's try the code on a couple of strings, just like we did before:

isCorrect('a1234567890')
true
isCorrect('z0987654321')
true
isCorrect('123456')
false
isCorrect('a123456789')
false
isCorrect('A1234567890')
false

And it works absolutely fine this time as well.

Now let's make the problem statement a little bit more intriguing.

Suppose that the pattern is as follows: a lowercase or uppercase letter, followed by an uppercase letter or an underscore (_), followed by 10 digits or whitespace characters (i.e. spaces, tabs or newlines), and then optionally followed by a hyphen (-) or an underscore (_).

See where this is going?

While it's still possible to address this complex problem purely using string based searching, as you can probably guess, it's absolutely impractical to do so.

There would arise countless of pattern-matching problems throughout your programming journey but that doesn't mean you have to draft such an intricate logic for each and every such problem.

In order to describe such patterns neatly, compactly and efficiently, we have at our dispense regular expressions.

So now that we know about the exact purpose of using regular expressions in code, let's dive right into exploring what they are.

What are regular expressions?

Regular expressions enter the game exactly when we want to work with patterns in strings.

In simple words:

Regular expressions are a means of describing patterns.

Sometimes also referred to as regex, regular expressions are used to describe patterns in strings, and are ultimately used to help in finding particular sequences of characters in strings.

Many modern high-level programming languages — the likes of Python, Go, Java, PHP, JavaScript — all support regular expressions out of the box.

They are a superbly compact and powerful way of performing searching routines over strings.

As a quick example to illustrate the potential of regular expressions, let's recall the second task that we solved above — the one with a for loop. The pattern to match was as follows: a lowercase letter, followed by 10 digits.

The manual JavaScript code, as shown above, was 14 lines long. Now, let's solve the same issue using a regular expression:

function isCorrect(str) {
   return /^[a-z]\d{10}$/.test(str);
}

Notice the expression /^[a-z]\d{10}$/ here — this is the regular expression that describes the aforementioned pattern.

Here's a quick go-through on what's going on in the expression:

  • ^ is an anchor that marks the beginning of the string (that'll eventually be compared against this regular expression).
  • [a-z] denotes a character set that matches a character in the range a to z.
  • \d denotes a character class that matches a single digit.
  • {10} is a quantifier to get \d to occur exactly 10 times.
  • $ is another anchor but this time it marks the end of the string.

Altogether, this terse regular expression describes our pattern pretty neatly.

The test() method following the regular expression in the code above is a method of the RegExp interface in JavaScript. It simply tests whether a string abides by the pattern described in the calling regular expression.

Just to leave you with a little bit of enthusiasm of what's possible with regex, and what you'll eventually learn in this course, let's also solve the third task given above to us.

Here's the pattern desired: a lowercase or uppercase letter, followed by an uppercase letter or an underscore (_), followed by 10 digits or whitespace characters (i.e. spaces, tabs or newlines), and then optionally followed by a hyphen (-) or an underscore (_).

And here's the code to test a string for this pattern:

function isCorrect(str) {
   return /^[a-zA-Z][A-Z_][\d \t\n]{10}[-_]?$/.test(str);
}

Seems complex and ugly?

Well, regular expressions can indeed turn pretty ugly very quickly. But the good thing is that they can describe superbly complicated patterns in the matter of seconds using this ugly nature. (BTW, personally, we don't feel that they're that ugly.)

And the best thing is, you won't find them ugly any more once you get over with this course.

About this course

In this course, we'll take a detailed look over regular expressions in JavaScript.

We'll start with the very basics, understanding such things as how regular expressions work; regular expression literals (//) and the RegExp interface in JavaScript; the replace() string method; and so on.

From here, we'll carry it forward and explore the various concepts in regular expressions in detail.

In particular, we'll see what are flags and how they are used to modify the searching behavior of regular expressions, and then what are quantifiers and how they are used to repeat a particular pattern a given number of times.

We'll then see what are character sets and how we can use them to define a set of characters to match a single character in a test string against, before moving over to consider character classes which are simply predefined character sets.

Following this, we'll learn about grouping patterns within regular expressions using non-capturing and capturing groups. Capturing groups are used for the purposes of backreferencing which is basically to save matches for future use.

Regular expressions in JavaScript have the ability to perform searching before a given pattern and even ahead of it, while not counting the matching sequence in the overall regex match. These are known as lookbehinds and lookaheads, respectively . We'll explore them in the regex assertions chapter.

Finally, we'll end with a thorough discussion on the various string methods and RegExp methods where we can use regular expressions to perform searching operations over given strings.

This whole course will make sure that, by the end of it, you feel extremely confident in using regular expressions in your code. You'll be able to construct patterns for literally every single searching problem that you can imagine of.

Yes, that's right!

And not only this but using the concepts learnt in this course, you could use regular expressions in other programming languages confidently as well.

So are you ready to begin the learning?