Camel Words Exercise - JavaScript Strings

Objective

Create a function to extract all individual words from a given camel-cased string.

Description

Consider the string 'helloWorld'. As we know, it is in camel case.

Camel case is a casing convention of text whereby given words are joined next to each other, without any space in between, with the first letter of every word uppercased, except for the first one.

So if we have the two words hello and world, then representing these together in camel case would give us the word helloWorld.

The word HTMLElement obtained from the words HTML and Element is an example of using Pascal case, not camel case. Hence, we don't consider such words here that begin with an uppercase character.

In this exercise, you have to create a function getWords() that takes in a camel-cased string and returns back an array containing all the individual words in the string.

If the given string is empty (''), you MUST return back an empty array.

Note that you must not worry about the grammar of the given string, i.e. whether or not it's a valid camel-cased string. You should assume that it's a well-formed camel-cased string.

For example, 'to There' is not a well-formed camel-cased string, but thanks to our assumption, we can be rest assured that we won't be dealing with such strings in our function.

In addition to this, you should also assume that there are no digits in the given string. This assumption largely simplifies the underlying logic.

Shown below are a couple of examples of the usage of the function:

getWords('helloWorld')

['hello', 'World']

getWords('innerHTML')

['inner', 'HTML']

getWords('lastHTMLElement')

['last', 'HTML', 'Element']

getWords('insertAdjacentElement')

['insert', 'Adjacent', 'Element']

getWords('insertAdjacentHTML')

['insert', 'Adjacent', 'HTML']

getWords('')

[]

Note that there is are two special cases to deal with as shown above:

If a sequence of characters is in upper case, then that constitutes as a single word. For instance, innerHTML is comprised of the two words inner and HTML, where the latter is a sequence of uppercase words and hence, constitutes a single word.
There is an exception to this rule and that's when a sequence of uppercase characters is followed by a lowercase character. In this case, the sequence, excluding the last character, constitutes a single word. For instance, 'lastHTMLElement' is comprised of the words 'last', 'HTML', and 'Element'. Here the sequence HTMLE is followed by l and hence it constitutes the word 'HTML'.

Once again, note that in none of the strings passed to getWords() above do we have the first letter in uppercase — the first character is always in lowercase, as per the camel case convention.

View Solution

New file

Inside the directory you created for this course on JavaScript, create a new folder called Exercise-20-Camel-Words and put the .html solution files for this exercise within it.

Solution

So where should we start from?

Well, first let's set up the basic wireframe of the function and then reason about its implementation.

This is done below:

function getWords(str) {
}

Actually, we could define more here, before we start writing the real code. That is, if str is '', we could immediately return []:

function getWords(str) {
   if (str === '') {
      return [];
   }
}

So far, so good.

Let's now get to real deal — thinking about how to extract all of the words from str. Well, it's really not that difficult to accomplish.

Here's what we could do:

Iterate over each character of str, while keeping an index-tracking variable at hand — let's call it index, initially set to 0 — to help us slice a given substring if we are sure that it represents a single word.
In the iteration, we check whether the current character is uppercase.
If it is one, this means that the preceding substring is a word.
This preceding substring starts at index index (inclusive) and ends at the index of the current character (exclusive).
To extract this word, we use the string slice() method.
In the end, we call slice() once more to slice the string from index upto its very end.

And this is just the algorithm that we need.

Let's take the help of an example to understand this better.

Suppose that str is 'firstChild'. Moreover, index = 0.

Iteration begins from index 0 in 'firstChild'.

f is not uppercase, i is not uppercase, r is again not uppercase, s is not uppercase, and t is not uppercase as well. However, C is an uppercase character and so the preceding substring is a word.

str.slice(index, 5), where 5 is the index of the current character C, helps us extract this word. This gives 'first'.

The variable index is updated to 5, which is precisely where the next word should begin.

Moving on, h is lowercase, i is lowercase, so is l, and so is d. Iteration over str completes, however we haven't yet extracted the second word, i.e Child. To do this, str.slice(index) is called after the iteration ends, which extracts a word from index index upto the very end of str. This gives 'Child'.

Simple?

In the code below, we implement this algorithm in getWords(), adding each extracted word onto an array words which ought to be returned in the end by the function:

function getWords(str) {
   if (str === '') {
      return [];
   }

   var words = [];
   var index = 0;
   for (var i = 0, len = str.length; i < len; i++) {
      if ('A' <= str[i] && str[i] <= 'Z') {
         words.push(str.slice(index, i));
         index = i;
      }
   }
   words.push(str.slice(index));

   return words;
}

Now the question is: can we be sure that this code works correctly for every input as shown in the exercise's description above?

Well, the best way is to just go and check it:

getWords('helloWorld')

['hello', 'World']

getWords('innerHTML')

['inner', 'H', 'T', 'M', 'L']

Oops! The second call here, i.e. getWords('innerHTML'), produces the wrong result. It considers HTML to be four words, whereas it's just one single word.

Clearly, something has to be changed/added to our getWords() function.

But what?

Well, let's walk through the algorithm with str set to 'innerHTML'. This will help us see exactly how our current approach leads to the wrong output for 'innerHTML' and then think about how to solve it.

Suppose that str is 'innerHTML', and that index = 0.

Iteration begins from index 0 in 'innerHTML'.

i is not uppercase, n is not uppercase, n is again not uppercase, e is not uppercase, and r is not uppercase. However, H is an uppercase character and so the preceding substring is a word.

str.slice(index, 5) is invoked, giving us the first word 'inner'. Thereby, index is updated to 5.

Moving on, T is also an uppercase character and so the preceding substring is again a word. str.slice(index, 6) is invoked (where index = 5), giving us the second word 'H'. index is updated to 6.

Going furhter, M is again an uppercase character and so the preceding substring is again a word. str.slice(index, 7) is invoked (where index = 6), giving us the third word 'T'. index is updated to 7.

Finally, L is yet another uppercase character, and so str.slice(index, 8) is invoked (where index = 7), giving us the fourth word 'M'. index is updated to 8.

Iteration ends, and so the last word extraction is made via str.slice(index). This gives us 'L'.

The problem is pretty apparent. Whenever we encounter an uppercase character, we just perform the preceding word's extraction without even checking if the uppercase character is preceded by another uppercase character.

The question remains: how to solve this?

Well, a pretty straightforward way is to check the preceding character each time an uppercase character is encountered.

If the preceding character is NOT uppercase, i.e. it's lowercase, this means that it's part of that preceding word, and so we should extract that word.
Otherwise, we shouldn't do anything.

Simple?

Adding this notion into our current approach, we get the following code:

function getWords(str) {
   if (str === '') {
      return [];
   }

   var words = [];
   var index = 0;
   for (var i = 0, len = str.length; i < len; i++) {
      if ('A' <= str[i] && str[i] <= 'Z'
      && 'a' <= str[i - 1] && str[i - 1] <= 'z') {
         words.push(str.slice(index, i));
         index = i;
      }
   }
   words.push(str.slice(index));

   return words;
}

To our surprise, even this code misses to address an edge case.

Let's see if you could figure out that edge case. Hint: it's already there in the console snippet in the exercise's description.

The problem with the approach above is that it only checks the preceding character when an uppercase character is encountered, not the next character.

Consider the third statement below:

getWords('helloWorld')

['hello', 'World']

getWords('innerHTML')

['inner', 'HTML']

getWords('lastHTMLElement')

['last', 'HTMLElement']

As stated in the description of this exercise, if a sequence of uppercase characters is followed by a lowercase character, then the sequence, excluding the last character, constitutes a word.

Hence, the sequence of uppercase characters HTMLE in 'lastHTMLElement' constitutes the word 'HTML', not 'HTMLE', as it's followed by l.

So how to address this case?

Well, we just need to add another check to our existing if conditional. With this check added, the if statement would be read as follows:

If the current character is uppercase and (the previous one is lowercase or the next one is lowercase), then the preceding word must be extracted.

The text in bold here represents the new condition while the parameters represent a group expression.

Converting this idea into the glyphs of code, we get the following:

function getWords(str) {
   if (str === '') {
      return [];
   }

   var words = [];
   var index = 0;
   for (var i = 0, len = str.length; i < len; i++) {
      if ('A' <= str[i] && str[i] <= 'Z'
      && ('a' <= str[i - 1] && str[i - 1] <= 'z'
      || 'a' <= str[i + 1] && str[i + 1] <= 'z'))
      {
         words.push(str.slice(index, i));
         index = i;
      }
   }
   words.push(str.slice(index));

   return words;
}

To improve the readability of the code, we've moved the starting brace ({) of the if statement above on a new line.

And now, the function getWords() works just as desired, without leaving off any edge cases.

Below we try it out on a couple of strings:

getWords('helloWorld')

['hello', 'World']

getWords('innerHTML')

['inner', 'HTML']

getWords('lastHTMLElement')

['last', 'HTML', 'Element']

getWords('insertAdjacentElement')

['insert', 'Adjacent', 'Element']

getWords('insertAdjacentHTML')

['insert', 'Adjacent', 'HTML']

getWords('')

[]

Superb!

Exercise: Camel Words