PHP String Transformations

Chapter 21 26 mins

Learning outcomes:

  1. What are string transformations
  2. The htmlspecialchars() function
  3. The htmlspecialchars_decode() function
  4. The htmlentities() function
  5. The html_entity_decode() function

Introduction

As we'll dive later on in this course into exploring how to work with PHP as a web server technology to deliver HTML files, we'll come across the need of transforming strings given as input in order to sanitize them from dangerous interpretations.

The exact meaning of 'dangerous' would become clear only once we start to look into PHP as a web server tool that delivers HTML files. But for now, since we're learning about strings in PHP, it's the right time to cover some string functions that allow us to transform strings.

By 'transform' we mean to replace given characters, or sequences of characters, in the string with other characters, or sequences of characters.

Special characters in HTML

As you would already know, some characters have a special meaning in HTML. The most obvious ones are < and >. The angle brackets (< and >) are used to denote tags in HTML.

If a PHP string contains these characters and then if that string is used to generate text in an HTML file, given that the string was obtained via an untrusted source, this could be extremely problematic.

We don't want to make our application such that it's damn easy to inject an arbitrary piece of HTML in any location. This could have some extremely fatal consequences for our web application.

Such a vulnerability and its associated security threat has been so common at one time in the world wide web that it goes by a fancy name: XSS for Cross-Site Scripting.

Read more about XSS at OWASP - Cross Site Scripting (XSS).

Consider the following code:

<?php

// String contains HTML code.
$str = "Hello <script>alert('World!')</script>";

echo $str;
Hello <script>alert('World!')</script>

The string $str contains some HTML code, including a <script> tag, which is output as it is.

If this PHP file was used to generate an HTML file, with echo producing the HTML file's content, the end user seeing the HTML would have the JavaScript program (embedded inside <script>) executed in the browser.

The real threat arises from the fact that this script is written by an unknown person (we're assuming that we obtained $str from an untrusted source) yet it runs on our web application, which is trusted.

We'll look into XSS in detail later on; right now, it's utmost important to be able to understand the tools that PHP provides at our dispense in order to prevent such a kind of a threat.

So how to make $str above less dangerous?

Well, one straightforward way is to use the htmlspecialchars() function.

The htmlspecialchars() function

The htmlspecialchars() function takes a string and converts all characters that have a special meaning in HTML to their corresponding HTML entities.

There are essentially just 5 such characters: <, >, &, ' and ".

If you're not familiar with it, HTML entities are more or less like escape sequences that we encountered in PHP strings — they can be used to denote characters in HTML, that are otherwise special, without being parsed as special.

For instance, < is denoted by the following entity: &lt;. An HTML parser would scan &lt; and produce the corresponding < character when generating output in the browser.

See how similar this is to using \' to denote ' in a single-quoted PHP string — we're able to denote the ' character without it being parsed as the end of the string.

Here's the syntax of htmlspecialchars():

htmlspecialchars(
   $str,
   $flags = ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML401,
   $encoding = null,
   $double_encode = true
)

The first $str argument is the string that we want to normalize for special characters.

The second optional $flags argument configures the conversion behavior of the function. As stated in the official documentation of htmlspecialchars(), $flags is a bitmask.

The third $encoding argument is also optional and specifies the encoding of $str. Some of the possible values for $encoding are mentioned as follows: 'ISO-8859-1', 'ISO-8859-5', 'UTF-8', 'cp1251', 'cp1252', 'Shift_JIS'.

Note that this is not the complete list of the possible values for $encoding. And then we also have aliases for given values. For example, 'ISO-8859-1' can also be expressed as 'ISO8859-1'.

For the complete set of values, refer to the official documentation of htmlspecialchars().

Finally, the fourth $double_encode argument specifies whether or not to double encode the given string $str. 'Double encode' here means to convert the ampersand (&) symbol, even if it is part of an existing HTML entity. By default, it's true.

The point of $encoding in htmlspecialchars()

As you may know, text files don't store their encoding along with their actual data, nor even in their meta data. In the case of PHP files as well, which are merely text files, this same idea applies.

That is, when a PHP engine parses a given PHP file, it can't really tell about its encoding. The end developer, though, might already be aware of the encoding. For instance, we might create a new PHP file based on the ISO-8859-1 character set for a legacy project and then obviously remember that the charset used in the file is ISO-8859-1.

Now since the PHP engine itself isn't able to determine the given file's character encoding, using htmlspecialchars() in such a file might produce unexpected results.

Let's consider an example to help understand this better.

The Г character is represented as the byte value 0xB3 (i.e. B3 in hexadecimal) in ISO-8859-5 but as the byte sequence 0xD0 0x93 in UTF-8.

Now suppose that we have a PHP file based on the ISO-8859-5 charset and that it has a string $str in it, containing the Г character.

If we call htmlspecialchars() on this string, without a value for $encoding, the function would assume PHP's default encoding, which typically is UTF-8. In other words, the function would assume that the given string is encoded in UTF-8.

Now because just the byte value 0xB3 followed by nothing is invalid in UTF-8, this would get htmlspecialchars() to replace the byte with the byte sequence of the Unicode Replacement Character, despite the fact that there was absolutely no such need, since we weren't dealing in UTF-8 but rather in ISO-8859-5.

To summarize it, because PHP files (in fact, all text files) don't carry character encoding info along with them, it's error-prone to run htmlspecialchars() inside them without specifying the $encoding.

If we know the PHP file's encoding, it's a good idea to provide it. This especially applies to legacy projects that don't work with today's modern and standard UTF-8 encoding scheme.

Hopefully this clarifies the purpose of $encoding.

Let's consider a couple of examples of using htmlspecialchars().

In the code below, we call htmlspecialchars() on $str in order to escape all the special-for-HTML characters in it:

<?php

$str = "Hello <script>alert('World!')</script>";

echo htmlspecialchars($str);
Hello &lt;script&gt;alert(&#039;World!&#039;)&lt;/script&gt;

Notice the output here; we've highlighted all the entities. Compare this to the previous code's output — all the special characters have now been replaced by their corresponding HTML entities.

If an HTML engine were to parse this piece of text, there would be no <script> tag to be executed — XSS threat vector neutralized.

Let's consider another example.

In the transformation above, the ' character gets replaced by the entity &#039;. This is the old way, specifically the one that was used in HTML4. A more modern entity is &apos;.

In order to get htmlspecialchars() to convert ' to &apos;, we ought to use the $flags argument. Specifically, we need the flag ENT_HTML5, as shown below:

<?php

$str = "Hello <script>alert('World!')</script>";

echo htmlspecialchars($str, ENT_HTML5);
Hello &lt;script&gt;alert('World!')&lt;/script&gt;

Wait a second... The output contains ' as it is.

We wanted ' to be converted to &apos; instead of to &#039;, but with the ENT_HTML5 flag set, it just fails to be converted into anything!

There's something wrong.

Well, the problem is the absence of the ENT_QUOTES flag. ENT_QUOTES means that the quote character (') should be transformed to an entity. That exactly which entity to convert it to depends upon the flags ENT_HTML401 and ENT_HTML5.

In the code above, we indeed used ENT_HTML5. But we didn't use the ENT_QUOTES flag. In order to use both of these flags together, we ought to use the bitwise OR (|) operator, as follows:

<?php

$str = "Hello <script>alert('World!')</script>";

echo htmlspecialchars($str, ENT_QUOTES | ENT_HTML5);
Hello &lt;script&gt;alert(&apos;World!&apos;)&lt;/script&gt;

So far, so good.

Let's consider yet another example, this time demonstrating the $double_encode parameter.

In the code below, we have an existing entity in the given string $str&lt;. Two calls are made to htmlspecialchars(), one with $double_encode set to true and one with it set to false:

<?php

$str = "&lt; is an existing entity.";

echo htmlspecialchars($str, ENT_QUOTES | ENT_HTML5, null, true), "\n";
echo htmlspecialchars($str, ENT_QUOTES | ENT_HTML5, null, false);
&amp;lt; is an existing entity. &lt; is an existing entity.

First of all, notice that since $double_encode is the last argument, we obviously ought to provide the rest of the arguments to the function.

A null value for $encoding, as used in the code above, means that htmlspecialchars() would use the default encoding, which is typically UTF-8.

Besides this, focusing on the output produced, the first htmlspecialchars() call converts the ampersand (&) character to its corresponding HTML entity &amp;, however the second call leaves it as it is, simply because $double_encode is false in this case.

As a matter of fact, remember that with $double_encode set to true, the ampersand (&) character will always be transformed regardless of whether it's part of an HTML entity or not.

This can be seen as follows:

<?php

$str = "An ampersand: &";

echo htmlspecialchars($str)
An ampersand: &amp;

The htmlspecialchars_decode() function

htmlspecialchars_decode() is, as per the name, the opposite of htmlspecialchars(). It decodes special HTML entities from a string to their corresponding characters.

Here's the syntax of the function:

htmlspecialchars_decode($str, $flags)

As before, $str is the string that we want to transform, while $flags configures the transformation behavior.

You might've noticed that there isn't an argument for $encoding in htmlspecialchars_decode(), unlike in htmlspecialchars(). This seems contrary to what we'd typically expect.

Let's sort out the reason as to why is this the case.

Why is there no $encoding argument for htmlspecialchars_decode()?

Upon inspecting the implementation of htmlspecialchars_decode() and htmlspecialchars(), available on GitHub in the The PHP Interpreter repository, we do seem to find some reasoning as to why exactly does the former function doesn't have an $encoding parameter.

First of all, note that when converting the few special characters to entities, htmlspecialchars() doesn't actually use the $encoding provided to it, internally. This makes perfect sense because the special characters all have the same encoding in all the possible character sets.

So then what does htmlspecialchars() use $encoding for?

Well, it only uses $encoding to normalize the input string from invalid code points so to conform to certain HTML doctypes (HTML5 is a bit lenient, but older versions and XHTML weren't — if a particular charset is chosen, then they require the underlying markup file to be valid for that particular charset).

Think about it very naturally, htmlspecialchars() is meant to be used in producing text that'll ultimately be inlined in an HTML document, and likewise end up in some HTML parsing engine. Conformation to the given charset is more than just important here — it's a requirement!

And so it's crucial for the developers of the PHP engine to make sure that htmlspecialchars() fixes any string that has invalid bytes in it for a given charset.

Talking about htmlspecialchars_decode(), it isn't meant to produce text for an HTML file. Instead, it's simply meant to decode an HTML file for some special entities. Obviously, in the decoding itself, there's no need for any $encoding, since the encoding of all the special characters are the same in all possible charsets.

However, for conformance to a particular charset, we could've had an $encoding in htmlspecialchars_decode(), but because the string provided to the function represents text from an HTML file, it's good to assume that it already conforms to the underlying charset (most probably, the HTML file was generated from PHP, with normalization done by htmlspecialchars() for the underlying charset and so it's already valid).

So to answer it concretely: htmlspecialchars_decode() doesn't have an $encoding parameter because the encoding of the characters transformed by it and by htmlspecialchars() is the same in all charsets and because there's no need to enforce conformance to a given charset for this function.

All this information is solely based on what we were able to infer from PHP's source code. A definitive answer, though, can obviously only be given by the authors of these string functions.

Time for some quick examples.

In the following code, we decode the string that we obtained at the start of the previous section using htmlspecialchars(), this time stored in $encoded_str:

<?php

$str = "Hello <script>alert('World!')</script>";
$encoded_str = htmlspecialchars($str);

echo $encoded_str, "\n";
echo htmlspecialchars_decode($encoded_str);
Hello &lt;script&gt;alert(&#039;World!&#039;)&lt;/script&gt; Hello <script>alert('World!')</script>

The first output shows the encoded string while the second output shows the string obtained by decoding this encoded string.

Moving on, if a string has been encoded using ENT_HTML5 in addition to ENT_QUOTES, calling htmlspecialchars_decode() on the resulting string, without any flags, won't produce back the ' characters.

The following code demonstrates this:

<?php

$str = "Hello <script>alert('World!')</script>";
$encoded_str = htmlspecialchars($str, ENT_QUOTES | ENT_HTML5);

echo $encoded_str, "\n";
echo htmlspecialchars_decode($encoded_str);
Hello &lt;script&gt;alert(&apos;World!&apos;)&lt;/script&gt; Hello <script>alert(&apos;World!&apos;)</script>

See how the string returned by htmlspecialchars_decode() contains the &apos; entities — they haven't been decoded to '.

In order to obtain the original string back in this case, it's important to use the same set of flags that we used while calling htmlspecialchars().

Just like shown below:

<?php

$str = "Hello <script>alert('World!')</script>";
$encoded_str = htmlspecialchars($str, ENT_QUOTES | ENT_HTML5);

echo $encoded_str, "\n";
echo htmlspecialchars_decode($encoded_str, ENT_QUOTES | ENT_HTML5);
Hello &lt;script&gt;alert(&apos;World!&apos;)&lt;/script&gt; Hello <script>alert('World!')</script>

Perfect!

If we remove the ENT_QUOTES flag from the htmlspecialchars_decode() call above, the function won't consider decoding entities representing the ' character.

The htmlentities() function

htmlentities() is a more comprehensive version of htmlspecialchars().

Unlike htmlspecialchars(), it converts all those characters to entities that have corresponding entities associated with them for a particular character set.

Here's the syntax of the function:

htmlentities(
   $str,
   $flags = ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML401,
   $encoding = null,
   $double_encode = true
)

As you'd be able to relate, the syntax of htmlentities() is identical to that of htmlspecialchars().

This makes perfect sense — both the functions do the same thing, i.e. converting certain characters to entities, just that htmlentities() converts all convertible characters to entities while htmlspecialchars() converts only five characters (<, >, &, ' and ").

But keep in mind that the $encoding parameter for htmlentities() does have its significance besides producing text that conforms to the given character encoding.

$encoding has its importance in htmlentities()

There are different mappings for the same characters in different character sets. For example, the superscript 3 character, denoted as ³, is represented as the byte value 0xB3 in ISO-8859-1 while as the byte sequence 0xC2 0xB3 in UTF-8.

This means that if we have a piece of text to be passes through htmlentities(), it's a really good idea to specify the encoding of the text, i.e. specify the character set used via the $encoding argument. In this way, we'll be able to prevented unexpected outcomes.

As a quick example, suppose that an input string contains only a single byte whose value is 0xB3. Given that we specify the charset to be ISO-8859-1 while calling htmlentities(), it would correctly get encoded to the corresponding entity &sup3;.

However, if we don't specify the charset, the function would assume PHP's default charset, which is typically UTF-8. Now because the byte value 0xB3 alone is invalid in UTF-8, it would get converted to the Unicode Replacement Character.

See the difference in the transformation?

With ISO-8859-1, the byte 0xB3 (representing ³) converts to &sup3;, while with UTF-8, it converts to the Unicode Replacement Character.

To boil it down, $encoding is utterly necessary in the case of htmlentities(). And not just in htmlentities(), but also in its reverse function, html_entity_decode(), as we shall explore later on below.

Let's consider a bunch of examples of using htmlentities().

In the following code, we run htmlentities() on a given heredoc string:

<?php

$str = <<<END
³ is called 'Superscript Three'.
≈ is called 'Almost Equal To'.
< is called 'Less-Than Sign'.
END;

echo htmlentities($str);
&sup3; is called &#039;Superscript Three&#039;. &asymp; is called &#039;Almost Equal To&#039;. &lt; is called &#039;Less-Than Sign&#039;.

Notice how the characters ³ and have been converted to their corresponding HTML entities (unlike in htmlspecialchars()), and obviously < and ' too (like htmlspecialchars()).

Let's now suppose that we don't want to transform the ' characters around the names of the entities in $str above. How to do that?

Fortunately, using the ENT_NOQUOTES flag, this is really easy:

<?php

$str = <<<END
³ is called 'Superscript Three'.
≈ is called 'Almost Equal To'.
< is called 'Less-Than Sign'.
END;

echo htmlentities($str, ENT_NOQUOTES);
&sup3; is called 'Superscript Three'. &asymp; is called 'Almost Equal To'. &lt; is called 'Less-Than Sign'.

To better understand the difference between htmlentities() and htmlspecialchars(), let's include the latter as well in this code and see the output:

<?php

$str = <<<END
³ is called 'Superscript Three'.
≈ is called 'Almost Equal To'.
< is called 'Less-Than Sign'.
END;

echo htmlentities($str, ENT_NOQUOTES), "\n\n";
echo htmlspecialchars($str, ENT_NOQUOTES);
&sup3; is called 'Superscript Three'. &asymp; is called 'Almost Equal To'. &lt; is called 'Less-Than Sign'. ³ is called 'Superscript Three'. ≈ is called 'Almost Equal To'. &lt; is called 'Less-Than Sign'.

The html_entity_decode() function

Just like we have a reverse function for htmlspecialchars(), i.e. htmlspecialchars_decode(), there exists a reverse function for htmlentities()html_entity_decode().

In particular, html_entity_decode() takes a string and decodes all HTML entities to their corresponding characters, that apply to the given character set.

One might've expected the reverse function to be called 'htmlentities_decode' but unfortunately this can't be changed now, thanks to staying compatible with older versions of PHP.

The syntax of html_entity_decode() is as follows:

html_entity_decode(
   $str,
   $flags = ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML401,
   $encoding = null,
)

The first two arguments, $str and $flags, work exactly as in htmlspecialchars_decode(). The additional third $encoding argument specifies the encoding of the given string.

That why exactly do we have the $encoding parameter present in html_entity_decode() but not in htmlspecialchars_decode() is due to reasons discussed in the previous section on htmlentities().

Let's consider an example using html_entity_decode().

In the following code, we decode a string containing a couple of entities using html_entity_decode():

<?php

$str = <<<END
&sup3; is called 'Superscript Three'.
&asymp; is called 'Almost Equal To'.
&lt; is called 'Less-Than Sign'.
END;

echo html_entity_decode($str);
³ is called 'Superscript Three'. ≈ is called 'Almost Equal To'. < is called 'Less-Than Sign'.

Simple, wasn't this?