As we'll dive later on in this course into exploring how to work with PHP as a web server technology to deliver HTML files, we'll come across the need of transforming strings given as input in order to sanitize them from dangerous interpretations.
The exact meaning of 'dangerous' would become clear only once we start to look into PHP as a web server tool that delivers HTML files. But for now, since we're learning about strings in PHP, it's the right time to cover some string functions that allow us to transform strings.
By 'transform' we mean to replace given characters, or sequences of characters, in the string with other characters, or sequences of characters.
Special characters in HTML
As you would already know, some characters have a special meaning in HTML. The most obvious ones are
>. The angle brackets (
>) are used to denote tags in HTML.
If a PHP string contains these characters and then if that string is used to generate text in an HTML file, given that the string was obtained via an untrusted source, this could be extremely problematic.
We don't want to make our application such that it's damn easy to inject an arbitrary piece of HTML in any location. This could have some extremely fatal consequences for our web application.
Such a vulnerability and its associated security threat has been so common at one time in the world wide web that it goes by a fancy name: XSS for Cross-Site Scripting.
Consider the following code:
<?php // String contains HTML code. $str = "Hello <script>alert('World!')</script>"; echo $str;
$str contains some HTML code, including a
<script> tag, which is output as it is.
If this PHP file was used to generate an HTML file, with
<script>) executed in the browser.
The real threat arises from the fact that this script is written by an unknown person (we're assuming that we obtained
$str from an untrusted source) yet it runs on our web application, which is trusted.
We'll look into XSS in detail later on; right now, it's utmost important to be able to understand the tools that PHP provides at our dispense in order to prevent such a kind of a threat.
So how to make
$str above less dangerous?
Well, one straightforward way is to use the
htmlspecialchars() function takes a string and converts all characters that have a special meaning in HTML to their corresponding HTML entities.
There are essentially just 5 such characters:
If you're not familiar with it, HTML entities are more or less like escape sequences that we encountered in PHP strings — they can be used to denote characters in HTML, that are otherwise special, without being parsed as special.
< is denoted by the following entity:
<. An HTML parser would scan
< and produce the corresponding
< character when generating output in the browser.
'in a single-quoted PHP string — we're able to denote the
'character without it being parsed as the end of the string.
Here's the syntax of
htmlspecialchars( $str, $flags = ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML401, $encoding = null, $double_encode = true )
$str argument is the string that we want to normalize for special characters.
The second optional
$flags argument configures the conversion behavior of the function. As stated in the official documentation of
$flags is a bitmask.
$encoding argument is also optional and specifies the encoding of
$str. Some of the possible values for
$encoding are mentioned as follows:
Note that this is not the complete list of the possible values for
$encoding. And then we also have aliases for given values. For example,
'ISO-8859-1' can also be expressed as
For the complete set of values, refer to the official documentation of
Finally, the fourth
$double_encode argument specifies whether or not to double encode the given string
$str. 'Double encode' here means to convert the ampersand (
&) symbol, even if it is part of an existing HTML entity. By default, it's
The point of
As you may know, text files don't store their encoding along with their actual data, nor even in their meta data. In the case of PHP files as well, which are merely text files, this same idea applies.
That is, when a PHP engine parses a given PHP file, it can't really tell about its encoding. The end developer, though, might already be aware of the encoding. For instance, we might create a new PHP file based on the ISO-8859-1 character set for a legacy project and then obviously remember that the charset used in the file is ISO-8859-1.
Now since the PHP engine itself isn't able to determine the given file's character encoding, using
htmlspecialchars() in such a file might produce unexpected results.
Let's consider an example to help understand this better.
Г character is represented as the byte value 0xB3 (i.e. B3 in hexadecimal) in ISO-8859-5 but as the byte sequence 0xD0 0x93 in UTF-8.
Now suppose that we have a PHP file based on the ISO-8859-5 charset and that it has a string
$str in it, containing the
If we call
htmlspecialchars() on this string, without a value for
$encoding, the function would assume PHP's default encoding, which typically is UTF-8. In other words, the function would assume that the given string is encoded in UTF-8.
Now because just the byte value 0xB3 followed by nothing is invalid in UTF-8, this would get
htmlspecialchars() to replace the byte with the byte sequence of the Unicode Replacement Character, despite the fact that there was absolutely no such need, since we weren't dealing in UTF-8 but rather in ISO-8859-5.
To summarize it, because PHP files (in fact, all text files) don't carry character encoding info along with them, it's error-prone to run
htmlspecialchars() inside them without specifying the
If we know the PHP file's encoding, it's a good idea to provide it. This especially applies to legacy projects that don't work with today's modern and standard UTF-8 encoding scheme.
Hopefully this clarifies the purpose of
Let's consider a couple of examples of using
In the code below, we call
$str in order to escape all the special-for-HTML characters in it:
<?php $str = "Hello <script>alert('World!')</script>"; echo htmlspecialchars($str);
Notice the output here; we've highlighted all the entities. Compare this to the previous code's output — all the special characters have now been replaced by their corresponding HTML entities.
If an HTML engine were to parse this piece of text, there would be no
<script> tag to be executed — XSS threat vector neutralized.
Let's consider another example.
In the transformation above, the
' character gets replaced by the entity
'. This is the old way, specifically the one that was used in HTML4. A more modern entity is
In order to get
htmlspecialchars() to convert
', we ought to use the
$flags argument. Specifically, we need the flag
ENT_HTML5, as shown below:
<?php $str = "Hello <script>alert('World!')</script>"; echo htmlspecialchars($str, ENT_HTML5);
Wait a second... The output contains
' as it is.
' to be converted to
' instead of to
', but with the
ENT_HTML5 flag set, it just fails to be converted into anything!
There's something wrong.
Well, the problem is the absence of the
ENT_QUOTES means that the quote character (
') should be transformed to an entity. That exactly which entity to convert it to depends upon the flags
In the code above, we indeed used
ENT_HTML5. But we didn't use the
ENT_QUOTES flag. In order to use both of these flags together, we ought to use the bitwise OR (
|) operator, as follows:
<?php $str = "Hello <script>alert('World!')</script>"; echo htmlspecialchars($str, ENT_QUOTES | ENT_HTML5);
So far, so good.
Let's consider yet another example, this time demonstrating the
In the code below, we have an existing entity in the given string
<. Two calls are made to
htmlspecialchars(), one with
$double_encode set to
true and one with it set to
<?php $str = "< is an existing entity."; echo htmlspecialchars($str, ENT_QUOTES | ENT_HTML5, null, true), "\n"; echo htmlspecialchars($str, ENT_QUOTES | ENT_HTML5, null, false);
First of all, notice that since
$double_encode is the last argument, we obviously ought to provide the rest of the arguments to the function.
$encoding, as used in the code above, means that
htmlspecialchars()would use the default encoding, which is typically UTF-8.
Besides this, focusing on the output produced, the first
htmlspecialchars() call converts the ampersand (
&) character to its corresponding HTML entity
&, however the second call leaves it as it is, simply because
false in this case.
As a matter of fact, remember that with
$double_encode set to
true, the ampersand (
&) character will always be transformed regardless of whether it's part of an HTML entity or not.
This can be seen as follows:
<?php $str = "An ampersand: &"; echo htmlspecialchars($str)
htmlspecialchars_decode() is, as per the name, the opposite of
htmlspecialchars(). It decodes special HTML entities from a string to their corresponding characters.
Here's the syntax of the function:
$str is the string that we want to transform, while
$flags configures the transformation behavior.
You might've noticed that there isn't an argument for
htmlspecialchars_decode(), unlike in
htmlspecialchars(). This seems contrary to what we'd typically expect.
Let's sort out the reason as to why is this the case.
Why is there no
$encoding argument for
Upon inspecting the implementation of
htmlspecialchars(), available on GitHub in the The PHP Interpreter repository, we do seem to find some reasoning as to why exactly does the former function doesn't have an
First of all, note that when converting the few special characters to entities,
htmlspecialchars() doesn't actually use the
$encoding provided to it, internally. This makes perfect sense because the special characters all have the same encoding in all the possible character sets.
So then what does
Well, it only uses
$encoding to normalize the input string from invalid code points so to conform to certain HTML doctypes (HTML5 is a bit lenient, but older versions and XHTML weren't — if a particular charset is chosen, then they require the underlying markup file to be valid for that particular charset).
Think about it very naturally,
htmlspecialchars() is meant to be used in producing text that'll ultimately be inlined in an HTML document, and likewise end up in some HTML parsing engine. Conformation to the given charset is more than just important here — it's a requirement!
And so it's crucial for the developers of the PHP engine to make sure that
htmlspecialchars() fixes any string that has invalid bytes in it for a given charset.
htmlspecialchars_decode(), it isn't meant to produce text for an HTML file. Instead, it's simply meant to decode an HTML file for some special entities. Obviously, in the decoding itself, there's no need for any
$encoding, since the encoding of all the special characters are the same in all possible charsets.
However, for conformance to a particular charset, we could've had an
htmlspecialchars_decode(), but because the string provided to the function represents text from an HTML file, it's good to assume that it already conforms to the underlying charset (most probably, the HTML file was generated from PHP, with normalization done by
htmlspecialchars() for the underlying charset and so it's already valid).
So to answer it concretely:
htmlspecialchars_decode() doesn't have an
$encoding parameter because the encoding of the characters transformed by it and by
htmlspecialchars() is the same in all charsets and because there's no need to enforce conformance to a given charset for this function.
Time for some quick examples.
In the following code, we decode the string that we obtained at the start of the previous section using
htmlspecialchars(), this time stored in
<?php $str = "Hello <script>alert('World!')</script>"; $encoded_str = htmlspecialchars($str); echo $encoded_str, "\n"; echo htmlspecialchars_decode($encoded_str);
The first output shows the encoded string while the second output shows the string obtained by decoding this encoded string.
Moving on, if a string has been encoded using
ENT_HTML5 in addition to
htmlspecialchars_decode() on the resulting string, without any flags, won't produce back the
The following code demonstrates this:
<?php $str = "Hello <script>alert('World!')</script>"; $encoded_str = htmlspecialchars($str, ENT_QUOTES | ENT_HTML5); echo $encoded_str, "\n"; echo htmlspecialchars_decode($encoded_str);
See how the string returned by
htmlspecialchars_decode() contains the
' entities — they haven't been decoded to
In order to obtain the original string back in this case, it's important to use the same set of flags that we used while calling
Just like shown below:
<?php $str = "Hello <script>alert('World!')</script>"; $encoded_str = htmlspecialchars($str, ENT_QUOTES | ENT_HTML5); echo $encoded_str, "\n"; echo htmlspecialchars_decode($encoded_str, ENT_QUOTES | ENT_HTML5);
ENT_QUOTESflag from the
htmlspecialchars_decode()call above, the function won't consider decoding entities representing the
htmlentities() is a more comprehensive version of
htmlspecialchars(), it converts all those characters to entities that have corresponding entities associated with them for a particular character set.
Here's the syntax of the function:
htmlentities( $str, $flags = ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML401, $encoding = null, $double_encode = true )
As you'd be able to relate, the syntax of
htmlentities() is identical to that of
This makes perfect sense — both the functions do the same thing, i.e. converting certain characters to entities, just that
htmlentities() converts all convertible characters to entities while
htmlspecialchars() converts only five characters (
But keep in mind that the
$encoding parameter for
htmlentities() does have its significance besides producing text that conforms to the given character encoding.
$encoding has its importance in
There are different mappings for the same characters in different character sets. For example, the superscript 3 character, denoted as ³, is represented as the byte value 0xB3 in ISO-8859-1 while as the byte sequence 0xC2 0xB3 in UTF-8.
This means that if we have a piece of text to be passes through
htmlentities(), it's a really good idea to specify the encoding of the text, i.e. specify the character set used via the
$encoding argument. In this way, we'll be able to prevented unexpected outcomes.
As a quick example, suppose that an input string contains only a single byte whose value is 0xB3. Given that we specify the charset to be ISO-8859-1 while calling
htmlentities(), it would correctly get encoded to the corresponding entity
However, if we don't specify the charset, the function would assume PHP's default charset, which is typically UTF-8. Now because the byte value 0xB3 alone is invalid in UTF-8, it would get converted to the Unicode Replacement Character.
See the difference in the transformation?
With ISO-8859-1, the byte 0xB3 (representing ³) converts to
³, while with UTF-8, it converts to the Unicode Replacement Character.
To boil it down,
$encoding is utterly necessary in the case of
htmlentities(). And not just in
htmlentities(), but also in its reverse function,
html_entity_decode(), as we shall explore later on below.
Let's consider a bunch of examples of using
In the following code, we run
htmlentities() on a given heredoc string:
<?php $str = <<<END ³ is called 'Superscript Three'. ≈ is called 'Almost Equal To'. < is called 'Less-Than Sign'. END; echo htmlentities($str);
Notice how the characters
≈ have been converted to their corresponding HTML entities (unlike in
htmlspecialchars()), and obviously
' too (like
Let's now suppose that we don't want to transform the
' characters around the names of the entities in
$str above. How to do that?
Fortunately, using the
ENT_NOQUOTES flag, this is really easy:
<?php $str = <<<END ³ is called 'Superscript Three'. ≈ is called 'Almost Equal To'. < is called 'Less-Than Sign'. END; echo htmlentities($str, ENT_NOQUOTES);
To better understand the difference between
htmlspecialchars(), let's include the latter as well in this code and see the output:
<?php $str = <<<END ³ is called 'Superscript Three'. ≈ is called 'Almost Equal To'. < is called 'Less-Than Sign'. END; echo htmlentities($str, ENT_NOQUOTES), "\n\n"; echo htmlspecialchars($str, ENT_NOQUOTES);
Just like we have a reverse function for
htmlspecialchars_decode(), there exists a reverse function for
html_entity_decode() takes a string and decodes all HTML entities to their corresponding characters, that apply to the given character set.
The syntax of
html_entity_decode() is as follows:
html_entity_decode( $str, $flags = ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML401, $encoding = null, )
The first two arguments,
$flags, work exactly as in
htmlspecialchars_decode(). The additional third
$encoding argument specifies the encoding of the given string.
That why exactly do we have the
$encoding parameter present in
html_entity_decode() but not in
htmlspecialchars_decode() is due to reasons discussed in the previous section on
Let's consider an example using
In the following code, we decode a string containing a couple of entities using
<?php $str = <<<END ³ is called 'Superscript Three'. ≈ is called 'Almost Equal To'. < is called 'Less-Than Sign'. END; echo html_entity_decode($str);
Simple, wasn't this?