HTML Entities

Chapter 7 9 mins

Learning outcomes:

  1. What are HTML entities
  2. Named and numeric entities
  3. Commonly used named entities
  4. Numeric entities

What are entities?

Let's say we want to add the less-than symbol, <, literally inside a <p> element (or for that matter, in any element). How could we do so?

Well, based on what we've learnt so far, we might go on and do the following:

<p>2 < 3</p>

2 < 3

This works, at least in this case, but it is NOT considered a good practice at all.

The < character is reserved for a special purpose by HTML, i.e. to denote the beginning of a tag, and should therefore not be used as it is in code.

In fact, let's say we want to denote the text 'This is <code>' in HTML; we couldn't do the following:

<p>This is <code></p>

since <code> will be treated as an element, not literally as the text '<code>'.

So what to do now?

The correct way to literally denote a character such as < or > in HTML, that is otherwise reserved for a special purpose, is to use an HTML entity.

An HTML entity is a sequence of characters that together denote a given character in the final output.

An entity begins with an ampersand (&) and ends with a semicolon (;). Between these we put some characters to altogether represent another character.

An entity is sometimes also known as a character reference in HTML.

There are multiple ways of specifying an entity:

  • Named entities, whereby a short abbreviation is used to represent the underlying character.
  • Numeric entities, whereby a number representing the underlying character is expressed as a decimal integer or a hexadecimal integer.

Since they're easier to remember compared to numeric entities, we'll use named entities largely when writing HTML code.

Anyways, let's now see how to denote < using an entity.

From mathematics, recall that the < symbol is called the less-than symbol. In HTML, the named entity &lt; denotes this < character. (You can obviously guess what 'lt' means here).

When an HTML parser encounters &lt;, it realizes that it's an entity. Likewise, it goes through its large collection of named entities and deduces that &lt; corresponds to <, and likewise replaces &lt; with < in the final output.

The source code of the page would obviously still contain &lt; but the rendered output would be different, containing the < character.

Here's a quick example of using &lt;:

<p>2 &lt; 3</p>

2 < 3

In the next section, we shall learn about some of the most commonly used entities in HTML.

Commonly used named entities

HTML defines a jaw-dropping amount of named entities, covering a huge variety of characters and symbols from a diverse set of languages and areas of study.

Now that we understand what exactly is an HTML entity, let's spare a couple of minutes in getting to know some of the most commonly-used ones.

Non-breaking spaces

A non-breaking space, given as &nbsp;, represents a single space character that doesn't get replaced by HTML.

That is, if we have 10 &nbsp; entities, we'll get exactly 10 corresponding space characters in the output — HTML doesn't strip these off.

Following is an example:

<p>Here we have 5 spaces: "     "</p>
<p>Here we have 5 non-breaking spaces: "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"</p>

Here we have 5 spaces: " "

Here we have 5 non-breaking spaces: "     "

The sequence of normal space characters in the first <p> gets reduced down to one single space (as per the default behavior of HTML). However, in the second <p>, the sequence of non-breaking spaces, each denoted as &nbsp; show up as they are written in the source code.

The &nbsp; entity can be really handy when we just want to add an extra space or two somewhere in HTML but don't necessarily want the power of preformatting for this.

Less-than (<) and greater-than (>)

We already saw the &lt; entity in the previous section above but let's quickly see it once again, along with &gt;.

The less-than (<) symbol is given by the &lt; entity while the greater-than (>) symbol is given by the &gt; entity.

If we need to use either of these symbols (< and >) in HTML, we must use their corresponding entities, since both the symbols are reserved for a special purpose in HTML, i.e. to denote HTML tags.

In the following code, we solve the problem discussed above, where we wanted to represent the text 'This is <code>' in HTML:

<p>This is &lt;code&gt;</p>

Let's see the output:

This is <code>

Voila! Just as we wanted.

Ampersand (&)

Suppose we want to represent literally the text '&gt;' in HTML. How could we do this?

Well, if we write &gt; as it is, we'll obviously get the corresponding > character, not the text itself, as can be seen below:

<p>Greater-than (&gt;)</p>

Greater-than (>)

What we need to do here is to replace the ampersand (&) from &gt; so that it isn't treated as an entity by the browser. And that's exactly where &amp; enters the game.

The ampersand (&) character is given by the named entity &amp;.

Coming back to our question, to represent '&gt;' literally in HTML, we just need to use &amp; in place of the & character. In that way, the whole sequence won't be treated as an entity but rather as plain text.

The following code demonstrates this:

<p>Greater-than (&amp;gt;)</p>

Greater-than (&gt;)

Numeric entities

While there are hundreds and hundreds of named entities in HTML, they still don't altogether represent the complete set of characters possible in Unicode.

For that, we use another kind, one that specifies the code point (the number) associated with a given character, as a decimal or hexadecimal integer.

As we know, such entities are referred to as numeric entities.

  • For a decimal representation, the code point is written as it is between the & and ; characters, prefixed with # (which means that a number follows).
  • For a hexadecimal representation, the code point is converted to its corresponding hexadecimal integer and then written between & and ;, prefixed with # and additionally x.

So, in general, a decimal numeric entity can be expressed as &#code; whereas a hexadecimal numeric entity can be expressed as &#xcode;, where code denotes the number representing the code point of the underlying character.

Using a numeric entity, we can express just about any character in an HTML document.

While we can represent any character in HTML using a numeric entity, the character would only be displayed in the browser if there is a supported glyph for it. For example, there is no way to display a control character, likewise rendering a control character using a numeric entity has no effect.

Let's take the example of the < character.

< has the code point 60 in Unicode, sometimes also expressed more technically as U+003C (where 003C is the hexadecimal representation of the number 60).

60 is already a decimal number, likewise, the decimal entity for < would trivially be &#60;. Converting 60 to hexadecimal gives us the number 3C, hence the hexadecimal entity would be &#x3c; (or equivalently, &#x3C;, with an uppercase C).

The code below expresses < in three different ways:

<p>Less-than symbol: &lt;</p>
<p>Less-than symbol: &#60;</p>
<p>Less-than symbol: &#x3c;</p>

Less-than symbol: <

Less-than symbol: <

Less-than symbol: <

Entities are amazing, aren't they.

If a character doesn't have a corresponding named entity, you can always use a numeric entity to express it in HTML, given that you know its code point.

"I created Codeguage to save you from falling into the same learning conundrums that I fell into."

— Bilal Adnan, Founder of Codeguage