HTML: Foundation — Entities

HTML Entities

Learning outcomes:

  • What are HTML entities
  • Named and numeric entities
  • Commonly used named entities
  • Numeric entities

What are entities?

Let's say we want to add the less-than symbol, <, literally inside a <p> element (or for that matter, in any element). How could we do so?

Well, based on what we've learnt so far, we might go on and do the following:

HTML
<p>2 < 3</p>

2 < 3

This works, at least in this case, but it is NOT considered a good practice at all.

The < character is reserved for a special purpose by HTML, i.e. to denote the beginning of a tag, and should therefore not be used as it is in code.

To better understand this, let's say we want to denote the text 'This is <code>' literally in HTML. We couldn't do the following:

HTML
<p>This is <code></p>

since <code> will be treated as an element, NOT literally as the text '<code>'.

So what to do now?

The correct way to literally denote a character such as < or > in HTML, that is otherwise reserved for a special purpose, is to use an HTML entity.

An HTML entity is a sequence of multiple characters that, as a unit, denotes a single character in the final output.

An entity begins with an ampersand (&) and ends with a semicolon (;). Between these we put some characters to altogether represent another character.

An entity is sometimes also known as a character reference in HTML.

There are multiple ways of specifying an entity:

  • Named entities, whereby a short abbreviation is used to represent the underlying character.
  • Numeric entities, whereby a number representing the underlying character is expressed as a decimal integer or a hexadecimal integer.

Since they're easier to remember compared to numeric entities, we'll typically use named entities when writing HTML code.

Anyways, let's now see how to denote < using an entity.

From mathematics, recall that the < symbol is called the less-than symbol. In HTML, the named entity &lt; denotes this < character. (You can obviously guess what 'lt' means here).

When an HTML parser encounters &lt;, it realizes that it's an entity. Likewise, it goes through its large collection of named entities and deduces that &lt; corresponds to <, and likewise replaces &lt; with < in the final output.

The source code of the page would obviously still contain &lt; but the rendered output would be different, containing the < character.

Here's a quick example of using &lt;:

HTML
<p>2 &lt; 3</p>

2 < 3

In the next section, we shall learn about some of the most commonly used entities in HTML.

Commonly used named entities

HTML defines a jaw-dropping amount of named entities, covering a huge variety of characters and symbols from a diverse set of languages and areas of study.

Now that we understand what exactly is an HTML entity, let's spare a couple of minutes in getting to know some of the most commonly-used ones.

Non-breaking spaces

Recall the fact that in HTML, each and every sequence of whitespace characters (spaces, tabs, newlines, etc.) is replaced with a single whitespace character.

This is the default behavior in HTML (to help us neatly structure our HTML files with indentations and newlines without affecting the output) unless we override it using preformatting.

A non-breaking space character, however, doesn't get treated as such a whitespace character in HTML even though it produces whitespace.

So what does this mean? Let's find out.

Denoted as &nbsp;, a non-breaking space represents a space character that follows two basic rules:

  • It does NOT get treated as a regular whitespace character, which means that it remains in the output as it is in the code.
  • It does NOT allow the breaking of text (which otherwise allows a single line of text to be wrapped on to a new line if all of its text couldn't fit on one line in the output.)

Expanding upon the first rule, if we have 10 &nbsp; entities, we'll get exactly 10 corresponding space characters in the output — HTML doesn't strip these off.

Following is an example:

HTML
<p>Here we have 5 spaces: "     "</p>
<p>Here we have 5 non-breaking spaces: "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"</p>

Here we have 5 spaces: " "

Here we have 5 non-breaking spaces: "     "

The sequence of normal space characters in the first <p> gets reduced down to one single space (as per the default behavior of HTML). However, in the second <p>, the sequence of non-breaking spaces, each denoted as &nbsp; show up as they are written in the source code.

The &nbsp; entity can be really handy when we just want to add an extra space or two somewhere in our HTML but don't necessarily want the power of preformatting for this.

The second rule, which is where the non-breaking space character gets its name from, means that a line of text couldn't be broken down at this character unlike how it could be broken down at a regular space.

Shown below is an example:

HTML
<p>This is an overflowing word.</p>
<p>This is an overflowing&nbsp;word.</p>

We have two paragraphs with the same text except for that the second one has a non-breaking space between 'overflowing' and 'word' (and hence the second paragraph couldn't be broken down at this space).

Using some CSS (which we'll explore later on in this course), we emulate the scenario of there not being sufficient width to fit the word 'word' on the same line. Take a look at both the paragraphs as follows:

This is an overflowing word.

This is an overflowing word.

To fit the entire text inside the <p> element, the browser is configured to break the line of text upon any whitespace character.

  • In the first paragraph, the space after 'overflowing' is taken to be the breaking point for the line of text, simply because it's a regular space.
  • In the second paragraph, however, &nbsp; denotes a non-breaking space which the browser can't break at; it instead breaks the line of text at the space following 'an' (since that's a regular space).
Think of &nbsp; as grouping two words together so that they become a single word, although, obviously, that's not the case visually.

Less-than (<) and greater-than (>)

We already saw the &lt; entity in the previous section above but let's quickly see it once again, along with &gt;.

The less-than (<) symbol is given by the &lt; entity while the greater-than (>) symbol is given by the &gt; entity.

If we need to use either of these symbols (< and >) in HTML, we must use their corresponding entities, since both the symbols are reserved for a special purpose in HTML, i.e. to denote HTML tags.

In the following code, we solve the problem discussed above, where we wanted to represent the text 'This is <code>' in HTML:

HTML
<p>This is &lt;code&gt;</p>

Let's see the output:

This is <code>

Voila! Just as we wanted.

Ampersand (&)

Suppose we want to represent literally the text '&gt;' in HTML. How could we do this?

Well, if we write &gt; as it is, we'll obviously get the corresponding > character, not the text itself, as can be seen below:

HTML
<p>Greater-than (&gt;)</p>

Greater-than (>)

What we need to do here is to replace the ampersand (&) from &gt; so that it isn't treated as an entity by the browser. And that's exactly where &amp; enters the game.

The ampersand (&) character is given by the named entity &amp;.

Coming back to our question, to represent '&gt;' literally in HTML, we just need to use &amp; in place of the & character. In that way, the whole sequence won't be treated as an entity but rather as plain text.

The following code demonstrates this:

HTML
<p>Greater-than (&amp;gt;)</p>

Greater-than (&gt;)

Numeric entities

While there are hundreds and hundreds of named entities in HTML, they still don't altogether represent the complete set of characters possible in Unicode.

For that, we use another kind, one that specifies the code point (the number) associated with a given character, as a decimal or hexadecimal integer.

As we know, such entities are referred to as numeric entities.

  • For a decimal representation, the code point is written as it is between the & and ; characters, prefixed with # (which means that a number follows).
  • For a hexadecimal representation, the code point is converted to its corresponding hexadecimal integer and then written between & and ;, prefixed with # and additionally x.

So, in general, a decimal numeric entity can be expressed as &#code; whereas a hexadecimal numeric entity can be expressed as &#xcode;, where code denotes the number representing the code point of the underlying character.

Using a numeric entity, we can express just about any character in an HTML document.

While we can represent any character in HTML using a numeric entity, the character would only be displayed in the browser if there is a supported glyph for it. For example, there is no way to display a control character, likewise rendering a control character using a numeric entity has no effect.

Let's take the example of the < character.

< has the code point 60 in Unicode, sometimes also expressed more technically as U+003C (where 003C is the hexadecimal representation of the number 60).

60 is already a decimal number, likewise, the decimal entity for < would trivially be &#60;. Converting 60 to hexadecimal gives us the number 3C, hence the hexadecimal entity would be &#x3c; (or equivalently, &#x3C;, with an uppercase C).

The code below expresses < in three different ways:

HTML
<p>Less-than symbol: &lt;</p>
<p>Less-than symbol: &#60;</p>
<p>Less-than symbol: &#x3c;</p>

Less-than symbol: <

Less-than symbol: <

Less-than symbol: <

Entities are amazing, aren't they.

If a character doesn't have a corresponding named entity, you can always use a numeric entity to express it in HTML, given that you know its code point.

Spread the word

Think that the content was awesome? Share it with your friends!

Join the community

Can't understand something related to the content? Get help from the community.

Open Discord

 Go to home Explore more courses