HTML URIs

Chapter 11 40 mins

Learning outcomes:

  1. What are URIs, or Uniform Resource Identifiers
  2. The components of a URI
  3. The http and https URI scheme
  4. Absolute URLs and relative references
  5. The file URI scheme
  6. The mailto URI scheme
  7. The tel URI scheme

What are URIs?

URIs are one of the most fundamental aspects of the World Wide Web.

They allow us to uniquely identify resources on the internet and essentially help us navigate our way around it. URIs are an integral part of the web, for if we take them out of the equation, the web would cease to function the way it does right now.

URIs define the entire web rather than being a part of it.

To get to the definition:

A Uniform Resource Identifier, or simply a URI, is a uniform means of identifying a resource on the internet.

There are two classifications of URIs:

  • Uniform Resource Locators, or URLs, are used to locate resources on the web instead of just uniquely identifying them. Almost every single person who uses the internet uses URLs at one point or another .
  • Uniform Resource Names, or simply URNs, are used to identify given resources but not necessarily locate them (as is otherwise the case with URLs). URNs are not as common as URLs.

A common misconception amongst beginner, sometimes even experienced, developers is that URIs and URLs are the same. Strictly speaking, that's NOT the case.

URIs and URLs are not the same!

As we learnt above, URIs can either be URLs or URNs; all URLs are URIs but not all URIs are URLs. In other words, URIs are a superset of URLs.

This misconception, that URIs and URLs are the same, arises from the fact that most — in fact, almost all — of the URIs that we use on the web are URLs.

Yes for sure, when a URI is a URL, we can call it either of these. But such a notion should not take us into believing that URIs and URLs are literally the exact same thing. Absolutely not.

A URL is just one classification of a URI, the other being URNs.

The components of a URI

In theory, a URI is a very simple concept. But to further simplify it, it's broken down into individual components, each serving a different purpose.

Here's an illustration of the general syntax of a URI:

scheme:[//authority]path[?query][#fragment]

There are five components depicted here: scheme, authority, path, query and fragment.

Let's see what each of these components does...

Scheme

The scheme, sometimes also known as the URI's naming scheme, is the most important part of a URI — the entry point into the URI.

The scheme states the very nature of what is identified by the URI (such as an http resource, or an email address, or a telephone number, and so on).

For example, the scheme that we're all mostly familiar with on the web is https. The https scheme lays forward a specification of URIs that are used identify resources served with the HTTP protocol. Similarly, mailto is used to identify resources that represent emailing addresses.

There are many different kinds of URI schemes. Some of the most popular ones are http, https (http with encryption), mailto, tel, telnet, ftp, file, data, ssh, irc, etc.

Each of these schemes is used to produce URIs whose meaning depends upon the semantics and syntax of the scheme itself. Different schemes use different parts of the general URI syntax (shown above) differently.

In this chapter, we're only concern ourselves with the following schemes: https (and http), file, mailto and tel.

Authority

Many URI schemes leverage the concept of an authority, sometimes also known as the naming authority, to hand over the path and the following components to, for the purpose of resolution.

The authority simply represents an entity that takes the latter part of the URI and processes it in identifying a specific resource.

The authority begins with two forward slashes (//) and ends with the next delimiter in the URI. It is comprised of further subcomponents, which, just like a URI, can also be expressed using a generic syntax.

The generic syntax of a URI is described below:

[userinfo@]host[:port]

It's worthwhile noting that NOT every URI is composed of an authority.

As an example, in the https scheme, the authority component is simply comprised of a domain name. That is, in the HTTP URI https://example.com, the authority is //example.com, where example.com is simply the domain name of the underlying resource.

In the following section, when we explore the https (and http) URI scheme in detail, we'll explore the authority component along with it.

Path

The path is perhaps the most ubiquitous component of a URI to understand.

The path represents a hierarchical means of identifying a resource.

Just like the authority component applies to the preceding naming scheme, the path applies to the preceding naming scheme along with the naming authority (if any).

As the name suggests, the path typically addresses the 'path' to get to a resource. This path may either be physical, i.e. it represents an actual filesystem path on a computer, or be abstract, i.e not existing in reality but processed by a computer for an appropriate response.

Once again, taking https as an example, in the following URI,

https://example.com/lectures/lecture-1.html,

the path is /lectures/lecture-1.html.

We'll learn more about paths, specifically in the https scheme, later on in this chapter when we explore the https scheme in detail.

Query

After the path comes the query in a URI.

The query component is a means of passing a query, i.e. a piece of additional information to the underlying resource.

The path begins with (and includes) the ? symbol.

We don't need to worry about understanding the query component at this stage because it's mostly used by server-side software and sometimes even by JavaScript, both of which are avenues that we're yet to discover.

Fragment

The fragment is the final component of a URI. As with the query component, not all URIs leverage the fragment.

The fragment identifies a particular section within the underlying resource.

The fragment begins with (and inclues) the # symbol.

When used in https (or http), the fragment represents a destination anchor in the underlying resource, which is customarily an HTML document. The name of the anchor is simply the value of the fragment component, excluding the # symbol.

We'll see more about fragments and destination anchors in the next HTML Hyperlinks chapter.

The http and https URI schemes

It won't be wrong to say that, by far, the most popular and commonly-used set of URIs on the web fall under the category of http and https URIs. The webpage that you're currently viewing is also identified by such a URI.

So what's so special about https and http?

Well, to start off the discussion, the http scheme denotes URIs that apply to the HTTP protocol.

The Hypertext Transfer Protocol, or HTTP, is a protocol, i.e. a set of rules, that defines how exactly hypertext is transferred over the wire.

HTML, HTTP and URIs are the three core technologies that formulated the web in the early 90s. HTML was the format of the data, HTTP was the way this data was transferred, and URI was the way to identify a given piece of the data.

As simple as it could get.

HTTP can transfer much more than just hypertext!

Although the name 'HTTP' suggests that the protocol is only capable of transferring hypertext, this has changed over the course of the years since its inception — HTTP today can transfer any kind of data, including such things as images, videos, audios, PDFs, binary files, our very own HTML files, and lot more.

In that sense, one might say that the term 'HTTP' today doesn't truly encompass what the protocol is capable of transferring, and it won't be wrong to say this.

Actually, the term 'HTTP' was crafted at a time when the only capability of the protocol was transferring just hypertext, but soon it evolved into a complex technology that became powerful enough to transfer just about any kind of data. In this evolution, however, the name of the protocol was kept the same.

But, frankly speaking, we feel that the term HTTP is quite cool despite the fact that it just talks about hypertext. What do you say?

Moving on, let's now talk about the https scheme.

The https scheme relies on the HTTPS protocol, which is just HTTP with an added layer of security.

HTTPS stands for Hypertext Transfer Protocol Secure and is a protocol for delivering HTTP-based data over a secure communication channel.

You might ask: Why was HTTPS created? Well, let's see...

In the nascent web, security wasn't really a big issue until people started to realize of the immense vulnerabilities that the web intrinsically carried along with it.

One of these was that of eavesdropping, whereby an attacker would tap into an HTTP connection between a client and a server and read all the data being transmitted therein.

Since HTTP was all just plain text, this led to a severe vulnerability of sensitive information being possibly leaked to eavesdroppers.

The solution, HTTPS.

HTTPS doesn't do an enormous amount of engineering to HTTP; it just takes whatever is delivered in the HTTP protocol and encrypts it (converts it into scrambled data). This encryption renders all the transferring data useless to an eavesdropper, who only gets to see gibberish data.

Remember that HTTPS doesn't prevent eavesdropping, that's still possible; it only makes the data being transferred not understandable to the eavesdropper.

We've largely simplified the model of HTTPS and eavesdropping, which is one of the many ways of cracking web security, over here; in reality, the situation is far more complex than this. Getting into the details of HTTPS and the possibilities of cracking its security are both out of the scope of this text and require some highly technical topics.

Anyways, now that we know about the http and https URI schemes, let's quickly go through the more common syntax of such URIs, with an example URI.

Consider the following simple URI:

A simple URI with a scheme, authority and path.
A simple URI with a scheme, authority and path.

We start off with the scheme, which is https. Next comes the authority part, which for http and https URIs is just a domain name along with a port number.

If the port is omitted, it's implicitly assumed to be 443 for https and 80 for http.

Since the example URI above doesn't include a port and its scheme is https, it's equivalent to the following URI that includes a port:

An https URI with a port
An https URI with a port

By default, browsers are configured to not display the port for http if it's 80 and for https if it's 443.

After the authority part, we have the path, which is /home.html.

For http and https URIs, making intuition of the path is really simple. It just represents the hierarchical path on the underlying machine hosting the website that takes us to the end resource.

In the web's early days, paths were always physical paths, i.e. they were equivalent to a normal filesystem path on the server.

For instance, if the path /home.html shown above was a physical path, then /home.html would be representing an actual home.html file located inside the root directory of the website www.example.com on its respective server.

However, these days, paths are usually abstract, i.e. they don't exist for real. Abstract paths are processed by servers using some kind of a program that crafts a response based on the path requested.

For example, in the URI www.example.com/products/78, we might not actually have anything such as /products/78 on the underlying server (there even isn't any file extension on 78!). The path would probably be processed by the server, with a product's information obtained from a database whose id is 78.

Going forward, after the path, we have the query, as follows:

An https URI with a query.
An https URI with a query.

The query begins with (and includes) the ? symbol. It consists of a set of name=value pairs, known as parameters, delimited by & characters.

Each of the parameters in a query acts more or less like an HTML attribute in that it provides additional information in the URI.

For instance, in the example shown above, the lang parameter set to the value en-us tells it to the server that the home.html resource should be returned in the English (US) language.

We won't be discussing the query component any further than this as it relies upon server-side software and/or JavaScript to be able to be parsed and reacted upon, and we haven't yet explored both of these avenues.

The final thing left for an http (and https) URI is the fragment.

An https URI with a fragment.
An https URI with a fragment.

A fragment simply represents a particular section within the resource identified by the underlying URI, which is mostly an HTML document.

A fragment begins with (and includes) the # symbol.

In the example URI above, the fragment #section1 represents a section — or better to say, a destination anchor — in the home.html resource (an HTML file) with the name section1.

We'll be discussing fragments and destination anchors in detail in the next HTML Hyperlinks chapter.

RFC 2616 — Hypertext Transfer Protocol -- HTTP/1.1, a formal specification published by the IETF in June 1999, goes into a lot more detail regarding http URIs. It's a highly technical document, though, but there still would be something of use for anyone of any experience level in it.

The file URI scheme

Besides http and https, the file URI scheme is also a very commonly-used scheme.

file is used by every newbie when he/she begins to learn HTML. Even you have used it, possibly without you knowing about it. The HTML files that you've been creating thus far in this course have all been launched in the browser using a file URI!

Go to any one of that HTML file that you created and open it up again in the browser and notice the URL displayed in the address bar above.

The file URI scheme represents URIs that identify resources on the underlying filesystem.

As an example, consider the following file URI on a Windows computer:

A file URI on Windows.
A file URI on Windows.

It denotes a greeting.html file that resides in the directory path C:/Users/Alice/OneDrive/Desktop.

The file scheme doesn't have an authority, and its path always represents an actual physical path pointing to a file on the underlying filesystem.

The path of a file URI can never be abstract!

At this beginning stage, you're all good to work with HTML in file URIs.

It's only once you're done learning CSS and JavaScript that you should, and probably would, transition to http URIs with the help of some server software (which you'll set on your own).

RFC 8089 — The "file" URI Scheme, published by the IETF in Feb 2017, discusses the file URIs in depth.

Absolute vs. relative URLs

In this section, we shall discuss about two ways of referring to URLs while working with them on the web, namely absolute URLs and relative references.

First things first:

An absolute URL begins with a scheme, and self-contains all the information necessary to define a complete URL.

Some examples of absolute URLs are:

  • https://example.com/
  • https://example.com/home.html
  • http://localhost/items/1
  • http://localhost:3000/
  • file:///C:/Users/Alice/OneDrive/Desktop/greeting.html

An absolute URL doesn't have to be combined together with another URL in order to give us a complete URL to work with — it already is in complete form.

Note that we use the term 'URL' in our discussion only when we have a complete URL. For example, we'll indeed call https://example.com/home.html a URL, but not home.html, since it doesn't begin with a scheme.

As we shall see up next, home.html aligns with the second way of referring to a URL, i.e. via a relative reference.

A relative reference to a URL doesn't begin with a scheme, and is combined with an absolute URL in order to produce a final, complete URL.

As the name suggests, a relative reference is literally relative to another URL.

After resolving a relative reference, the complete URL that we get in the end is called its target URL. The URL relative to which a relative reference gets resolved is called the base URL.

Relative references or relative URLs?

Relative references are more commonly referred to as relative URLs (the same as relative URLs as per the context) in resources out there.

However, we tend to avoid this naming, being in line with RFC 3986.

In fact, RFC 3986 itself states that the term 'relative URI' was used in previous RFCs but led to some readers misunderstanding that it referred to a subset of URIs, which wasn't the case.

Consequently, the term 'relative reference' was used in lieu of 'relative URI' to emphasize that it's merely a means of referring to another URI and not a URI itself.

Relative references are pretty commonly used on the web in order to save space and keep from specifying complete, absolute URLs when the resources linked are somehow related to the current URL (in the browser's address bar). Relative references lead to shorter addresses.

A relative reference can begin with two slashes (//); in this case, it denotes a network-path reference, beginning with its authority (recall that the authority begins with //).

For instance, if our base URL is http://example.com/home.html and our network-path reference is //codeguage.com/about, we'll get the following resolved URL: http://codeguage.com/about.

As you can see, the scheme of the target URL of a network-path reference is the same as that of the base URL. That is, in our example, we had http in the base URL and again http in the resolved, target URL.

Network-path references are perhaps the least commonly used of relative references.

A relative reference can even begin with a single slash (/); in this case, it denotes an absolute-path reference.

An absolute-path reference denotes the complete — the 'absolute' — path of the target URL.

For instance, let's say that our base URL is http://example.com/items/watch.html and our absolute-path reference is /home.html. Then we'll get the following resolved URL: http://example.com/home.html.

Here are a handful of examples of absolute-path references, with the base URL http://example.com/items/watch.html:

  • /home.html; resolves to http://example.com/home.html
  • /about/our-story; resolves to http://example.com/about/our-story
  • /cart/checkout; resolves to http://example.com/card/checkout

If a relative reference doesn't begin with a slash (/), we have what's called a relative-path reference.

A relative-path reference denotes a partial path of the target URL; the complete path is determined by combining it with the path of the base URL (hence, the term 'relative').

Suppose that our base URL is again http://example.com/items/watch.html and our relative-path reference is cup.html. Then we'll get the following resolved URL: http://example.com/items/cup.html.

It's very easy to reason this using the analogy of files and directories on a computer. That is, think of items as being a directory, containing the file watch.html; then when we refer to cup.html, since we're still in that items directory, we just go to the cup.html file in it.

Here are a couple of examples of relative references for the base URL http://example.com/items/watch.html:

  • watches/big-watch.html; resolves to http://example.com/items/watches/big-watch.html
  • ./cup.html; resolves to http://example.com/items/cup.html
  • ../cup.html; resolves to http://example.com/cup.html

The last two examples here are worth consideration. The special values . and .. are called dot-directories.

  • . represents the current directory. A reference beginning with ./ is essentially the same as the reference with these two characters trimmed off.
  • .. represents the parent directory; it takes us one directory upwards.

It's paramount for you to master these references since we'll be using them a lot in the coming chapters, as we start linking more and more resources into our HTML pages.

The mailto URI scheme

Two other particularly handy URI schemes are those that represent email addresses and telephone terminals. In this section and the next one, we shall explore these two kinds of URIs.

Starting off with the former, the mailto URI scheme is used to denote email addresses.

The mailto URI scheme represents URIs that identify resources accessible via Internet mail.

In simpler words, a mailto URI represent an email address, or multiple such addresses, that will be sent an email to.

A mailto URI has neither an authority component, nor a fragment; only a scheme (obviously), followed by a path, followed by an optional query.

By default, when a mailto URI is opened up in the browser, it takes us to the system's default email app (whichever it is) and creates a new email in it, directed to the given recipient(s) as mentioned in the URI.

mailto doesn't directly send an email; it just creates a new email in the email app which we have to send ourselves. In that sense, we can edit the email before dispatching it.

Let's consider an example URI:

A mailto URI
A mailto URI.

This URI represents the email address contact@example.com. If we enter this URI into the browser's address bar, we'll be taken to the system's default emailing app, with a message crafted to be sent to the given address.

RFC 6068 — The 'mailto' URI Scheme, published by the IETF in Oct 2010, explores mailto URIs in very fine, technical detail.

The tel URI scheme

Many websites provide mobile/telephone contact links which, when clicked, directly take us to the phone app with the underlying contact number filled in there, ready to be called.

Such links leverage the tel URI scheme.

The tel URI scheme represents URIs that identify resources reachable via a telephone number.

The basic syntax of tel URIs is extremely simple.

The telephone number comes right after tel: and can be delimited by hypens (-) for better readability (by separating the country code, the area code, and so on).

There shouldn't be any spaces in a tel URI. As specified before, use hyphens (-) for delimiting different parts of the number.

Here's an example URI:

A tel URI.
A tel URI.

The +1 at the beginning of the number is the country code of the US; hence, the shown number actually represents a phone number in the US.

Every country has an associated country code to be used in telephone numbers. For example, UK has the code +44, Turkey has +90, France has +33, Saudi Arabia has +966, and so on.

The rest of the parts of the number above further filter it down to specific locations. For instance, the segment 201 in the number above, after the country code, represents the area code of the eastern part of the state of New Jersey in the US.

Different countries have different rules of defining numbers; some might use city codes and area codes whereas others might not use them at all.

Anyways, moving on, browsers are configured by default to handle tel URIs by opening the underlying number in the system's phone application.

On mobile browsers, this is a straightforward process since the phone app exists in the same system as the browser app. However, on desktop devices, opening a tel URI usually leads to some kind of a browser pop-up asking where to redirect the call; if the browser is somehow connected to our phone, we can get the tel URI to be passed on to the mobile phone.

RFC 3966 — The tel URI for Telephone Numbers, published by the IETF in Dec 2004, explores tel URIs in granular detail.

"I created Codeguage to save you from falling into the same learning conundrums that I fell into."

— Bilal Adnan, Founder of Codeguage