Introduction

All over computer programming, strings play a huge role in powering numerous applications. They are the backbone of programs, used to display messages, show inputs, output logs, print stuff, and much more.

Nearly every language has a mechanism to work with strings in one way or the other. As we learnt in the foundation chapter, Python also has one to work with strings.

In this chapter, we shall explore Python's string data type in detail, covering an array of concepts ranging from common terminology to immutability.

What are strings?

Let's start by reviewing what exactly is a string:

A string is a sequence of characters.

Each character in a string sits at a given position, formally known as its index. Indexes start at 0.

The first character has index 0, the second one has index 1, the third one has index 2, and so on and so forth.

The total number of characters in a string is referred to as the length of the string.

Python follows the UTF-8 encoding scheme to represent characters in strings. In this system, each character occupies at least 8 bits of memory.

Creating strings

Python provides a handful of ways to create strings.

First we'll see how to create single-line strings using a pair of single quotes or using a pair of double quotes and then head over to consider multi-line strings.

A pair of single quotes '' or a pair of double quotes "" denotes a string. Specifically, such a value is called a string literal.

Consider the code below:

s = "Hello"

We define a variable s and assign it the string literal "Hello", created via double quotes.

This string could've been written using single quotes as well, as shown below.

s = 'Hello'

Both the strings "Hello" and 'Hello' are identical to one another — there is absolutely no difference between them.

Now you would ask, what's the use of two kinds of quotes to denote a string.

Well if you want to create a string that has a single quote in itself, then doing so using single quotes would throw an error.

This is apparent in the code below:

s = 'Python's World'

Python figures out that the string starts after the equals sign and ends at the word 'Python', however this is not the case. We know that the string ends at the word 'World'.

This misinterpretation occurs because when a string begins with a given character — a ' single quote in this case — the interpreter looks for the same character eagerly, in order to complete a pair. As soon as it founds one, it terminates the string right at that point, without going any further.

The solution to this issue is very simple — create the string using double quotes "". In this way, the symbols used to denote the string won't come in conflict with the symbols used within the string.

The same goes for strings that have double quotes in them — create them using single quotes ''.

However, what if the string has both double and single quotes in it. Now there are two ways to solve this problem:

  1. Escape characters in the string
  2. Use a multiline string.

Both these solutions are discussed in detail in the sections below.

Escaping characters

When a string is denoted by a certain symbol, the interpreter searches for its pair and terminates the string as soon as it's found.

If the symbol appears within the string, Python misinterprets it as the ending of the string and terminates the string right at that point, leading to errors.

This can be solved, if we tell the interpreter that the synonymous symbol within the string doesn't denote the string. How?

By escaping the characters using a \ backslash.

When we escape a character within a string, then even if it resembles the character used to denote the string, no misinterpretation occurs. This is because the interpreter now knows that the similar character within the string is escaped, and hence doesn't denote the end of the string.

Below shown is a very elementary example:

s = 'Python\'s World'
print(s)
Python's World

The string s is denoted using single quotes, therefore a single quote can't be put within it, as is. To put a single quote, we first have to escape it. This is done by using a backslash followed by the quote — \'.

Multiline strings

If you're not a big fan of escaping characters, but still wan't to prevent the misinterpretation of a character within a string as the end of the string, then behold, Python has another way to save your day.

That is using multiline strings.

A multiline string can be denoted in one of the two ways shown below:

  1. Using a pair of triple single quotes - ''' '''
  2. Using a pair of triple double quotes - """ """

Why is it called multiline, this we shall explore later on. For now, let's see how to solve the character conflict problem using multiline strings.

We want to put the following text in a string: Python's saying: "Readability counts".

It has both single and double quotes and so denoting it using either of these would require the other character to be escaped in the string. However, using multiline strings, the need to escaping just disappears away.

Consider the code below:

s = '''Python's saying: "Readability counts"'''

The given text in enclosed within a pair of ''', as is. There is no need to escape the single or double quotes.

The reason is very straightforward. The character used to denote the string is ''', so Python looks for its pair which is '''. Since this doesn't occur anywhere within the string, there is no character misinterpretation.

So, let's now see why are multiline strings called multiline. For this we'll take another example.

Let's say you want to make a message that starts with "Hello world!" on the first line, followed by "Bye." on the third line, as shown below.

Hello World!

Bye.

Doing so with single line strings would look something as follows:

s = 'Hello World!\n\nBye.'

After 'Hello World', we put two \n characters. The \n character denotes a newline.

However, with multiline strings, we do not have to worry about these \n newline characters — just write the message as it is.

s = '''Hello World!

Bye.'''

As is clear cut over here, multiline strings can span multiple lines, hence the name 'multiline'.

Indexes and length

As we've stated before, each character in a string lies at a given position known as its index, and the total number of characters is known as the length of the string.

In Python, we can retrieve a character at a given index using bracket notation, as follows:

s = 'Hello World!'

print(s[0]) # first character
print(s[2]) # third character
H
l

The first character lies at index 0, likewise s[0] returns the first character of s. The third character, similarly, lies at index 2 and this is what s[2] returns.

Since strings are immutable in nature, it's invalid to change a character from a given string using bracket notation.

Python, also supports negative indexes in bracket notation, which are computed from the end of a string.

For example, s[-1] returns the last character of the string s, while s[-2] returns its second last character.

s = 'Hello World!'

print(s[-1]) # last character
print(s[-3]) # third last character
!
l

Moving on, to retrieve the length of a string, we pass it to the global len() function. It returns the total number of characters in a given string.

The len() function is demonstrated below:

len("Hello")
5
len("Python")
6
len("Programming geeks")
17

What will len('') return?

  • -1
  • 0

Slicing a string

It's a common task to extract substrings out of a string from given positions. Formally, this is referred to as slicing a string.

A slice of a string is a segment of it, starting at a given index and ending at a given index.

In Python, we can slice a string using bracket notation. But this time, what goes inside the brackets is something different.

The general syntax of slicing is shown below for a given string s.

s[start:end]

start defines the index at which to start the slicing. This is inclusive. On the other hand, end defines the index at which to end the slicing. This is exclusive.

Below shown are a handful of slices made on the string "Hello World!":

s = "Hello World!"

slice_1 = s[0:5]
print(slice_1)

slice_2 = s[2:4]
print(slice_2)
Hello
ll

The first slice works as follows: it begins at index 0, and ends exactly at index 4, since the end, which is 5, is exclusive. It constitutes the string 'Hello'.

The second one begins at index 2 and ends exactly at index 3 constituting the string 'll'.

The end parameter is always one greater than the index upto which you want to slice a string. For example, if you wish to slice upto index 4 (including it), then you'll need to pass in 5 as the end parameter.

Omitting the end index makes it default to the length of the string. In other words, the slice gets made from start to the end of the string.

An example is illustrated below:

"Great day..."[2:]
eat day...

The notation [2:] slices the string "Great day..." from index 2, all the way to its end.

It's also possible to omit the start index. In this case, it would be defaulted to the start of the string i.e index 0.

"Great day..."[:3]
Gre

In the snippet above, the notation [:3] slices the string "Great day..." from index 0 to exactly the index 2 (3 is exclusive).

Following from both these cases of omitting the start and the end indexes, there is another case where we can omit both of them from the slice notation.

This merely returns a copy of the given string.

Consider the code below:

"Great day..."[:]
Great day...
"Nice try"[:]
Nice try

In both cases, we get the same strings returned back, since the slicing is performed from the very beginning to the very end of the strings.

Raw strings

Once you learn regular expression in Python, you'll find yourself constructing many many strings containing the backslash character literally.

As we've seen above, the backslash character is used to escape characters. This means that we can't use a backslash, as is, in a string — if we do so, it would be treated as a backslash, but rather as the start of an escape sequence.

To include a backslash literally in a string, we have to escape it using another backslash.

So to write three backslashes we would need write a total of six backslashes. If we need to write five backslashes we would need to write ten of them. In short, we need to write double the amount of slashes we need.

As the number of backslashes increases, creating strings can become quite complex.

Python recognises this issue and provides a special kind of string to solve it - known as a raw string.

A raw string is denoted by the prefix r before the string. For instance, to denote 'a' in raw format we'll go as follows: r'a', r"a", r'''a''' or r"""a""".

In a raw string, a backslash isn't treated as an escape character, rather it's treated as a literal character.

Consider the following code comparing a normal string with a raw string:

s = 'New\nline'
print(s)

print()

raw_s = r'New\nline'
print(raw_s)
New line
New\nline

First we have a normal string. In it, the character \n denotes a new line and likewise the string 'New\nline' gets printed with the text 'line' coming on a new line.

In contrast, the raw string r'New\nline' treats \n as a backslash character followed by the 'n' character, NOT as a newline character. What gets printed is New\nline, as it is.

Let's inspect how raw_s really looks under the hood:

raw_s
'New\\nline'

This is how raw_s is stored internally — it has a backslash character preceding the backslash in order to escape it.

Seeing this, we can make the deduction that Python converts a raw string into a normal string by preceding each backslash with another backslash.

Raw strings are simply syntactic sugar over the usual escaping routine required in normal strings.

String replication

To replicate a string a given number of times, use the * replication operator on the string.

When used on numbers, * perform multiplication between the numbers. However, when used on strings, it performs replication of the string.

Well technically, even in terms of strings, * can be thought of as multiplying the string a given number of times.

Consider the snippet below:

'a' * 4
aaaa
'10_' * 6
10_10_10_10_10_10_

First the string 'a' is replicated 4 times giving 'aaaa', and then the string '10_' is replicated 6 times giving '10_10_10_10_10_10_'.

String replication is not a rarely-used feature in Python programming. Many algorithms out there like the ones used in coding competitions can benefit from it.

String methods

Uptil this point, we've explored a lot on the string data type in Python. One huge avenue, however, is still left and that string methods.

Most of the times when we are working with string we want to manipulate them in some way or the other, for example split them into an array of substrings, lowercase all their characters and so on.

These concerns are addressed by string methods, which we shall explore in detail in the next chapter.