Python Strings Basics

Chapter 15 29 mins

Learning outcomes:

  1. A quick recap
  2. Escaping characters using \
  3. Multi-line strings
  4. Immutability
  5. String slicing
  6. Raw strings
  7. String replication
  8. Finding substrings within strings

Introduction

All over computer programming, strings play a huge role in powering numerous applications. They are the backbone of programs, used to display messages, obtain inputs, output logs, print stuff, and much more.

Nearly every programming language has a mechanism to work with strings in one way or the other. As we learnt in the Python Basics chapter, Python also has one to work with strings.

In this chapter, we shall explore Python's string data type in detail, covering an array of concepts ranging from common terminology to immutability.

What are strings?

Let's start by reviewing what exactly is a string:

A string is a sequence of textual characters.

In simple words, a string is basically a piece of text. Wherever we ought to work with text, we ought to work with strings.

Each character in the sequence sits at a given position, formally known as its index.

Typically throughout the computing words, indexes start at 0. The first character in a string has index 0, the second one has index 1, the third one has index 2, and so on and so forth.

The total number of characters in a string is referred to as the length of the string.

Python follows the UTF-8 encoding scheme to represent characters in strings. In this system, each character occupies at least 8 bits of memory.

Depending on the character, Python might allot 16 bits or 32 bits to the character. UTF-8 just states that the minimum is 8 bits — not that the exact size is 8 bits.

Creating strings

Python provides a handful of ways to create strings.

First we'll see how to create single-line strings using a pair of single quotes ('') or using a pair of double quotes (""), and then head over to consider multi-line strings.

A pair of single quotes ('') or a pair of double quotes ("") denotes a string.

Specifically, such a value is called a string literal.

A string literal is a literal representation of a string in a piece of code.

It's that simple!

Let's start by creating a string literal using single quotes:

s = 'Hello'

Here, we first define a variable s and then assign it the string literal 'Hello', created by a pair of single quotes ('').

This string could've been denoted using double quotes as well, as shown below:

s = "Hello"

Both the strings 'Hello' and "Hello" are identical to one another — there is absolutely no difference between them.

Now you might ask: what's the use of two kinds of quotes to denote a string?

Well if you want to create a string that has a single quote in itself, then doing so using single quotes would throw an error.

This is apparent in the code below:

s = 'Python's World'

Here, Python figures out that the given string starts after the equals sign and ends right after the word Python (i.e. 'Python's World'), however this is not the case. We know that the string ends at the word World.

This misinterpretation occurs because when a string begins with a given character — a ' single quote in this case — the interpreter looks for the same character eagerly, in order to complete a pair. As soon as it finds one, it terminates the string right at that point, without going any further.

The term 'eargerly' simply means that the interpreter can't wait to find a match — as soon as it finds one, it completes the string at that point.

The solution to this issue is very simple — create the string using double quotes "". In this way, the symbols used to denote the string won't come in conflict with the symbols used within the string.

s = "Python's World"

The same goes for strings that have double quotes in them — create them using single quotes ''.

s = 'He said, "I love you!"'

At this stage, one question pops up naturally in the mind — what if the string has both double and single quotes in it?

Let's say we want to create a string that has the following text inside it:

Python's saying: "Readability counts"

Based on the discussion above, we couldn't solve this problem in any way. If we use single quotes ('') to denote the string, the ' after Python would come in conflict. Similarly, if we use double quotes ("") to denote the string, the " just before Readability would come in conflict.

In either case, we are helpless.

Now, there are two ways to solve this problem:

  1. Escape characters in the string
  2. Use a multi-line string.

Both of these solutions are discussed in detail in the sections below.

Escaping characters

When a string is denoted by a certain symbol, the interpreter searches for its pair and terminates the string as soon as it's found.

If the symbol appears within the string, Python misinterprets it as the ending of the string and terminates the string right at that point, ultimately leading to errors.

This can be solved if we tell the interpreter that the synonymous symbol within the string doesn't denote the string.

How?

By escaping the characters.

Escaping a character in a string means preceding it with a backslash character (\) in order to escape its default behavior.

When we escape a character within a string, Python adds the character to the content of the string, regardless of whatever that character is.

Hence, if we escape ' in a string denoted using single quotes (''), or escape " in a string denoted using double quotes (""), the interpreter would know that the escaped quote character is part of the string's text, and not a symbol used to denote the end of the string.

Below shown is a very elementary example:

s = 'Python\'s World'
print(s)
Python's World

The string s is denoted using single quotes, therefore a single quote can't be put within it, as is. To put a single quote, we first have to escape it. This is done by using a backslash followed by the quote — \'.

Let's use this idea to solve the problem stated at the end of the last section:

s = 'Python\'s saying: "Readability counts:"'
print(s)
Python's saying: "Readability counts"

Here, the string is denoted using single quotes, likewise we only had to escape the single quote in the string.

If we were to denote the string using double quotes (""), then we'd had to escape " in the string only and not the ' character as illustrated below:

s = "Python's saying: \"Readability counts:\""
print(s)
Python's saying: "Readability counts"

As you might agree, the former code is simpler since it involves only one escape sequence.

Anyways, there is yet another way to solve this string-creation problem that's even simpler than the former code snippet.

Multi-line strings

If you're not a big fan of escaping characters, but still want to prevent the misinterpretation of a character within a string as the end of the string, then behold, Python has another way to save your day.

That is using multi-line strings.

A multi-line string can be denoted in one of the two ways shown below:

  1. Using a pair of triple single quotes — ''' '''
  2. Using a pair of triple double quotes — """ """

That why is it called 'multi-line', this we shall explore later on. For now, let's see how to solve the character conflict problem that we saw above using multi-line strings.

The problem is that we want create a string holding the following text:

Python's saying: "Readability counts"

It has both single and double quotes in it, and so denoting it using either of these would require the similar character to be escaped in the string. However, using multi-line strings, the need to escaping just disappears away.

Consider the code below:

s = '''Python's saying: "Readability counts"'''
Python's saying: "Readability counts"

The string is denoted using a pair of three-single-quotes (''' '''). As can be seen, we don't escape the single or double quotes appearing within the string, simply because we don't need to.

But why?

The reason is very straightforward. The character used to denote the string is ''', so Python looks for its pair, which is ''', eagerly in the string. Since this doesn't occur anywhere within the string, there is no character misinterpretation.

Easy?

Alright, now it's time to see as to why are multi-line strings called 'multi-line'. For this we'll take another example.

Let's say you want to make a message that starts with "Hello world!" on the first line, followed by "Bye." on the third line, as shown below.

Hello World!

Bye.

Doing so with single line strings would look something as follows:

s = 'Hello World!\n\nBye.'

After 'Hello World', we put two \n characters. The \n character denotes a newline.

However, with multi-line strings, we do not have to worry about these \n newline characters — just write the message as it is.

s = '''Hello World!

Bye.'''

As is clear cut over here, multi-line strings can span multiple lines, hence the name 'multi-line'.

Immutability

As we saw in the Python Data Types chapter, in the section on lists, it's possible to use the same bracket notation shown above to assign a new value to a particular position in a list.

Let's see whether we could do so with a string:

s = 'Cat'

s[0] = 'B'
print(s)
Traceback (most recent call last): File "stdin", line 3, in <module> s[0] = 'B' TypeError: 'str' object does not support item assignment

Unfortunately, we can't.

Changing the value of a given character in a string is invalid. The reason for this is because strings in Python are immutable in nature.

Immutable means that once created, the underlying data can't be modified by means of any operation.

In the case of strings, immutability implies the fact that we can't change any character of a string. All string operations that seem to modify a string, as we shall see later on when we discover string methods, return a new string.

If you really want to change a particular character of a string to some other character, then you must create a new string with that new character. We'll see how to do this later on in the section on string slicing.

Moving on, Python also supports negative indexes in bracket notation, which are computed from the end of a string.

For example, s[-1] returns the last character of the string s, while s[-2] returns its second last character.

This idea of negative indexes is quite useful when working with strings in Python. It prevents us from manually computing the index of a given character from the end of the string, for e.g len(s) - 1 for the last character of the string s.

An example follows:

s = 'Hello World!'

print(s[-1]) # last character
print(s[-3]) # third last character
!
l

Whatever you do, just make sure that the index is an integer. If it isn't one, Python would throw an error:

s = 'Hello World!'

print(s[0.0])
Traceback (most recent call last): File "stdin", line 3, in <module> print(s[0.0]) TypeError: string indices must be integers

Slicing a string

It's a common task to extract substrings out of a string from given positions. Formally, this is referred to as slicing a string.

A slice of a string is a segment of it, starting at a given index and ending at a given index.

In Python, we can slice a string using bracket notation. But this time, what goes inside the brackets is something different.

The general syntax of slicing is shown below for a given string s:

s[start:end]

start defines the index at which to start the slicing. This is inclusive. On the other hand, end defines the index at which to end the slicing. This is exclusive.

For instance, if we want to slice out the first two characters from the string 'Hello', we'd set start to 0 (as this is the position where the slicing would begin) and end to 2 (as this is the position right before which the slicing would end).

Below shown are a handful of slices made on the string "Hello World!":

s = "Hello World!"

slice_1 = s[0:5]
print(slice_1)

slice_2 = s[2:4]
print(slice_2)
Hello
ll

The first slice works as follows: it begins at index 0, and ends exactly at index 4, since the end, which is 5, is exclusive. It constitutes the string 'Hello'.

The second one begins at index 2 and ends exactly at index 3 constituting the string 'll'.

The end parameter is always one greater than the index upto which you want to slice a string. For example, if you wish to slice upto index 4 (including it), then you'll need to pass in 5 as the end parameter.

Omitting the end index makes it default to the length of the string. In other words, the slice gets made from start to the end of the string.

An example is illustrated below:

"Great day..."[2:]
eat day...

The notation [2:] slices the string "Great day..." from index 2, all the way to its end.

It's also possible to omit the start index. In this case, it would be defaulted to the start of the string i.e index 0.

"Great day..."[:3]
Gre

In the snippet above, the notation [:3] slices the string "Great day..." from index 0 to exactly the index 2 (3 is exclusive).

Following from both these cases of omitting the start and the end indexes, there is another case where we can omit both of them from the slice notation.

This merely returns a copy of the given string.

Consider the code below:

"Great day..."[:]
Great day...
"Nice try"[:]
Nice try

In both cases, we get the same strings returned back, since the slicing is performed from the very beginning to the very end of the strings.

Why is the end parameter exclusive?

You might be thinking as to why exactly does Python consider end to be exclusive in the slice notation shown above.

Well, this is a good question. A sensible question. Even I had it when I came across string slicing for the first time, but my first language wasn't Python; it was JavaScript.

Anyways, here's the answer to it...

The end parameter is exclusive so that we could just set it to the length of the underlying string and thus get the string sliced from the index start to the very end of the string.

If end was inclusive, we would've had to subtract 1 from the string's length in order to slice it from start to its very end. For e.g. to slice a string s from the start to the end we would've had to do s[0:len(s) - 1].

To boil it down, end is considered exclusive just to make our work a bit simpler. That's it!

In many programming languages, whenever slicing a string from a starting index to an ending index, the second index is exclusive. You could even see this in JavaScript String Methods — slice().

Raw strings

Once you learn regular expressions in Python, you'll find yourself constructing many many strings containing the backslash character literally.

As we've seen above, the backslash character (\) is used to escape characters. This means that we can't use a backslash, as is, in a string — if we do so, it wouldn't be treated as a backslash, but rather as the start of an escape sequence.

For instance consider the code below:

s = '\patt\flags denotes a regex literal in JS.'
print(s)
\patt lags denotes a regex literal in JS.

We meant to create a string with exactly the text shown above and then print it in the shell. However, some weird formatting seems to happen with the output. Where is this coming from?

Well, without actually needing knowledge of all escape sequences in UTF-8, we notice that the characters \ and f are missing from the output, so it would definitely be the case that \f is some special sequence.

And indeed it is — \f is the form feed character. It causes the following piece of text to begin from the same position it does, but on a new line.

Anyways, the main point over here is that denoting backslashes as is in a string, as we did above, could cause quite unexpected outcomes. In our case, they led to a form feed character that we never even thought of in the back of our mind!

To include a backslash (\) literally in a string, we have to escape it using another backslash (\).

So to write three backslashes in a string, we would need to write a total of six backslashes. If we need to write five backslashes we would need to write ten of them. In short, we need to write double the amount of backslashes we actually need in a string.

As the number of backslashes increases, the readability of the string decreases. Interpreting strings with tons of backslashes can be a nightmare. Believe it!

Python recognises this issue and provides a special kind of string to solve it, known as a raw string.

A raw string is denoted by the prefix r before the string.

For instance, to denote 'a' in raw format we'll go as follows: r'a', r"a", r'''a''' or r"""a""".

In a raw string, a backslash isn't treated as an escape character, rather it's treated as a literal character.

Consider the following code comparing a normal string with a raw string:

s = 'New\nline'
print(s)

print()

raw_s = r'New\nline'
print(raw_s)
New line
New\nline

First we have a normal string. In it, the character \n denotes a new line and likewise the string 'New\nline' gets printed with the text 'line' coming on a new line.

In contrast, the raw string r'New\nline' treats \n as a backslash character followed by the 'n' character, NOT as a newline character. What gets printed is New\nline, as it is.

Let's inspect how raw_s really looks under the hood:

raw_s
'New\\nline'

This confirms that raw_s has a literal backslash character (\) in it.

Seeing this, we can reason that Python converts a raw string into a normal string by preceding each backslash with another backslash.

Raw strings are simply syntactic sugar over the usual escaping routine required in normal strings.

String replication

To replicate a string a given number of times, use the * replication operator on the string.

When used on numbers, * perform multiplication between the numbers. However, when used on strings, it performs replication of the string.

Well technically, even in terms of strings, * can be thought of as multiplying the string a given number of times.

Consider the snippet below:

'a' * 4
aaaa
'10_' * 6
10_10_10_10_10_10_

First the string 'a' is replicated 4 times giving 'aaaa', and then the string '10_' is replicated 6 times giving '10_10_10_10_10_10_'.

String replication is not a rarely-used feature in Python programming. Many algorithms out there like the ones used in coding competitions can benefit from it.

Finding a substring

It's an extremely common thing to search for substrings with a given string.

For example, given a string s, you might be interested in finding out where does the word 'as' occur in it, or whether the word even occurs or not.

There are essentially two ways to search for stuff within a string, each having its own purpose.

They are as follows:

  1. The in operator
  2. The find() string method

Let's see how to use each of these...

The in operator

The in operator is mainly used to check for the existence of a substring within a string. Here's its syntax:

substring in string

If you notice it, this is much like a question: 'is substring in string?'

The return value of the in operation is a Boolean. In the case of strings, we get True if substring occurs in string, or otherwise False.

Using in, let's check whether a given string s has the word 'as' in it. For now, we'll suppose a value for s. But in a real world application, the value for s might be unknown, for example, it might be received from the input, or from a database, and so on and so forth.

s = 'Hello World!'
print('as' in s)
False

Since the string s here doesn't have 'as' in it, the in operator returns False.

Let's consider another example:

s = 'Hashing'
print('as' in s)
True

This time 'as' occurs in s ('Hashing'), and likewise we get True returned by the in operator.

So this was how to use in for our string-searching needs.

As stated before, there is another way to do so, serving another purpose — the find() string method.

The find() method

The find() method tells us about the exact position where a substring occurs in a given string.

Often times, you'll find yourself using find(), as it's usually required to know the position of the substring's occurence so that one could further process the string.

Here's the syntax of find():

string.find(substring)

If substring occurs in string, we get the index of its first occurence returned, Otherwise, we get -1.

With this in mind, let's work with find().

Did this just rhyme?

Consider the following code:

s = 'Hello World!'
print(s.find('as'))
-1

'as' doesn't occur in s here, likewise the method returns -1.

Now, over to the second example:

s = 'Hashing'
print(s.find('as'))

'as' occurs this time in s ('Hashing'), likewise find() returns its index which is 1.

Remember that if the given string has multiple occurences of the substring, find() would just return the index of the first occurence.

This is demonstrated below:

s = 'The program has crashed!'
print(s.find('as'))
13

Here, 'as' occurs twice in s ('The program has crashed!'). Nonetheless, find() returns 13 which is the index of the first occurence of 'as'.

String methods

Uptil this point, we've explored a lot on the string data type in Python. One huge avenue, however, is still left and that is string methods.

Most of the times, when we are working with strings, we want to manipulate them in some way or the other, for example split them into an array of substrings, lowercase all their characters and so on.

These concerns are addressed by string methods, which we shall explore in detail in the next chapter.