A common programming task is to deal with text. Python has a build in data type just for that:
str - a string. Strings are used to display data in a form humans can easily understand, collect user input, and store information. Textual information can also be analyzed in highly sophisticated scenarios such as performing sentiment analysis on social media comments around some topic. Whatever the use may be, strings are a very important basic data type in many programming languages, and Python is no exception. This guide will show you some string basics, providing the key to understanding the nature and usage of strings in Python.
In the previous guide we saw some simple strings assigned to variables and displayed using
print(). Strings were designed to hold text. More formally: Strings in Python are a structure that holds a sequence of Unicode code points. The important distinction here is that a character in Unicode is not necessarily a single ASCII byte, so strings in Python can represent any character from any language. This makes strings readily suitable for text processing of any kind.
I just mentioned the word "sequence". Strings can contain zero or more characters and therefore have a concept of length which we'll address a bit later in this guide. But first: how do we tell Python that something is a string?
The python interpreter knows you’re are providing a string literal when it sees a sequence of characters surrounded by single or double quotation marks.
>>> a = 'marklar' >>> b = "marklar marklar"
b above are understood to contain a string. The quoted text on the right-hand-side of the equal sign is understood by the interpreter as a string literal because it is surrounded by quotation marks. You may use the single-quote
' or double-quote
" to surround your text, as long as the opening and closing quotes are the same.
For longer text that may exceed the width of your screen or if you want to have some line-breaks in the text, you can use triple-quotes:
1 2 3 4 5 6 7 8 9 10 11
>>> mailing_address = """Chip Marklar ... 123 Main street ... Zork City ... Planet Goof""" >>> >>> print(mailing_address) Chip Marklar 123 Main street Zork City Planet Goof >>>
When we assign a triple-quoted string to the variable
mailing_address in the interpreter, the interpreter knows the string is not "closed" after the first line, so it prints out three dots expecting more input or a matching closing triple-quote. Once the final quote is accepted, the variable
mailing_address is finally assigned. When we print the variable, four lines are shown. This is because the newline character entered (when you hit enter between lines of the address you typed) are also captured into the string.
Assigning a long string literal without line breaks is easy too:
1 2 3 4 5 6 7 8 9
>>> a = 'Supercalifragilistic' 'expialidocious' >>> print(a) Supercalifragilisticexpialidocious >>> b = 'Supercalifragilistic' \ ... 'expialidocious' >>> print(b) Supercalifragilisticexpialidocious >>>
Above, the variable
a is assigned a string using two literal parts. Python looks at the two quoted strings and mashes them together into one. Even though there is space between the two literals, python adds no space between the two strings. The resulting string has no space in it - it is one long text.
Python lets you continue a long line explicitly by putting a backslash
\ at the end of a line. The variable
b is assigned from two string literals as well, and python mashed those two literals from across the two visible lines in the program but treated them just the same as if they were present on one visible line. Notice that python did not add any space - the result is text without space in between the two parts - and it didn't add any line break either. Using this technique, you can assign long text values to a variable across multiple lines without introducing any new-line characters into the text.
But what if you do want to introduce a newline into a string? Python lets you represent a small set of special characters using an escape sequence. The newline character is represented as
1 2 3 4 5
>>> a = 'Chip\nMarklar' >>> print(a) Chip Marklar >>>
\ character is used to start an escape sequence. You need an escape sequence to represent characters which are otherwise hard to type or that can be confusing to interpret. A few common escape sequences are:
1 2 3 4 5
>>> text = 'The c:\\temp folder has temporary files\n\tBut don\'t quote me on that!' >>> print(text) The c:\temp folder has temporary files But don't quote me on that! >>>
In the example above, the backslash in the folder path was quoted, since otherwise it would be interpreted as a tab character
\t. The single-quote in the word
don't was quoted as well, because I don't want python to think the string ended there instead of at the end of the line. You can avoid the need for escape characters by choosing your quotes wisely or using the raw string prefix:
a = r'c:\temp' b = "Doesn't break"
r before the string literal assigned to
a tells python to digest the backslash as a plain character. By using a double-quote around the string assigned to
b, the single-quote in the word
Doesn't is digested as a plain character as well, and python does not "close" the string until the matching double-quote at the end of that literal.
How can we compare strings? How do we tell if two strings contain the same word or text? Python string can be compared for equality and order.
1 2 3 4 5 6 7 8 9
>>> 'Chip' == 'Chip' True >>> 'Chip' == 'Marklar' False >>> 'Chip' != 'Marklar' True >>> 'a' < 'a' True >>>
As you might expect, two strings of the same content are equal (
'Chip'), and two strings with different content are never equal. The not-equal
!= can be used to express non-equality.
Comparing strings using
> lets you know which string precedes (is "smaller" lexicographically) the other in alphabetical order. Take special care when strings contain numbers though:
1 2 3
>>> '10' > '9' False >>>
'10' is "smaller" than the string
'9' because the character 1 is smaller than 9, and these are taken as strings - not the numerical values they may appear to be.
Python string comparison is case sensitive.
>>> 'Chip' == 'chip' False
An easy way to compare strings without regards to case, is to convert them to lower or upper case. This can be done using the
1 2 3 4 5
>>> 'Chip'.lower() == 'chip' True >>> 'Chip'.upper() == 'CHIP' True >>>
Strings in Python are immutable - once created they are not modifiable. Variables may be assigned another string but that doesn't modify the original string. Consider this code:
1 2 3 4 5 6 7 8 9 10 11 12
>>> a = 'Chip' >>> b = a >>> a is b True >>> a = a + ' Marklar' >>> a is b False >>> a 'Chip Marklar' >>> b 'Chip' >>>
Above I assign the value 'Chip' to the variable
a and then assign that same string to variable
b actually point to the same internal string containing 'Chip'. Using the
is comparison operator confirms this:
True when both variables point to the same object in memory.
But when we then attempt to modify the string assigned to
a, we actually create a new string. Even though
a seems to start with the same sequence, they are totally different from each other. The original string remains and
b points to it.
One obvious difference between
b at this point is that they have different lengths. The length of a string can be obtained by using the
1 2 3 4 5 6 7
>>> len('Marklar') 7 >>> len('\t') 1 >>> len('') 0 >>>
The length of a string is the number of Unicode characters (code points more strictly) it contains. In the example above we see that the word "Marklar" has seven characters, a tab is one character, and that an empty string has a length of zero.
A string cannot be modified, but if you need a part of a string you can slice up a string and use a portion of it:
1 2 3 4
>>> initials = 'Chip' + 'Marklar' >>> print(initials) CM >>>
Using square brackets, you can index into - point into - the characters in a string by their offset from the start of the string. Zero is the first character, so in the example above I created
initials by smashing together the first character from two parts of a name.
1 2 3 4 5
>>> word = 'HELLO!' >>> proper = word + word[1:5].lower() >>> print(proper) Hello >>>
To get a single character out of a string, specify a single subscript in the square brackets.
'E', and so on. The slice of a string
'HELLO!' is actually a one-character string, not a datatype character.
If you want a few characters, you provide the starting and ending offset into the string. In the example above, the variable
proper is made "proper cased" by slicing out the second through the fifth characters by using the subscripts one as the starting point and five as the last character, inclusive. The offsets are zero-based, so
E's subscript is 1. When you want a substring from a certain position to the end of the string, you may omit the second subscript. The following two expressions are equivalent:
1 2 3 4 5
>>> print( 'Chip Marklar'[5:12]) Marklar >>> print( 'Chip Marklar'[5:]) Marklar >>>
You can also specify an offset from the end of a string, using a negative integer:
1 2 3
>>> print( 'Marklar!'[0:-1]) Marklar >>>
The example above extracts the sequence of the first through the one-before-last characters from the string. I like to read the negative subscript as "except for the last n characters". As with the second subscript, if you want all characters from the beginning of a string you may omit the first offset and write:
1 2 3
>>> print('Marklar!'[:-1]) 'Marklar' >>>
As I mentioned, the slice of the string returned is a string. But it does not modify the original string, nor is it the original string. If we slice up the first part of a string, we are actually getting a new string, populated from the slice of the original:
1 2 3 4 5 6 7 8 9 10 11
>>> a = 'Chip Marklar' >>> b = a[0:4] >>> >>> print(a) Chip Marklar >>> print(b) Chip >>> >>> print(a is b) False >>>
b share the same starting sequence of characters, even though
b was seemingly created from the very same starting characters of the original
a, they are not the same string and don't point to the same memory location. A slice of a string is a new string altogether.
lower()are specific to strings, the
len()function mentioned before also operates on sequences, sets, lists, and other length-able constructs in Python. Strings just happen to be one of the constructs
len()works on. The subscripting syntax above is also a general one that works on strings as well.
Another common task is to split sentences into words. The
split() method is made just for that:
1 2 3 4 5 6
>>> 'Chip Marklar Sipped Sarsaparilla'.split() ['Chip', 'Marklar', 'Sipped', 'Sarsaparilla'] >>> 'Chip Marklar Sipped Sarsaparilla'.split('S') ['Chip Marklar ', 'ipped ', 'arsaparilla'] >>>
Given no arguments,
split() will break apart a string on any whitespace character such as tab, space, line-break, and line-feed'. The result of
.split() is a list object containing individual strings remaining after breaking the original text.
If you supply a character to
.split() - in our example the capital letter S - the text will be broken along the occurrences of that character in the original string. Note that this is a case sensitive operation, so in the example above the last word contains lower-cased s because split was given the upper-cased S.
Occasionally, you might want to know if a string contains another string. Use the
in operator to check whether a string is contained in another.
1 2 3 4 5
>>> 'i' in 'team' False >>> 'chip' in 'archipelago' True >>>
How many times the stated substring exists in the larger string doesn't matter, as long as it appears once the
in operator returns
Strings are an important part of most languages - Python being no exception. This guide explained how to assign, compare, and manipulate strings in common scenarios. String and text manipulation are made very easy in Python by using string-specific methods as well as generalized functions that deal with sequences and collections. Since strings are often used to collect and manipulate values, it is well worth the time to play around and get familiar with strings, string operators and their nuances.