Author avatar

Douglas Starnes

Python for Data Analysts

Douglas Starnes

  • Mar 11, 2020
  • 16 Min read
  • 5,022 Views
  • Mar 11, 2020
  • 16 Min read
  • 5,022 Views
Languages Frameworks and Tools
Python

Introduction

For those looking to enter the data analytics space, a few languages dominate the field. My preference is Python. It is simple, easy to learn, and more importantly, easy to remember. This guide will discuss the fundamentals of Python for data analysts.

Variables

If you're coming from a language like C# or Java, you already understand the general concept of variables, and some of that carries over to Python. But there are some notable differences. First, declaring a variable in Python does not require a type specifier:

1i = 42
python

The type of the variable is inferred from the value assigned to it. Thus the preceding code would declare a variable named i of type int (for integer) and assign it the value 42.

Also, variables in Python are dynamically typed. This means the type of a variable can change along with its value.

1i = 42
2i = 'Hello world'
python

Attempting this in C# or Java, which are statically typed, will reward you with a compiler error. In Python, it simply changes the type of i from an int to a str (string). Also, notice that strings in Python can be either double or single quoted and that the statements are not terminated with a semi-colon.

It's beyond the scope of this guide, but as of Python 3.5, type hinting is supported. But most of the Python code you see today is still dynamically typed without using hints.

Types

Python has a streamlined type system. In addition to the int type, the float and complex complete the set of numeric types.

The str type was discussed with variables but it is worth noting that there is no character type in Python. Thus a single character in single quotes is a string:

1ch = 'a'
python

A common task with strings is formatting. This is made easy with the format method.

1first_name = 'John'
2last_name = 'Johnson'
3message = 'Hello {} {}'.format(first_name, last_name)
python

You can probably guess that message will be assigned Hello John. With Python 3.6 we have the 'f-string', which makes formatting even easier.

1message = f'Hello {first_name} {last_name}'
python

This has the same effect as the format method. There is also an older style that works, but I recommend you don't use it in new projects.

1message = 'Hello %s %s' % (first_name, last_name)
python

The boolean or bool type will either True or False. Notice the values of bool are capitalized. The null value, None, which has a type of NoneType, is also capitalized, along with the other built-in Python values.

Collections

Python also supports a number of types which can have multiple values, also referred to as collections. The most commonly used is the list, which is a linear collection of valid Python values:

1my_list = [42, 'hello world', False, 3.14159, None]
python

A Python list is surrounded by square brackets, and the values are separated by commas. The values in a list are managed with these methods:

  • append: add a value to the end of the list
  • pop: remove and return the value at the end of the list
  • index: return the 0-based position of a value in a list or -1 if it does not exist

The values of a list can be added or removed and also can be changed. Values in the list are accessed by 0-based index:

1pi = my_list[3]
2my_list[4] = 'snafu'
python

A part of a list is available with a slice. To slice a list, provide a start index and a stop index separated by a colon.

1five_ws = ['who', 'what', 'why', 'when', 'where']
2three_ws = five_ws[1:4]
python

The value of three_ws will be the list ['what', 'why', 'when']. Notice the stop index is not included in the slice. If the start index is omitted, it is assumed to be 0, and if the stop index is omitted, the slice will extend to the end of the list.

The cousin of the list is a tuple, which looks like a list surrounded in parentheses:

1my_tuple = (42, 'hello world', False, 3.14159, None)
python

The tuple and its values can not be changed. So it is fixed in length and immutable. The values can be accessed by 0-based index.

An interesting feature of the tuple is destructuring or unpacking.

1address = ('http', 'pluralsight.com', 80)
2protocol, domain, port = address
python

This will assign the values in the tuple, in order, to the variables on the left hand side. If one or more values are not needed, assign them to the throwaway variable, or underscore.

1protocol, domain, _ = address
python

The built-in len function accepts a list or tuple and returns the number of values:

1items = len(my_tuple) # 5
python

Single line comments in Python are preceded by a pound sign (#). Create multi-line comments by surrounding them inside of triple quotes:

1"""
2A
3multi-line
4comment
5"""
python

The dict is a collection of key/value pairs. The keys and values are separated by a colon, the pairs by commas, and the collection by curly braces:

1my_dict = {
2    'one': 1,
3    'two': 2,
4    'three': 3,
5    'four': 4
6}
python

The values of the dict are accessed by key:

1number_one = my_dict['one']
2my_dict['ten'] = 10

Operators

Python includes the usual suspects when it comes to operators with a few exceptions. The arithmetic operators are all present with the addition of the double star (**) for exponents:

1eight = 2 ** 3 # 8
python

Note that as of Python 3.0, dividing two integers may result in a float:

1two_and_a_half = 5 / 2 # 2.5
python

Using the double slash operator (//) is for integer division:

1two = 5 // 2 # 2
python

The increment (++) and decrement (--) operators are not included with Python.

Some operators in Python are actually keywords, for example not, and, and or:

1negative = not True # False
python

In addition to negate equality, you may see the following:

1not a == 1
python

instead of

1a != 1
python

The in operator checks for membership in a collection:

1my_list = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
2found_f = 'f' in my_list # True
python

The in keyword also works with the tuple and dict. With the dict it will check for membership in the keys.

Loops

Python does not have an equivalent of the C-style for-next loop found in C# and Java. Instead, it repurposes the in keyword to iterate over a collection.

1for x in my_list:
2    # do something with x
python

On each iteration, the next item in my_list will be stored in x and in scope for the body of the loop. The syntax used to separate the body from from the rest of the loop warrants more discussion.

PEP-8

A PEP is a Python Enhancement Proposal, which is another way of saying "suggestion box." If someone has an idea for a feature to add to Python, they can submit a PEP. Some of these have been added as part of the language. PEP-8 has not, but is a generally accepted style guide.

Python requires that the body of a for loop (and many other constructs) are indented. And this indentation must be consistent in both size and characters. PEP-8 recommends that indentation be 4 spaces. This is not required, but is generally accepted among Python developers. What is not allowed is to have indentation of 2 spaces mixed with indentation of 4 spaces—and this is bad style anyways. But Python will raise errors if the indentation is not consistent. Most modern text editors that support Python will automatically indent code when required.

Another convention that PEP-8 recommends is the idea of snake case for variable and function names. This simply means that the entire name is lowercase with underscores between the words. Therefore, firstName would become first_name and testHelloWorldReturnsString would be test_hello_world_returns_string. In addition, PEP-8 recommends capitalized camel case for class names (i.e., DailyFruitSalesReport) and upper snake case for constants (i.e., NUMBER_OF_DAYS_IN_A_LEAP_YEAR).

It is worth mentioning that Python does not enforce the value of constants to be immutable. This is why the convention above is so important. If upper snake case is used to indicate a constant value, and everyone plays by the rules, there will be no confusion.

Conditionals

Conditionals in Python are similar to loops. There are no parentheses in the conditional statement, and a colon terminates the statement with the body indented. The catch-all else block also ends with a colon and is indented. For a conditional with several possibilities, add one or more elif (short for else if) blocks

1if a == 42:
2    print('You discovered the meaning of life!')
3if a == 41 or a == 43:
4    print('Close but no cigar!')
5else:
6    print('Better luck next time!')
python

There is a ternary version of the conditional which can be used for simple situations.

1message = 'Turn the lights {}'
2  .format('on' if evening == True else 'off')
python

There is no switch statement in Python. It has been proposed, but never added to the language.

With knowledge of the list type and the conditional, we can tackle a commonly used and powerful structure in Python, the list comprehension. The following is code you have seen many times.

1lower_names = ['liam', 'emma', 'noah', 'olivia', 'william', 'ava', 'james', 'isabella', 'oliver', 'sophia']
2upper_names = []
3for name in lower_names:
4    upper_names.append(name.capitalize())
python

This will capitalize each name. Simple tasks like this one don't reveal much of an issue, but more complex needs will get messy. The list comprehension is a cleaner way to do the same thing.

1upper_names = [name.capitalize() for name in lower_names]
python

The source of the list comprehension (lower_names) can be filtered with a conditional.

1short_names = [name.capitalize() for name in lower_names if len(name) < 6]
python

The list of names would now include only ['liam', 'emma', 'noah', 'ava', 'james']. Another neat list trick is the built-in enumerate function. By passing a list to the enumerate function, it will return a list of tuple with the first value being the zero-based index of each element in the list and the second being the element itself. Therefore, the first several tuple returned from enumerate(lower_names) would be [(0, 'liam'), (1, 'emma'), (2, 'noah'), (3, 'olivia')]. Notice that the even indicies are male and the odd indicies are female. So if I wanted to get only the male names, the following list comprehension would work.

1boy_names = [
2    name.capitalize() for (idx, name) 
3        in enumerate(lower_names) 
4            if idx % 2 == 0
5]
python

This example uses tuple destructuring and a conditional inside of a list comprehension. It's a good first peek into how the features of Python can work together to quickly perform exploratory data tasks.

Functions

A function is merely a collection of Python statements executed together. The function declaration begins with the def keyword, followed by the name of the function (in snake case if you are following PEP-8), the parameters in parentheses, and terminated with a colon. Function bodies are indented similar to conditionals and loops.

1def say_hello(first_name):
2    message = f'Hello {first_name}'
3    return message
python

Again, since Python is a dynamically typed language, it is not necessary to assign explicit types to the parameters and return values. The function is called in a fashion used by other languages.

1greeting = say_hello('John')
python

A common use of the tuple is to return multiple values from a function.

1def parse_address(address):
2    protocol_stop = address.find(':')
3    port_start = address.rfind(':')
4    protocol = address[:protocol_stop]
5    port = address[port_start+1:]
6    domain = address[protocol_stop + 3:port_start]
7    return (protocol, domain, port)
python

When used with tuple destructing, a lot can be accomplished with a little code.

1(protocol, domain, port) = parse_address('http://example.com:80')
python

Modular Python Applications

Applications of respectable complexity need to be organized and not dumped into a single file. In Python, the idea of a module helps to prevent this. A module is simply a container for a chunk of code that can be reused in an application. The Python Standard Library contains many modules that are useful in everyday tasks. Let's consider the random module.

Generating random data is a common need. One obvious example is a deck of cards. By storing the suits and ranks in lists, we could easily use the random module to "draw" a card from a "shuffled" deck.

1suits = ['hearts', 'clubs', 'spades', 'diamonds']
2ranks = [str(r) for r in range(2, 11)]
3ranks.extend(list('AKQJ'))
python

The range function will generate integers from the start index up to the stop index, so 2 through 10 in this example. And the str initializer will cast the integers to strings. Finally, the list initializer will split the string into a list.

Now import the random module and use the choice function to select a suit and rank.

1import random
2suit = random.choice(suits)
3rank = random.choice(ranks)
4card = f'{rank} of {suit}'
python

Notice that the call to choice is prefixed with the module name. This is a common idiom, but the choice function could also be explicitly imported.

1from random import choice
2suit = choice(suits)
3rank = choice(ranks)
4card = f'{rank} of {suit}'
python

Both work the same. Which one should you use? That depends. To avoid naming conflicts, it is best to import the module and refer to the individual members. While it is legal, it is considered poor style to do the following:

1from random import *
python

This will import everything from the random module. While this might seem convenient, as you start to import more modules, it could raise the possibility of naming conflicts.

Another common import trick is to use aliases.

1import random as rnd
2suit = rnd.choice(suits)
3rank = rnd.choice(ranks)
4card = f'{rank} of {suit}'
python

This is common in data analysis with Python. You will often see these two modules aliased.

1import numpy as np
2import pandas as pd
python

And in a case where you are working with submodules, it can eliminate a lot of typing.

1import matplotlib.pyplot as plt
python

Creating a module is simple. If you create a code file, you've created a module. Suppose we are building a card game. Dealing cards will be a common task, so it makes sense to reuse the above code. Let's put it in a file named cards.py. Python files typically have a '.py' extension.

1import random
2
3suit = random.choice(suits)
4rank = random.choice(ranks)
5card = f'{rank} of {suit}'
6
7def draw_card():
8    suit = rnd.choice(suits)
9    rank = rnd.choice(ranks)
10    card = f'{rank} of {suit}'
11    return card
python

By placing this code in a file named cards.py, we have created a module. It can be used in the main file, perhaps app.py.

1from cards import draw_card
2
3def draw_hand(num_cards):
4    hand = [draw_card() for _ in range(num_cards)]
5    return hand
6
7if __name__ == '__main__':
8    poker_hand = draw_hand(5)
python

The conditional at the bottom of the file is the entry point. To get the name of the current module, use the special __name__ variable. As an aside, this is pronounced "dunder name dunder", or just "dunder name", using "dunder" as a shortcut for "double underscore". The module named __main__ is the first module invoked. For this application, it would be app.py. Use the Python interpreter to start the app.

1$ python app.py

It should be noted that this code will not prevent the drawing of duplicate cards. But I'll leave that as an exercise for the reader.

Summary

This just scratches the surface of the Python language. But you should be able to see how quickly you can get up and running and even do something useful with little code. Visit the Python documentation at https://docs.python.org for details on the Python language and the Python Standard Library. After that, you'll want to dive into the packages for data analysis, such as numpy, pandas, and matplotlib.

Thanks for reading!