Python for Data Analysts

By Douglas Starnes

Mar 11, 2020 • 16 Minute Read

Introduction

For those looking to enter the data analytics space, a few languages dominate the field. My preference is Python. It is simple, easy to learn, and more importantly, easy to remember. This guide will discuss the fundamentals of Python for data analysts.

Variables

If you're coming from a language like C# or Java, you already understand the general concept of variables, and some of that carries over to Python. But there are some notable differences. First, declaring a variable in Python does not require a type specifier:

      i = 42

The type of the variable is inferred from the value assigned to it. Thus the preceding code would declare a variable named i of type int (for integer) and assign it the value 42.

Also, variables in Python are dynamically typed. This means the type of a variable can change along with its value.

          i = 42
i = 'Hello world'
    

Attempting this in C# or Java, which are statically typed, will reward you with a compiler error. In Python, it simply changes the type of i from an int to a str (string). Also, notice that strings in Python can be either double or single quoted and that the statements are not terminated with a semi-colon.

It's beyond the scope of this guide, but as of Python 3.5, type hinting is supported. But most of the Python code you see today is still dynamically typed without using hints.

Types

Python has a streamlined type system. In addition to the int type, the float and complex complete the set of numeric types.

The str type was discussed with variables but it is worth noting that there is no character type in Python. Thus a single character in single quotes is a string:

      ch = 'a'

A common task with strings is formatting. This is made easy with the format method.

          first_name = 'John'
last_name = 'Johnson'
message = 'Hello {} {}'.format(first_name, last_name)
    

You can probably guess that message will be assigned Hello John. With Python 3.6 we have the 'f-string', which makes formatting even easier.

      message = f'Hello {first_name} {last_name}'

This has the same effect as the format method. There is also an older style that works, but I recommend you don't use it in new projects.

      message = 'Hello %s %s' % (first_name, last_name)

The boolean or bool type will either True or False. Notice the values of bool are capitalized. The null value, None, which has a type of NoneType, is also capitalized, along with the other built-in Python values.

Collections

Python also supports a number of types which can have multiple values, also referred to as collections. The most commonly used is the list, which is a linear collection of valid Python values:

      my_list = [42, 'hello world', False, 3.14159, None]

A Python list is surrounded by square brackets, and the values are separated by commas. The values in a list are managed with these methods:

append: add a value to the end of the list
pop: remove and return the value at the end of the list
index: return the 0-based position of a value in a list or -1 if it does not exist

The values of a list can be added or removed and also can be changed. Values in the list are accessed by 0-based index:

          pi = my_list[3]
my_list[4] = 'snafu'
    

A part of a list is available with a slice. To slice a list, provide a start index and a stop index separated by a colon.

          five_ws = ['who', 'what', 'why', 'when', 'where']
three_ws = five_ws[1:4]
    

The value of three_ws will be the list ['what', 'why', 'when']. Notice the stop index is not included in the slice. If the start index is omitted, it is assumed to be 0, and if the stop index is omitted, the slice will extend to the end of the list.

The cousin of the list is a tuple, which looks like a list surrounded in parentheses:

      my_tuple = (42, 'hello world', False, 3.14159, None)

The tuple and its values can not be changed. So it is fixed in length and immutable. The values can be accessed by 0-based index.

An interesting feature of the tuple is destructuring or unpacking.

          address = ('http', 'pluralsight.com', 80)
protocol, domain, port = address
    

This will assign the values in the tuple, in order, to the variables on the left hand side. If one or more values are not needed, assign them to the throwaway variable, or underscore.

      protocol, domain, _ = address

The built-in len function accepts a list or tuple and returns the number of values:

      items = len(my_tuple) # 5

Single line comments in Python are preceded by a pound sign (#). Create multi-line comments by surrounding them inside of triple quotes:

          """
A
multi-line
comment
"""
    

The dict is a collection of key/value pairs. The keys and values are separated by a colon, the pairs by commas, and the collection by curly braces:

          my_dict = {
    'one': 1,
    'two': 2,
    'three': 3,
    'four': 4
}
    

The values of the dict are accessed by key:

          number_one = my_dict['one']
my_dict['ten'] = 10
    

Operators

Python includes the usual suspects when it comes to operators with a few exceptions. The arithmetic operators are all present with the addition of the double star (**) for exponents:

      eight = 2 ** 3 # 8

Note that as of Python 3.0, dividing two integers may result in a float:

      two_and_a_half = 5 / 2 # 2.5

Using the double slash operator (//) is for integer division:

      two = 5 // 2 # 2

The increment (++) and decrement (--) operators are not included with Python.

Some operators in Python are actually keywords, for example not, and, and or:

      negative = not True # False

In addition to negate equality, you may see the following:

      not a == 1

instead of

      a != 1

The in operator checks for membership in a collection:

          my_list = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
found_f = 'f' in my_list # True
    

The in keyword also works with the tuple and dict. With the dict it will check for membership in the keys.

Loops

Python does not have an equivalent of the C-style for-next loop found in C# and Java. Instead, it repurposes the in keyword to iterate over a collection.

          for x in my_list:
    # do something with x
    

On each iteration, the next item in my_list will be stored in x and in scope for the body of the loop. The syntax used to separate the body from from the rest of the loop warrants more discussion.

PEP-8

A PEP is a Python Enhancement Proposal, which is another way of saying "suggestion box." If someone has an idea for a feature to add to Python, they can submit a PEP. Some of these have been added as part of the language. PEP-8 has not, but is a generally accepted style guide.

Python requires that the body of a for loop (and many other constructs) are indented. And this indentation must be consistent in both size and characters. PEP-8 recommends that indentation be 4 spaces. This is not required, but is generally accepted among Python developers. What is not allowed is to have indentation of 2 spaces mixed with indentation of 4 spaces—and this is bad style anyways. But Python will raise errors if the indentation is not consistent. Most modern text editors that support Python will automatically indent code when required.

Another convention that PEP-8 recommends is the idea of snake case for variable and function names. This simply means that the entire name is lowercase with underscores between the words. Therefore, firstName would become first_name and testHelloWorldReturnsString would be test_hello_world_returns_string. In addition, PEP-8 recommends capitalized camel case for class names (i.e., DailyFruitSalesReport) and upper snake case for constants (i.e., NUMBER_OF_DAYS_IN_A_LEAP_YEAR).

It is worth mentioning that Python does not enforce the value of constants to be immutable. This is why the convention above is so important. If upper snake case is used to indicate a constant value, and everyone plays by the rules, there will be no confusion.

Conditionals

Conditionals in Python are similar to loops. There are no parentheses in the conditional statement, and a colon terminates the statement with the body indented. The catch-all else block also ends with a colon and is indented. For a conditional with several possibilities, add one or more elif (short for else if) blocks

          if a == 42:
    print('You discovered the meaning of life!')
if a == 41 or a == 43:
    print('Close but no cigar!')
else:
    print('Better luck next time!')
    

There is a ternary version of the conditional which can be used for simple situations.

          message = 'Turn the lights {}'
  .format('on' if evening == True else 'off')
    

There is no switch statement in Python. It has been proposed, but never added to the language.

With knowledge of the list type and the conditional, we can tackle a commonly used and powerful structure in Python, the list comprehension. The following is code you have seen many times.

          lower_names = ['liam', 'emma', 'noah', 'olivia', 'william', 'ava', 'james', 'isabella', 'oliver', 'sophia']
upper_names = []
for name in lower_names:
    upper_names.append(name.capitalize())
    

This will capitalize each name. Simple tasks like this one don't reveal much of an issue, but more complex needs will get messy. The list comprehension is a cleaner way to do the same thing.

      upper_names = [name.capitalize() for name in lower_names]

The source of the list comprehension (lower_names) can be filtered with a conditional.

      short_names = [name.capitalize() for name in lower_names if len(name) < 6]

The list of names would now include only ['liam', 'emma', 'noah', 'ava', 'james']. Another neat list trick is the built-in enumerate function. By passing a list to the enumerate function, it will return a list of tuple with the first value being the zero-based index of each element in the list and the second being the element itself. Therefore, the first several tuple returned from enumerate(lower_names) would be [(0, 'liam'), (1, 'emma'), (2, 'noah'), (3, 'olivia')]. Notice that the even indicies are male and the odd indicies are female. So if I wanted to get only the male names, the following list comprehension would work.

          boy_names = [
    name.capitalize() for (idx, name) 
        in enumerate(lower_names) 
            if idx % 2 == 0
]
    

This example uses tuple destructuring and a conditional inside of a list comprehension. It's a good first peek into how the features of Python can work together to quickly perform exploratory data tasks.

Functions

A function is merely a collection of Python statements executed together. The function declaration begins with the def keyword, followed by the name of the function (in snake case if you are following PEP-8), the parameters in parentheses, and terminated with a colon. Function bodies are indented similar to conditionals and loops.

          def say_hello(first_name):
    message = f'Hello {first_name}'
    return message
    

Again, since Python is a dynamically typed language, it is not necessary to assign explicit types to the parameters and return values. The function is called in a fashion used by other languages.

      greeting = say_hello('John')

A common use of the tuple is to return multiple values from a function.

          def parse_address(address):
    protocol_stop = address.find(':')
    port_start = address.rfind(':')
    protocol = address[:protocol_stop]
    port = address[port_start+1:]
    domain = address[protocol_stop + 3:port_start]
    return (protocol, domain, port)
    

When used with tuple destructing, a lot can be accomplished with a little code.

      (protocol, domain, port) = parse_address('http://example.com:80')

Modular Python Applications

Applications of respectable complexity need to be organized and not dumped into a single file. In Python, the idea of a module helps to prevent this. A module is simply a container for a chunk of code that can be reused in an application. The Python Standard Library contains many modules that are useful in everyday tasks. Let's consider the random module.

Generating random data is a common need. One obvious example is a deck of cards. By storing the suits and ranks in lists, we could easily use the random module to "draw" a card from a "shuffled" deck.

          suits = ['hearts', 'clubs', 'spades', 'diamonds']
ranks = [str(r) for r in range(2, 11)]
ranks.extend(list('AKQJ'))
    

The range function will generate integers from the start index up to the stop index, so 2 through 10 in this example. And the str initializer will cast the integers to strings. Finally, the list initializer will split the string into a list.

Now import the random module and use the choice function to select a suit and rank.

          import random
suit = random.choice(suits)
rank = random.choice(ranks)
card = f'{rank} of {suit}'
    

Notice that the call to choice is prefixed with the module name. This is a common idiom, but the choice function could also be explicitly imported.

          from random import choice
suit = choice(suits)
rank = choice(ranks)
card = f'{rank} of {suit}'
    

Both work the same. Which one should you use? That depends. To avoid naming conflicts, it is best to import the module and refer to the individual members. While it is legal, it is considered poor style to do the following:

      from random import *

This will import everything from the random module. While this might seem convenient, as you start to import more modules, it could raise the possibility of naming conflicts.

Another common import trick is to use aliases.

          import random as rnd
suit = rnd.choice(suits)
rank = rnd.choice(ranks)
card = f'{rank} of {suit}'
    

This is common in data analysis with Python. You will often see these two modules aliased.

          import numpy as np
import pandas as pd
    

And in a case where you are working with submodules, it can eliminate a lot of typing.

      import matplotlib.pyplot as plt

Creating a module is simple. If you create a code file, you've created a module. Suppose we are building a card game. Dealing cards will be a common task, so it makes sense to reuse the above code. Let's put it in a file named cards.py. Python files typically have a '.py' extension.

          import random

suit = random.choice(suits)
rank = random.choice(ranks)
card = f'{rank} of {suit}'

def draw_card():
    suit = rnd.choice(suits)
    rank = rnd.choice(ranks)
    card = f'{rank} of {suit}'
    return card
    

By placing this code in a file named cards.py, we have created a module. It can be used in the main file, perhaps app.py.

          from cards import draw_card

def draw_hand(num_cards):
    hand = [draw_card() for _ in range(num_cards)]
    return hand

if __name__ == '__main__':
    poker_hand = draw_hand(5)
    

The conditional at the bottom of the file is the entry point. To get the name of the current module, use the special __name__ variable. As an aside, this is pronounced "dunder name dunder", or just "dunder name", using "dunder" as a shortcut for "double underscore". The module named __main__ is the first module invoked. For this application, it would be app.py. Use the Python interpreter to start the app.

      $ python app.py

It should be noted that this code will not prevent the drawing of duplicate cards. But I'll leave that as an exercise for the reader.

Summary

This just scratches the surface of the Python language. But you should be able to see how quickly you can get up and running and even do something useful with little code. Visit the Python documentation at https://docs.python.org for details on the Python language and the Python Standard Library. After that, you'll want to dive into the packages for data analysis, such as numpy, pandas, and matplotlib.

Thanks for reading!

Douglas S.

Douglas Starnes is a tech author, professional explainer and Microsoft Most Valuable Professional in developer technologies in Memphis, TN. He is published on Pluralsight, Real Python and SkillShare. Douglas is co-director of the Memphis Python User Group, Memphis .NET User Group, Memphis Xamarin User Group and Memphis Power Platform User Group. He is also on the organizing committees of Scenic City Summit in Chattanooga, and TDevConf, a virtual conference in the state of Tennessee. A frequent conference and user group speaker, Douglas has delivered more than 70 featured presentations and workshops at more than 35 events over the past 10 years. He holds a Bachelor of Music degree with an emphasis on Music Composition from the University of Memphis.

More about this author