A Lap Around the PyData Stack

The data analytics and science community has long been dominated by the R programming language. While quite capable, R targets a niche audience. It begins to lose its appeal in the broader field of software development. In other words, no data project is an island. The modern web has made it almost mandatory that data be accessible anywhere, at any time and by anyone. The solution is to implement data solutions using a more general-purpose language.

Python has become a popular alternative to R for implementing data solutions. Python has a very small footprint with a simple, clean syntax that is easy to use and remember. Python has a very active open-source community that is contributing packages for a wide variety of tasks, including data science and analytics. And finally, Python is cross-platform. It works equally well on the three major platforms: macOS, Linux, and even Windows.

This guide is not for Python novices. It already assumes basic Python experience. However, it should be noted that someone with significant experience in another language (i.e., Java, C#) should be able to pick up Python quickly. The concepts are largely the same. Syntax is the only major difference.

The term ‘PyData’ is not official but rather a colloquial title that refers to a collection of commonly used Python packages and tools within the data science community. So while it means different things to different Pythonistas, there are a few generally accepted members of this collection. This guide will specifically focus on those and then mention several others at the end.

The `numpy`

package handles numerical computing tasks using the Python language. Now, don't let the phrase *numerical computing* frighten you. You'll soon see how `numpy`

handles most of the hard work for you. That hard work consists largely of manipulating multidimensional data structures.

Consider the task of matrix multiplication. Multiplying a 2x3 matrix with a 3x4 matrix will, after 24 multiplication operations and 8 addition operations, yield a 2x4 matrix. And this is easily done with a few lines of Python. But recall that Python is an interpreted language and can be quite slow. While it will be tough to observe a performance hit from the above example, higher dimensional data structures will quickly overwhelm pure Python.

This is why `numpy`

is largely implemented in C. The C code compiles to native, platform specific binaries that run close to the metal, eliminating the performance issues with pure Python. But `numpy`

still provides Python bindings to the API implemented in C. This gives the developer the best of both worlds: the performance of C and the ease of use of Python.

The central data structure of `numpy`

is the `ndarray`

. The “nd” stands for “n-dimensional”, so the `ndarray`

is capable of representing multidimensional data structures. Creating an `ndarray`

from a Python list is simple:

```
1import numpy as np
2data = list(range(20))
3arr = np.array(data)
```

python

The `np`

alias for `numpy`

is a common convention. You'll see it used in a lot of PyData applications. Displaying the value of `arr`

will yield:

`1array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])`

This looks a lot like a Python list. And the `ndarray`

can be used as a substitute for a list much of the time. But it has learned a few new tricks. For example, the `reshape`

method:

`1mat = arr.reshape(5, 4)`

python

This will return an `ndarray`

with the dimensions passed to it, in this case 5 rows and 4 columns:

```
1array([[ 0, 1, 2, 3],
2 [ 4, 5, 6, 7],
3 [ 8, 9, 10, 11],
4 [12, 13, 14, 15],
5 [16, 17, 18, 19]])
```

While this looks like a nested list in Python, it's actually an `ndarray`

. Indexing an `ndarray`

has also learned some new tricks. For example, to access the value in row 1 and column 2, you would write this code with Python lists:

`1mat[1][2]`

python

This works with an `ndarray`

as well, but there is a shortcut that reduces some of the syntactic noise.

`1mat[1, 2]`

python

The cool thing about `ndarray`

indexing is that you can select every value in a dimension with a colon, like so:

`1mat[:,2]`

python

This effectively selects column two in `mat`

. The colon says to select every row, and then index two in each row.

Another benefit of `numpy`

is the idea of *broadcasting*. Consider the obligatory example of squaring a list of integers:

```
1my_list = list(range(10))
2
3squares = []
4for value in my_list:
5 sqaure_value = value ** 2
6 squares.append(square_value)
7# or
8squares = [value ** 2 for value in my_list]
```

python

Even though this example can be reduced to a single line of code with the list comprehension, it requires some thought to translate the Python code to the mathematical expression it represents. Using an `ndarray`

to do the same thing is much more readable:

```
1my_array = np.array(list(range(10))
2
3square_array = my_array ** 2
```

python

And `numpy`

scales broadcasting to high dimensions as well:

```
1my_array = np.random.randint(0, 10, size=(3, 2))
2square_array = my_array ** 2
```

python

The `random`

module in `numpy`

mirrors the `random`

module in the Python standard library with some extras to accommodate the `ndarray`

. In the above example, `randint`

will return a 3x2 `ndarray`

with integer values from 0 to 9.

Operations on more than one `ndarray`

are just as easy:

```
1A = np.random.randint(0, 10, size=(3, 2))
2B = np.random.randint(0, 10, size=(3, 2))
3mat_sum = A + B
```

python

And that matrix multiplication issue mentioned at the beginning of this section?

```
1A = np.random.randint(0, 10, size=(2, 3))
2B = np.random.randint(0, 10, size=(3, 4))
3dot_prod = A.dot(B)
```

python

Finally, the `ndarray`

can be filtered using simple Boolean expressions:

```
1arr = np.array(list(range(20)))
2odds = arr[arr % 2 == 1]
```

python

`1array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19])`

This just scratches the surface of `numpy`

. There are many other useful features (check out indexing an `ndarray`

) that this guide cannot cover.

Hopefully you can see how powerful `numpy`

is and how much time it can save. But at the end of the day, the `ndarray`

is nothing more than a big glob of numbers. This is because there is no metadata associated with an `ndarray`

. To work with data in a more intellectually palatable context, we turn to `pandas`

.

The `pandas`

API is what you are more likely to spend time with. While `numpy`

handles the core operations on numeric values, `pandas`

mimics large parts of R. This again gives you the best of both worlds: the power of R with the syntax of Python.

The central data structure in `pandas`

is the `DataFrame`

. It mimics the R data structure of the same name. In `pandas`

, a`DataFrame`

is a two-dimensional data structure with named columns and indexed rows. In other words, you can conceptually think of it as a table. The values of a `DataFrame`

are represented by an `ndarray`

, and a `DataFrame`

can be created from an `ndarray`

.

```
1import numpy as np
2import pandas pd
3
4raw = np.random.randint(0, 10, size=(5, 3))
5df = pd.DataFrame(data=raw)
```

python

The result is

```
1 0 1 2
20 3 4 0
31 1 8 6
42 0 5 4
53 8 5 8
64 0 1 2
```

The top row of headers and the left most column of indexes were generated by `pandas`

. The headers can be modified by setting the `columns`

field:

`1df.columns = ['Foo', 'Bar', 'Baz']`

python

The `DataFrame`

now:

```
1 Foo Bar Baz
20 3 4 0
31 1 8 6
42 0 5 4
53 8 5 8
64 0 1 2
```

The columns can also be accessed by name:

`1df['Foo']`

python

```
10 3
21 1
32 0
43 8
54 0
6Name: Red, dtype: int64
```

While bearing a strong resemblance to a `DataFrame`

, this is actually a `Series`

. A `Series`

is an `ndarray`

with an index. The column names of the `DataFrame`

can be used as fields as well:

`1df.Foo`

python

Like the `ndarray`

a `DataFrame`

can also be filtered.

`1df[df.Foo == 0]`

python

```
1 Foo Bar Baz
22 0 5 4
34 0 1 2
```

Notice that a `DataFrame`

is returned containing the rows in which the value of `Foo`

is zero.

A `DataFrame`

supports dynamically adding a new column:

`1df['New'] = np.random.randint(0, 10, size=5)`

python

```
1 Foo Bar Baz New
20 3 4 0 0
31 1 8 6 9
42 0 5 4 4
53 8 5 8 6
64 0 1 2 2
```

The `DataFrame`

exposes a number of statistical methods such as `sum`

:

`1df.sum()`

python

```
1Foo 12
2Bar 23
3Baz 20
4New 21
5dtype: int64
```

The result is a `Series`

with the values of the sums of the columns, and the index is the column names. To sum along the rows, add the `axis=`

keyword argument with a value of 1 (the default is 0).

`1df['Total'] = df.sum(axis=1)`

python

```
1 Foo Bar Baz New Total
20 3 4 0 0 7
31 1 8 6 9 24
42 0 5 4 4 13
53 8 5 8 6 27
64 0 1 2 2 5
```

You can also change the index:

`1df.index = [5, 10, 15, 20, 25]`

python

```
1 Foo Bar Baz New Total
25 3 4 0 0 7
310 1 8 6 9 24
415 0 5 4 4 13
520 8 5 8 6 27
625 0 1 2 2 5
```

Accessing the rows of the `DataFrame`

by index can be done using `loc`

:

`1df.loc[15]`

python

```
1Foo 0
2Bar 5
3Baz 4
4New 4
5Total 13
6Name: 15, dtype: int64
```

And by zero-based position with `iloc`

:

`1df.iloc[2]`

python

Again, this is just scratching the surface of the capabilities of `pandas`

.

It is obviously easier to use `pandas`

over the `ndarray`

. But most of the time, `DataFrame`

s will be very large. Presenting a `DataFrame`

to a knowledge worker or decision maker who needs information at a glance is not an option. For this, we turn to the visualization, which is a graphical representation of data and information. Think of it as another level of abstraction.

In the PyData stack, the `matplotlib`

package handles visualizations using Python. The `matplotlib`

API is big, in fact it is huge. I can't begin to cover it in this guide. In fact, I have a two-hour course on Pluralsight about `matplotlib`

that only covers an introduction to the package. But here are the basics.

First, the module where much of the API resides is `matplotlib.pyplot`

. This is often shortened to `plt`

with an alias. Inside of the `plt`

module are different functions to create different types of visualizations, such as `plot`

to create a line chart:

```
1import numpy as np
2import matplotlib.pyplot as plt
3
4data = np.random.randint(0, 10, 10)
5plt.plot(data)
```

python

Creating a bar chart is simple as well:

`1plt.bar(range(len(data)), data)`

python

Here's a pie chart:

`1plt.pie(data)`

python

These visualizations all render the same data in different charts. With the exception of the bar chart, the functions take an `ndarray`

or any Python iterable. For the `bar`

function, the first argument is the positions of the bars. To space the bars evenly, just use a list of sequential integers.

And here is a fancy scatter chart.

```
1x = np.random.randn(25)
2y = np.random.randn(25)
3colors = np.random.uniform(size=25)
4sizes = np.random.uniform(size=25) * 1000
5plt.scatter(x, y, c=colors, s=sizes, cmap='Blues')
```

python

The scatter chart plots point in two dimensions. The keyword arguments `c=`

and `s=`

represent the color and size of the points to show the intensity of the values. The `cmap=`

keyword argument represents the available colors to use. There are many color maps predefined in `matplotlib`

.

You could change the values of the axes in the charts:

```
1import string
2plt.plot(data)
3labels = list(string.ascii_uppercase[:len(data)])
4plt.xticks(list(range(len(labels))), labels)
```

python

Similar to the `bar`

function, the `xticks`

function expects the first argument to be the locations of the ticks and the second to be the new labels.

The are a number of keyword arguments in `matplotlib`

to get the appearance of visualizations just the way you want them. You already saw some of these for the scatter chart. The line chart also provides the following:

- Color -
`c=`

- Line Style =
`ls=`

`1plt.plot(data, c='red', ls='--')`

python

The documentation covers the complete set of keyword arguments for each visualization. As you might have guessed, there are too many to cover here.

The next one is a little different. It doesn't expose an API, but instead is a tool. The process of writing code in data science is not like that of most software development. In data science, you set up an experiment, set some initial parameters, run the experiment, collect the results, tweak the parameters, and repeat. This process thrives when you can get rapid feedback, and it also requires different tools.

Jupyter Notebook takes the concept of the Python REPL and goes wild with it. Instead of a platform-specific command line interface, Jupyter Notebook is like a Python session running on a web server. And you use a web browser to access that Python session. This has several advantages. You don't have to be running a specific operating system or even device to access Jupyter Notebook. You don't even need Python installed!

Jupyter Notebook organizes code into cells. A cell may be executed by pressing *Shift-Enter*.

Also, since everything is accessed in a web browser, the results are not limited to text. The web technologies of HTML, CSS, and even JavaScript could be used to implement rich user experiences, such as:

- Syntax highlighting and automatic indentation
- Inline images with
`matplotlib`

- Rich text with markdown
- Inline help

Jupyter Notebook also persists everything to a file in JSON. This way you can close a notebook and reopen it in the state where you last left it. It also makes sharing experiments much easier. With the Python REPL, when you close it, all of your work is lost.

I've prepared a sample notebook illustrating the concepts discussed in the previous three sections running on a service from Google called Google Colab.

Jupyter Notebook can also be installed locally via `pip`

, but Colab is a great way to get started. And it's not a toy either. You can actually get access for free to GPUs and run jobs for up to 12 hours.

If you'd like to know more about Jupyter Notebook, check out my video course on Pluralsight: Getting Started with Jupyter Notebook and Python.

There are a lot of other packages often used with the packages discussed, but maybe not as often. Here are a few:

Scikit-Learn (

`sklearn`

): This is a machine learning API. It implements many common use cases for machine learning, such as classification and regression. Generally, it is a great choice for someone who does not need to get into the details of machine learning. You can read a Pluralsight Guide I authored about Scikit-learn for more information about it.Statsmodels: Where

`sklearn`

implements common tasks in machine learning,`statsmodels`

does the same for statistics. It provides an easy-to-use API that avoids the details of implementation.Sympy: Data science will often require complex mathematical expression. While these can be written using Python, it can be difficult to read them. The

`sympy`

package is a computer algebra system that makes writing mathematical expressions in Python simpler.NLTK: This is a package for implementing tasks in natural language processing. It includes an API for tagging parts of speech, parsing and tokenizing text and over 50 popular corpora (bodies of text). The

`nltk`

package even has a free, online, full-length ebook that covers the package in detail.

Data science and analytics include a diverse range of topics and skills. While R has been a popular choice of data scientists for years, in the larger field of software engineering, developers want a more general-purpose language. By using Python, developers have access to a powerful, yet simple language that avoids the quirks of R. Thanks to the open-source Python community, developers can choose from a wide selection of packages for implementing solutions to data science problems.