Data Science Beginners

Python is an extremely powerful language for data science and this guide will explain fundamental Python concepts like Variables, Arrays, Dataframes, and more.

Oct 24, 2019 • 13 Minute Read

Subscribe to the newsletter

Introduction

Python is an extremely powerful programming language for data science and artificial intelligence. As in every programming language, there are building blocks or fundamentals in Python that need to be understood to master the language.

In this guide, you will acquire the fundamental knowledge required to succeed in data science with Python. We will go through eight concepts:

Variables
Lists
Dictionary
Arrays
Functions
Packages
Dataframes
Introduction to Machine Learning

Variables

A variable is a specific, case-sensitive name that allows you to refer to a value, as shown in the example below. In the first line, we assign the value 88 to the variable 'marks'. In the second line, we call the variable that returns the value it stored.

      marks = 88
marks
    

Output:

There are many types of variables.The most common ones are float (for real number), int (for integer), str (for string or text), and bool (for True or False). The type of variable can be checked with the type() function.

      month = "march"
average_marks = 67.3

print(type(marks)); print(type(month)); print(type(average_marks))
    

Output:

      <class 'int'>
    <class 'str'>
    <class 'float'>
    

Lists

Lists are one of the most versatile data structures in Python. They contain a collection of values. Lists can contain items of the same or different types, and the elements can also be changed. It is easy to create a list by simply defining a collection of comma separated values in square brackets. The lines of code below create a list, 'movie', containing the names of the most successful movies and their box office collections in billion dollars.

      movie = ["Avengers: Endgame", 2.796, "Avatar", 2.789, "Titanic", 2.187, "Star Wars: The Force Awakens", 2.068, "Avengers: Infinity War", 2.048] 

type(movie)

Output:

      list

It is also possible to create a list of lists, as shown below.

      movie2 = [["Avengers: Endgame", 2.796], ["Avatar", 2.789], ["Titanic", 2.187], ["Star Wars: The Force Awakens", 2.068], ["Avengers: Infinity War", 2.048]] 

movie2

Output:

      ['Avengers: Endgame'

Subsetting lists is easy in Python, and it starts with the index of 0. This is why it is called zero based-indexing. Suppose we want to see the first element of the list 'movie'.We can do this easily with the syntax ***movie[0]***.

      movie[0]

Output:

      'Avengers: Endgame'

Like subsetting, list slicing is easy in Python. The syntax is in the format ***list[start:end]***, in which the 'start' element is inclusive but the 'end' element is exclusive. The example below prints output for the third through fifth elements, excluding the sixth.

      movie[3:6]

Output:

      2.789

We can also perform list manipulation to change, add or remove list elements, as shown below.

      # Adding the new element
movie + ["Jurassic World", 1.67]
    

Output:

      'Avengers: Endgame'

Tuples

A tuple is a number of values separated by commas. The major differences between tuples and lists are that tuples cannot be changed and tuples use parentheses, whereas lists use square brackets.

      tuple_1 = 20,30,40,50,60
tuple_1
    

Output:

      (20, 30, 40, 50, 60)

Dictionary

Lists are convenient, but not intuitive. Dictionary is an intuitive alternative to lists. The major difference between the two is that lists are indexed by a range of numbers, whereas a dictionary is indexed by unique keys that can be used to create lookup tables. A pair of braces, {}, creates an empty dictionary, as shown below:

      mov = {"Avengers: Endgame": 2.796, "Avatar":2.789, "Titanic":2.187} 

mov["Avatar"]

Output:

      2.789

Numpy Arrays

Another alternative to Python lists is numpy arrays, which are collections of data points. Lists are powerful, but for data science, we need an alternative that has speed and allows mathematical operations over the elements. Numpy arrays allow for simpler computations, as illustrated in the example below.

      import numpy as np

runs = [100, 89, 75, 28]
np_runs = np.array(runs)
                     
over = [10, 9, 6, 2]
np_over = np.array(over)

runs_over = np_runs / np_over
runs_over
    

Output:

      array([10, 9.89, 12.5, 14])

It is possible to do subsetting of the numpy arrays in a similar manner as with lists.

      runs_over[0]

Output:

      10.0

It is also possible to create n-dimensional arrays using numpy. In the example below, we create a two dimensional arrays, containing runs and over.

      example_2d = np.array([[100, 89, 75, 28, 35], 
                   [10, 9, 6, 2, 4]]) 

example_2d
    

Output:

      array([[100,  89,  75,  28,  35],
           [ 10,   9,   6,   2,   4]])
    

Subsetting of the n-dimensional numpy arrays also follows the zero-based indexing method. A few examples are shown in the lines of code below.

      # Extracting the first row of the array
print(example_2d[0])

# Extracting the first row and second element of the array
print(example_2d[0][1])

# Extracting the second and third columns and subsequent rows
example_2d[:,1:3]
    

Output:

      100  89  75  28  35]

    89

    array([[89

Functions

Functions are arguably the most widely used component in predictive modeling. In simple terms, a function is a chunk of resuable code that can be called upon to solve a particular problem. This reduces a lot of coding work at the data scientist's end. We have already used one such function: 'type()'. There are many inbuilt functions in Python, and for any standard task, there is likely to be a function. In the example below, we print the maximum value and the type of list with the help of two functions.

      s1 = [20, 30, 26, 32, 43, 13]

print(max(s1)); print(type(s1))

Output:

      43
  <class 'list'>
    

To understand the documentation of a particular function, we can use the function help.

      help(max)

Output:

      Help on built-in function max in module builtins:
    
    max(...)
        max(iterable, *[, default=obj, key=func]) -> value
        max(arg1, arg2, *args, *[, key=func]) -> value
        
        With a single iterable argument, return its biggest item. The
        default keyword-only argument specifies an object to return if
        the provided iterable is empty.
        With two or more arguments, return the largest argument.
    

Packages

Functions are powerful, but complex code can get messy, requiring a lot of maintenance. In such cases, we can get help from Python packages.

A package can be considered a directory of Python scripts, where each script is contained within a module. These modules specify several functions and methods. Python has several powerful packages, and some of the most common ones are:

1. NumPy - stands for Numerical Python. It is used for creating and dealing with n-dimensional arrays and contains basic linear algebra functions, along with many other numerical capabilities.

2. Matplotlib - for visualization.

3. Pandas - for structured data operations and manipulations.

4. Scikit Learn - for machine learning. This is the most popular package for building machine learning models.

5. Statsmodels - for statistical modeling.

6. NLTK - for natural language processing.

7. SciPy - stands for Scientific Python, and is built on NumPy.

There are thousands of other packages, but the ones listed above are the most widely used for data science. It is easy to import these packages using the import command. For example, the lines of code below import the 'numpy' package and create an array.

      import numpy
numpy.array([20, 21, 22]) 

import numpy as np
np.array([20, 21, 22])
    

Output:

      array([20, 21, 22])

Dataframes

In a previous section, we learned about numpy arrays, which are collections of data points. However, the limitation of an array is that it can handle only one data type. But for real world data science problems, you need datasets to handle different types of data, such as text, float, integer, etc. The solution is 'Dataframes', which is the defacto data format for machine learning and predictive modeling. The typical structure of a dataframe contains observations in rows and variables in columns.

A dataframe can be constructed using the dictionary, as in the lines of codes below. The first line of code below imports the pandas library. The second line creates the dictionary that contains the values stored in the variables 'movie', 'collections', and 'release_yr', respectively. The third line converts this dictionary into a pandas dataframe, while the fourth line prints the resulting dataframe.

      import pandas as pd 

dic_movie = { 
    "movie":["Avengers: Endgame", "Avatar", "Titanic", "Star Wars: The Force Awakens", "Avengers: Infinity War"], 
    "collections":[2.796, 2.789, 2.187, 2.068, 2.048],
    "release_yr":[2019, 2009, 1997, 2015, 2018]}

movie_df = pd.DataFrame(dic_movie)
movie_df
    

Output:

      |   	| movie                        	| collections 	| release_yr 	|
|---	|------------------------------	|-------------	|------------	|
| 0 	| Avengers: Endgame            	| 2.796       	| 2019       	|
| 1 	| Avatar                       	| 2.789       	| 2009       	|
| 2 	| Titanic                      	| 2.187       	| 1997       	|
| 3 	| Star Wars: The Force Awakens 	| 2.068       	| 2015       	|
| 4 	| Avengers: Infinity War       	| 2.048       	| 2018       	|
    

We can also create dataframes from existing comma separated files, also called 'csv' files. The 'read_csv' function from the pandas library can be used to read the files, as in the line of code below.

      df = pd.read_csv("data_desc.csv") 
df.head()
    

Output:

      |   	| Marital_status 	| Dependents 	| Is_graduate 	| Income 	| Loan_amount 	| Term_months 	| Credit_score 	| approval_status 	| Age 	| Sex 	|
|---	|----------------	|------------	|-------------	|--------	|-------------	|-------------	|--------------	|-----------------	|-----	|-----	|
| 0 	| Yes            	| 2          	| Yes         	| 306800 	| 43500       	| 204         	| Satisfactory 	| Yes             	| 76  	| M   	|
| 1 	| Yes            	| 3          	| Yes         	| 702100 	| 104000      	| 384         	| Satisfactory 	| Yes             	| 75  	| M   	|
| 2 	| No             	| 0          	| Yes         	| 558800 	| 66500       	| 384         	| Satisfactory 	| Yes             	| 75  	| M   	|
| 3 	| Yes            	| 2          	| Yes         	| 534500 	| 64500       	| 384         	| Satisfactory 	| Yes             	| 75  	| M   	|
| 4 	| Yes            	| 2          	| Yes         	| 468000 	| 135000      	| 384         	| Satisfactory 	| Yes             	| 75  	| M   	|
    

Machine Learning

In the previous sections, we have learned the basic concepts related to Python for data science. However, the most popular and challenging data science task is to build machine learning models. In simple terms, machine learning is the field of teaching machines and computers to learn from existing data and to make predictions on the new, unseen data. Python is one of the most powerful languages for machine learning, and is extensively used for building data science products.

Machine learning is a vast concept and is not in the scope of this guide. To learn about data preparation and building machine learning models using Python, please refer to the following guides:

Conclusion

In this guide, you have acquired the fundamental knowledge to succeed in data science with Python. Specifically, you now have a basic understanding of:

Variables
Lists
Dictionary
Arrays
Functions
Packages
Dataframes
Introduction to Machine Learning

Understanding these concepts will enable you handle basic data science tasks successfully and provide a foundation for more complex skills.