Important Update
The Guide Feature will be discontinued after December 15th, 2023. Until then, you can continue to access and refer to the existing guides.
Author avatar

Deepika Singh

Data Science Beginners

Deepika Singh

  • Oct 24, 2019
  • 13 Min read
  • 10,487 Views
  • Oct 24, 2019
  • 13 Min read
  • 10,487 Views
Languages Frameworks and Tools
python

Introduction

Python is an extremely powerful programming language for data science and artificial intelligence. As in every programming language, there are building blocks or fundamentals in Python that need to be understood to master the language.

In this guide, you will acquire the fundamental knowledge required to succeed in data science with Python. We will go through eight concepts:

  1. Variables
  2. Lists
  3. Dictionary
  4. Arrays
  5. Functions
  6. Packages
  7. Dataframes
  8. Introduction to Machine Learning

Variables

A variable is a specific, case-sensitive name that allows you to refer to a value, as shown in the example below. In the first line, we assign the value 88 to the variable 'marks'. In the second line, we call the variable that returns the value it stored.

1marks = 88
2marks
python

Output:

1  88

There are many types of variables.The most common ones are float (for real number), int (for integer), str (for string or text), and bool (for True or False). The type of variable can be checked with the type() function.

1month = "march"
2average_marks = 67.3
3
4print(type(marks)); print(type(month)); print(type(average_marks))
python

Output:

1    <class 'int'>
2    <class 'str'>
3    <class 'float'>

Lists

Lists are one of the most versatile data structures in Python. They contain a collection of values. Lists can contain items of the same or different types, and the elements can also be changed. It is easy to create a list by simply defining a collection of comma separated values in square brackets. The lines of code below create a list, 'movie', containing the names of the most successful movies and their box office collections in billion dollars.

1movie = ["Avengers: Endgame", 2.796, "Avatar", 2.789, "Titanic", 2.187, "Star Wars: The Force Awakens", 2.068, "Avengers: Infinity War", 2.048] 
2
3type(movie)
python

Output:

1  list

It is also possible to create a list of lists, as shown below.

1movie2 = [["Avengers: Endgame", 2.796], ["Avatar", 2.789], ["Titanic", 2.187], ["Star Wars: The Force Awakens", 2.068], ["Avengers: Infinity War", 2.048]] 
2
3movie2
python

Output:

1    [['Avengers: Endgame', 2.796],
2     ['Avatar', 2.789],
3     ['Titanic', 2.187],
4     ['Star Wars: The Force Awakens', 2.068],
5     ['Avengers: Infinity War', 2.048]]

Subsetting lists is easy in Python, and it starts with the index of 0. This is why it is called zero based-indexing. Suppose we want to see the first element of the list 'movie'.We can do this easily with the syntax movie0.

1movie[0]
python

Output:

1 'Avengers: Endgame'

Like subsetting, list slicing is easy in Python. The syntax is in the format liststart:end, in which the 'start' element is inclusive but the 'end' element is exclusive. The example below prints output for the third through fifth elements, excluding the sixth.

1movie[3:6]
python

Output:

1  [2.789, 'Titanic', 2.187]

We can also perform list manipulation to change, add or remove list elements, as shown below.

1# Adding the new element
2movie + ["Jurassic World", 1.67]
3 
python

Output:

1    ['Avengers: Endgame',
2     2.796,
3     'Avatar',
4     2.789,
5     'Titanic',
6     2.187,
7     'Star Wars: The Force Awakens',
8     2.068,
9     'Avengers: Infinity War',
10     2.048,
11     'Jurassic World',
12     1.67]

Tuples

A tuple is a number of values separated by commas. The major differences between tuples and lists are that tuples cannot be changed and tuples use parentheses, whereas lists use square brackets.

1tuple_1 = 20,30,40,50,60
2tuple_1
python

Output:

1  (20, 30, 40, 50, 60)

Dictionary

Lists are convenient, but not intuitive. Dictionary is an intuitive alternative to lists. The major difference between the two is that lists are indexed by a range of numbers, whereas a dictionary is indexed by unique keys that can be used to create lookup tables. A pair of braces, {}, creates an empty dictionary, as shown below:

1mov = {"Avengers: Endgame": 2.796, "Avatar":2.789, "Titanic":2.187} 
2
3mov["Avatar"]
python

Output:

1  2.789

Numpy Arrays

Another alternative to Python lists is numpy arrays, which are collections of data points. Lists are powerful, but for data science, we need an alternative that has speed and allows mathematical operations over the elements. Numpy arrays allow for simpler computations, as illustrated in the example below.

1import numpy as np
2
3runs = [100, 89, 75, 28]
4np_runs = np.array(runs)
5                     
6over = [10, 9, 6, 2]
7np_over = np.array(over)
8
9runs_over = np_runs / np_over
10runs_over
python

Output:

1    array([10, 9.89, 12.5, 14])

It is possible to do subsetting of the numpy arrays in a similar manner as with lists.

1runs_over[0]
python

Output:

1 10.0

It is also possible to create n-dimensional arrays using numpy. In the example below, we create a two dimensional arrays, containing runs and over.

1example_2d = np.array([[100, 89, 75, 28, 35], 
2                   [10, 9, 6, 2, 4]]) 
3
4example_2d
python

Output:

1    array([[100,  89,  75,  28,  35],
2           [ 10,   9,   6,   2,   4]])

Subsetting of the n-dimensional numpy arrays also follows the zero-based indexing method. A few examples are shown in the lines of code below.

1# Extracting the first row of the array
2print(example_2d[0])
3
4# Extracting the first row and second element of the array
5print(example_2d[0][1])
6
7# Extracting the second and third columns and subsequent rows
8example_2d[:,1:3] 
python

Output:

1    [100  89  75  28  35]
2
3    89
4
5    array([[89, 75],
6           [ 9,  6]])

Functions

Functions are arguably the most widely used component in predictive modeling. In simple terms, a function is a chunk of resuable code that can be called upon to solve a particular problem. This reduces a lot of coding work at the data scientist's end. We have already used one such function: 'type()'. There are many inbuilt functions in Python, and for any standard task, there is likely to be a function. In the example below, we print the maximum value and the type of list with the help of two functions.

1s1 = [20, 30, 26, 32, 43, 13]
2
3print(max(s1)); print(type(s1))
python

Output:

1  43
2  <class 'list'>

To understand the documentation of a particular function, we can use the function help.

1help(max)
python

Output:

1    Help on built-in function max in module builtins:
2    
3    max(...)
4        max(iterable, *[, default=obj, key=func]) -> value
5        max(arg1, arg2, *args, *[, key=func]) -> value
6        
7        With a single iterable argument, return its biggest item. The
8        default keyword-only argument specifies an object to return if
9        the provided iterable is empty.
10        With two or more arguments, return the largest argument.
11    

Packages

Functions are powerful, but complex code can get messy, requiring a lot of maintenance. In such cases, we can get help from Python packages.

A package can be considered a directory of Python scripts, where each script is contained within a module. These modules specify several functions and methods. Python has several powerful packages, and some of the most common ones are:

1. NumPy - stands for Numerical Python. It is used for creating and dealing with n-dimensional arrays and contains basic linear algebra functions, along with many other numerical capabilities.

2. Matplotlib - for visualization.

3. Pandas - for structured data operations and manipulations.

4. Scikit Learn - for machine learning. This is the most popular package for building machine learning models.

5. Statsmodels - for statistical modeling.

6. NLTK - for natural language processing.

7. SciPy - stands for Scientific Python, and is built on NumPy.

There are thousands of other packages, but the ones listed above are the most widely used for data science. It is easy to import these packages using the import command. For example, the lines of code below import the 'numpy' package and create an array.

1import numpy
2numpy.array([20, 21, 22]) 
3
4import numpy as np
5np.array([20, 21, 22]) 
python

Output:

1  array([20, 21, 22])

Dataframes

In a previous section, we learned about numpy arrays, which are collections of data points. However, the limitation of an array is that it can handle only one data type. But for real world data science problems, you need datasets to handle different types of data, such as text, float, integer, etc. The solution is 'Dataframes', which is the defacto data format for machine learning and predictive modeling. The typical structure of a dataframe contains observations in rows and variables in columns.

A dataframe can be constructed using the dictionary, as in the lines of codes below. The first line of code below imports the pandas library. The second line creates the dictionary that contains the values stored in the variables 'movie', 'collections', and 'release_yr', respectively. The third line converts this dictionary into a pandas dataframe, while the fourth line prints the resulting dataframe.

1import pandas as pd 
2
3dic_movie = { 
4    "movie":["Avengers: Endgame", "Avatar", "Titanic", "Star Wars: The Force Awakens", "Avengers: Infinity War"], 
5    "collections":[2.796, 2.789, 2.187, 2.068, 2.048],
6    "release_yr":[2019, 2009, 1997, 2015, 2018]}
7
8movie_df = pd.DataFrame(dic_movie)
9movie_df
python

Output:

1|   	| movie                        	| collections 	| release_yr 	|
2|---	|------------------------------	|-------------	|------------	|
3| 0 	| Avengers: Endgame            	| 2.796       	| 2019       	|
4| 1 	| Avatar                       	| 2.789       	| 2009       	|
5| 2 	| Titanic                      	| 2.187       	| 1997       	|
6| 3 	| Star Wars: The Force Awakens 	| 2.068       	| 2015       	|
7| 4 	| Avengers: Infinity War       	| 2.048       	| 2018       	| 

We can also create dataframes from existing comma separated files, also called 'csv' files. The 'read_csv' function from the pandas library can be used to read the files, as in the line of code below.

1df = pd.read_csv("data_desc.csv") 
2df.head()
python

Output:

1|   	| Marital_status 	| Dependents 	| Is_graduate 	| Income 	| Loan_amount 	| Term_months 	| Credit_score 	| approval_status 	| Age 	| Sex 	|
2|---	|----------------	|------------	|-------------	|--------	|-------------	|-------------	|--------------	|-----------------	|-----	|-----	|
3| 0 	| Yes            	| 2          	| Yes         	| 306800 	| 43500       	| 204         	| Satisfactory 	| Yes             	| 76  	| M   	|
4| 1 	| Yes            	| 3          	| Yes         	| 702100 	| 104000      	| 384         	| Satisfactory 	| Yes             	| 75  	| M   	|
5| 2 	| No             	| 0          	| Yes         	| 558800 	| 66500       	| 384         	| Satisfactory 	| Yes             	| 75  	| M   	|
6| 3 	| Yes            	| 2          	| Yes         	| 534500 	| 64500       	| 384         	| Satisfactory 	| Yes             	| 75  	| M   	|
7| 4 	| Yes            	| 2          	| Yes         	| 468000 	| 135000      	| 384         	| Satisfactory 	| Yes             	| 75  	| M   	|

Machine Learning

In the previous sections, we have learned the basic concepts related to Python for data science. However, the most popular and challenging data science task is to build machine learning models. In simple terms, machine learning is the field of teaching machines and computers to learn from existing data and to make predictions on the new, unseen data. Python is one of the most powerful languages for machine learning, and is extensively used for building data science products.

Machine learning is a vast concept and is not in the scope of this guide. To learn about data preparation and building machine learning models using Python, please refer to the following guides:

Conclusion

In this guide, you have acquired the fundamental knowledge to succeed in data science with Python. Specifically, you now have a basic understanding of:

  1. Variables
  2. Lists
  3. Dictionary
  4. Arrays
  5. Functions
  6. Packages
  7. Dataframes
  8. Introduction to Machine Learning

Understanding these concepts will enable you handle basic data science tasks successfully and provide a foundation for more complex skills.