Python is an extremely powerful programming language for data science and artificial intelligence. As in every programming language, there are building blocks or fundamentals in Python that need to be understood to master the language.
In this guide, you will acquire the fundamental knowledge required to succeed in data science with Python. We will go through eight concepts:
A variable is a specific, case-sensitive name that allows you to refer to a value, as shown in the example below. In the first line, we assign the value 88 to the variable 'marks'. In the second line, we call the variable that returns the value it stored.
1marks = 88
2marks
Output:
1 88
There are many types of variables.The most common ones are float (for real number), int (for integer), str (for string or text), and bool (for True or False). The type of variable can be checked with the type() function.
1month = "march"
2average_marks = 67.3
3
4print(type(marks)); print(type(month)); print(type(average_marks))
Output:
1 <class 'int'>
2 <class 'str'>
3 <class 'float'>
Lists are one of the most versatile data structures in Python. They contain a collection of values. Lists can contain items of the same or different types, and the elements can also be changed. It is easy to create a list by simply defining a collection of comma separated values in square brackets. The lines of code below create a list, 'movie', containing the names of the most successful movies and their box office collections in billion dollars.
1movie = ["Avengers: Endgame", 2.796, "Avatar", 2.789, "Titanic", 2.187, "Star Wars: The Force Awakens", 2.068, "Avengers: Infinity War", 2.048]
2
3type(movie)
Output:
1 list
It is also possible to create a list of lists, as shown below.
1movie2 = [["Avengers: Endgame", 2.796], ["Avatar", 2.789], ["Titanic", 2.187], ["Star Wars: The Force Awakens", 2.068], ["Avengers: Infinity War", 2.048]]
2
3movie2
Output:
1 [['Avengers: Endgame', 2.796],
2 ['Avatar', 2.789],
3 ['Titanic', 2.187],
4 ['Star Wars: The Force Awakens', 2.068],
5 ['Avengers: Infinity War', 2.048]]
Subsetting lists is easy in Python, and it starts with the index of 0. This is why it is called zero based-indexing. Suppose we want to see the first element of the list 'movie'.We can do this easily with the syntax movie0.
1movie[0]
Output:
1 'Avengers: Endgame'
Like subsetting, list slicing is easy in Python. The syntax is in the format liststart:end, in which the 'start' element is inclusive but the 'end' element is exclusive. The example below prints output for the third through fifth elements, excluding the sixth.
1movie[3:6]
Output:
1 [2.789, 'Titanic', 2.187]
We can also perform list manipulation to change, add or remove list elements, as shown below.
1# Adding the new element
2movie + ["Jurassic World", 1.67]
3
Output:
1 ['Avengers: Endgame',
2 2.796,
3 'Avatar',
4 2.789,
5 'Titanic',
6 2.187,
7 'Star Wars: The Force Awakens',
8 2.068,
9 'Avengers: Infinity War',
10 2.048,
11 'Jurassic World',
12 1.67]
A tuple is a number of values separated by commas. The major differences between tuples and lists are that tuples cannot be changed and tuples use parentheses, whereas lists use square brackets.
1tuple_1 = 20,30,40,50,60
2tuple_1
Output:
1 (20, 30, 40, 50, 60)
Lists are convenient, but not intuitive. Dictionary is an intuitive alternative to lists. The major difference between the two is that lists are indexed by a range of numbers, whereas a dictionary is indexed by unique keys that can be used to create lookup tables. A pair of braces, {}, creates an empty dictionary, as shown below:
1mov = {"Avengers: Endgame": 2.796, "Avatar":2.789, "Titanic":2.187}
2
3mov["Avatar"]
Output:
1 2.789
Another alternative to Python lists is numpy arrays, which are collections of data points. Lists are powerful, but for data science, we need an alternative that has speed and allows mathematical operations over the elements. Numpy arrays allow for simpler computations, as illustrated in the example below.
1import numpy as np
2
3runs = [100, 89, 75, 28]
4np_runs = np.array(runs)
5
6over = [10, 9, 6, 2]
7np_over = np.array(over)
8
9runs_over = np_runs / np_over
10runs_over
Output:
1 array([10, 9.89, 12.5, 14])
It is possible to do subsetting of the numpy arrays in a similar manner as with lists.
1runs_over[0]
Output:
1 10.0
It is also possible to create n-dimensional arrays using numpy. In the example below, we create a two dimensional arrays, containing runs and over.
1example_2d = np.array([[100, 89, 75, 28, 35],
2 [10, 9, 6, 2, 4]])
3
4example_2d
Output:
1 array([[100, 89, 75, 28, 35],
2 [ 10, 9, 6, 2, 4]])
Subsetting of the n-dimensional numpy arrays also follows the zero-based indexing method. A few examples are shown in the lines of code below.
1# Extracting the first row of the array
2print(example_2d[0])
3
4# Extracting the first row and second element of the array
5print(example_2d[0][1])
6
7# Extracting the second and third columns and subsequent rows
8example_2d[:,1:3]
Output:
1 [100 89 75 28 35]
2
3 89
4
5 array([[89, 75],
6 [ 9, 6]])
Functions are arguably the most widely used component in predictive modeling. In simple terms, a function is a chunk of resuable code that can be called upon to solve a particular problem. This reduces a lot of coding work at the data scientist's end. We have already used one such function: 'type()'. There are many inbuilt functions in Python, and for any standard task, there is likely to be a function. In the example below, we print the maximum value and the type of list with the help of two functions.
1s1 = [20, 30, 26, 32, 43, 13]
2
3print(max(s1)); print(type(s1))
Output:
1 43
2 <class 'list'>
To understand the documentation of a particular function, we can use the function help.
1help(max)
Output:
1 Help on built-in function max in module builtins:
2
3 max(...)
4 max(iterable, *[, default=obj, key=func]) -> value
5 max(arg1, arg2, *args, *[, key=func]) -> value
6
7 With a single iterable argument, return its biggest item. The
8 default keyword-only argument specifies an object to return if
9 the provided iterable is empty.
10 With two or more arguments, return the largest argument.
11
Functions are powerful, but complex code can get messy, requiring a lot of maintenance. In such cases, we can get help from Python packages.
A package can be considered a directory of Python scripts, where each script is contained within a module. These modules specify several functions and methods. Python has several powerful packages, and some of the most common ones are:
1. NumPy - stands for Numerical Python. It is used for creating and dealing with n-dimensional arrays and contains basic linear algebra functions, along with many other numerical capabilities.
2. Matplotlib - for visualization.
3. Pandas - for structured data operations and manipulations.
4. Scikit Learn - for machine learning. This is the most popular package for building machine learning models.
5. Statsmodels - for statistical modeling.
6. NLTK - for natural language processing.
7. SciPy - stands for Scientific Python, and is built on NumPy.
There are thousands of other packages, but the ones listed above are the most widely used for data science. It is easy to import these packages using the import command. For example, the lines of code below import the 'numpy' package and create an array.
1import numpy
2numpy.array([20, 21, 22])
3
4import numpy as np
5np.array([20, 21, 22])
Output:
1 array([20, 21, 22])
In a previous section, we learned about numpy arrays, which are collections of data points. However, the limitation of an array is that it can handle only one data type. But for real world data science problems, you need datasets to handle different types of data, such as text, float, integer, etc. The solution is 'Dataframes', which is the defacto data format for machine learning and predictive modeling. The typical structure of a dataframe contains observations in rows and variables in columns.
A dataframe can be constructed using the dictionary, as in the lines of codes below. The first line of code below imports the pandas library. The second line creates the dictionary that contains the values stored in the variables 'movie', 'collections', and 'release_yr', respectively. The third line converts this dictionary into a pandas dataframe, while the fourth line prints the resulting dataframe.
1import pandas as pd
2
3dic_movie = {
4 "movie":["Avengers: Endgame", "Avatar", "Titanic", "Star Wars: The Force Awakens", "Avengers: Infinity War"],
5 "collections":[2.796, 2.789, 2.187, 2.068, 2.048],
6 "release_yr":[2019, 2009, 1997, 2015, 2018]}
7
8movie_df = pd.DataFrame(dic_movie)
9movie_df
Output:
1| | movie | collections | release_yr |
2|--- |------------------------------ |------------- |------------ |
3| 0 | Avengers: Endgame | 2.796 | 2019 |
4| 1 | Avatar | 2.789 | 2009 |
5| 2 | Titanic | 2.187 | 1997 |
6| 3 | Star Wars: The Force Awakens | 2.068 | 2015 |
7| 4 | Avengers: Infinity War | 2.048 | 2018 |
We can also create dataframes from existing comma separated files, also called 'csv' files. The 'read_csv' function from the pandas library can be used to read the files, as in the line of code below.
1df = pd.read_csv("data_desc.csv")
2df.head()
Output:
1| | Marital_status | Dependents | Is_graduate | Income | Loan_amount | Term_months | Credit_score | approval_status | Age | Sex |
2|--- |---------------- |------------ |------------- |-------- |------------- |------------- |-------------- |----------------- |----- |----- |
3| 0 | Yes | 2 | Yes | 306800 | 43500 | 204 | Satisfactory | Yes | 76 | M |
4| 1 | Yes | 3 | Yes | 702100 | 104000 | 384 | Satisfactory | Yes | 75 | M |
5| 2 | No | 0 | Yes | 558800 | 66500 | 384 | Satisfactory | Yes | 75 | M |
6| 3 | Yes | 2 | Yes | 534500 | 64500 | 384 | Satisfactory | Yes | 75 | M |
7| 4 | Yes | 2 | Yes | 468000 | 135000 | 384 | Satisfactory | Yes | 75 | M |
In the previous sections, we have learned the basic concepts related to Python for data science. However, the most popular and challenging data science task is to build machine learning models. In simple terms, machine learning is the field of teaching machines and computers to learn from existing data and to make predictions on the new, unseen data. Python is one of the most powerful languages for machine learning, and is extensively used for building data science products.
Machine learning is a vast concept and is not in the scope of this guide. To learn about data preparation and building machine learning models using Python, please refer to the following guides:
In this guide, you have acquired the fundamental knowledge to succeed in data science with Python. Specifically, you now have a basic understanding of:
Understanding these concepts will enable you handle basic data science tasks successfully and provide a foundation for more complex skills.