Importing data from different sources is fundamental to data science and machine learning. The abundance of good quality data not only eliminates a lot of pre-processing steps but also determines how likely your model is going to succeed in predicting plausible outcomes. The Python Panda library is the workhorse of a data scientist when dealing with table or matrix forms of data. Panda is written on top of NumPy and provides the additional level of abstraction. This helps users focus more on solving the problem statement by hiding the elaborate implementation details. It takes the input in the form of csv, txt or sql file and converts it into the dataframe object which is then available for splicing and analysis.
In this guide, we are going to work with household_data.csv; the contents of which are displayed below. Unless explicitly mentioned, the data of file will remain throughout this guide.
1import pandas as pd
2df = pd.read_csv('household_data.csv')
3print(df)
Output:
1 Item_Category Gender Age Salary Purchased
20 Fitness Male 20 30000 Yes
31 Fitness Female 50 70000 No
42 Food Male 35 50000 Yes
53 Kitchen Male 22 40000 No
64 Kitchen Female 30 35000 Yes
1import pandas as pd
2df = pd.read_excel('household_data.xlsx', sheetname='household_data')
3print(df)
Output:
1 Item_Category Gender Age Salary Purchased
20 Fitness Male 20 30000 Yes
31 Fitness Female 50 70000 No
42 Food Male 35 50000 Yes
53 Kitchen Male 22 40000 No
64 Kitchen Female 30 35000 Yes
We are going to see how to read the contents returned by the select statement to the dataframe. The below snippet is for Oracle but the idea remains same for other databases. Only the connection details should change.
1import cx_Oracle
2import pandas as pd
3dsn_tns = cx_Oracle.makedsn('server', 'port', 'SID')
4conn = cx_Oracle.connect(user='username', password='password', dsn=dsn_tns)
5cursor = conn.cursor()
6df = pd.read_sql_query("select * from household_data where Item_Category = 'Food'",conn)
7print(df)
Output:
1 Item_Category Gender Age Salary Purchased
20 Food Male 35 50000 Yes
Panda provides various in-built functions that come in handy when dealing with the data set.
1print(df["Salary"].min())
2print(df["Salary"].max())
3print(df["Salary"].mean())
Output:
130000
270000
345000.0
Count: This method is useful when the user is interested in getting the number of elements present per column. If there is any value that is left null than that is eliminated from the count. Assume if the value of purchased is left blank for one of the rows then following would be the output.
1print(df.count(0))
Output:
1Item_Category 5
2Gender 5
3Age 5
4Salary 5
5Purchased 4
Shape is used to get the dimensions of the dataframe.
1print(df.shape)
Output:
1(5, 5)
Size is used to get the number of elements in the dataframe.
1print(df.size)
Output:
125
See the below equation:
1y = 10a + 2b - 4.3c
It demonstrates that the value of y is dependent on the value of a, b, and c. So, y is referred to as dependent feature or variable and a, b, and c are independent features or variables. Any predictive mathematical model tends to divide the observations (data) into dependent/ independent features in order to determine the causal effect. It should be noted that relationship between dependent and independent variables need not be linear, it can be polynomial. It is common practise while doing experiments to change one independent variable while keeping others constant to see the change caused on the dependent variable.
In machine learning, the concept of dependent and independent variables is important to understand. In the above dataset, if you look closely, the first four columns (Item_Category, Gender, Age, Salary) determine the outcome of the fifth, or last, column (Purchased). Intuitively, it means that the decision to buy a product of a given category (Fitness item, Food product, kitchen goods) is determined by the Gender (Male, Female), Age, and the Salary of the individual. So, we can say that Purchased is the dependent variable, the value of which is determined by the other four variables.
With this in mind, we need to split our dataset into the matrix of independent variables and the vector or dependent variable. Mathematically, Vector is defined as a matrix that has just one column.
1X = df.iloc[:, :-1].values
2print(X)
Output:
1[['Fitness' 'Male' 20 30000]
2['Fitness' 'Female' 50 70000]
3['Food' 'Male' 35 50000]
4['Kitchen' 'Male' 22 40000]
5['Kitchen' 'Female' 30 35000]]
1Y = df.iloc[:, -1].values
2print(Y)
Output:
1['Yes', 'No', 'Yes', 'No', 'Yes']
There are many other sophisticated methods available in Python Pandas that can help the user to import data from different sources to its dataframe. Once you have the data in the dataframe, it can then be used for various kinds of analysis. We also saw how to segregate the data into dependent and independent variables. In the next guide, we will see how to carry on a few more pre-processing steps before data can be presented to the machine learning models.