Importing Data in Python

Jan 2, 2020 • 10 Minute Read

Introduction

Data scientists are expected to build high-performing machine learning models, but the starting point is getting the data into the Python environment. Only after importing the data can the data scientist clean, wrangle, visualize, and build predictive models on it.

In this guide, you'll learn the techniques to import data into Python. We will start with flat files, including .text and .csv files, which are simple and prevalent formats for data storage.

CSV Files

One of the most common data types is the CSV format, which is an acronym for comma-separated values. The general structure of CSV files uses rows as observations and columns as attributes.

The first line of code below imports the pandas package using the alias pd. The second line reads the .csv file and stores it as a pandas dataframe using the pandas pd.read_csv() function. The third line prints the shape of the data, and the fourth line displays the first five observations.

      import pandas as pd
data = pd.read_csv("data_desc.csv")
print(data.shape)
data.head(5)
    

Output:

      (600, 11)


|   	| Marital_status 	| Dependents 	| Is_graduate 	| Income 	| Loan_amount 	| Term_months 	| Credit_score 	| approval_status 	| Age 	| Sex 	| Investment 	|
|---	|----------------	|------------	|-------------	|--------	|-------------	|-------------	|--------------	|-----------------	|-----	|-----	|------------	|
| 0 	| Yes            	| 2          	| Yes         	| 306800 	| 43500       	| 204         	| Satisfactory 	| Yes             	| 76  	| M   	| 199420     	|
| 1 	| Yes            	| 3          	| Yes         	| 702100 	| 104000      	| 384         	| Satisfactory 	| Yes             	| 75  	| M   	| 456365     	|
| 2 	| No             	| 0          	| Yes         	| 558800 	| 66500       	| 384         	| Satisfactory 	| Yes             	| 75  	| M   	| 363220     	|
| 3 	| Yes            	| 2          	| Yes         	| 534500 	| 64500       	| 384         	| Satisfactory 	| Yes             	| 75  	| M   	| 347425     	|
| 4 	| Yes            	| 2          	| Yes         	| 468000 	| 135000      	| 384         	| Satisfactory 	| Yes             	| 75  	| M   	| 304200     	|
    

Text Files

The other common flat file type is text files, which also contain textual data, but not necessarily in a tabular format. For our example, we'll be working with the moby_dick.txt file. It is a text file that contains the opening sentences of Moby Dick, one of the greatest American novels, by Herman Melville. The novel was published in the year 1851 as The Whale and a month later in New York City as Moby Dick.

The first line of code below reads the text file using the pandas pd.read_table() function. The second line prints the first few lines of the file.

      data3 = pd.read_table("moby_dick.txt")
print(data3)
    

Output:

      CHAPTER 1. Loomings.
 Call me Ishmael. Some years ago--never mind ho...
 little or no money in my purse, and nothing pa...
 shore, I thought I would sail about a little a...
 the world. It is a way I have of driving off t...
 the circulation. Whenever I find myself growin...
 whenever it is a damp, drizzly November in my ...
 myself involuntarily pausing before coffin war...
 the rear of every funeral I meet; and especial...
 such an upper hand of me, that it requires a s...
 prevent me from deliberately stepping into the...
knocking people's hats off--then, I account it...
as soon as I can. This is my substitute for pi...
philosophical flourish Cato throws himself upo...
take to the ship. There is nothing surprising ...
it, almost all men in their degree, some time ...
nearly the same feelings towards the ocean wit...
    

Excel Data

Excel data needs no introduction and is arguably the most widely used data type in the business world. The first line of code below imports and stores the dataset as a pandas dataframe, using the pandas pd.ExcelFile() function. The second line prints the sheet names in the file.

      data2= pd.ExcelFile("data_excel.xlsx")
print(data2.sheet_names)
    

Output:

      'sheet 1'

The output shows that the Excel file has three sheets. If we didn't specify a sheet name, it would take the first sheet by default. If we want to load only a particular sheet from the Excel file for analysis, we can do that using the first line of code below. The second line prints the first five rows of the data. It is also possible to customize the imports, for example, skipping certain rows, importing only selected columns, or changing variable names.

      df1 = data2.parse('sheet 1')
df1.head(5)
    

Output:

      |   	| Marital_status 	| Dependents 	| Is_graduate 	| Income 	| Loan_amount 	| Term_months 	| Credit_score 	| approval_status 	| Age 	| Sex 	| Investment 	|
|---	|----------------	|------------	|-------------	|--------	|-------------	|-------------	|--------------	|-----------------	|-----	|-----	|------------	|
| 0 	| Yes            	| 2          	| Yes         	| 306800 	| 43500       	| 204         	| Satisfactory 	| Yes             	| 76  	| M   	| 199420     	|
| 1 	| Yes            	| 3          	| Yes         	| 702100 	| 104000      	| 384         	| Satisfactory 	| Yes             	| 75  	| M   	| 456365     	|
| 2 	| No             	| 0          	| Yes         	| 558800 	| 66500       	| 384         	| Satisfactory 	| Yes             	| 75  	| M   	| 363220     	|
| 3 	| Yes            	| 2          	| Yes         	| 534500 	| 64500       	| 384         	| Satisfactory 	| Yes             	| 75  	| M   	| 347425     	|
| 4 	| Yes            	| 2          	| Yes         	| 468000 	| 135000      	| 384         	| Satisfactory 	| Yes             	| 75  	| M   	| 304200     	|
    

Importing Data from URL

Often data is available on a website and can be downloaded into a local system. However, the disadvantage is that this method is neither reproducible nor scalable. In such cases, we can load the data directly from a website URL (Universal Resource Locator) into the Python environment and perform the analytics.

We will use the urllib library for performing this task, as this package provides the interface for fetching data across the web. The first two lines of code below import the required libraries. The third line assigns the URL of the file that we want to import into Python. The fourth line uses the urlretrieve function to save the file in the local environment. The last three lines of code below read the file into a dataframe and print the shape and the first few observations of the dataset.

      import urllib
from urllib.request import urlretrieve

url = 'http://winterolympicsmedals.com/medals.csv'
urlretrieve(url, 'medals.csv')

df_web = pd.read_csv('medals.csv')
print(df_web.shape)
df_web.head()
    

Output:

      (2311, 8)


|   	| Year 	| City     	| Sport      	| Discipline     	| NOC 	| Event      	| Event gender 	| Medal  	|
|---	|------	|----------	|------------	|----------------	|-----	|------------	|--------------	|--------	|
| 0 	| 1924 	| Chamonix 	| Skating    	| Figure skating 	| AUT 	| individual 	| M            	| Silver 	|
| 1 	| 1924 	| Chamonix 	| Skating    	| Figure skating 	| AUT 	| individual 	| W            	| Gold   	|
| 2 	| 1924 	| Chamonix 	| Skating    	| Figure skating 	| AUT 	| pairs      	| X            	| Gold   	|
| 3 	| 1924 	| Chamonix 	| Bobsleigh  	| Bobsleigh      	| BEL 	| four-man   	| M            	| Bronze 	|
| 4 	| 1924 	| Chamonix 	| Ice Hockey 	| Ice Hockey     	| CAN 	| ice hockey 	| M            	| Gold   	|
    

SQL Database

Relational databases are a prominent source of data storage for many organizations, and it is extremely important to know how to import data from such databases. Structured Query Language (or SQL) is the most widely used database, and we can import data from tables stored in SQL Server by building a connection. The first step is to import the required packages and functions. The pyodbc package is used in the illustration below. The next step is to establish the connection with the database for which you will need to have server, user, and database details to establish a connection. Once the connection is established, you can run the SQL query to extract the relevant information and store it in the dataframe for further analysis. An illustration is shown below.

      import pyodbc 
sql_conn = pyodbc.connect("Driver={SQL Server};Server=serverName;UID=UserName;PWD=Password;Database=sql_db;")
df_sql = pd.read_sql_query('type your sql query here', sql_conn)
df_sql.head()
    

Conclusion

In this guide, you learned techniques for importing different types of data into Python. The knowledge of predictive modeling is important, but equally important is knowing how to import data files such as CSV, text, Excel files from a local machine into the Python environment. You also learned how to directly import data from a website into your working environment and the basic workflow for importing data from an SQL database.

To learn more about data science using Python, please refer to the following guides: