Author avatar

Gabriel Cánepa

Importing Data from a Database Using Python

Gabriel Cánepa

  • Feb 5, 2019
  • 9 Min read
  • 40,574 Views
  • Feb 5, 2019
  • 9 Min read
  • 40,574 Views
Data
Python

Introduction

In previous guides, we have explained how to import data from Excel spreadsheets, tab-delimited files, and online APIs. As helpful as those resources are, if the data we need to handle is large in size or becomes too complex, a database (relational or otherwise) may be a better solution. Regardless of the flavor you choose, there are general principles and basic tools that we can use to query a database and manipulate the results using Python.

Prerequisites

To begin, we need to install the appropriate connector (also known as driver) for the database system that we are using. This utility comes in the form of a module that is at one's disposal either from the standard library (such as sqlite3) or a third-party package like mysql-connector-python and psycopg2-binary for Mysql / MariaDB and PostgreSQL, respectively. In any event, the Python Package Index is our go-to place to search for available adapters.

In this guide, we will use PostgreSQL, since it provides a function called ROW_TO_JSON out of the box. As its name suggests, this function returns each row in the result set as a JSON object. Since we have already learned how to work with JSON data, we will be able to manipulate the result of a query very easily.

If we use a different database system, we can iterate over the results and create a list of dictionaries where each element is a record in the result set.

That being said, we will need to install psycopg2-binary, preferably inside a virtual environment before we proceed:

1	pip install psycopg2-binary
bash

Now let's examine the PostgreSQL database we will work with, which is called nba. Figs. 1 through 3 show the structure of the coaches, players, and teams tables.

  • coaches stores the following data, where coach_id acts as the primary key. Besides the coach's first and last names, there's also a team_id which is a foreign key that references the homonymous field in the teams table.

Figure 1 - The **coaches** table

  • players, besides the player_id (primary key) and team_id (foreign key, which indicates the team he is currently playing for), also holds the first and last names, the jersey number, the height in meters, the weight in kilograms, and the country of origin.

Figure 2 - The **players** table

  • Finally, teams are described by their name, conference, current conference rank, home wins and losses, and away wins and losses. Of course, it also has the team_id primary key that is referenced in the other two tables.

Figure 3 - The **teams** table

The next step consists in writing a SQL query to retrieve the list of teams ordered by conference and rank, along with the number of players in each team and the name of its coach. And while we're at it, we can also add the number of home and away wins:

1	SELECT
2				t.name,
3				t.city,
4				t.conference,
5				t.conference_rank,
6				COUNT(p.player_id) AS number_of_players,
7				CONCAT(c.first_name, ' ', c.last_name) AS coach,
8				t.home_wins,
9				t.away_wins
10	FROM		players p, teams t, coaches c
11	WHERE		p.team_id = t.team_id
12	AND			c.team_id = t.team_id
13	GROUP BY	t.name, c.first_name, c.last_name, t.city, t.conference, t.conference_rank, t.home_wins, t.away_wins
14	ORDER BY    t.conference, t.conference_rank
sql

We will then wrap the query inside the ROW_TO_JSON function for our convenience and save it to a file named query.sql in the current directory:

1	SELECT ROW_TO_JSON(team_info) FROM (
2		SELECT
3					t.name,
4					t.city,
5					t.conference,
6					t.conference_rank,
7					COUNT(p.player_id) AS number_of_players,
8					CONCAT(c.first_name, ' ', c.last_name) AS coach,
9					t.home_wins,
10					t.away_wins
11		FROM		players p, teams t, coaches c
12		WHERE		p.team_id = t.team_id
13		AND			c.team_id = t.team_id
14		GROUP BY	t.name, c.first_name, c.last_name, t.city, t.conference, t.conference_rank, t.home_wins, t.away_wins
15		ORDER BY    t.conference, t.conference_rank
16	) AS team_info
sql

Fig. 4 shows the first records of the above query. Note that each row has the structure of a Python dictionary where the names of the fields returned by the query are the keys.

Figure 4 - Team information

Last, but not least, a word of caution. To connect to a database, we need a username and a password. It is best practice to use environment variables instead of exposing them in plain sight as part of the connection string. This is particularly important if you push your code to a version control system that other people can access. In Unix-like environments, this can be done by appending the following two lines at the end of your shell's initialization file. To apply changes, you will need to log out and log back in or source the file in the current session.

1	export DB_USER="your_PostgreSQL_username_here_inside_quotes"
2	export DB_PASS="your_password_inside_quotes"
bash

In Windows, go to Control Panel / System / Advanced system settings. Select the Advanced tab and click on Environment Variables to add them:

We are now ready to start writing Python code!

Querying the Database and Manipulating Results

At the top of our program we will import the necessary modules and one function to handle errors:

1	import os
2	import psycopg2 as p
3	from psycopg2 import Error
python

Next, we will load the contents of query.sql into query and instantiate the connection. You can also use environment variables for host, port, and database just like we did for user and password, although it is not strictly necessary to do so.

1	with open('query.sql') as sql:
2	    query = sql.read()
3
4	conn = p.connect(
5	    user = os.environ['DB_USER'],
6	  	password = os.environ['DB_PASS'],
7	  	host = 'localhost',
8	  	port = '5432',
9	  	database = 'nba'
10	)
python

Once we have successfully connected to the database, it is time to execute the query. To do so, a control structure associated with the connection and known as cursor is used. If everything went as expected, the variable called result contains a list of one-element tuples where each element is a dictionary.

1	cursor = conn.cursor()
2	cursor.execute(query)
3	result = cursor.fetchall()
python

At this point, we can iterate over result and manipulate its contents as desired. For example, we may insert them into a spreadsheet (as illustrated in Fig. 5), as we learned in Importing Data from Microsoft Excel Files with Python, or use them to feed an HTML table via a web application.

Figure 5 - Inserting team information into a spreadsheet

To catch errors, if they occur, it is necessary to wrap our code inside a try-except block. And while we are at it, adding a finally sentence allows us to clean up the connection when we are done using it:

1	try:
2		# Instantiate the connection
3		conn = p.connect(
4		    user = os.environ['DB_USER'],
5	      	password = os.environ['DB_PASS'],
6	      	host = 'localhost',
7	      	port = '5432',
8	      	database = 'nba'
9		)
10		
11		# Create cursor, execute the query, and fetch results
12		cursor = conn.cursor()
13		cursor.execute(query)
14		result = cursor.fetchall()
15		
16		# Create workbook and select active sheet
17		wb = Workbook()
18		ws = wb.active
19		
20		# Rename active sheet
21		ws.title = 'Teams'
22		
23		# Column headings
24		column_headings = [
25							'Name', 
26							'City', 
27							'Conference', 
28							'Rank', 
29							'Players', 
30							'Coach', 
31							'Home wins', 
32							'Away wins'
33		]
34		ws.append(column_headings)
35		
36		# Add players
37		for team in result:
38		    ws.append(list(team[0].values()))
39			
40		# Get coordinates of last cell
41		last_cell = ws.cell(row = ws.max_row, column = ws.max_column).coordinate
42		
43		# Create table
44		team_table = Table(displayName = 'TeamTable', ref = 'A1:{}'.format(last_cell))
45		
46		# Add 'Table Style Medium 6' style
47		style = TableStyleInfo(name = 'TableStyleMedium6', showRowStripes = True)
48		
49		# Apply style to table
50		team_table.tableStyleInfo = style
51		
52		# Add table to spreadsheet
53		ws.add_table(team_table)
54		
55		# Save spreadsheet
56		wb.save('teams.xlsx')
57		
58	except p.Error as error:
59		print('There was an error with the database operation: {}'.format(error))
60	except:
61	    print('There was an unexpected error of type {}'.format(sys.exc_info()[0]))
62	finally:
63		if conn:
64			cursor.close()
65			conn.close()
python

Both the script and the SQL file are available in Github. Feel free to use and modify them as you need.

Summary

In this guide, we have learned how to connect to a PostgreSQL database using Python, execute queries, and import the results into an Excel spreadsheet. However, you can adapt the code to do other things, such as creating and saving a JSON file or populating an HTML table.