Author avatar

Gabriel Cánepa

Importing Data from a Database Using Python

Gabriel Cánepa

  • Feb 5, 2019
  • 9 Min read
  • 492 Views
  • Feb 5, 2019
  • 9 Min read
  • 492 Views
Data
Python

Introduction

In previous guides, we have explained how to import data from Excel spreadsheets, tab-delimited files, and online APIs. As helpful as those resources are, if the data we need to handle is large in size or becomes too complex, a database (relational or otherwise) may be a better solution. Regardless of the flavor you choose, there are general principles and basic tools that we can use to query a database and manipulate the results using Python.

Prerequisites

To begin, we need to install the appropriate connector (also known as driver) for the database system that we are using. This utility comes in the form of a module that is at one's disposal either from the standard library (such as sqlite3) or a third-party package like mysql-connector-python and psycopg2-binary for Mysql / MariaDB and PostgreSQL, respectively. In any event, the Python Package Index is our go-to place to search for available adapters.

In this guide, we will use PostgreSQL, since it provides a function called ROW_TO_JSON out of the box. As its name suggests, this function returns each row in the result set as a JSON object. Since we have already learned how to work with JSON data, we will be able to manipulate the result of a query very easily.

If we use a different database system, we can iterate over the results and create a list of dictionaries where each element is a record in the result set.

That being said, we will need to install psycopg2-binary, preferably inside a virtual environment before we proceed:

1
	pip install psycopg2-binary
bash

Now let's examine the PostgreSQL database we will work with, which is called nba. Figs. 1 through 3 show the structure of the coaches, players, and teams tables.

  • coaches stores the following data, where coach_id acts as the primary key. Besides the coach's first and last names, there's also a team_id which is a foreign key that references the homonymous field in the teams table.

Figure 1 - The **coaches** table

  • players, besides the player_id (primary key) and team_id (foreign key, which indicates the team he is currently playing for), also holds the first and last names, the jersey number, the height in meters, the weight in kilograms, and the country of origin.

Figure 2 - The **players** table

  • Finally, teams are described by their name, conference, current conference rank, home wins and losses, and away wins and losses. Of course, it also has the team_id primary key that is referenced in the other two tables.

Figure 3 - The **teams** table

The next step consists in writing a SQL query to retrieve the list of teams ordered by conference and rank, along with the number of players in each team and the name of its coach. And while we're at it, we can also add the number of home and away wins:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
	SELECT
				t.name,
				t.city,
				t.conference,
				t.conference_rank,
				COUNT(p.player_id) AS number_of_players,
				CONCAT(c.first_name, ' ', c.last_name) AS coach,
				t.home_wins,
				t.away_wins
	FROM		players p, teams t, coaches c
	WHERE		p.team_id = t.team_id
	AND			c.team_id = t.team_id
	GROUP BY	t.name, c.first_name, c.last_name, t.city, t.conference, t.conference_rank, t.home_wins, t.away_wins
	ORDER BY    t.conference, t.conference_rank
sql

We will then wrap the query inside the ROW_TO_JSON function for our convenience and save it to a file named query.sql in the current directory:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
	SELECT ROW_TO_JSON(team_info) FROM (
		SELECT
					t.name,
					t.city,
					t.conference,
					t.conference_rank,
					COUNT(p.player_id) AS number_of_players,
					CONCAT(c.first_name, ' ', c.last_name) AS coach,
					t.home_wins,
					t.away_wins
		FROM		players p, teams t, coaches c
		WHERE		p.team_id = t.team_id
		AND			c.team_id = t.team_id
		GROUP BY	t.name, c.first_name, c.last_name, t.city, t.conference, t.conference_rank, t.home_wins, t.away_wins
		ORDER BY    t.conference, t.conference_rank
	) AS team_info
sql

Fig. 4 shows the first records of the above query. Note that each row has the structure of a Python dictionary where the names of the fields returned by the query are the keys.

Figure 4 - Team information

Last, but not least, a word of caution. To connect to a database, we need a username and a password. It is best practice to use environment variables instead of exposing them in plain sight as part of the connection string. This is particularly important if you push your code to a version control system that other people can access. In Unix-like environments, this can be done by appending the following two lines at the end of your shell's initialization file. To apply changes, you will need to log out and log back in or source the file in the current session.

1
2
	export DB_USER="your_PostgreSQL_username_here_inside_quotes"
	export DB_PASS="your_password_inside_quotes"
bash

In Windows, go to Control Panel / System / Advanced system settings. Select the Advanced tab and click on Environment Variables to add them:

We are now ready to start writing Python code!

Querying the Database and Manipulating Results

At the top of our program we will import the necessary modules and one function to handle errors:

1
2
3
	import os
	import psycopg2 as p
	from psycopg2 import Error
python

Next, we will load the contents of query.sql into query and instantiate the connection. You can also use environment variables for host, port, and database just like we did for user and password, although it is not strictly necessary to do so.

1
2
3
4
5
6
7
8
9
10
	with open('query.sql') as sql:
	    query = sql.read()

	conn = p.connect(
	    user = os.environ['DB_USER'],
	  	password = os.environ['DB_PASS'],
	  	host = 'localhost',
	  	port = '5432',
	  	database = 'nba'
	)
python

Once we have successfully connected to the database, it is time to execute the query. To do so, a control structure associated with the connection and known as cursor is used. If everything went as expected, the variable called result contains a list of one-element tuples where each element is a dictionary.

1
2
3
	cursor = conn.cursor()
	cursor.execute(query)
	result = cursor.fetchall()
python

At this point, we can iterate over result and manipulate its contents as desired. For example, we may insert them into a spreadsheet (as illustrated in Fig. 5), as we learned in Importing Data from Microsoft Excel Files with Python, or use them to feed an HTML table via a web application.

Figure 5 - Inserting team information into a spreadsheet

To catch errors, if they occur, it is necessary to wrap our code inside a try-except block. And while we are at it, adding a finally sentence allows us to clean up the connection when we are done using it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
	try:
		# Instantiate the connection
		conn = p.connect(
		    user = os.environ['DB_USER'],
	      	password = os.environ['DB_PASS'],
	      	host = 'localhost',
	      	port = '5432',
	      	database = 'nba'
		)
		
		# Create cursor, execute the query, and fetch results
		cursor = conn.cursor()
		cursor.execute(query)
		result = cursor.fetchall()
		
		# Create workbook and select active sheet
		wb = Workbook()
		ws = wb.active
		
		# Rename active sheet
		ws.title = 'Teams'
		
		# Column headings
		column_headings = [
							'Name', 
							'City', 
							'Conference', 
							'Rank', 
							'Players', 
							'Coach', 
							'Home wins', 
							'Away wins'
		]
		ws.append(column_headings)
		
		# Add players
		for team in result:
		    ws.append(list(team[0].values()))
			
		# Get coordinates of last cell
		last_cell = ws.cell(row = ws.max_row, column = ws.max_column).coordinate
		
		# Create table
		team_table = Table(displayName = 'TeamTable', ref = 'A1:{}'.format(last_cell))
		
		# Add 'Table Style Medium 6' style
		style = TableStyleInfo(name = 'TableStyleMedium6', showRowStripes = True)
		
		# Apply style to table
		team_table.tableStyleInfo = style
		
		# Add table to spreadsheet
		ws.add_table(team_table)
		
		# Save spreadsheet
		wb.save('teams.xlsx')
		
	except p.Error as error:
		print('There was an error with the database operation: {}'.format(error))
	except:
	    print('There was an unexpected error of type {}'.format(sys.exc_info()[0]))
	finally:
		if conn:
			cursor.close()
			conn.close()
python

Both the script and the SQL file are available in Github. Feel free to use and modify them as you need.

Summary

In this guide, we have learned how to connect to a PostgreSQL database using Python, execute queries, and import the results into an Excel spreadsheet. However, you can adapt the code to do other things, such as creating and saving a JSON file or populating an HTML table.

1