In previous guides, we have explained how to import data from Excel spreadsheets, tab-delimited files, and online APIs. As helpful as those resources are, if the data we need to handle is large in size or becomes too complex, a database (relational or otherwise) may be a better solution. Regardless of the flavor you choose, there are general principles and basic tools that we can use to query a database and manipulate the results using Python.
To begin, we need to install the appropriate connector (also known as driver) for the database system that we are using. This utility comes in the form of a module that is at one's disposal either from the standard library (such as sqlite3) or a third-party package like mysql-connector-python and psycopg2-binary for Mysql / MariaDB and PostgreSQL, respectively. In any event, the Python Package Index is our go-to place to search for available adapters.
In this guide, we will use PostgreSQL, since it provides a function called ROW_TO_JSON
out of the box. As its name suggests, this function returns each row in the result set as a JSON object. Since we have already learned how to work with JSON data, we will be able to manipulate the result of a query very easily.
If we use a different database system, we can iterate over the results and create a list of dictionaries where each element is a record in the result set.
That being said, we will need to install psycopg2-binary, preferably inside a virtual environment before we proceed:
1 pip install psycopg2-binary
Now let's examine the PostgreSQL database we will work with, which is called nba. Figs. 1 through 3 show the structure of the coaches, players, and teams tables.
The next step consists in writing a SQL query to retrieve the list of teams ordered by conference and rank, along with the number of players in each team and the name of its coach. And while we're at it, we can also add the number of home and away wins:
1 SELECT
2 t.name,
3 t.city,
4 t.conference,
5 t.conference_rank,
6 COUNT(p.player_id) AS number_of_players,
7 CONCAT(c.first_name, ' ', c.last_name) AS coach,
8 t.home_wins,
9 t.away_wins
10 FROM players p, teams t, coaches c
11 WHERE p.team_id = t.team_id
12 AND c.team_id = t.team_id
13 GROUP BY t.name, c.first_name, c.last_name, t.city, t.conference, t.conference_rank, t.home_wins, t.away_wins
14 ORDER BY t.conference, t.conference_rank
We will then wrap the query inside the ROW_TO_JSON
function for our convenience and save it to a file named query.sql in the current directory:
1 SELECT ROW_TO_JSON(team_info) FROM (
2 SELECT
3 t.name,
4 t.city,
5 t.conference,
6 t.conference_rank,
7 COUNT(p.player_id) AS number_of_players,
8 CONCAT(c.first_name, ' ', c.last_name) AS coach,
9 t.home_wins,
10 t.away_wins
11 FROM players p, teams t, coaches c
12 WHERE p.team_id = t.team_id
13 AND c.team_id = t.team_id
14 GROUP BY t.name, c.first_name, c.last_name, t.city, t.conference, t.conference_rank, t.home_wins, t.away_wins
15 ORDER BY t.conference, t.conference_rank
16 ) AS team_info
Fig. 4 shows the first records of the above query. Note that each row has the structure of a Python dictionary where the names of the fields returned by the query are the keys.
Last, but not least, a word of caution. To connect to a database, we need a username and a password. It is best practice to use environment variables instead of exposing them in plain sight as part of the connection string. This is particularly important if you push your code to a version control system that other people can access. In Unix-like environments, this can be done by appending the following two lines at the end of your shell's initialization file. To apply changes, you will need to log out and log back in or source the file in the current session.
1 export DB_USER="your_PostgreSQL_username_here_inside_quotes"
2 export DB_PASS="your_password_inside_quotes"
In Windows, go to Control Panel / System / Advanced system settings. Select the Advanced tab and click on Environment Variables to add them:
We are now ready to start writing Python code!
At the top of our program we will import the necessary modules and one function to handle errors:
1 import os
2 import psycopg2 as p
3 from psycopg2 import Error
Next, we will load the contents of query.sql into query and instantiate the connection. You can also use environment variables for host, port, and database just like we did for user and password, although it is not strictly necessary to do so.
1 with open('query.sql') as sql:
2 query = sql.read()
3
4 conn = p.connect(
5 user = os.environ['DB_USER'],
6 password = os.environ['DB_PASS'],
7 host = 'localhost',
8 port = '5432',
9 database = 'nba'
10 )
Once we have successfully connected to the database, it is time to execute the query. To do so, a control structure associated with the connection and known as cursor is used. If everything went as expected, the variable called result contains a list of one-element tuples where each element is a dictionary.
1 cursor = conn.cursor()
2 cursor.execute(query)
3 result = cursor.fetchall()
At this point, we can iterate over result and manipulate its contents as desired. For example, we may insert them into a spreadsheet (as illustrated in Fig. 5), as we learned in Importing Data from Microsoft Excel Files with Python, or use them to feed an HTML table via a web application.
To catch errors, if they occur, it is necessary to wrap our code inside a try-except block
. And while we are at it, adding a finally
sentence allows us to clean up the connection when we are done using it:
1 try:
2 # Instantiate the connection
3 conn = p.connect(
4 user = os.environ['DB_USER'],
5 password = os.environ['DB_PASS'],
6 host = 'localhost',
7 port = '5432',
8 database = 'nba'
9 )
10
11 # Create cursor, execute the query, and fetch results
12 cursor = conn.cursor()
13 cursor.execute(query)
14 result = cursor.fetchall()
15
16 # Create workbook and select active sheet
17 wb = Workbook()
18 ws = wb.active
19
20 # Rename active sheet
21 ws.title = 'Teams'
22
23 # Column headings
24 column_headings = [
25 'Name',
26 'City',
27 'Conference',
28 'Rank',
29 'Players',
30 'Coach',
31 'Home wins',
32 'Away wins'
33 ]
34 ws.append(column_headings)
35
36 # Add players
37 for team in result:
38 ws.append(list(team[0].values()))
39
40 # Get coordinates of last cell
41 last_cell = ws.cell(row = ws.max_row, column = ws.max_column).coordinate
42
43 # Create table
44 team_table = Table(displayName = 'TeamTable', ref = 'A1:{}'.format(last_cell))
45
46 # Add 'Table Style Medium 6' style
47 style = TableStyleInfo(name = 'TableStyleMedium6', showRowStripes = True)
48
49 # Apply style to table
50 team_table.tableStyleInfo = style
51
52 # Add table to spreadsheet
53 ws.add_table(team_table)
54
55 # Save spreadsheet
56 wb.save('teams.xlsx')
57
58 except p.Error as error:
59 print('There was an error with the database operation: {}'.format(error))
60 except:
61 print('There was an unexpected error of type {}'.format(sys.exc_info()[0]))
62 finally:
63 if conn:
64 cursor.close()
65 conn.close()
Both the script and the SQL file are available in Github. Feel free to use and modify them as you need.
In this guide, we have learned how to connect to a PostgreSQL database using Python, execute queries, and import the results into an Excel spreadsheet. However, you can adapt the code to do other things, such as creating and saving a JSON file or populating an HTML table.