R is great for statistics. You need data—lots of data—for better statistical results. Relational databases are great at storing lots of data.
As the saying goes, use the right tool for the right job.
If your job is to analyze lots of data, then it makes much sense to import data, get data, grab data, fetch data ... whatever you prefer to call it, just let your R script use the good stuff from that database!
To import data into your R script, you must connect to the database. Fortunately, there are dedicated R packages for connecting to most popular databases out there. Even more good news: such packages have simple names.
I like this convention: just put R in front of the database software name, cross your fingers, and you are good to go.
Let's connect to a SQLite database from the R interpreter. Which package do we need for that?
1> install.packages('RSQLite') 2> library(RSQLite) 3> con <- dbConnect(SQLite(), 'play-example.db') 4> con 5<SQLiteConnection> 6 Path: C:\Users\dan\Documents\play-example.db 7 Extensions: TRUE 8>
The first line installs the RSQLite package (I skipped the boring installation message).
The second line loads the RSQLite package, as expected.
The connection magic happens on the third line by calling the
dbConnect() function with two arguments:
SQLite()function, which creates a driver object for SQLite under the hood.
It makes sense to store the connection into a variable. The newly created connection shows the file path of the SQLite database. Of course, expect to get a different path if you try it on your machine. By the way, can you please check the size of the SQLite file on your machine? What is the explanation?
We can expect a newly created SQLite database to be empty. Let's use some R code to check and confirm that.
1> dbListTables(con) 2character(0) 3>
Ok, no tables yet, but at least the
dbListTables() function has a nice, intuitive name.
Let's create a new table with the classic
mtcars data frame.
1> dbWriteTable(con, 'cars', mtcars) 2> dbListTables(con) 3 "cars" 4>
Excellent! Now we can play with the
Let's use some SQL to get 3 rows from the
1> dbGetQuery(con, 'SELECT * FROM cars LIMIT 3') 2 mpg cyl disp hp drat wt qsec vs am gear carb 31 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 42 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 53 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 6>
How can we get mpg and number of cylinders for only cars with mpg higher than 30, ordered by mpg in ascending order?
1>> dbGetQuery(con, 'SELECT mpg, cyl 2 FROM cars 3 WHERE mpg>30 4 ORDER BY mpg') 5 mpg cyl 61 30.4 4 72 30.4 4 83 32.4 4 94 33.9 4 10>
Of course, you can use more sophisticated queries to refine results.
If you are not in the mood to write SQL queries, then use
dplyr to generate SQL code automatically for you. The
dplyr package is great for working with data frames. Together with
dbplyr, it enables you to work with a table as if it were a typical data frame.
Let's install prerequisites and connect to the database:
1> install.packages(c('dplyr', 'dbplyr')) 2> library(dplyr) 3> library(RSQLite) 4> con <- dbConnect(SQLite(), 'play-example.db') 5> cars <- tbl(con, 'cars') 6>
Look at line 4: we use
dbConnect() again. Pay attention to the first argument: it's
SQLite(), not just
SQLite. That's very easy to miss and it happened to me a few times.
On line 5, the
tbl() function takes care of returning data from the
cars table through the connection.
Let's look at how
dplyr can help us:
1> cars %>% select(mpg) 2# Source: lazy query [?? x 1] 3# Database: sqlite 3.30.1 [C:\Users\dan\Documents\play-example.db] 4 mpg 5 <dbl> 6 1 21 7 2 21 8 3 22.8 9 4 21.4 10 5 18.7 11 6 18.1 12 7 14.3 13 8 24.4 14 9 22.8 1510 19.2 16# … with more rows 17>
What is this?
%>% is about piping data from left to right, from
cars to the
select() function that returns the
mpg column. Read it as then for convenience:
cars, then select
mpg, and so on.
Let's try another one.
Again, how can we get mpg and number of cylinders for only cars with mpg higher than 30, ordered by mpg in ascending order?
1> cars %>% 2 select(mpg, cyl) %>% 3 filter(mpg>30) %>% 4 arrange(mpg) 5# Source: lazy query [?? x 2] 6# Database: sqlite 3.30.1 [C:\Users\dan\Documents\play-example.db] 7# Ordered by: mpg 8 mpg cyl 9 <dbl> <dbl> 101 30.4 4 112 30.4 4 123 32.4 4 134 33.9 4 14>
Look at the first line:
cars, then select
cyl, then filter by mpg larger than 30, then arrange by mpg. Sounds quite natural, right? Also, results are the same as using the SQL query.
By the way, did I mention that
dplyr generates SQL automatically? Let's use the
show_query() function to look at the generated SQL.
1> cars %>% 2 select(mpg, cyl) %>% 3 filter(mpg>30) %>% 4 arrange(mpg) %>% 5 show_query() 6<SQL> 7SELECT * 8FROM (SELECT `mpg`, `cyl` 9FROM `cars`) 10WHERE (`mpg` > 30.0) 11ORDER BY `mpg` 12>
This is very similar to the SQL query we wrote painstakingly in the previous section. It's great to have both options: manual SQL and
Once connected to the database, you can import data using either SQL or
dplyr and use that data further in your R code.
In this guide, we used SQLite to illustrate the main points of importing data from a relational database. These points are also applicable for working with other relational databases.
If you want to delve deeper into importing data from databases into your R code, then have a look at my Pluralsight course Importing Data from Relational Databases in R, which covers topics such as how to connect to the database with ODBC and how to secure credentials or troubleshoot connection issues.