Author avatar

Dániel Szabó

Scraping Your First Web Page with R

Dániel Szabó

  • Apr 29, 2020
  • 12 Min read
  • 651 Views
  • Apr 29, 2020
  • 12 Min read
  • 651 Views
Data
Data Analytics
Languages and Libraries
R

Introduction

This guide will build on the guide Web Crawling in R, which laid out in detail the foundations of web crawling and web scraping in R. Now we will narrow our focus to web scraping a webpage with the help of R and look at different techniques that allow you to scrape information from a selected website. First we'll clarify which modules in R support this activity, and then we'll build a small solution to scrape a particular site of our choosing.

Web Scraping

The exponentially growing amount of data on the internet has opened up new horizons for data scientists, and web scraping has grown in popularity as well. Nowadays there is hardly a topic of interest that can't be found on the internet, and the skill to make sense of all the noise by mining relevant information has become invaluable.

Data is rarely in the appropriate format and needs to be further processed in a language that can make sense of it. When you are scraping data, it's most commonly in HTML format, which is nice for the human eye to look at through the lens of a web browser, but harder to programmatically process.

Web scraping is the art of doing just that: programmatically making sense of chunks of data in HTML format.

Let's say that, due to the pandemic, you are binge watching horror series, and your current series has only one part left. In desperation, you turn to R to write a small script to parse the appropriate category and give you the best-rated series of this genre.

In order to start scraping, you will need to open up your R console and install the rvest package.

1
install.packages("rvest")
R

In your web browser, navigate to IMBD.com and select the top-rated horror shows.

Horror

From the browser, copy the URL. This URL will serve as an anchor point where the scraping can begin. Load the rvest module and initialize the horror variable with the URL.

1
2
library("rvest")
horror <- "https://www.imdb.com/search/title/?genres=horror&explore=title_type,genres&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=3396781f-d87f-4fac-8694-c56ce6f490fe&pf_rd_r=AFA6QBCC85CEAWNC6GTC&pf_rd_s=center-1&pf_rd_t=15051&pf_rd_i=genre&ref_=ft_gnr_pr1_i_3"
R

Now you can load the content of the page.

1
HTML <- read_html(horror)
R

The following output should be visible.

1
2
3
4
{html_document}
<html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body id="styleguide-v2" class="fixed">\n            <img height="1" widt ...
bash

This is the information you would like to scrape.

  1. Rank
  2. Rating
  3. Title
  4. Description

You could pick from the browser, but you want to automate the process.

Information

There are two main approaches you could use in this case: CSS Selector and XPATH. Both require some knowledge of HTML and how the code is structured. CSS Selectors allow you to find elements in your HTML code based on their styling. XPATH treats the HTML page as an XML document, and it uses path expressions to find a specific element on the webpage. Based on the given site, it may be easier to use one over the other. For example, if the site is fancy with many animations and flashy boxes, it may be wiser to use CSS Selector. All in all, both can get the job done. For now, use CSS selector in this case.

Open up the developer tools on the IMDB site in the browser and hover over the first-ranked series.

CSS

span.lister-item -index.unbold.text-primary is the relevant code, and you need the last part, .text-primary, to work with. Use the html_nodes() function to pull out the relevant HTML code, then process the results with the html_text() function.

1
2
ranking_html <- html_nodes(HTML,'.text-primary')
ranking <- html_text(ranking_html)
R

The result should look like this.

1
2
3
 [1] "1."  "2."  "3."  "4."  "5."  "6."  "7."  "8."  "9."  "10." "11." "12." "13." "14." "15." "16." "17." "18." "19." "20." "21." "22." "23."
[24] "24." "25." "26." "27." "28." "29." "30." "31." "32." "33." "34." "35." "36." "37." "38." "39." "40." "41." "42." "43." "44." "45." "46."
[47] "47." "48." "49." "50."
bash

Now, since the process is iterative, you can pull out the title in the same way.. That can be found by the .ratings-imdb-rating CSS Selector.

1
2
rating_html <- html_nodes(HTML,'.ratings-imdb-rating')
rating <- html_text(rating_html)
R

This is a bit more tricky to get right, as there are remnants other than the pure rating values.

1
2
3
4
5
6
7
8
9
10
11
12
13
 [1] "\n        \n        8.2\n    " "\n        \n        7.0\n    " "\n        \n        5.3\n    " "\n        \n        7.2\n    "
 [5] "\n        \n        8.4\n    " "\n        \n        8.8\n    " "\n        \n        5.9\n    " "\n        \n        6.4\n    "
 [9] "\n        \n        8.3\n    " "\n        \n        7.1\n    " "\n        \n        7.5\n    " "\n        \n        3.7\n    "
[13] "\n        \n        7.4\n    " "\n        \n        8.1\n    " "\n        \n        5.8\n    " "\n        \n        7.7\n    "
[17] "\n        \n        6.8\n    " "\n        \n        7.4\n    " "\n        \n        8.3\n    " "\n        \n        7.5\n    "
[21] "\n        \n        7.6\n    " "\n        \n        6.9\n    " "\n        \n        7.5\n    " "\n        \n        8.7\n    "
[25] "\n        \n        4.5\n    " "\n        \n        7.0\n    " "\n        \n        8.4\n    " "\n        \n        6.7\n    "
[29] "\n        \n        5.2\n    " "\n        \n        8.2\n    " "\n        \n        6.6\n    " "\n        \n        7.3\n    "
[33] "\n        \n        8.8\n    " "\n        \n        5.0\n    " "\n        \n        5.7\n    " "\n        \n        8.4\n    "
[37] "\n        \n        7.6\n    " "\n        \n        6.3\n    " "\n        \n        7.3\n    " "\n        \n        1.9\n    "
[41] "\n        \n        6.8\n    " "\n        \n        8.4\n    " "\n        \n        7.6\n    " "\n        \n        6.9\n    "
[45] "\n        \n        8.5\n    " "\n        \n        8.2\n    " "\n        \n        7.0\n    " "\n        \n        7.0\n    "
[49] "\n        \n        6.9\n    "
bash

The gsub() function can be used to remove unnecessary characters. First remote the \n characters, then the extra whitespaces.

1
2
rating <-  gsub("\n","",rating)
rating <-  gsub(" ","",rating)
R

Now the ratings look fine as well.

1
2
3
 [1] "8.2" "7.0" "5.3" "7.2" "8.4" "8.8" "5.9" "6.4" "8.3" "7.1" "7.5" "3.7" "7.4" "8.1" "5.8" "7.7" "6.8" "7.4" "8.3" "7.5" "7.6" "6.9" "7.5"
[24] "8.7" "4.5" "7.0" "8.4" "6.7" "5.2" "8.2" "6.6" "7.3" "8.8" "5.0" "5.7" "8.4" "7.6" "6.3" "7.3" "1.9" "6.8" "8.4" "7.6" "6.9" "8.5" "8.2"
[47] "7.0" "7.0" "6.9"
R

The .lister-item-header a CSS Selector allows you to pull out the titles.

1
2
title_html <- html_nodes(HTML,'.lister-item-header a')
title <- html_text(title_html)
R

The output does not need further processing.

1
2
3
4
5
6
7
8
9
 [1] "The Walking Dead"                "A platform"                      "Gretel & Hansel"                 "A láthatatlan ember"             "Odaát"                           "Különös dolgok"                 
 [7] "Árok"                            "Vadászat"                        "Vaják"                           "Fehér éjszakák"                  "Legacies: A sötétség öröksége"   "The Turning"                    
[13] "Kulcs a zárját"                  "Amerikai Horror Story"           "Vivarium"                        "Vámpírnaplók"                    "Train to Busan 2"                "Zombieland: A második lövés"    
[19] "Álom doktor"                     "A királyság titkai"              "Vonat Busanba: Zombi expressz"   "A világítótorony"                "Fear the Walking Dead"           "Hang nélkül"                    
[25] "A Hill-ház szelleme"             "Brahms: The Boy II"              "Egy szent szarvas meggyilkolása" "Hétköznapi vámpírok"             "I See You"                       "We Summon the Darkness"         
[31] "A sötétség kora"                 "Az - Második fejezet"            "Az"                              "Shingeki no kyojin"              "The Other Lamb"                  "Sea Fever"                      
[37] "Ragyogás"                        "Sabrina hátborzongató kalandjai" "The Wretched"                    "Örökség"                         "Verotika"                        "Aki bújt"                       
[43] "A nyolcadik utas: a Halál"       "28 nappal később"                "Expedíció"                       "Hannibal"                        "Castlevania"                     "Ház az erdő mélyén"             
[49] "Z világháború"                   "Mi"                             
bash

The last piece of information you need to pull out is the description.

1
2
description_html <- html_nodes(HTML,'.detail.sub-list div.lister-list div.lister-item.mode-advanced div.lister-item-content p.text-muted')
description <- html_text(description_html)
R

The output of the description is pretty hard to bring to a readable format.

1
2
3
4
5
[1] "\n            \n                44 min\n                 | \n            \nDrama, Horror, Thriller            \n   "                                                                                                                                                
[2] "\n    Sheriff Deputy Rick Grimes wakes up from a coma to learn the world is in ruins and must lead a group of survivors to stay alive"                                                                                                                              
[3] "\n            \n                94 min\n                 | \n            \nHorror, Sci-Fi, Thriller            \n   "                                                                                                                                               
[4] "\n    A vertical prison with one cell per level. Two people per cell. One only food platform and two minutes per day to feed from up to down. An endlessnightmare trapped in The Hole."                                                                             
[5] "\n            PG-13\n                 | \n                87 min\n                 | \n            \nFantasy, Horror, Thriller           \n                                                                        
bash

Use a sequence to get every second row.

1
description<- description[seq(0, length(description), 2)]
R

Now all that is left is to remove the \n character and the trailing whitespaces. To remove the whitespaces, the trimws() function is the perfect choice!

1
2
description <- gsub('\n','',description)
description <- trimws(description, which =c("left"), whitespace="[ ]")
R

Now you have all the pieces, and you just need to glue them together as a data frame. The ranking and the rating should be converted to numeric type. Use the seq function to pull out only the first 10.

1
2
3
4
ranking <- as.numeric(ranking)
rating <- as.numeric(rating)

horrors <- data.frame(Ranking = ranking[seq(0,10)], Rating = rating[seq(0,10)],Title = title[seq(0,10)], description = description[seq(0,10)])
R

Now that your horrors data frame holds all the information together, you could build a visualization on top of this data with the ggplot2 library, which is very popular.

Conclusion

In this guide we have successfully built up the knowledge to create a web scraping solution that allows you to pick your next series to binge watch. CSS Selector and XPATH were introduced, and this solution used the CSS selector approach. You learned how to process HTML chunks into readable information with the gsub() and trimws() functions. I hope this guide has been informative to you and I would like to thank you for reading it!

10