Web crawling is a very widely used activity, and its implementation is very different depending on the data you are looking for. For example, you might crawl the web to analyze product ratings on Amazon or responses to tweets of famous people. This guide will show you how anyone can get started with web scraping in R. We are going to define what web scraping and crawling are, then see how to pick the best approach to a specific activity. This guide assumes a working knowledge in R.
Data you are looking for when scraping can come in many shapes and sizes. The tasks you perform and the way you mine the relevant data depends on the format of your data and the number of sites you are mining from.
The most common forms of data you'll get when scraping are API response (SOAP, JSON), HTML, and FTP or SFTP. However, the data can come in other forms depending on the provider's implementation.
In general, you just tell R where to look for the information and what to do with the found information, or how to process it. Web scraping is usually intended to process a single response from a website, like raw HTML or JSON.
Crawling is very similar to what big search engines do. In the simplest terms, crawling is a method of finding web links originating from one URL or a list of URLs. In order to do crawling you need to scrape a bit, but when crawling you are only interested in the links to other URLs. It's more strict than scraping because when scraping you decide what you are looking for, but when crawling you pull as many sites from a web server as you can, then process it later. These two terms are often mixed because they are so close in concept.
When crawling the web you need to be aware of
robots.txt, which is a standard sometimes referred to as
robots exclusion standard or
robots exclusion protocol. It's an official solution to communicate with web crawlers or web robots via the website. It tells the robot which areas of a specific site should not be scanned or processed. There are some robots that refuse to comply with this standard, as it's more like an agreement to adhere to than something websites can enforce. E-mail harvesters and spam bots usually ignore it.
The below example tells the robot to stay away from a specific file:
User-agent: * Disallow: /product/secret_file.html
You could tell the robot to behave differently based on its type:
1 2 3 4 5
User-agent: googlebot Disallow: /secret/ User-agent: googlebot-news Disallow: /
Here, you disallowed the
googlebot from accessing the
/secret/ folder, and completely disallowed
googlebot-news from accessing your site.
Use the following command after opening up the R console to install them.
Java is required to use the Rcrawler packages. To verify that Java is available, you can issue the following command:
The output should look something like this:
1 2 3 4
java version "1.8.0_241" Java(TM) SE Runtime Environment (build 1.8.0_241-b07) Java HotSpot(TM) 64-Bit Server VM (build 25.241-b07, mixed mode)  0
To use these modules, you need to load them:
Before you jump right into crawling, consider the impact you are about to make on a specific site. There are sites that implement mechanisms against denial of service attacks. These attacks usually mean that you are requesting data from the web server faster than it can serve it, so the resources are overloaded. It may crash, or your IP address could be temporarily banned from reaching the site. You should always be polite and obey the robots' protocol.
You may want to pick a random site which is small in size.
Rcrawler(Website = "http://r3ap3rpy.pythonanywhere.com/", no_cores = 4, no_conn = 4, Obeyrobots = TRUE)
The output should look something like this:
1 2 3 4 5 6
Preparing multihreading cluster .. In process : 1.. Progress: 100.00 % : 1 parssed from 1 | Collected pages: 1 | Level: 1 + Check INDEX dataframe variable to see crawling details + Collected web pages are stored in Project folder + Project folder name : r3ap3rpy.pythonanywhere.com-070950 + Project folder path : C:/Users/dszabo/Documents/r3ap3rpy.pythonanywhere.com-070950
Obeyrobots argument is
FALSE by default; it tells Rcrawler to obey the
robots.txt file's contents. The
no_cores tells R how many cores it should run the crawling activity on, and the
no_conn tells R how many simultaneous connections can be made by each core. Each webpage crawled gets downloaded to your local drive under your
Documents directory, and a new folder is created with the name of the site you are crawling.
The Rcrawler module holds many functions that allow you to target sites with specific information to extract. It is always wise to consult the documentation.
Let's say you would like to extract the links from the site only, and collect links that are external, meaning that they point to other sites.
page <- LinkExtractor(url = "http://r3ap3rpy.pythonanywhere.com/", ExternalLInks=TRUE)
Now in the
page variable you have the following values:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
$Info $Info$Id  112 $Info$Url  "http://r3ap3rpy.pythonanywhere.com/" $Info$Crawl_status  "finished" $InternalLinks  "http://r3ap3rpy.pythonanywhere.com/" "http://r3ap3rpy.pythonanywhere.com/github" "http://r3ap3rpy.pythonanywhere.com/ytube" "http://r3ap3rpy.pythonanywhere.com/udemy"  "http://r3ap3rpy.pythonanywhere.com/education" "http://r3ap3rpy.pythonanywhere.com/experience" "http://r3ap3rpy.pythonanywhere.com/certificates" $ExternalLinks  "https://r3ap3rpy.github.io/" "http://shortenpy.pythonanywhere.com/" "https://twitter.com/r3ap3rpy"  "https://www.linkedin.com/in/d%C3%A1niel-ern%C5%91-szab%C3%B3-081359157" "https://github.com/r3ap3rpy" "https://www.youtube.com/channel/UC1qkMXH8d2I9DDAtBSeEHqg"
Now you can reference the external links this way:
The art of web scraping and crawling is a very potent skill to master. It allows us to look behind the curtain, calculate our own statistics, and analyze websites from a different perspective. In this guide we learned the main difference between scraping and crawling, and what the
robots.txt is used for. Through a demonstration, we gained insight into how the
Rcrawler module can be used to extract information from a site, or crawl it all the way through. I hope this guide has been informative to you and I would like to thank you for reading it.