The extraction process of structured data from a website can be implemented using requests and beautifulsoup libraries or the scrapy
framework. Both are sufficient to extract data from a static webpage though, in terms of features, scrapy
is a compelling choice because it has inbuilt support to download and process content while applying restrictions whereas beautifulsoup
is only capable of extracting data.
Scrapy is an open source python framework, specifically developed to:
Scrapy offers a base structure to write your own spider or crawler. Spiders and crawlers both can be used for scraping, though a crawler provides inbuilt support for recursive web-scraping while going through extracted URLs. This guide will demonstrate the application and various features of scrapy to extract data from the Github Trending Page to collect the details of repositories.
Every site provides a URL/robots.txt
file which defines the access policies for a particular website or sub-directory, like:
1User-Agent: *
2Disallow: /posts/
3
4User-agent: Googlebot
5Allow: /*/*/main
6Disallow: /raw/*
Here Googlebot
is allowed to access the main page but not to the raw sub-directory. The posts section is prohibited for all bots. So, these rules must be followed to avoid getting blocked by the website. Alternatively, you could use a delay while making requests to avoid constant access to the website, otherwise, this could degrade the performance of the website.
Before going further, it is advisable to have a basic understanding of:
pip
command to install the required packages:1pip install scrapy
Execute the below command to create a Scrapy project:
1scrapy startproject github_trending_bot
Startproject command will create a directory in the current directory. Use the
cd
command to change directory andpwd
orcd
(alone) to check the name of the current directory.
github_trending_bot
directory with the following contents:
1github_trending_bot/
2 scrapy.cfg # configuration parameters for deployment
3
4 github_trending_bot/ # python module, can be used for imports
5
6 __init__.py # required to import submodules in current directory
7
8 __pycache__ # contains bytecode of compiled code
9
10 items.py # custom classes(like directory) to store data
11
12 middlewares.py # middlewares are like interceptors to process request/response
13
14 pipelines.py # classes to perform tasks on scraped data, returned by spiders
15
16 settings.py # setting related to delay, cookies, used pipelines etc.
17
18 spiders/ # directory to store all spider files
19 __init__.py
Spider is a class responsible for extracting the information from a website. It also contains additional information to apply or restrict the crawling process to specific domain names. To create a Spider, use the genspider
command as:
1cd github_trending_bot # move to project github_trending_bot directory
2scrapy genspider GithubTrendingRepo github.com/trending/
The above command will create a GithubTrendingRepo.py
file, shown below:
1# -*- coding: utf-8 -*-
2import scrapy
3
4
5class GithubtrendingrepoSpider(scrapy.Spider):
6 name = 'GithubTrendingRepo'
7 allowed_domains = ['github.com/trending/']
8 start_urls = ['http://github.com/trending//']
9
10 def parse(self, response):
11 pass
As you may have already infered, the GithubtrendingrepoSpider
class is a subclass of scrapy.Spider
and each spider should have at least two properties:
crawl
command.response
object contains the HTML text response, HTTP status code, source url, etc. Currently, the parse()
function is doing nothing and can be replaced with the below implementation to view the webpage data.parse
method with the below code to view the response content. 1def parse(self, response):
2 print("%s : %s : %s" % (response.status, response.url, response.text))
Add ROBOTSTXT_OBEY=False
in the settings.py
file because by default the crawl
command will verify against robots.txt
and a True
value will result in a forbidden access response.
crawl
command with the spider name to execute the project:1scrapy crawl GithubTrendingRepo
You can skip the
startproject
andcrawl
command. Write your spider python script for the spider class and then run thespidername.py
file directly usingrunspider
command:
1scrapy runspider github_trending_bot/spiders/GithubTrendingRepo.py
Use the
scrapy fetch URL
command to view the HTML response from a URL for testing purposes.
Extracting data is one of the crucial and common tasks that occur while scraping a website. Every HTML element can be found by either using unique CSS properties or an Xpath expression syntax, as shown below:
1<!DOCTYPE html>
2<html>
3 <head>
4 <title><b>CSS Demo</b></title>
5 <style type="text/css">
6 div{ /*style for tag*/
7 padding: 8px;
8 }
9 .container-alignment{ /*style using css classes*/
10 text-align: center; /* !!! */
11 }
12 .header{
13 color: darkblue;
14 }
15 #container{ /*style using ids*/
16 color: rgba(0, 136, 218, 1);
17 font-size: 20px;
18 }
19 </style>
20 </head>
21 <body>
22 <div id="container" class="container-alignment">
23 <p class="header"> CSS Demonstration</p>
24 <p>CSS Description paragraph</p>
25 </div>
26 </body>
27</html>
In the above example, the header p
tag can be accessed using the header
class.
::
: Double-colon is used to define tag along with attribute. a::attr(href)
is an example of an anchor tag with the href
attribute or p::text
to get the text of paragraph tag.space
: Space is used to define the inner tag as title b::text
/
: Select node from the root. /html/body/div[1]
will find the first div
.//
: Select node from the current node. //p[1]
will find the first p
element.[@attributename='value']
: Use the syntax to find the node with the required value of an attribute. //div[@id='container']
will find the first input element with the name Email
.A simple way to get the XPath is via the
inspect element
option. Right click on the desired node and choose thecopy xpath
option:
Read more about XPaths to combine multiple attributes or use it as a supported function.
scrapy.http.TextResponse
object has the css(query)
function which can take the string input to find all the possible matches using the pass CSS query pattern. To extract the text with the CSS selector, simply pass tag_name::text
query to the css(query)
method which will create an object of the Selector
and then use get()
to fetch the text of the first matched tag. Use the below code to extract the text of title
tag:1def parse(self, response):
2 title_text = response.css('title::text')
3 print(title_text.get())
This can also be done using xpath
:
1title_text = response.xpath('//title[1]/text()')
2print(title_text.get())
//title[1]/text()
is indicating a text node under the first title
tag and can be explained as below:
//
: This will start the search from the current tag that is html
.
/
: Indicates the child node in a hierarchical path.
text()
: Indicates a text node.
Both
css
andxpath
methods return aSelectorList
object which also supportscss
,xpath
andre
(regex) methods for data extractions.
css('a::attr(href)').getall()
: Finds the a
(anchor) tag with the href
attribute.response.xpath('//a/@href').getall()
: Find the a
(anchor) tag from the current node with href
attribute.1css_links = response.css('a::attr(href)').getall()
2xpath_links = response.xpath('//a/@href').getall()
3# print length of lists
4print(len(css_links))
5print(len(xpath_links))
LxmlLinkExtractor
can be used as a filter to extract a list of specific URL(s) by:1# from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor # import required
2
3trending_links = LxmlLinkExtractor(allow= r'^https://[a-z.]+/[a-z.]+$', deny_domains=['shop.github.com','youtube.com','twitter.com'], unique = True).extract_links(response)
4for link in trending_links:
5 print("%s : %s " % (link.url, link.text))
The type of list element is
scrapy.link.Link
.
Often it is required to extract links from a webpage and further extract data from those extracted links. This process can be implemented using the CrawlSpider
which provides inbuilt implementation to generate requests from extracted links. The CrawlSpider
also supports crawling Rule
which defines:
Every Rule
object takes the LxmlLinkExtractor
object as a parameter which will be used to filter links. LxmlLinkExtractor
is subclass of FilteringLinkExtractor
which does most of the filtration. A Rule
object can be created by:
1Rule(
2 LxmlLinkExtractor(restrict_xpaths=[
3 "//ol[@id=repo-list]//h3/a/@href"], allow_domains=['https://github.com/trending']),
4 callback='parse'
5)
restrict_xpaths
will only extract the links from this matched xpath
HTML element.
LxmlLinkExtractor
will extract the links from an ordered list which has repo-list
as the ID.
//h3/a/@href:
Here //
will find an h3
heading element from current element i.e. ol
.
/a/@href
indicates that h3
elements should have a direct anchor tag with the href
attribute.
LxmlLinkExtractor
has various useful optional parameter likeallow
anddeny
to match link patterns,allow_domains
, anddeny_domains
to define desired and undesired domain names.tags
andattrs
are used to match specific attributes and tag values. It hasrestrict_css
as well.
Follow the below steps to create a custom crawler to fetch trending repositories:
Githubtrendingrepocrawler
class by extending CrawlSpider
.name
and start_urls
properties.rules
object of a tuple with Rules
object.callback
property to provide methods to handle the result from a URL, filtered by a specific Rule
object.1# -*- coding: utf-8 -*-
2import scrapy
3from scrapy.spiders import CrawlSpider, Rule
4from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
5from scrapy.item import Item, Field
6
7
8class PageContentItem(Item): # A data storage class(like directory) to store the extracted data
9 url = Field()
10 content = Field()
11
12class Githubtrendingrepocrawler(CrawlSpider): # 1
13 name = 'GithubTrendingRepoCrawler' # 2
14 start_urls = ['http://github.com/trending/'] # 2
15
16 # 3
17 rules = (
18 # Extract link from this path only
19 Rule(
20 LxmlLinkExtractor(restrict_xpaths=[
21 "//ol[@id=repo-list]//h3/a/@href"], allow_domains=['https://github.com/trending']),
22 callback='parse'
23 ),
24 # link should match this pattern and create new requests
25 Rule(
26 LxmlLinkExtractor(allow='https://github.com/[\w-]+/[\w-]+$', allow_domains=['github.com']),
27 callback='parse_product_page'
28 ),
29 )
30
31 # 4
32 def parse_product_page(self, response):
33 item = PageContentItem()
34 item['url'] = response.url
35 item['content'] = response.css('article').get()
36 yield item
yield
is like areturn
statement to send a value back to the caller but it doesn't stop the execution of the method.yield
creates a generator which can be used in the future. If the body of a function containsyield
then the function automatically becomes a generator function.
Run this crawler using scrapy
command by:
1 scrapy crawl GithubTrendingRepoCrawler
If a link matched multiple rules then the first matched
Rule
object will be applied.
To continue crawling through previously extracted links, just use follow=True
in the second Rule
by:
1 rules = (
2 Rule(
3 LxmlLinkExtractor(allow='https://github.com/[\w-]+/[\w-]+$', allow_domains=['github.com']),
4 callback='parse_product_page', follow=True # this will continue crawling through the previously extracted links
5 ),
6 )
DEPTH_LIMIT=1
can be added in thesetting.py
file to process the level of extraction as per the value.1
means do not extract links from newly extracted links.Considering the number of repositories and links in their description, this process can continue for hours or days. Use
Command + C
orControl + C
to stop the process on the terminal.
Scrapy provides a convenient way to store the yielded item into a separate file using -o
option by:
1scrapy crawl GithubTrendingRepoCrawler -o extracted_data_files/links_JSON.json
extracted_data_files
is a folder and .json
is the file format. Scrapy also supports .csv
and .xml
formats as well.
1scrapy shell 'https://github.com
2>>> response.css('title::text')
setting.py
provides information about the bot and contains flags like DEPTH_LIMIT
, USER_AGENT
, ROBOTSTXT_OBEY
, 'DOWNLOAD_DELAY', etc. to control the behavior of bots executed via the crawl
command.
xpath
pattern also supports placeholders; the values of these placeholders can be changed by:1response.xpath('//div[@id=$value]/a/text()', value='content').get()
re
method aby:1response.xpath('//a/@href').re(r'https://[a-zA-Z]+\.com') # [a-zA-Z]+ mean match one or more alphabets
The code is available on github_trending_bot repo for demonstration and practice. Next, you can try to implement a news website crawler for fun.