Author avatar

Gaurav Singhal

Crawling the Web with Python and Scrapy

Gaurav Singhal

  • Jun 25, 2019
  • 19 Min read
  • 103 Views
  • Jun 25, 2019
  • 19 Min read
  • 103 Views
Data
Python

Introduction

Web scraping has become popular over the last few years, as it is an effective way to extract the required information out from the different websites so that it can be used for further analysis.

If you are new to using web scraping, check out my previous guide on extracting data with Beautiful Soup.

According to the documentation on Scrapy:

Scrapy is an application framework for crawling websites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing, or historical archival.

In this guide, we will learn how to scrape the products from the product page of Zappos. We will be scraping men’s running shoes products which have been paginated into 100 products per page and then export the data into a CSV file. Scraping data from a website that has been paginated is not always easy. This guide will establish a strong groundwork for such websites. Zappos is an example, the same technique can be used on numerous websites like Amazon.

Why Scrapy?

Beautiful Soup is widely used for scraping, but it is also used for small scale scraping (static HTML pages). Remember, Scrapy is only a parsing library which parses the HTML document. However it is easy to learn, so you can quickly use it to extract the data you want.

On the other hand, Scrapy is a web crawling framework that provides a complete tool for scraping to developers. In Scrapy, we create Spiders which are python classes that define how a certain site/sites will be scraped. So, if you want to build a robust, scalable, large scale scraper, then Scrapy is a good choice for you.

The biggest advantage of Scrapy is that it is built on top of theTwisted library which is an asynchronous networking library that allows you to write non-blocking (asynchronous) code for concurrency, which improves the spider performance to a great extent.

Getting Started

Before getting started with this guide, make sure you have Python3 installed in your system. If not, you can install it from here.

The next thing you need is the Scrapy package, let's install it by pip.

1
pip3 install scrapy
shell

Note: If you are using Windows, use pip instead of pip3.

For Windows users only:

If you are getting the following error, Microsoft Visual C++ 14.0 is required, while installing the Twisted library, then you need to install cpp build tools from the below link.

Under downloads you will find Tools for Visual Studio 2019. Then, download Build Tools for Visual Studio 2019. After downloading and installing, you need to install the Visual C++ build tools which will be almost 1.5GB.

Jumping to the Code

Now that you have installed Scrapy in your system, let us jump into a simplistic example code. As discussed earlier, in the Introduction, we will be scraping Zappos product list page for the keywords men running shoes which is available in paginated form.

Step 1: Start a New Project

Since Scrapy is a framework, we need to follow some standards of the framework. To create a new project in Scrapy, use the command startproject. I have named my project tutorial.

1
scrapy startproject tutorial
shell

This will create a tutorial directory with the following contents:

1
2
3
4
5
6
7
8
9
10
tutorial
├── scrapy.cfg          -- deploy configuration file of scrapy project
└── tutorial            -- your scrapy project module.
    ├── __init__.py     -- module initializer(empty file)
    ├── items.py        -- project item definition py file
    ├── middlewares.py  -- project middleware py file
    ├── pipelines.py    -- project pipeline py file
    ├── settings.py     -- project settings py file
    └── spiders         -- directory where spiders are kept
        ├── __init__.py
docs

Step 2: Analyze the Website

The next important step while doing Scraping is analyzing the webpage content that you want to scrap, is to identify how the information can be retrieved from the HTML text by examining the uniqueness in the desired element.

To inspect the page in the Chrome, open Developer Tools by right-clicking on the page.

Product page overview

In this example, we are intending to scrape all the information about the product from the list of products. Every piece of product information is available between the article tags. The sample HTML layout of a product (article) is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<article>
    <a aria-label="" itemprop="url" href="PRODUCT URL HERE">
        <meta itemprop="image" content="">
        <div>
            <span>
                <img src="PRODUCT IMG SRC HERE" alt="alt tag" >
            </span>
        </div>
    </a>
    <div>
        <button>
        </button>
        <p>
            <span itemprop="name">PRODUCT BY HERE</span>
        </p>
        <p itemprop="name">PRODUCT NAME HERE</p>
        <p><span>PRODUCT PRICE HERE</span></p>
        <p>
            <span itemprop="aggregateRating" data-star-rating="PRODUCT RATING HERE">
                <meta><meta>
                <span></span>
                <span class="screenReadersOnly"></span>
            </span>
        </p>
    </div>
</article>
html

From the above HTML code snippet, we are going to scrape the following things from each product:

  • Product Name
  • Product by
  • Product price
  • Product stars
  • Product image url

Step 3: Creating Our First Spider

Now let's create our first spider. To create new spider, you can use the genspider command which takes an argument of spider name and start url.

1
scrapy genspider zappos www.zappos.com
shell

After you run the above command, you will notice that a new .py file is created in your spider's folder.

In that spider python file, you will see a class named ZapposSpider which inherits the scrapy.Spider class and contains a method named parse which we will discuss in the next step.

1
2
3
4
5
6
7
8
9
10
11
import scrapy


class ZapposSpider(scrapy.Spider):
    name = 'zappos'
    allowed_domains = ['www.zappos.com']
    start_urls = ['http://www.zappos.com/']


    def parse(self, response):
        pass
python

To run a spider, you can use either the crawl command or the runspider command.

The crawl command takes the spider name as an argument:

1
scrapy crawl zappos
shell

Or you can use the runspider command. This command will take the location of the spider file.

1
scrapy runspider tutorial/spiders/zappos.py
shell

After you run any of the above commands, you will see the output in the terminal showing something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
2019-06-17 15:45:11 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: tutorial)
2019-06-17 15:45:11 [scrapy.utils.log] INFO: Versions: lxml 4.3.4.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.1, Python 3.6.7 (default, Oct 22 2018, 11:32:17) - [GCC 8.2.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c  28 May 2019), cryptography 2.7, Platform Linux-4.18.0-21-generic-x86_64-with-Ubuntu-18.04-bionic
2019-06-17 15:45:11 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'tutorial', 'NEWSPIDER_MODULE': 'tutorial.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['tutorial.spiders']}
2019-06-17 15:45:11 [scrapy.extensions.telnet] INFO: Telnet Password: 8ddf42dffb5b3d2f
2019-06-17 15:45:11 [scrapy.middleware] INFO: Enabled extensions:
[...]
2019-06-17 15:45:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
[...]
2019-06-17 15:45:11 [scrapy.middleware] INFO: Enabled spider middlewares:
[...]
2019-06-17 15:45:11 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-06-17 15:45:11 [scrapy.core.engine] INFO: Spider opened
2019-06-17 15:45:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-06-17 15:45:11 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-06-17 15:45:13 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.zappos.com/robots.txt> (referer: None)
2019-06-17 15:45:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zappos.com/> (referer: None)
2019-06-17 15:45:13 [scrapy.core.engine] INFO: Closing spider (finished)
2019-06-17 15:45:13 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{
    ...
}
2019-06-17 15:45:13 [scrapy.core.engine] INFO: Spider closed (finished)
shell

Step 4: Extracting the Data from the Page

Now, let's write our parse method. Before jumping to the parse method, we have to change the start_url to the web page URL, that we wish to scrape.

We will use CSS selectors for this guide, since CSS is the easiest option to iterate over the products. The other selector that is commonly used is XPath selector. For more info about Scrapy selectors, refer to this documentation.

As discussed earlier, in Step 2, while we are inspecting the elements on the web page every product is wrapped in an article tag. So, we have to loop through each article tag and then extract the further the product information from the product object.

The product object has all the information regarding each product.we can further use the selector on the product object to find information about the product. Let's just try to extract product name only, from each product on the first page.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# -*- coding: utf-8 -*-
import scrapy


class ZapposMenShoesSpider(scrapy.Spider):
    name = "zappos"
    start_urls = ['https://www.zappos.com/men-running-shoes']
    allowed_domains = ['www.zappos.com']

    def parse(self, response):
        for product in response.css("article"):
            yield {
                "name": product.css("p[itemprop='name']::text").extract_first()
            }
python

You’ll notice the following things going on in the above code:

  • We use the selector as p[itemprop='name'] for fetching the product name. It says, "Hey find the p tag that has the attribute as itemprop and which sets it to name from the product object".

  • We append ::text to our selector for the name because we just want to extract the text between the tags enclosed. It is called CSS pseudo-selector.

  • We call extract_first() on the object returned by product.css (CSS SELECTOR) because we just want the first element that matches the selector. This will give us a string, rather than a list of elements, which may match other similar CSS patterns.

Save the spider file and run the scraper again:

1
scrapy crawl zappos
shell

This time, you will see the names of all the products (100) which were listed on first page appear in the output:

1
2
3
4
5
6
7
8
9
10
11
12
...
...
{'name': 'Motion 7'}
2019-06-18 00:09:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zappos.com/men-running-shoes>
{'name': 'Fate 5'}
2019-06-18 00:09:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zappos.com/men-running-shoes>
{'name': 'Gravity 8'}
2019-06-18 00:09:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zappos.com/men-running-shoes>
{'name': 'Distance 8'}
2019-06-18 00:09:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zappos.com/men-running-shoes>
...
...
shell

Now, let's expand our yield dictionary by adding price, stars, by, image URL, etc.

  • by: For extracting product by from a product object. The p[itemprop='brand'] span[itemprop='name']::text selector can be used; it says that from the product object, find the p tag that has an attribute named itemprop which sets it to brand and which has a child element span with attribute named itemprop and attribute value named name.
  • price: For price, the p span::text selector can be used. Note that we have two matching results for the above selector, so we have to use the second one or match at the first index.
  • stars: The total star of a product can be extracted from an attribute value. The selector will be p span[itemprop='aggregateRating']::attr('data-star-rating'), it says that, in the product object, find the p tag that has the child element span and has attribute named itemprop which sets to aggregateRating. And then extract the attribute value of data-star-rating.
  • image url: For extracting src attribute from the img tag, we will use the selector as div span img::attr('src').
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# -*- coding: utf-8 -*-
import scrapy


class ZapposMenShoesSpider(scrapy.Spider):
    name = "zappos"
    start_urls = ['https://www.zappos.com/men-running-shoes']
    allowed_domains = ['www.zappos.com']

    def parse(self, response):
        for product in response.css("article"):
            yield {
                "name": product.css("p[itemprop='name']::text").extract_first(),
                "by": product.css("p[itemprop='brand'] span[itemprop='name']::text").extract_first(),
                "price": product.css("p span::text").extract()[1],
                "stars": product.css(
                    "p span[itemprop='aggregateRating']::attr('data-star-rating')"
                ).extract_first(),
                "img-url": product.css(
                    "div span img::attr('src')").extract_first()
            }
python

This time you will see the all the information about each product which were listed on 1st page appeared in the output:

1
2
3
4
5
6
7
8
9
10
11
12
...
...
{'name': 'Motion 7', 'by': 'Newton Running', 'price': '$131.25', 'stars': '4', 'img-url': 'https://m.media-amazon.com/images/I/81e878wBGeL._AC_SX255_.jpg'}
2019-06-18 00:36:41 \[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zappos.com/men-running-shoes>
{'name': 'Fate 5', 'by': 'Newton Running', 'price': '$140.00', 'stars': None, 'img-url': 'https://m.media-amazon.com/images/I/81nVby4s0lL._AC_SX255_.jpg'}
2019-06-18 00:36:41 \[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zappos.com/men-running-shoes>
{'name': 'Gravity 8', 'by': 'Newton Running', 'price': '$175.00', 'stars': None, 'img-url': 'https://m.media-amazon.com/images/I/81-iMPpgxrL._AC_SX255_.jpg'}
2019-06-18 00:36:41 \[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zappos.com/men-running-shoes>
{'name': 'Distance 8', 'by': 'Newton Running', 'price': '$155.00', 'stars': '5', 'img-url': 'https://m.media-amazon.com/images/I/81PT5DJEVCL._AC_SX255_.jpg'}
2019-06-18 00:36:41 \[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zappos.com/men-running-shoes>
...
...
shell

To preserve the output in a file, you can use the -o flag followed by the filename while running the spider.

Scrapy allows you to export your extracted data item into several different file formats. Some of the commonly used file exports are (Refer to):

  • CSV
  • JSON
  • XML
  • pickle

For example, let's export the data into a CSV file.

1
    scrapy crawl zappos -o zappos.csv
shell

Step 5: Crawling Multiple Pages

We have successfully extracted products for the first page. Now, let's extend our spider so that it navigates to all the available pages for the given keyword by fetching the next page URL.

You will notice a Next Page link at the bottom of the page

Next page link overview

which has an element as follows:

1
2
3
<a rel="next" href="/men-running-shoes/.zso?t=men running shoes&amp;p=1">Next<!-- -->
    <span> Page</span>
</a>
html

You can grab the next page URL from the href attribute of a tag which has another unique attribute, called rel, which is next for this element.

So, the CSS selector for grabbing the same will be: a[rel='next']::attr('href')

Modify your code as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# -*- coding: utf-8 -*-
import scrapy


class ZapposMenShoesSpider(scrapy.Spider):
    name = "zappos_p"
    start_urls = ['https://www.zappos.com/men-running-shoes']
    allowed_domains = ['www.zappos.com']


    def parse(self, response):
        for product in response.css("article"):
            yield {
                "name": product.css("p[itemprop='name']::text").extract_first(),
                "by": product.css("p[itemprop='brand'] span[itemprop='name']::text").extract_first(),
                "price": product.css("p span::text").extract()[1],
                "stars": product.css(
                    "p span[itemprop='aggregateRating']::attr('data-star-rating')"
                ).extract_first(),
                "img-url": product.css(
                    "div span img::attr('src')").extract_first()
            }


        next_url_path = response.css(
            "a[rel='next']::attr('href')").extract_first()
        if next_url_path:
            yield scrapy.Request(
                response.urljoin(next_url_path),
                callback=self.parse
            )
python

You will notice from the previous code that we have just added two new statements.

  • The first statement will grab the next page URL, if exists, which we will store it in the variable next_url_path.
  • The second statement will check if the next_url_path exists or not. If it exists, then we simply call the self.parse method with the new page URL we got.

Finally, you can run your spider with mentioning the output file:

1
    scrapy crawl zappos -o zappos.csv
shell

Conclusion

In this guide, you have successfully built a spider that extracts all the products of the specified category which are available in the paginated form in just 25 lines of code. This is a great start, but there are a lot of things that you can do with the spider. For a greater understanding, you can follow the documentation of Scrapy.

Here are some of the ways that you can expand your code for learning purposes:

  • Extract the URL of the product.
  • Scrape for multiple keywords. In this example, we have just scraped for a single keyword (men-running-shoes).
  • Try to scrape other, different types of websites.

I hope you have learned a lot in this guide. Try experimenting with another website to get more understanding about the Scrapy framework. You can always follow the Scrapy Documentation for a better and deeper understanding. For more information on scraping the data from the web, check out Extracting data with Beautiful soup .

3