Author avatar

Gaurav Singhal

Implementing Web Scraping with Scrapy

Gaurav Singhal

  • Mar 11, 2020
  • 12 Min read
  • 1,665 Views
  • Mar 11, 2020
  • 12 Min read
  • 1,665 Views
Data
Scrapy

Introduction

Data is the new oil, many argue, as it becomes an increasingly valuable resource. With internet use growing, there is a massive amount of data on different websites. If you want to get data from web pages, one way is to use an API or implement web scraping techniques. Web scrapers and crawlers read a website’s pages and feed, and they analyze the site’s structure and markup language for clues to extract data. Sometimes the data collected from scraping is fed into other programs for validation, cleaning, and input into a datastore. It may also be fed into other processes, such as natural language processing (NLP) toolchains or machine learning (ML) models.

There are a few Python packages you can use for web scraping, including Beautiful Soup and Scrapy, and we’ll focus on Scrapy in this guide. Scrapy makes it easy for us to quickly prototype and develop web scrapers.

In this guide, you will see how to scrape the IMDB website and extract some of its data into a JSON file.

What is Scrapy?

Scrapy is a free and open-source web crawling framework written in Python. It is a fast, high-level framework used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Scrapy uses spiders to define how a site should be scraped for information. It lets us determine how we want a spider to crawl, what information we want to extract, and how we can extract it.

Setup and Installation

Let’s talk about installation, creating a spider, and then testing it.

Step 1: Creating a Virtual Environment

It's best to create a different virtual environment for Scrapy because that isolates the program and doesn’t affect any other programs present in the machine.

First, install the virtualenv using the below command.

1
$ pip install virtualenv
shell

Now create a virtual environment with Python.

1
$ virtualenv scrapyvenv
powershell

For Linux/Mac, you can mention the Python version.

1
$ virtualenv -p python3 scrapyvenv
shell

You can also mention which Python version you want to create the virtual environment.

After creating a virtual environment, activate it.

For Windows:

1
2
$ cd scrapyvenv
$ .\Scripts\activate
powershell

For Linux/Mac:

1
2
$ cd scrapyvenv
$ source bin/activate
shell

Step 2: Installing Scrapy

Most of the dependencies will automatically get installed. They're available for Python 2.7+.

  • pip install: To install using pip, open the terminal and run the following command:
1
$ pip install scrapy
shell
  • conda Install: To install using conda, open the terminal and run the following command:
1
$ conda install -c anaconda scrapy
shell

If you have a problem installing the twisted library, you can download it here and then install it locally.

Step 3: Creating a Scrapy Project

Since Scrapy is a framework, we need to follow some standards of the framework. To create a new project in scrapy, use the command startproject. I have named my project webscrapy.

1
$ scrapy startproject webscrapy
shell

Moreover, this will create a webscrapy directory with the following contents:

1
2
3
4
5
6
7
8
9
10
webscrapy
├── scrapy.cfg          -- deploy configuration file of scrapy project
└── webscrapy           -- your scrapy project module.
    ├── __init__.py     -- module initializer(empty file)
    ├── items.py        -- project item definition py file
    ├── middlewares.py  -- project middleware py file
    ├── pipelines.py    -- project pipeline py file
    ├── settings.py     -- project settings py file
    └── spiders         -- directory where spiders are kept
        ├── __init__.py
docs

Create a Spider

Now, let's create our first spider. Use the command genspider, which takes the name of spider and the URL it will crawl :

1
2
$ cd webscrapy
$ scrapy genspider imdb www.imdb.com
terminal

After running this command, Scrapy will automatically create a Python file named imdb in the spider folder.

When you open that spider imdb.py Python file, you will see a class named imdbSpider that inherits scrapy.Spider class and contains a method named parse, which we will discuss later.

1
2
3
4
5
6
7
8
9
import scrapy

class ImdbSpider(scrapy.Spider):
    name = 'imdb'
    allowed_domains = ['www.imdb.com']
    start_urls = ['http://www.imdb.com/']

    def parse(self, response):
        pass
python

A few things to note here:

  • name: The name of the spider. In this case, it is ImdbSpider. Naming spiders properly becomes a huge relief when you have to maintain hundreds of spiders.

  • allowed_domains: An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list won’t be followed.

  • parse(self, response): This function is called whenever the crawler successfully crawls a URL.

To run this spider, use the below command. Before running this command, make sure that you in the right directory.

1
$ scrapy crawl imdb
terminal

Note that the above command takes the spider's name as an argument.

Scrape on IMDB

Let's now get all the table entries, such as title, year, and rating, from the table of IMDB top 250 movies.

imgur

Create the spider imdb.py, which has been created earlier.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# importing the scrapy
import scrapy

class ImdbSpider(scrapy.Spider):
    name = "imdb"
    allowed_domains = ["imdb.com"]
    start_urls = ['http://www.imdb.com/chart/top',]
   
    def parse(self, response):
        # table coloums of all the movies 
        columns = response.css('table[data-caller-name="chart-top250movie"] tbody[class="lister-list"] tr')
        for col in columns:
            # Get the required text from element.
            yield {
                "title": col.css("td[class='titleColumn'] a::text").extract_first(),
                "year": col.css("td[class='titleColumn'] span::text").extract_first().strip("() "),
                "rating": col.css("td[class='ratingColumn imdbRating'] strong::text").extract_first(),

            }
python

Run the above imdb spider:

1
$ scrapy crawl imdb
shell

You will get the following output:

1
2
3
4
5
{'title': 'The Shawshank Redemption', 'year': '1994', 'rating': '9.2'}
{'title': 'The Godfather', 'year': '1972', 'rating': '9.1'}
...
{'title': 'Swades', 'year': '2004', 'rating': '8.0'}
{'title': 'Song of the Sea', 'year': '2014', 'rating': '8.0'}
terminal

Create a More Advanced Scraper

Let's get more advanced in scraping IMDB. Let's open the detailed page for each movie in the list of all 250 movies and then fetch all the important features, such as director name, genre, cast members, etc.

imgur

Before diving into creating the spider, we have to create Movie and Cast items in itmes.py.

For more details, read the Item documentation here.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import scrapy

class MovieItem(scrapy.Item):
    title = scrapy.Field()
    rating = scrapy.Field()
    summary = scrapy.Field()
    genre = scrapy.Field()
    runtime = scrapy.Field()
    directors = scrapy.Field()
    writers = scrapy.Field()
    cast = scrapy.Field()

class CastItem(scrapy.Item):
    name = scrapy.Field()
    character = scrapy.Field()
    
python

Now that the Items are created, let's extend the spider.

Withn this spider, we are fetching the URL of each movie item and requesting that URL by calling parseDetailItem, which collects all the movie data from the movie detail page.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# importing the scrapy
import scrapy
from webscrapy.items import MovieItem, CastItem

class ImdbSpider(scrapy.Spider):
    name = "imdb"
    allowed_domains = ["imdb.com"]
    base_url = "https://imdb.com"
    start_urls = ['https://www.imdb.com/chart/top',]
   
    def parse(self, response):
        # table coloums of all the movies 
        columns = response.css('table[data-caller-name="chart-top250movie"] tbody[class="lister-list"] tr')
        for col in columns:
            # rating of the movie i.e., position in the table
            rating = col.css("td[class='titleColumn']::text").extract_first().strip()
            # url of detail page of that movie. 
            rel_url = col.css("td[class='titleColumn'] a::attr('href')").extract_first().strip()
            # add the domain to rel. url
            col_url = self.base_url + rel_url
            # Make a request to above url, and call the parseDetailItem
            yield scrapy.Request(col_url, callback=self.parseDetailItem, meta={'rating' : rating})
    
    # calls every time, when the movie is fetched from table.
    def parseDetailItem(self, response):
        # create a object of movie.
        item = MovieItem()
        # fetch the rating meta.
        item["rating"] = response.meta["rating"]
        # Get the required text from element.
        item['title'] = response.css('div[class="title_wrapper"] h1::text').extract_first().strip()
        item["summary"] = response.css("div[class='summary_text']::text").extract_first().strip()
        item['directors'] = response.css('div[class="credit_summary_item"] a::text').extract()[0].strip()
        item['writers'] = response.css('div[class="credit_summary_item"] a::text').extract()[1].strip()
        item["genre"] = response.xpath("//*[@id='title-overview-widget']/div[1]/div[2]/div/div[2]/div[2]/div/a[1]/text()").extract_first().strip()
        item["runtime"] = response.xpath("//*[@id='title-overview-widget']/div[1]/div[2]/div/div[2]/div[2]/div/time/text()").extract_first().strip()

        # create a list of cast of movie.
        item["cast"] = list()

        # fetch all the cast of movie from table except first row.
        for cast in response.css("table[class='cast_list'] tr")[1:]:
            castItem = CastItem()
            castItem["name"] = cast.xpath("td[2]/a/text()").extract_first().strip()
            castItem["character"] = cast.css("td[class='character'] a::text").extract_first()
            item["cast"].append(castItem)

        return item
python

Getting all the data on the command line is nice, but as a data scientist, it is preferable to have data in certain formats like CSV, Excel, JSON, etc. that can be imported into programs. Scrapy provides this nifty little functionality where you can export the downloaded content in various formats. Many of the popular formats are already supported.

Let's export our data in JSON format using the below command.

1
$ scrapy crawl imdb -o imdbdata.json -t json 
shell

You will get this type of output in the file imdbdata.json.:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[
    ...
    {
        "rating": "4",
        "title": "The Dark Knight",
        "summary": "When the menace ...",
        "directors": "Christopher Nolan",
        "writers": "Jonathan Nolan",
        "genre": "Action",
        "runtime": "2h 32min",
        "cast": [
            {
                "name": "Christian Bale",
                "character": "Bruce Wayne"
            },
            ...
    },
    ...
]
json

Conclusion

In this guide, we learned the basics of Scrapy and how to extract data in JSON file format, though you can export to any file format. We have just scratched the surface of Scrapy’s potential as a web scraping tool.

I hope that from this guide, you understood the basic of Scrapy and are motivated to go deeper with this wonderful scraping tool. For deeper understanding, you can always follow the Scrapy Documentation or read my previous guide on Crawling the Web with Python and Scrapy.

If you have any questions related to this guide, feel free to ask me at Codealphabet.

8