Extracting Structured Data from the Web Using Scrapy

Data analysts and scientists are always on the lookout for new sources of data, competitive intelligence, and new signals for proprietary models in applications. The Scrapy package in Python makes extracting raw web content easy and scalable.
Course info
Rating
(16)
Level
Beginner
Updated
Jul 6, 2018
Duration
1h 52m
Table of contents
Description
Course info
Rating
(16)
Level
Beginner
Updated
Jul 6, 2018
Duration
1h 52m
Description

Websites contain meaningful information which can drive decisions within your organization. The Scrapy package in Python makes crawling websites to scrape structured content easy and intuitive and at the same time allows crawling to scale to hundreds of thousands of websites. In this course, Extracting Structured Data from the Web Using Scrapy, you will learn how you can scrape raw content from web pages and save them for later use in a structured and meaningful format. You will start off by exploring how Scrapy works and how you can use CSS and XPath selectors in Scrapy to select the relevant portions of any website. You'll use the Scrapy command shell to prototype the selectors you want to use when building Spiders. Next, you'll see learn Spiders specify what to crawl, how to crawl, and how to process scraped data. You'll also learn how you can take your Spiders to the cloud using the Scrapy Cloud. The cloud platform offers advanced scraping functionality including a cutting-edge tool called Portia with which you can build a Spider without writing a single line of code. At the end of this course, you will be able to build your own spiders and crawlers to extract insights from any website on the web. This course uses Scrapy version 1.5 and Python 3.

About the author
About the author

A problem solver at heart, Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework.

More from the author
Building Blockchains with Hyperledger
Intermediate
2h 16m
1 Nov 2018
More courses by Janani Ravi
Section Introduction Transcripts
Section Introduction Transcripts

Course Overview
Hi, my name is Janani Ravi, and welcome to this course on Extracting Structured Data from the Web Using Scrapy. A little about myself, I have a master's degree in electrical engineering from Stanford and have worked at companies such as Microsoft, Google, and Flipkart. At Google I was one of the first engineers working on real-time collaborative editing in Google Docs and I hold four patents for its underlying technologies. I currently work on my own start up, Loonycorn, a studio for high- quality video content. In this course, you will learn how you can scrape raw content from web pages and save them for later use in a structured and meaningful format. We start off by understanding how Scrapy works and how we can use CSS and XPath selectors in Scrapy to select the relevant portions of any website. We'll use the Scrapy command shell to prototype the selectors we want to use when we go ahead and build spiders. Spiders are at the heart of Scrapy. They are Python classes, which are called by the Scrapy framework to perform the actual scraping of sites. Spiders specify what to crawl, how to crawl, and how to process the scraped data. Scrapy allows logical grouping of data using items and data processing using input and output processors. Item pipelines allow us to chain transformations on data before they're saved to file using feed exporters. Scrapy allows broad crawls of thousands of sites and advanced features to support these crawls, such as auto throttling of requests to websites. We'll see all of this in this course. We'll also learn how we can take our spiders to the cloud using the Scrapy Cloud. The cloud platform offers advanced scraping functionality, including a cutting-edge tool called Portia, with which you can build a spider without writing a single line of code. At the end of this course, you will be able to build your own spiders and crawlers to extract inside from any website out on the internet.

Getting Started Scraping Web Sites Using Scrapy
Hi, and welcome to this course on Extracting Structured Data from the Web Using Scrapy. Now there is a lot of information scattered across different websites nowadays, and a good data analyst often finds that he or she needs this information in order to extract insights or gain competitive intelligence, which is why Scrapy is so popular. It's an application framework for crawling websites to extract data in a structured form. In addition to extracting data from the website that you're crawling, Scrapy allows you to logically group data in the form that you will want to save out to a file. Scraping information from websites that are not your own is often a painstaking process. The Scrapy shell is an interactive shell that Scrapy offers in order to quickly test how you would extract information from these sites. The Scrapy shell is where you'd start prototyping your code. Selectors in Scrapy are objects that allow you to specify what portion of the website that you're interested in. You can select relevant data from a site by specifying the XPath to that HTML element or the CSS class that applies to it.

Using Spiders to Crawl Sites
In this module, we'll see how we can build Scrapy spiders. If you're using Scrapy in production, chances are you'll build a spider in order to crawl websites that you're interested in. At the heart of Scrapy's web crawling system is the spider. Spiders are classes that allow you to define which websites you're interested in, the websites that you want to crawl, how you want those websites to be crawled, and how you want the data that you're interested in to be extracted. Scrapy allows you to logically group the data that you extract from websites into something called an item. An item allows you to specify pre and post processors, which allow you to process data so that it's in the format that you're interested in. Once you've extracted scraped information in raw format, you can pass it to a series of transformations using item pipelines. This transformed and extracted data can then be saved out to a file or to a database.

Building Crawlers Using Built-in Services in Scrapy
Hi, and welcome to this module on Building Crawlers Using Built-in Services in Scrapy. When we use Scrapy to crawl multiple websites at scale, these are called broad crawls. Scrapy offers many built-in useful features in order to allow you to perform this kind of crawl. You can use Scrapy to log your own events. Events can be logged either to console or out to a file. These events can then be processed later. Your Scrapy crawlers can be debugged using telnet from a terminal window. You can pause crawling, restart your crawlers, and also view crawl statistics. Broad crawls involve scraping thousands of websites in a concurrent manner. Scrapy offers a special broad crawler which enables this. An important mechanism that you can use to control broad crawlers is the auto throttle mechanism. Websites nowadays often have policies, which block bots and other crawlers. In order to not run afoul of these policies, it's important that you auto throttle your crawls.

Deploying Crawlers Using Scrapy Cloud
Hi and welcome to this module where we'll see how we can run Scrapy on the cloud. The company that has developed and maintains the Scrapy application framework is called Scraping hub Limited and they have a website at scrapinghub. com. Scrapinghub offers a wide variety of products and services, all of which relate to scraping. They offer paid services as well where experts help you with your scraping needs. What's useful for us as developers though is the tools that they offer in the cloud, which allow you to scrape at scale. If you're scraping from a single machine, you're limited by your machine's hardware. In this module we'll deploy a Scrapy spider that we already built in a previous module to the Scrapy cloud and run a crawler from the Scrapinghub Cloud platform. We'll also play with one of the very cool tools that Scrapinghub has to offer. Portia is a UI-based tool that allows you to build spiders using point and click mechanism, there is absolutely no code that you need to write.