Extracting Structured Data from the Web Using Scrapy

Data analysts and scientists are always on the lookout for new sources of data, competitive intelligence, and new signals for proprietary models in applications. The Scrapy package in Python makes extracting raw web content easy and scalable.
Course info
Level
Beginner
Updated
Jul 6, 2018
Duration
1h 53m
Table of contents
Description
Course info
Level
Beginner
Updated
Jul 6, 2018
Duration
1h 53m
Description

Websites contain meaningful information which can drive decisions within your organization. The Scrapy package in Python makes crawling websites to scrape structured content easy and intuitive and at the same time allows crawling to scale to hundreds of thousands of websites. In this course, Extracting Structured Data from the Web Using Scrapy, you will learn how you can scrape raw content from web pages and save them for later use in a structured and meaningful format. You will start off by exploring how Scrapy works and how you can use CSS and XPath selectors in Scrapy to select the relevant portions of any website. You'll use the Scrapy command shell to prototype the selectors you want to use when building Spiders. Next, you'll see learn Spiders specify what to crawl, how to crawl, and how to process scraped data. You'll also learn how you can take your Spiders to the cloud using the Scrapy Cloud. The cloud platform offers advanced scraping functionality including a cutting-edge tool called Portia with which you can build a Spider without writing a single line of code. At the end of this course, you will be able to build your own spiders and crawlers to extract insights from any website on the web. This course uses Scrapy version 1.5 and Python 3.

About the author
About the author

A problem solver at heart, Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework.

More from the author
Building Features from Image Data
Advanced
2h 10m
Aug 13, 2019
Designing a Machine Learning Model
Intermediate
3h 25m
Aug 13, 2019
More courses by Janani Ravi
Section Introduction Transcripts
Section Introduction Transcripts

Course Overview
Hi, my name is Janani Ravi, and welcome to this course on Extracting Structured Data from the Web Using Scrapy. A little about myself, I have a master's degree in electrical engineering from Stanford and have worked at companies such as Microsoft, Google, and Flipkart. At Google I was one of the first engineers working on real-time collaborative editing in Google Docs and I hold four patents for its underlying technologies. I currently work on my own start up, Loonycorn, a studio for high- quality video content. In this course, you will learn how you can scrape raw content from web pages and save them for later use in a structured and meaningful format. We start off by understanding how Scrapy works and how we can use CSS and XPath selectors in Scrapy to select the relevant portions of any website. We'll use the Scrapy command shell to prototype the selectors we want to use when we go ahead and build spiders. Spiders are at the heart of Scrapy. They are Python classes, which are called by the Scrapy framework to perform the actual scraping of sites. Spiders specify what to crawl, how to crawl, and how to process the scraped data. Scrapy allows logical grouping of data using items and data processing using input and output processors. Item pipelines allow us to chain transformations on data before they're saved to file using feed exporters. Scrapy allows broad crawls of thousands of sites and advanced features to support these crawls, such as auto throttling of requests to websites. We'll see all of this in this course. We'll also learn how we can take our spiders to the cloud using the Scrapy Cloud. The cloud platform offers advanced scraping functionality, including a cutting-edge tool called Portia, with which you can build a spider without writing a single line of code. At the end of this course, you will be able to build your own spiders and crawlers to extract inside from any website out on the internet.