Web Scraping with Python

Paths

Web Scraping with Python

Authors: Clarke Bishop, Janani Ravi, Eduardo Freitas, Pratheerth Padman

There are times in which you need data but there is no API (application programming interface) to be found. Web scraping is the process of extracting data from web sites via... Read more

What You Will Learn

  • Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format.

Pre-requisites

  • Data Analytics Literacy
  • Python for Data Analysts

Beginner

Describe the process of scraping data from the web, explain the legal factors, and scrape data from a web page with BeautifulSoup.

Exploring Web Scraping with Python

by Clarke Bishop

Feb 7, 2020 / 1h 34m

1h 34m

Start Course
Description

The web is a giant database and when there’s no API, you can still retrieve the data through web scraping. In this course, Exploring Web Scraping with Python, you will learn foundational knowledge of web scraping and how to use Python’s rich set of scraping capabilities. First, you will learn how to download and extract data with Requests and Beautiful Soup. Next, you will discover how to build a spider in about 20 lines of code with Scrapy. Finally, you will explore how to use a robotic browser to solve advanced web scraping challenges. When you are finished with this course, you will know “SQL for web scraping,” and have the skills and knowledge of content selectors and Python needed to start your own website scraping projects.

Table of contents
  1. Course Overview
  2. Why Scrape the Web?
  3. The Web Scraping Process
  4. CSS Selectors and XPath
  5. Is Web Scraping Legal? Is it Ethical?
  6. Web Scraping Basics with Python
  7. Building a Web Spider with Scrapy
  8. Advanced Web Scraping with Selenium and Requests-html
  9. Course Summary

Scraping Your First Web Page with Python

by Janani Ravi

Nov 5, 2019 / 2h 39m

2h 39m

Start Course
Description

Web scraping is an important technique that is widely used as the first step in many workflows in data mining, information retrieval, and text-based machine learning. In this course, Scraping your First Web Page with Python, you will gain the ability to apply different scraping techniques including Beautiful Soup, and Scrapy. First, you will learn and use various HTTP client libraries such as Requests, httplib2, and urllib to download HTML content. Next, you will discover how Beautiful Soup is an extremely popular Python library that does better than regex in important ways. You will see how Beautiful Soup fixes up badly formed HTML, and constructs a nice parse tree that can be traversed and queried. Finally, you will add to your toolkit the knowledge of Scrapy, which is a full-fledged web scraping framework that combines the steps of retrieving and parsing web content and does so at production-scale. When you’re finished with this course, you will have the skills and knowledge to identify the relative strengths and use-cases of different web retrieval and scraping technologies such as regular expressions, Beautiful Soup, and Scrapy.

Table of contents
  1. Course Overview
  2. Getting Started with Web Scraping
  3. Working with the Parse Tree in BeautifulSoup
  4. Selecting Elements Using the Scrapy Shell
  5. Scraping Web Sites Using Scrapy Spiders

Intermediate

Scrape data from a web page extract with specific page elements using BeautifulSoup.

Extracting Data from HTML with BeautifulSoup

by Janani Ravi

Nov 1, 2019 / 2h 25m

2h 25m

Start Course
Description

Web scraping is an important technique that is widely used as the first step in many workflows in data mining, information retrieval, and text-based machine learning.

In this course, Extracting Data from HTML with BeautifulSoup* you will gain the ability to build robust, maintainable web scraping solutions using the Beautiful Soup library in Python.

First, you will learn how regular expressions can be used to scrape web content, and how Beautiful Soup does better in important ways. Next, you will discover how Beautiful Soup parses HTML from web content, fixes up badly-formed tags, and builds a clean, easily traversable parse tree. You will then see how that parse tree can be used in order to find and retrieve specific patterns.

Finally, you will round out your knowledge by leveraging advanced features of beautiful soup such as working with CSS and XPath. When you’re finished with this course, you will have the skills and knowledge to implement robust web scraping using Beautiful Soup.

Table of contents
  1. Course Overview
  2. Getting Started with BeautifulSoup
  3. Navigating the Parse Tree
  4. Searching for Elements in the Parse Tree
  5. Leveraging Advanced Features of BeautifulSoup

Coming Soon

Scraping Media from the Web with Python

Coming Soon

by Allen O'Neill

Advanced

Create a spider to collect data across multiple pages and scrape a dynamically-rendered web page.

Crawling the Web with Python and Scrapy

by Eduardo Freitas

Dec 23, 2019 / 1h 32m

1h 32m

Start Course
Description

Have you ever spent hours trying to gather high-quality data from specific websites, and wondered how you could extract this data programmatically and use it within your own applications? In this course, Crawling the Web with Python and Scrapy, you will gain the ability to write spiders that can extract data from the web, using Python and Visual Studio Code, through an advanced yet easy-to-use framework called Scrapy. First, you will learn what scraping and crawling are, and explore all its implications. Next, you will discover how to scaffold a Scrapy project and write spiders. Finally, you will explore how to influence how spiders crawl websites and extract data in different formats. When you are finished with this course, you will have the skills and knowledge on how to use Scrapy with Python, to programmatically crawl and scrape data from any website.

Table of contents
  1. Course Overview
  2. Extracting Data from the Web – Core Concepts
  3. Scaffolding and Running Your First Scrapy Web Crawler Project
  4. Achieving Common Spider Behaviors Using Built-in Classes
  5. Influencing Scrapy Crawling
  6. Scrapy Outcome and Data Export

Scraping Dynamic Web Pages with Python and Selenium

by Pratheerth Padman

Jun 6, 2019 / 1h 7m

1h 7m

Start Course
Description

They say data is the new oil, and given what you can do with high quality data, you'd be hard-pressed to disagree. There are many ways to collect data, one of which is extracting the oodles of data swimming around in the form of websites. That is exactly what this course, Scraping Dynamic Web Pages with Python and Selenium, aims to teach. First, you are going to look at how to scrape data from dynamic websites. The main tool used is Selenium, and the course starts off by exploring that. Next, you will move onto the specifics of it, starting with opening a webpage using a web driver. Then you will learn to identify and locate dynamic elements in a webpage and handing the page source over to beautiful soup. Finally, to round off the course, you will explore the common challenges you will face and methods to increase scraping efficiency. When you are finished with this course, you will be able to combine Python, Selenium, and Beautiful Soup to extract data from any dynamic webpage.

Table of contents
  1. Course Overview
  2. Exploring Selenium with Python
  3. Locating Elements & Navigating Dynamic Web Pages
  4. Loading Selenium Page Source into BeautifulSoup
  5. Overcoming Challenges and Increasing Efficiency

Advanced Web Scraping Tactics: Python Playbook

by Pratheerth Padman

Mar 24, 2020 / 44m

44m

Start Course
Description

Scraping static, uncomplicated webpages is easy to do with Python. The going gets a little tougher though when you are confronted with things like login pages, checkboxes, and forms.

In this course, Advanced Web Scraping Tactics: Python Playbook, you will take what you already know about introductory web scraping and learn advanced web scraping techniques.

First, you will learn what advanced web scraping means, followed by how to handle form submissions with the Python requests module and Selenium.

Next, you will deal with how to handle websites with login pages and cookies, and how to provide button input values such as clicking checkboxes and radio buttons.

Finally, you will use Selenium to upload files which will come in handy when you are required by websites to upload images, pdf files, and more to proceed further. When you are finished with this course, you will have the skills to navigate problems when trying to scrape data from websites.

Table of contents
  1. Course Overview
  2. Introducing Advanced Web Scraping & Handling Form Submissions
  3. Submitting Cookies & Button Input Values to a URL
  4. Uploading Files to a Webpage during Scraping
Offer Code *
Email * First name * Last name *
Company
Title
Phone
Country *

* Required field

Opt in for the latest promotions and events. You may unsubscribe at any time. Privacy Policy

By providing my phone number to Pluralsight and toggling this feature on, I agree and acknowledge that Pluralsight may use that number to contact me for marketing purposes, including using autodialed or pre-recorded calls and text messages. I understand that consent is not required as a condition of purchase from Pluralsight.

By activating this benefit, you agree to abide by Pluralsight's terms of use and privacy policy.

I agree, activate benefit