The internet revolution has resulted in an explosion of data, and many companies are trying to extract and analyze as much as they can from the web. The process of scraping data from websites and extracting information is called web scraping. In this guide, you will learn about web scraping using Python's powerful package, BeautifulSoup, which is used for parsing HTML and XML documents.
Let's start by loading the required libraries.
1import pandas as pd
2import numpy as np
3import bs4
4import requests
5import urllib
6import urllib.request
7import re
8
9from bs4 import BeautifulSoup
10from urllib.request import urlretrieve
11from urllib.request import urlopen, Request
In this guide, we'll scrape data from a Wikipedia article on the movie Avengers: Endgame. We'll specify the URL address of the web page using the first line of code below. URL is an acronym for Universal Resource Locator, which focuses on web addresses and has two components:
Protocol identifier, denoted by http:
en.wikipedia.org/wiki/Avengers:_Endgame
in this caseThese two components specify the web address completely. The first line of code below specifies the url of the Wikipedia link to the movie, while the second line extracts the response as an HTML object. HTML is an acronym for *Hyper-Text Markup Language and is the standard language for web pages. Once we have the HTML object, we'll use the BeautifulSoup method to parse the HTML document, as shown in the third line of code. The fourth line prints the type of the object.
1url = "https://en.wikipedia.org/wiki/Avengers:_Endgame"
2
3html = urlopen(url)
4
5soup = BeautifulSoup(html, 'lxml')
6
7type(soup)
Output:
1bs4.BeautifulSoup
We can look at the structure of the object we created above using the code below.
1print(soup.prettify())
Output:
1<!DOCTYPE html>
2<html class="client-nojs" dir="ltr" lang="en">
3 <head>
4 <meta charset="utf-8"/>
5 <title>
6 Avengers: Endgame - Wikipedia
7 </title>
8 <script>
9 document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XjNILwpAAEIAAJRf3S8AAAAU","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Avengers:_Endgame","wgTitle":"Avengers: Endgame","wgCurRevisionId":938381569,"wgRevisionId":938381569,"wgArticleId":44254295,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 uses Russian-language script (ru)","CS1 Russian-language sources (ru)","Wikipedia pages semi-protected against vandalism","Articles with short description","Use American English from October 2019",
10"All Wikipedia articles written in American English","Use mdy dates from January 2020","Use list-defined references from October 2019","Pages using multiple image with manual scaled images","Articles with Encyclopædia Britannica links","Comics navigational boxes purge","2019 films","English-language films","2010s science fiction action films","2010s sequel films","2010s superhero films","2019 3D films","Alien invasions in films","Alternate timeline films","American 3D films","American films","American science fiction action films","American sequel films","Avengers (film series)","Crossover films","Films about extraterrestrial life","Films about quantum mechanics","Films about size change","Films about time travel","Films directed by Anthony and Joe Russo","Films featuring anthropomorphic characters","Films scored by Alan Silvestri","Films set in 1970","Films set in 2012","Films set in 2013","Films set in 2014","Films set in 2018","Films set in 2023","Films set in New Jersey",
11
12</script>
13 <script>
14 (RLQ=window.RLQ||[]).push(function(){mw.loader.implement("user.tokens@tffin",function($,jQuery,require,module){/*@nomin*/mw.user.tokens.set({"patrolToken":"+\\","watchToken":"+\\","csrfToken":"+\\"});
15});});
16 </script>
17 <link href="/w/load.php?lang=en&modules=ext.cite.styles%7Cext.uls.interlanguage%7Cext.visualEditor.desktopArticleTarget.noscript%7Cext.wikimediaBadges%7Cjquery.makeCollapsible.styles%7Cmediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.skinning.interface%7Cmediawiki.toc.styles%7Cskins.vector.styles%7Cwikibase.client.init&only=styles&skin=vector" rel="stylesheet"/>
18 <script async="" src="/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=vector">
19 </script>
20 <meta content="" name="ResourceLoaderDynamicStyles"/>
21 <link href="/w/load.php?lang=en&modules=site.styles&only=styles&skin=vector" rel="stylesheet"/>
22 <meta content="MediaWiki 1.35.0-wmf.15" name="generator"/>
23 <meta content="origin" name="referrer"/>
24 <meta content="origin-when-crossorigin" name="referrer"/>
25 <meta content="origin-when-cross-origin" name="referrer"/>
The command print(soup.prettify())
generates a long output, which has been truncated above for the sake of brevity.
In this guide, you learned about the basics of web scraping using the popular BeautifulSoup
library in Python. You learned how to access web data and convert it into an HTML object, along with the basic methods of parsing it with the BeautifulSoup
library.
To learn more about data science using Python, please refer to the following guides.