Author avatar

Deepika Singh

Implementing Web Scraping with Requests

Deepika Singh

  • Feb 5, 2020
  • 17 Min read
  • 97 Views
  • Feb 5, 2020
  • 17 Min read
  • 97 Views
Data
Requests

Introduction

Web scraping is the technique of extracting data from websites. The internet revolution has resulted in an explosion of data, and the ability to extract this data has become an important prerequisite for data scientists. One of the most popular approaches to web scraping in Python is sending an HTTP request via the Requests package, and then parsing the HTML received with the BeautifulSoup package.

The Requests package is one of the most downloaded Python packages and has two components: making get requests to an API and getting raw HTML content.

In this guide, you will learn the basics of implementing web scraping using the Requests package in Python, which is used for performing HTTP requests. HTTP is an acronym for Hyper-Text Transfer Protocol, a foundation of data communication for the web.

Let's start by loading the required libraries.

1
2
3
4
5
import requests
import urllib
import urllib.request
from urllib.request import urlretrieve 
from urllib.request import urlopen, Request
python

Exploring the Requests Package

In this guide, we'll scrape data from an arbitrary Wikipedia article on the movie Avengers: Endgame. We'll use the URL address of the web page. (URL is an acronym for Universal Resource Locator).

The URL web address has two components that together specify the web address completely.

  1. Protocol identifier: denoted by http:

  2. Resource name: denoted by en.wikipedia.org/wiki/Avengers:_Endgame in this case

The first line of code below specifies the URL of the Wikipedia page of the movie and stores it to the variable url. The second line packages and sends the request and catches the response with the function requests.get(). We store the response to the variable req1.

1
2
3
url = "https://en.wikipedia.org/wiki/Avengers:_Endgame"

req1 = requests.get(url)
python

The same task can also be done with a single line of code:

req1 = requests.get('https://en.wikipedia.org/wiki/Avengers:_Endgame')

We can look at the content of the object created above using the code below.

1
req1.content
python

Output:

1
b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Avengers: Endgame - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XjSKaApAADkAAIpdQNsAAAEX","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Avengers:_Endgame","wgTitle":"Avengers: Endgame","wgCurRevisionId":938381569,"wgRevisionId":938381569,"wgArticleId":44254295,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 uses Russian-language script (ru)","CS1 Russian-language sources (ru)","Wikipedia pages semi-protected against vandalism","Articles with short description","Use American English from October 2019",\n"All Wikipedia articles written in American English","Use mdy dates from January 2020","Use list-defined references from October 2019","Pages using multiple image with manual scaled images","Articles with Encyclop\xc3\xa6dia Britannica links","Comics navigational boxes purge","2019 films","English-language films","2010s science fiction action films","2010s sequel films","2010s superhero films","2019 3D films","Alien invasions in films","Alternate timeline films","American 3D films","American films","American science fiction action films","American sequel films","Avengers (film series)","Crossover films","Films about extraterrestrial life","Films about quantum mechanics","Films about size change","Films about time travel","Films directed by Anthony and Joe Russo","Films featuring anthropomorphic characters","Films scored by Alan Silvestri","Films set in 1970","Films set in 2012","Films set in 2013","Films set in 2014","Films set in 2018","Films set in 2023","Films set in New Jersey",\n"Films set in New York (state)","Films set in New York City","Films set in Norway","Films set in San Francisco","Films set in Tokyo","Films set in Wakanda","Films set in Africa","Films set in the 1940s","Films set on fictional planets","Films shot at Pinewood Atlanta Studios","Films shot in A tlanta","Films shot in County Durham","Films shot in New York (state)","Films shot in Scotland","Films using computer-generated imagery","Films with screenplays by Christopher Markus \\u0026 Stephen McFeely","IMAX films","Intergalactic travel in fiction","Marvel Cinematic Universe films","Motion capture in film","Nanotechnology in fiction","Post-apocalyptic films","Sequel films"],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":"Avengers:_Endgame","wgRelevantArticleId":44254295,"wgIsProbablyEditabl

We can examine the header of the web page using the code below.

1
req1.headers
python

Output:

1
{'Date': 'Fri, 31 Jan 2020 20:13:29 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Server': 'mw1262.eqiad.wmnet', 'X-Powered-By': 'PHP/7.2.26-1+0~20191218.33+debian9~1.gbpb5a340+wmf1', 'X-Content-Type-Options': 'nosniff', 'P3P': 'CP="See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."', 'Content-language': 'en', 'Vary': 'Accept-Encoding,Cookie,Authorization', 'Content-Encoding': 'gzip', 'Last-Modified': 'Fri, 31 Jan 2020 20:09:01 GMT', 'Backend-Timing': 'D=211902 t=1580501608833754', 'X-ATS-Timestamp': '1580501609', 'X-Varnish': '569435268 464972321', 'Age': '39789', 'X-Cache': 'cp2004 miss, cp2010 hit/166', 'X-Cache-Status': 'hit-front', 'Server-Timing': 'cache;desc="hit-front"', 'Strict-Transport-Security': 'max-age=106384710; includeSubDomains; preload', 'Set-Cookie': 'WMF-Last-Access=01-Feb-2020;Path=/;HttpOnly;secure;Expires=Wed, 04 Mar 2020 00:00:00 GMT, WMF-Last-Access-Global=01-Feb-2020;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Wed, 04 Mar 2020 00:00:00 GMT, GeoIP=US:TX:San_Antonio:29.42:-98.49:v4; Path=/; secure; Domain=.wikipedia.org', 'X-Client-IP': '13.84.209.100', 'Cache-Control': 'private, s-maxage=0, max-age=0, must-revalidate', 'Accept-Ranges': 'bytes', 'Content-Length': '110083', 'Connection': 'keep-alive'}

We can also extract the response using the text attribute of the object using the first line of code below. This returns the HTML content of the web page that we store in the variable text_object. The second line prints the content of the text object.

1
2
3
text_object = req1.text

print(text_object)
python

Output:

1
2
3
4
5
6
7
8
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Avengers: Endgame - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XjSKaApAADkAAIpdQNsAAAEX","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Avengers:_Endgame","wgTitle":"Avengers: Endgame","wgCurRevisionId":938381569,"wgRevisionId":938381569,"wgArticleId":44254295,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 uses Russian-language script (ru)","CS1 Russian-language sources (ru)","Wikipedia pages semi-protected against vandalism","Articles with short description","Use American English from October 2019",
"All Wikipedia articles written in American English","Use mdy dates from January 2020","Use list-defined references from October 2019","Pages using multiple image with manual scaled images","Articles with Encyclopædia Britannica links","Comics navigational boxes purge","2019 films","English-language films","2010s science fiction action films","2010s sequel films","2010s superhero films","2019 3D films","Alien invasions in films","Alternate timeline films","American 3D films","American films","American science fiction action films","American sequel films","Avengers (film series)","Crossover films","Films about extraterrestrial life","Films about quantum mechanic s","Films about size change","Films about time travel","Films directed by Anthony and Joe Russo","Films featuring anthropomorphic characters","Films scored by Alan Silvestri","Films set in 1970","Films set in 2012","Films set in 2013","Films set in 2014","Films set in 2018","Films set in 2023","Films set in New Jersey",
"Films set in New York (state)","Films set in New York City","Films set in Norway","Films set in San Francisco","Films set in Tokyo","Films set in Wakanda","Films set in Africa","Films set in the 1940s","Films set on fictional planets","Films shot at Pinewood Atlanta Studios","Films shot in Atlanta","Films shot in County Durham","Films shot in New York (state)","Films shot in Scotland","Films using computer-generated imagery","Films with screenplays by Christopher Markus \u0026 Stephen McFeely","IMAX films","Intergalactic travel in fiction","Marvel Cinematic Universe films","Motion capture in film","Nanotechnology in fiction","Post-apocalyptic films","Sequel films"],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":"Avengers:_Endgame","wgRelevantArticleId":44254295,"wgIsProbablyEditable":!1,"wgRelevantPageIsProbablyEditable":!1,"wgRestrictionEdit":["autoconfirmed"],"wgRestrictionMove":["extendedconfirmed"],"wgMediaViewerOnClick":!0,

The Urllib Package

Another useful package that can be used to retrieve web data is the urllib library. It is not as popular as the Requests package, but is useful to know. The package uses the urlretrieve() function to perform a GET request.

The first line of code below specifies the web address. The second line uses the Request() function to package the request, while the third line catches the response with the function urlopen().

1
2
3
4
5
url = "https://en.wikipedia.org/wiki/Avengers:_Endgame"

req1 = Request(url)

response = urlopen(req1)
python

We can print the datatype of response object using the code below.

1
2
print(response)
print(type(response))
python

Output:

1
2
<http.client.HTTPResponse object at 0x7f0b261a5588>
<class 'http.client.HTTPResponse'>

Finally, we can extract the response using the lines of code below.

1
2
3
html_1 = response.read()

print(html_1)
python

Output:

1
b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Avengers: Endgame - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XjSKaApAADkAAIpdQNsAAAEX","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Avengers:_Endgame","wgTitle":"Avengers: Endgame","wgCurRevisionId":938381569,"wgRevisionId":938381569,"wgArticleId":44254295,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 uses Russian-language script (ru)","CS1 Russian-language sources (ru)","Wikipedia pages semi-protected against vandalism","Articles with short description","Use American English from October 2019",\n"All Wikipedia articles written in American English","Use mdy dates from January 2020","Use list-defined references from October 2019","Pages using multiple image with manual scaled images","Articles with Encyclop\xc3\xa6dia Britannica links","Comics navigational boxes purge","2019 films","English-language films","2010s science fiction action films","2010s sequel films","2010s superhero films","2019 3D films","Alien invasions in films","Alternate timeline films","American 3D films","American films","American science fiction action films","American sequel films","Avengers (film series)","Crossover films","Films about extraterrestrial life","Films about quantum mechanics","Films about size change","Films about time travel","Films directed by Anthony and Joe Russo","Films featuring anthropomorphic characters","Films scored by Alan Silvestri","Films set in 1970","Films set in 2012","Films set in 2013","Films set in 2014","Films set in 2018","Films set in 2023","Films set in New Jersey",\n"Films set in New York (state)","Films set in New York City","Films set in Norway","Films set in San Francisco","Films set in Tokyo","Films set in Wakanda","Films set in Africa","Films set in the 1940s","Films set on fictional planets","Films shot at Pinewood Atlanta Studios","Films shot in Atlanta","Films shot in County Durham","Films shot in New York (state)","Films shot in Scotland","Films using computer-generated imagery","Films with screenplays by Christopher Markus \\u0026 Stephen McFeely","IMAX films","Intergalactic travel in fiction","Marvel Cinematic Universe films","Motion capture in film","Nanotechnology in fiction","Post-apocalyptic films","Sequel films"],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":"Avengers:_Endgame","wgRelevantArticleId":44254295,"wgIsProbablyEditable":!1,"wgRelevantPageIsProbablyEditable":!1,"wgRestrictionEdit":["autoconfirmed"],"wgRestrictionMove":["extendedconfirmed"],"wgMediaViewerOnClick":!0,\n"wgMediaViewerEnabledByDefault":!0,"wgPopupsReferencePreviews":!1,"wgPopupsConflictsWithNavPopupGadget":!1,"wgVisualEditor":{"pageLanguageCode":"en","pageLanguageDir":"ltr","pageVariantFallbacks":"en"},"wgMFDisplayWikibaseDescriptions":{"search":!0,"nearby":!0,"watchlist":!0,"tagline":!1},"wgWMESchemaEditAttemptStepOversample":!1,"wgULSCurrentAutonym":"English","wgNoticeProject":"wikipedia","wgWikibaseItemId":"Q23781155","wgCentralAuthMobileDomain":!1,"wgEditSubmitButtonLabelPublish":!0};RLSTATE={"ext.globalCssJs.user.styles":"ready","site.styles":"ready","noscript":"ready","user.styles":"ready","ext.globalCssJs.user":"ready","user":"ready"

0