Web scraping is a process of extracting specific information as structured data from HTML/XML content. Often data scientists and researchers need to fetch and extract data from numerous websites to create datasets, test or train algorithms, neural networks, and machine learning models. Usually, a website offers APIs which are the sublime way to fetch structured data. However, there are times when there is no API available or you want to bypass the registration process. Under these circumstances, the data can only be accessed via the web page. A manual process can be quite cumbersome and time-consuming when dealing with dynamic data related to a website like stocks, job listing, hotel bookings, real estate, etc. which needs to be accessed frequently. Python offers an automated way, through various modules, to fetch the HTML content from the web (URL/URI) and extract data. This guide will elaborate on the process of web scraping using the beautifulsoup
module.
The process of scraping includes the following steps:
Make a request with requests
module via a URL.
Retrieve the HTML content as text.
In Safari, enable developer option via
Safari -> Preferences -> Advanced -> show develop menu in bar
BeautifulSoup
to find the particular element from the response and extract the text.Follow the Web Requests in Python guide to learn how to make web requests in Python.
HTTP requests via a URL are responded to with an HTML webpage. HTML and XML are markup languages and are used to define the way to format of the text using tags. HTML content can also contain CSS instructions within style
tag to add various styles and decorations interpreted by the browser to apply formatting. Below is a common example of a typical HTML page:
1<!DOCTYPE html> --> 1
2<html> --> 2
3 <head> --> 3
4 <title>List of Game Engines</title>
5 <style type="text/css"> --> 4
6 table, th, td{
7 border: solid 2px black;
8 border-collapse: collapse;
9 border-color: lightgrey;
10 padding: 4px;
11 }
12 table th{
13 color: blue;
14 font-size: 20px;
15 }
16 table td{
17 color: green;
18 font-size: 25px;
19 }
20 </style>
21 </head>
22 <body> --> 5
23 <div> --> 6
24 <table align = "center"> --> 7
25 <tr>
26 <th>Name</th>
27 <th>Language</th>
28 <th>Platform</th>
29 </tr><br>
30 <tr>
31 <td>Unreal Engine</td>
32 <td>C++</td>
33 <td>Cross platform</td>
34 </tr>
35 </table>
36 </div>
37 </body>
38</html>
Tags (<tags>
) are written within diamond brackets <>
, they can be paired(<title>
) or unpaired(<br>
)
<!DOCTYPE html>
tags represent the HTML5 syntax which supports some new tags like nav
, header
etc.
The html
tag contains HTML content also knows as root
tag.
head
includes the CSS style code, JavaScript code, and meta tags.
CSS is used to decorate content and can be added using style
tag.
All the HTML rendered content should be placed inside body
tag.
div
is used as a container to represent an area on the screen.
table
tag is used to render data in the form of a table. th
is for bold heading columns, td
for columns, and tr
is for rows. There are two rows, knowns as siblings.CSS and JavaScript files can be created separately and linked to multiple HTML pages using
link
orscript
tags.
Python has a rich collection of packages and a pip tool is used to manage those packages in the current development environment. This guide will use the below modules:
Requests: To make web requests
Use the below pip
command to install the required packages:
1pip install beautifulsoup4 requests
Use a space to mention multiple modules in a single install statement
The first step of scraping is to get the data. So, for demonstration purposes, we will be using a List of game engines page. Let's open the page and view the structure using inspect option.
This will bring up the developer tool window which will display the HTML element structure. There is a div
with id
as bodyContent
which contains all the visible HTML elements as:
This is the table
tag which contains the details about game engines.
Every tr
represents an entry in the list and contains columns entries.
This is the cursor which will highlight the corresponding element on the web page and here it's highlighting the first column of second row i.e. "4A Engine" heading.
Node
provides the attributes of the selected node like id
, style
, etc. Styles
provide the details about CSS
code and layers
provides the details about re-drawn content like images etc.
Attribute
section is displaying the name of the class
i.e. wikitable sortable jquery-tablesorter
which is a customizable name given to a group of CSS style properties applied to this table. Now we know about the specific HTML tags which contain the data, so let's jump straight into writing code.
The first step is to import modules. BeautifulSoup
for scraping and Requests
to make HTTP requests.
1from bs4 import BeautifulSoup # BeautifulSoup is in bs4 package
2import requests
Make an HTTP request to get HTML content via the specific URL.
1URL = 'https://en.wikipedia.org/wiki/List_of_game_engines'
2content = requests.get(URL)
Create a BeautifulSoup
object and define the parser
.
1soup = BeautifulSoup(content.text, 'html.parser')
The default parser is
lxml
which is lenient and fast as compared tohtml.parser
thoughlxml
is platform dependent andhtml.parser
is part of Beautiful Soup.Parsers convert the input into single entities known as tokens and further convert the tokens into a graph or a tree structure for processing.
BeautifulSoup
can extract single or multiple occurrences of a specific tag and can also accept search criteria based on attributes such as:
1row = soup.find('tr') # Extract and return first occurrence of tr
2print(row) # Print row with HTML formatting
3print("=========Text Result==========")
4print(row.get_text()) # Print row as text
find_all
to extract all the occurrences of a particular tag from the page response as: 1rows = soup.find_all('tr')
2for row in rows: # Print all occurrences
3 print(row.get_text())
find_all
returns an object of ResultSet
which offers index based access to the result of found occurrences and can be printed using a for
loop.
Pass List: find_all
can accept a list of tags as soup.find_all(['th', 'td'])
and parameters like id
to find tags with unique id and href
to process tags with href
attribute as:
1content = requests.get("URL")
2soup = BeautifulSoup(content.text, 'html.parser')
3tags = soup.find_all(id = True, href = True)
Pass Function: A function can contain your customized logic to validate the tag and can be used as:
1content = requests.get(URL)
2soup = BeautifulSoup(content.text, 'html.parser')
3tags = soup.find_all(isAnchorTagWithLargeText, limit = 10)
4for tag in tags:
5 print(tag.get_text())
6
7def isAnchorTagWithLargeText(tag):
8""" Validate the anchor tag and should have text length greater than 50 """
9 return True if tag.name == 'a' and len(tag.get_text()) > 50 else False
find_all
function can also contain Rows from other tables
id
, class
, or value
are used to further refine the search. Let's print the first found table (content table) to identify the attributes as:
1table = soup.find_all('table')
2print(table)
The content table has a unique CSS class attribute i.e. wikitable sortable
which can be used to find the main content table as:
1contentTable = soup.find('table', { "class" : "wikitable sortable"}) # Use dictionary to pass key : value pair
2rows = contentTable.find_all('tr')
3for row in rows:
4 print(row.get_text())
Here
find
is more suitable thanfind_all
, since only one table haswikitable sortable
class property.Alternatively, the
_class
(not available in old versions) attribute can be used assoup.find_all('table', class_ ="wikitable sortable")
.
select
method as:1print(soup.select("html head title")[0].get_text()) # List of game engines – Wikipedia
Regular expression allows you to find specific tags by matching a pattern instead of an entire value of an attribute. Beautiful Soup can take regular expression objects to refine the search. Below is the example to find all the anchor tags with title starting with Id Tech
:
1contentTable = soup.find('table', { "class" : "wikitable sortable"})
2rows = contentTable.find_all('a', title = re.compile('^Id Tech .*'))
3print(rows)
4for row in rows:
5 print(row.get_text())
^
: Start matching from the beginning (otherwise it can match from anywhere like the middle).
Id Tech
: Match the exact characters.
.*
: .
means match any character and *
mean keeps on matching till line break('\n' or enter). Beautiful Soup offers functionality like limit
, string
, and recursive
which can be applied as:
Use limit = 2
to apply a limit on a result
Use contentTable.find_all('a', string = 'Alamo')
to extract all anchor tags with text Alamo
recursive = False
will restrict the search to the first found element and its child only. 1contentTable = soup.find('table', { "class" : "wikitable sortable"})
2rows = contentTable.find_all('a', string = 'C', limit = 2
3 #, recursive = False
4 )
5# Output: [<a href="/wiki/C_(programming_language)" title="C (programming language)">C</a>]
Beautiful Soup also allows you to mention tags as properties to find first occurrence of the tag as:
1content = requests.get(URL)
2soup = BeautifulSoup(content.text, 'html.parser')
3print(soup.head, soup.title)
4print(soup.table.tr) # Print first row of the first table
Beautiful Soup also provides navigation properties like
next_sibling
and previous_sibling
: To traverse tags at same level, like tr
or td
within the same tag.
next_element
and previous_element
: To shift HTML elements.Multiple elements can also be traversed with next_siblings
, previous_siblings
, and next_elements
, previous_elements
The logic to extract the data usually depends upon the HTML structure of the webpage, so some changes in structure can break the logic.
The content of a website can be subject to applied laws, so make sure to read the terms and conditions about content.
prettify()
method to print the formatted HTML response. The code for this script is available on Github for experimenting. It would be great to pick up any content-based website and write your own script to scrap it. Happy Scraping!