Author avatar

Gaurav Singhal

Extracting Data from HTML with BeautifulSoup

Gaurav Singhal

  • Jun 11, 2019
  • 13 Min read
  • 13,521 Views
  • Jun 11, 2019
  • 13 Min read
  • 13,521 Views
Data
BeautifulSoup

Introduction

Nowadays everyone is talking about data and how it is helping to learn hidden patterns and new insights. The right set of data can help a business to improve its marketing strategy and that can increase the overall sales. And let's not forget the popular example in which a politician can know the public's opinion before elections. Data is powerful, but it does not come for free. Gathering the right data is always expensive; think of surveys or marketing campaigns, etc.

The internet is a pool of data and, with the right set of skills, one can use this data in a way to gain a lot of new information. You can always copy paste the data to your excel or CSV file but that is also time-consuming and expensive. Why not hire a software developer who can get the data into a readable format by writing some jiber-jabber? Yes, it is possible to extract data from Web and this "jibber-jabber" is called Web Scraping.

According to Wikipedia, Web Scraping is:

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites

BeautifulSoup is one popular library provided by Python to scrape data from the web. To get the best out of it, one needs only to have a basic knowledge of HTML, which is covered in the guide.

Components of a Webpage

If you know the basic HTML, you can skip this part.

The basic syntax of any webpage is:

1
2
3
4
5
6
7
8
9
10
11
<!DOCTYPE html>  
<html markdown="1">  
    <head>
    <meta charset="utf-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
    </head>
    <body>
        <h1 class = "heading"> My first Web Scraping with Beautiful soup </h1>
        <p>Let's scrap the website using python. </p>
    <body>
</html>
html

Every tag in HTML can have attribute information (i.e., class, id, href, and other useful information) that helps in identifying the element uniquely.

For more information about basic HTML tags, check out w3schools.

Steps for Scraping Any Website

To scrape a website using Python, you need to perform these four basic steps:

  • Sending an HTTP GET request to the URL of the webpage that you want to scrape, which will respond with HTML content. We can do this by using the Request library of Python.

  • Fetching and parsing the data using Beautifulsoup and maintain the data in some data structure such as Dict or List.

  • Analyzing the HTML tags and their attributes, such as class, id, and other HTML tag attributes. Also, identifying your HTML tags where your content lives.

  • Outputting the data in any file format such as CSV, XLSX, JSON, etc.

Understanding and Inspecting the Data

Now that you know about basic HTML and its tags, you need to first do the inspection of the page which you want to scrape. Inspection is the most important job in web scraping; without knowing the structure of the webpage, it is very hard to get the needed information. To help with inspection, every browser like Google Chrome or Mozilla Firefox comes with a handy tool called developer tools.

In this guide, we will be working with wikipedia to scrap some of its table data from the page List of countries by GDP (nominal). This page contains a Lists heading which contains three tables of countries sorted by their rank and its GDP value as per "International Monetary Fund", "World Bank", and "United Nations". Note, that these three tables are enclosed in an outer table.

To know about any element that you wish to scrape, just right-click on that text and examine the tags and attributes of the element.

Understanding the data

Jump into the Code

In this guide, we will be learning how to do a simple web scraping using Python and BeautifulSoup.

Install the Essential Python Libraries

1
pip3 install requests beautifulsoup4 
shell

Note: If you are using Windows, use pip instead of pip3

Importing the Essential Libraries

Import the "requests" library to fetch the page content and bs4 (Beautiful Soup) for parsing the HTML page content.

1
2
from bs4 import BeautifulSoup
import requests
python

Collecting and Parsing a Webpage

In the next step, we will make a GET request to the url and will create a parse Tree object(soup) with the help of BeautifulSoup and Python built-in "lxml" parser.

1
2
3
4
5
6
7
8
9
10
11
12
# importing the libraries
from bs4 import BeautifulSoup
import requests

url="https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"

# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text

# Parse the html content
soup = BeautifulSoup(html_content, "lxml")
print(soup.prettify()) # print the parsed data of html
python

With our BeautifulSoup object i.e., soup we can move on and collect the required table data.

Before going to the actual code, let's first play with the soup object and print some basic information from it:

Example 1:

Let’s just first print the title of the webpage.

1
print(soup.title)
python

It will give an output as follows:

1
<title>List of countries by GDP (nominal) - Wikipedia</title>

To get the text without the HTML tags, we just use .text:

1
print(soup.title.text)
python

Which will result into:

1
List of countries by GDP (nominal) - Wikipedia

Example 2:

Now, let's get all the links in the page along with its attributes, such as href, title, and its inner Text.

1
2
3
4
for link in soup.find_all("a"):
    print("Inner Text: {}".format(link.text))
    print("Title: {}".format(link.get("title")))
    print("href: {}".format(link.get("href")))
python

This will output all the available links along with its mentioned attributes from the page.

Now, let's get back to the track and find our goal table.

Analyzing the outer table, we can see that it has special attributes which include class as wikitable and has two tr tags inside tbody.

Table element

If you uncollapse the tr tag, you will find that the first tr tag is for the headings of all three tables and the next tr tag is for the table data for all three inner tables.

Let's first get all three table headings:

Note that we are removing the newlines and spaces from left and right of the text by using simple strings methods available in Python.

1
2
3
4
5
6
7
8
9
10
gdp_table = soup.find("table", attrs={"class": "wikitable"})
gdp_table_data = gdp_table.tbody.find_all("tr")  # contains 2 rows

# Get all the headings of Lists
headings = []
for td in gdp_table_data[0].find_all("td"):
    # remove any newlines and extra spaces from left and right
    headings.append(td.b.text.replace('\n', ' ').strip())

print(headings)
python

This will give an output as:

1
['Per the International Monetary Fund (2018)', 'Per the World Bank (2017)', 'Per the United Nations (2017)']

Moving on to the second tr tag of the outer table, let's get the content of all the three tables by iterating over each table and its rows. Per the International Monetary Fund table

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
data = {}
for table, heading in zip(gdp_table_data[1].find_all("table"), headings):
    # Get headers of table i.e., Rank, Country, GDP.
    t_headers = []
    for th in table.find_all("th"):
        # remove any newlines and extra spaces from left and right
        t_headers.append(th.text.replace('\n', ' ').strip())
    # Get all the rows of table
    table_data = []
    for tr in table.tbody.find_all("tr"): # find all tr's from table's tbody
        t_row = {}
        # Each table row is stored in the form of
        # t_row = {'Rank': '', 'Country/Territory': '', 'GDP(US$million)': ''}

        # find all td's(3) in tr and zip it with t_header
        for td, th in zip(tr.find_all("td"), t_headers): 
            t_row[th] = td.text.replace('\n', '').strip()
        table_data.append(t_row)

    # Put the data for the table with his heading.
    data[heading] = table_data

print(data)
python

Writing Data to CSV

Now that we have created our data structure, we can export it to a CSV file by just iterating over it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import csv

for topic, table in data.items():
    # Create csv file for each table
    with open(f"{topic}.csv", 'w') as out_file:
        # Each 3 table has headers as following
        headers = [ 
            "Country/Territory",
            "GDP(US$million)",
            "Rank"
        ] # == t_headers
        writer = csv.DictWriter(out_file, headers)
        # write the header
        writer.writeheader()
        for row in table:
            if row:
                writer.writerow(row)
python

Per the International Monetary Fund csv table

Putting It Together

Let's join all the above code snippets.

Our complete code looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# importing the libraries
from bs4 import BeautifulSoup
import requests
import csv


# Step 1: Sending a HTTP request to a URL
url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text


# Step 2: Parse the html content
soup = BeautifulSoup(html_content, "lxml")
# print(soup.prettify()) # print the parsed data of html


# Step 3: Analyze the HTML tag, where your content lives
# Create a data dictionary to store the data.
data = {}
#Get the table having the class wikitable
gdp_table = soup.find("table", attrs={"class": "wikitable"})
gdp_table_data = gdp_table.tbody.find_all("tr")  # contains 2 rows

# Get all the headings of Lists
headings = []
for td in gdp_table_data[0].find_all("td"):
    # remove any newlines and extra spaces from left and right
    headings.append(td.b.text.replace('\n', ' ').strip())

# Get all the 3 tables contained in "gdp_table"
for table, heading in zip(gdp_table_data[1].find_all("table"), headings):
    # Get headers of table i.e., Rank, Country, GDP.
    t_headers = []
    for th in table.find_all("th"):
        # remove any newlines and extra spaces from left and right
        t_headers.append(th.text.replace('\n', ' ').strip())
    
    # Get all the rows of table
    table_data = []
    for tr in table.tbody.find_all("tr"): # find all tr's from table's tbody
        t_row = {}
        # Each table row is stored in the form of
        # t_row = {'Rank': '', 'Country/Territory': '', 'GDP(US$million)': ''}

        # find all td's(3) in tr and zip it with t_header
        for td, th in zip(tr.find_all("td"), t_headers): 
            t_row[th] = td.text.replace('\n', '').strip()
        table_data.append(t_row)

    # Put the data for the table with his heading.
    data[heading] = table_data


# Step 4: Export the data to csv
"""
For this example let's create 3 seperate csv for 
3 tables respectively
"""
for topic, table in data.items():
    # Create csv file for each table
    with open(f"{topic}.csv", 'w') as out_file:
        # Each 3 table has headers as following
        headers = [ 
            "Country/Territory",
            "GDP(US$million)",
            "Rank"
        ] # == t_headers
        writer = csv.DictWriter(out_file, headers)
        # write the header
        writer.writeheader()
        for row in table:
            if row:
                writer.writerow(row)
python

BEWARE -> Scraping rules

Now that you have a basic idea about scraping with Python, it is important to know the Legality of web scraping before starting scraping a website. Generally, if you are using scraped data for personal use and do not plan to republish that data, it may not cause any problems. Read the Terms of Use, Conditions of Use, and also the robots.txt before scraping the website. You must follow the robots.txt rules before scraping, otherwise, the website owner has every right to take legal action against you.

Conclusion

The above guide went through the process of how to scrape a Wikipedia page using Python3 and Beautiful Soup and finally exporting it to a CSV file. We have learned how to scrape a basic website and fetch all the useful data in just a couple of minutes.

You can further continue to expand the awesomeness of the art of scraping by jumping for new websites. Some good examples of data to scrape are:

  • Weather forecasts
  • Customer reviews and product pages
  • Stock Prices
  • Articles

Beautiful Soup is simple for small-scale web scraping. If you want to scrape webpages on a large scale, you can consider more advanced techniques like Scrapy and Selenium. Please read about these in more detail on Pluralsight guides.

24