Data on websites has become a very rich source of information for many organizations, and the way to get that data is to ‘Scrape’ it. Web-scraping is an easy skill to get started with and a valuable tool in every developer’s toolbox. In this guide, you will learn how to scrape your first website with Python.
Opening a webpage in a browser is quick and easy, but the browser is doing lots of work behind the scenes. It gets the data from the web-server, parses it, and then displays the page to us - these are three discrete actions. When we web-scrape, we mostly only do the first two actions: we get the data and then we parse it. Getting the data involves connecting to the web server, requesting a specific file (usually HTML), and then downloading that file. Parsing is the technique used to examine the file we downloaded and extract information from it.
For this guide, we are going to use the Python ‘Requests’ library to GET the data, and the ‘Lxml’ library to PARSE the HTML that we download. These are very straightforward to use and suitable for most web-scraping purposes.
In general, once you have Python 3 installed correctly, you can download Lxml and Requests using the ‘PIP’ utility:
pip install requests pip install lxml
If Pip is not installed, you can download and install it here: https://pypi.org/project/pip/
For simple web-scraping, an interactive editor like Microsoft Visual Code (free to use and download) is a great choice, and it works on Windows, Linux, and Mac.
The webpage that we are going to test our skills on is a demo webpage for web scraping learning purposes.
The webpage is a product listing of a book. It contains the book title as the main heading, below the title is the author of the book, and below this is a simple table. Inside the table there are two cells, the first contains an image of the front cover of the book and the second a description of what the book is about. Together, with useful product details, these cells can be used to help you purchase books.
The page uses very basic HTML and no external CSS styling; so, it’s a great example to start with. You can see, in the image below, how the webpage source HTML code relates to the rendered webpage above.
We will start by creating a new python file. As with other languages, as we are using external libraries, we need to import these into the file using the import directive. Here we import both lxml and requests.
from lxml import html, etree import requests
To download the page, we simply need to ask the requests library to ‘get’ it. S,o we declare a variable for examples called ‘page’, and the result of the call to ‘get’ is loaded into this variable.
page = requests.get("http://www.howtowebscrape.com/examples/simplescrape1.html")
The variable page has several properties - the one we are interested in this guide is ‘content’. This property contains the raw HTML of the page we are downloading and is presented in string format. We can print it out to view what we received:
In its raw stage, the content we have just received looks like this:
We can see the HTML tags there, but we also have extra ‘noise’ which, in this case, consists of escape characters (\r \n) that indicate line breaks in the raw source. In theory, that’s it, we have completed the first stage of the web-scrape - although a lot more goes on under the hood, the requests library makes it simple to download a webpage source HTML by simply issuing the ‘GET’ command.
Once we have the raw data we need available to us, we then use a parsing library to extract information from this data using parsing methods. As the raw data we have is XML format, we can use the lxml library to assist in unpacking it into XML which is easier to work with. XML is structured, so we will use the ‘html.fromstring’ function to convert the single big content string to a variable ‘extractedHtml’ - this removes the noise that we don’t want and exposes the data as easily accessible HTML.
extractedHtml = html.fromstring(page.content)
The final part of the parsing process is identifying where the data we require actually sits in the XML structure itself. As XML consists of a series of nodes, we can use the XPath syntax to identify the ‘route’ to the data that we want to extract. The title of the book, for example, is contained within the first ‘h1’ tag in the HTML file, therefore we can extract it using a path that shows the route from the top of the document down to the h1 node itself:
Lets see that in action - here is the code:
1 2 3
extractedHtml = html.fromstring(page.content) bookTitle = extractedHtml.xpath("/html/body/center/h1") print(bookTitle.text)
and the output:
Grokking Algorithms: An illustrated guide for programmers and other curious people
For parsing, the majority of the time, you can use the XPath syntax to locate and extract data from your webpage source. You can learn more about XPath and how powerful it can be here.
Let's finish now by extracting all the pieces of information about the book we need.
1 2 3 4
bookTitle = extractedHtml.xpath("/html/body/center/h1") author = extractedHtml.xpath("//span[@id='author']") image = extractedHtml.xpath("//img/@src") bookData = extractedHtml.xpath("/html/body/center/table/tr/td")
Web-scraping is an important skill to have, especially for developers who work with data, and also business intelligence and data science professionals. This guide has given a fast-track introduction to the basics. If you wish to learn more about the subject please consider the following courses Pluralsight has to offer: