Scraping Dynamic Web Pages with Python and Selenium

As a result of dynamic interaction and interface automation, it's difficult to use a simple http agent - we need an automated web-browser, such as Selenium.

By Allen O'Neill

May 17, 2019 • 10 Minute Read

Subscribe to the newsletter

Introduction

Over the past number of years front-end design methods and technologies for websites have developed greatly, and frameworks such as React, Angular, Vue, and more, have become extremely popular. These frameworks enable front-end website developers to work efficiently and offer many benefits in making websites, and the webpages they serve, much more usable and appealing for the website user. Webpages that are generated dynamically can offer a faster user experience; the elements on the webpage itself are created and modified dynamically. This contrasts with the more traditional method of server-based page generation, where the data and elements on a page are set once and require a full round-trip to the web server to get the next piece of data to serve to a user. When we scrape websites, the easiest to do are the more traditional, simple, server-based ones. These are the most predictable and consistent.

While Dynamic websites are of great benefit to the end user and the developer, they can be problematic when we want to scrape data from them. For example, consider that in a dynamic webpage: much of the functionality happens in response to user actions and the execution of JavaScript code in the context of the browser. Data that is automatically generated, or appears ‘on demand’, and is ‘automatically generated’ as a result of user interaction with the page can be difficult to replicate programmatically at a low level – a browser is a pretty sophisticated piece of software after all!

As a result of this level of dynamic interaction and interface automation, it is difficult to use a simple http agent to work with the dynamic nature of these websites and we need a different approach. The simplest solution to scraping data form dynamic websites is to use an automated web-browser, such as selenium, which is controlled by a programming language such as Python. In this guide, we will explore an example of how to set up and use Selenium with Python for scraping dynamic websites, and some of the use features available to us that are not easily achieved using more traditional scraping methods.

Requirements

For this guide, we are going to use the ‘Selenium’ library to both GET and PARSE the data.

Prerequisites:

In general, once you have Python 3 installed correctly, you can download Selenium using the ‘PIP’ utility:

      pip install -U selenium

You will also need to install a driver for the Selenium package, Chrome works well for this. Install it also using the chromedriver-install pip wrapper.

      pip install chromedriver-install

If Pip is not installed, you can download and install it here

For simple web-scraping, an interactive editor like Microsoft Visual Code (free to use and download) is a great choice, and it works on Windows, Linux, and Mac.

Getting Started Using Selenium

After running the pip installs, we can start writing some code. One of the initial blocs of code checks to see if the Chromedriver is installed and, if not, downloads everything required. I like to specify the folder that chrome operates from so I pass the download and install folder as an argument for the install library.

          import chromedriver_install as cdi
path = cdi.install(file_directory='c:\\data\\chromedriver\\', verbose=True, chmod=True, overwrite=False, version=None)
print('Installed chromedriver to path: %s' % path)
    

The main body of code is then called – this creates the Chromedriver instance, pointing the starting point to the folder I installed it to.

          from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome("c:\\data\\chromedriver\\chromedriver.exe")
    

Once this line executes, a version of Chrome will appear on the desktop – we can hide this, but for our initial test purposes its good to see what's happening. We direct the driver to open a webpage by calling the ‘get’ method, with a parameter of the page we want to visit.

      driver.get("http://www.python.org")

The power of Selenium is that it allows the chrome-driver to do the heavy lifting while it acts as a virtual user, interacting the webpage and sending your commands as required. To illustrate this, let's run a search on the Python website by adding some text to the search box. We first look for the element called ‘q’ – this is the “inputbox” used to send the search to the website. We clear it, then send in the keyboard string ‘pycon’

          elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
    

We can then virtually hit ‘enter/return’ by sending ‘key strokes’ to the inputbox – the webpage submits, and the search results are shown to us.

      elem.send_keys(Keys.RETURN)

Navigating Webpages Using Selenium

We have seen how simple it is to get up and running with Selenium, next we will look at how to navigate around a webpage and indeed a full website using navigation commands. As humans, when we want to carry out a task on a webpage, we identify what we want to do visually, such as drag and drop, scroll, click a button, etc. We then move the mouse and click, or use the keyboard, accordingly. Things are not that simple (yet!) with Selenium, so we need to give it a bit of assistance. In order to navigate around a webpage, we need to tell Selenium what objects on the page to interact with. We do this by identifying page elements with XPaths and then calling functions appropriate to the task we wish to carry out.

In the case of our first example, the search box, we did the following:

Tasked the driver to find a browser element named ‘q’.
Gave an instruction to send a series of characters to the element identified.
Gave an instruction to send key command for ‘RETURN’.

This was the equivalent of us as humans, clicking into the search box, entering the search term, and hitting RETURN or ENTER on our keyboard.

The pattern of navigation in Selenium therefore is:

Identify the element you wish to interact with.
Interact as required (set some text, extract a value, send a keystroke, etc.).

Elements can be located using xPath ‘driver.find_element_by_xpath’, or more high level methods such as ‘find_element_by_id’.

          <input type="text" name="searchbox" id="someUniqueId" />

element = driver.find_element_by_id("someUniqueId")
element = driver.find_element_by_name("searchbox")
element = driver.find_element_by_xpath("//input[@id='someUniqueId']")
    

Sending interaction instructions, such as setting text, selecting a radio box, and hitting ‘RETURN’ (on the keyboard), can be achieved using the ‘sendkeys’ method:

      element.send_keys("Set some text")

In addition to sending text, we can also send keystrokes, individually or combined, with the text.

          element.send_keys(Keys.RETURN)
element.send_keys("Set text", Keys.ARROW_DOWN)
    

Working with Forms

Working with forms in Selenium is straightforward and combines what we have learned with some additional functionality. Filling in a form on a webpage generally involves setting values of text boxes, perhaps selecting options from a drop-box or radio control, and clicking on a submit button. We have already seen how to identify and send data into a text field. Locating and selecting an option control requires us to:

Locate the control.
Iterate through its options.
Set the option we want to choose a ‘selected’ value.

In the following example we are searching a select control for the value ‘Ms’and, when we find it, we are clicking it to select it:

          element = driver.find_element_by_xpath("//select[@name='Salutation']")
all_options = element.find_elements_by_tag_name("option")
for option in all_options:
    if option.get_attribute("value") == "Ms":
        option.click()
    

The final part of working with forms is knowing how to send the data in the form back to the server. This is achieved by either locating the submit button and sending a click event, or selecting any control within the form and calling ‘submit’ against that:

          driver.find_element_by_id("SubmitButton").click()

someElement = driver.find_element_by_name("searchbox")
someElement.submit()
    

Smile! … Taking a Screenshot

One of the benefits of using Selenium is that you can take a screenshot of what the browser has rendered. This can be useful for debugging an issue and also for keeping a record of what the webpage looked like when it was scraped.

Taking a screenshot could not be easier. We call the ‘save_screenshot’ method and pass in a location and filename to save the image.

      driver.save_screenshot('WebsiteScreenShot.png')

Conclusion

Web-scraping sites using Selenium can be a very useful tool in your bag of tricks, especially when faced with dynamic webpages. This guide has only scratched the surface – to learn more please visit the Selenium website .

If you wish to learn more about web-scraping please consider the following courses Pluralsight has to offer:

Web Scraping: Python Data Playbook by Ian Ozsvald

Extracting Structured Data from the Web Using Scrapy by Janani Ravi

Allen O.

Allen is a consulting engineer with a background in enterprise systems. He runs his own company specializing in systems architecture, optimisation and scaling. He is also involved in a number of start-ups. Allen is a chartered engineer, a Fellow of the British Computing Society, a Microsoft MVP and insider, and both a CodeProject and C-SharpCorner MVP. He is a regular speaker at events both locally and internationally. His core technology interests are Big Data, Data Science and Machine Learning, combining these to create intelligent agents for the web. Allen has numerous qualifications including IT, Law and Training. He is also a ball throwing slave to his family dogs.

More about this author