Over the past number of years front-end design methods and technologies for websites have developed greatly, and frameworks such as React, Angular, Vue, and more, have become extremely popular. These frameworks enable front-end website developers to work efficiently and offer many benefits in making websites, and the webpages they serve, much more usable and appealing for the website user. Webpages that are generated dynamically can offer a faster user experience; the elements on the webpage itself are created and modified dynamically. This contrasts with the more traditional method of server-based page generation, where the data and elements on a page are set once and require a full round-trip to the web server to get the next piece of data to serve to a user. When we scrape websites, the easiest to do are the more traditional, simple, server-based ones. These are the most predictable and consistent.
As a result of this level of dynamic interaction and interface automation, it is difficult to use a simple http agent to work with the dynamic nature of these websites and we need a different approach. The simplest solution to scraping data form dynamic websites is to use an automated web-browser, such as selenium, which is controlled by a programming language such as Python. In this guide, we will explore an example of how to set up and use Selenium with Python for scraping dynamic websites, and some of the use features available to us that are not easily achieved using more traditional scraping methods.
For this guide, we are going to use the ‘Selenium’ library to both GET and PARSE the data.
In general, once you have Python 3 installed correctly, you can download Selenium using the ‘PIP’ utility:
1pip install -U selenium
You will also need to install a driver for the Selenium package, Chrome works well for this. Install it also using the chromedriver-install pip wrapper.
1pip install chromedriver-install
If Pip is not installed, you can download and install it here
For simple web-scraping, an interactive editor like Microsoft Visual Code (free to use and download) is a great choice, and it works on Windows, Linux, and Mac.
After running the pip installs, we can start writing some code. One of the initial blocs of code checks to see if the Chromedriver is installed and, if not, downloads everything required. I like to specify the folder that chrome operates from so I pass the download and install folder as an argument for the install library.
1import chromedriver_install as cdi 2path = cdi.install(file_directory='c:\\data\\chromedriver\\', verbose=True, chmod=True, overwrite=False, version=None) 3print('Installed chromedriver to path: %s' % path)
The main body of code is then called – this creates the Chromedriver instance, pointing the starting point to the folder I installed it to.
1from selenium import webdriver 2from selenium.webdriver.common.keys import Keys 3 4driver = webdriver.Chrome("c:\\data\\chromedriver\\chromedriver.exe")
Once this line executes, a version of Chrome will appear on the desktop – we can hide this, but for our initial test purposes its good to see what's happening. We direct the driver to open a webpage by calling the ‘get’ method, with a parameter of the page we want to visit.
The power of Selenium is that it allows the chrome-driver to do the heavy lifting while it acts as a virtual user, interacting the webpage and sending your commands as required. To illustrate this, let's run a search on the Python website by adding some text to the search box. We first look for the element called ‘q’ – this is the “inputbox” used to send the search to the website. We clear it, then send in the keyboard string ‘pycon’
1elem = driver.find_element_by_name("q") 2elem.clear() 3elem.send_keys("pycon")
We can then virtually hit ‘enter/return’ by sending ‘key strokes’ to the inputbox – the webpage submits, and the search results are shown to us.
Working with forms in Selenium is straightforward and combines what we have learned with some additional functionality. Filling in a form on a webpage generally involves setting values of text boxes, perhaps selecting options from a drop-box or radio control, and clicking on a submit button. We have already seen how to identify and send data into a text field. Locating and selecting an option control requires us to:
In the following example we are searching a select control for the value ‘Ms’and, when we find it, we are clicking it to select it:
1element = driver.find_element_by_xpath("//select[@name='Salutation']") 2all_options = element.find_elements_by_tag_name("option") 3for option in all_options: 4 if option.get_attribute("value") == "Ms": 5 option.click()
The final part of working with forms is knowing how to send the data in the form back to the server. This is achieved by either locating the submit button and sending a click event, or selecting any control within the form and calling ‘submit’ against that:
1driver.find_element_by_id("SubmitButton").click() 2 3someElement = driver.find_element_by_name("searchbox") 4someElement.submit()
One of the benefits of using Selenium is that you can take a screenshot of what the browser has rendered. This can be useful for debugging an issue and also for keeping a record of what the webpage looked like when it was scraped.
Taking a screenshot could not be easier. We call the ‘save_screenshot’ method and pass in a location and filename to save the image.
Web-scraping sites using Selenium can be a very useful tool in your bag of tricks, especially when faced with dynamic webpages. This guide has only scratched the surface – to learn more please visit the Selenium website .
If you wish to learn more about web-scraping please consider the following courses Pluralsight has to offer: