Author avatar

Gaurav Singhal

Implementing Web Scraping with Selenium

Gaurav Singhal

  • Feb 15, 2020
  • 13 Min read
  • 10,576 Views
  • Feb 15, 2020
  • 13 Min read
  • 10,576 Views
Data
Selenium

Introduction

In recent years, there has been an explosion of front-end frameworks like Angular, React, and Vue, which are becoming more and more popular. Webpages that are generated dynamically can offer a faster user experience; the elements on the webpage itself are created and modified dynamically. These websites are of great benefit, but can be problematic when we want to scrape data from them. The simplest way to scrape these kinds of websites is by using an automated web browser, such as a selenium webdriver, which can be controlled by several languages, including Python.

Selenium is a framework designed to automate tests for your web application. Through Selenium Python API, you can access all functionalities of Selenium WebDriver intuitively. It provides a convenient way to access Selenium webdrivers such as ChromeDriver, Firefox geckodriver, etc.

In this guide, we will explore how to scrape the webpage with the help of Selenium Webdriver and BeautifulSoup. This guide will demonstrate with an example script that will scrape authors and courses from pluralsight.com with a given keyword.

Installation

Download Driver

Selenium requires a driver to interface with the chosen browser. Here are the links to some of the most popular browser drivers:.

Make sure the driver is in PATH folder, i.e., for Linux, place it in /usr/bin or /usr/local/bin. Or you can place the driver in a known location and provide the executable_path afterward.

Install required packages

Install the Selenium Python package, if it is not already installed.

1pip install selenium
shell
1pip install bs4
2pip install lxml
shell

Initialize the Webdriver

Let's create a function to initialize the webdriver by adding some options, such as headless. In the below code, I have created two different functions for Chrome and Firefox, respectively.

1from selenium import webdriver
2from selenium.webdriver.chrome.options import Options as ChromeOptions
3from selenium.webdriver.firefox.options import Options as FirefoxOptions
4
5# configure Chrome Webdriver
6def configure_chrome_driver():
7    # Add additional Options to the webdriver
8    chrome_options = ChromeOptions()
9    # add the argument and make the browser Headless.
10    chrome_options.add_argument("--headless")
11    # Instantiate the Webdriver: Mention the executable path of the webdriver you have downloaded
12    # if driver is in PATH, no need to provide executable_path
13    driver = webdriver.Chrome(executable_path="./chromedriver.exe", options = chrome_options)
14    return driver
15
16# configure Firefox Driver
17def configure_firefox_driver():
18    # Add additional Options to the webdriver
19    firefox_options = FirefoxOptions()
20    # add the argument and make the browser Headless.
21    firefox_options.add_argument("--headless")
22
23    # Instantiate the Webdriver: Mention the executable path of the webdriver you have downloaded
24    # if driver is in PATH, no need to provide executable_path
25    driver = webdriver.Firefox(executable_path = "./geckodriver.exe", options = firefox_options)
26    return driver
python

Making Browser Headless

Headless browsers can work without displaying any graphical UI, which allows applications to be a single source of interaction for users and provides a smooth user experience. Selenium helps you make any browser headless by adding an options argument as --headless. There are several option parameters you can set for your selenium webdriver. Check out some Chrome WebDriver Options here .

Locating the Elements on the Page

Selenium offers a wide variety of functions to locate an element on a web page:

1<div id="search-field">
2  <input type="text" name = "search-container" id = "id_search_input" class = "search_input" autocomplete="off">
3  <input type="submit" class = "search_submit btn btn-default" >
4</div>
html
1element = driver.find_element_by_id("id_search_input") # by id
2element = driver.find_element_by_class_name("search-container") # by class
3element = driver.find_element_by_name("search-container") # by name
4element = driver.find_element_by_xpath("//input[@type='text']") # by xpath
python

If the element is not be found, a NoSuchElementException is raised. You can read more strategies to locate the element here .

XPath is a powerful language often used in scraping the web. You can learn more about XPath here.

Not only can you locate the element on the page, you can also fill a form by sending the key input, add cookies, switch tabs, etc. You can read more about that here .

Data Extraction

Let's now see how to extract the required data from a web page. In the below code, we define two functions, getCourses and getAuthors, and print the courses and authors respectively for a given search keyword query.

Beautiful Soup remains the best way to traverse the DOM and scrape the data, so after making a GET request to the url, we will transform the page source to a BeautifulSoup object. Before doing that, we can wait for the element to get loaded, and also load all the paginated content by clicking Load More again and again (uncomment the loadAllContent(driver) to see this in action). After that, we can quickly get the required information from the page source using the select method.

1from selenium.webdriver.support.ui import WebDriverWait
2from selenium.common.exceptions import TimeoutException
3from bs4 import BeautifulSoup
4
5def getCourses(driver, search_keyword):
6    # Step 1: Go to pluralsight.com
7    driver.get(f"https://www.pluralsight.com/search?q={search_keyword}&categories=course")
8    WebDriverWait(driver, 5).until(
9        lambda s: s.find_element_by_id("search-results-category-target").is_displayed()
10    )
11    
12    # Load all the page data, by clicking Load More button again and again
13    # loadAllContent(driver) # Uncomment me for loading all the content of the page
14    
15    # Step 2: Create a parse tree of page sources after searching
16    soup = BeautifulSoup(driver.page_source, "lxml")
17    
18    # Step 3: Iterate over the search result and fetch the course
19    for course_page in soup.select("div.search-results-page"):
20        for course in course_page.select("div.search-result"):
21            # selectors for the required information
22            title_selector = "div.search-result__info div.search-result__title a"
23            author_selector = "div.search-result__details div.search-result__author"
24            level_selector = "div.search-result__details div.search-result__level"
25            length_selector = "div.search-result__details div.search-result__length"
26            print({
27                "title": course.select_one(title_selector).text,
28                "author": course.select_one(author_selector).text,
29                "level": course.select_one(level_selector).text,
30                "length": course.select_one(length_selector).text,
31            })
32            
33# Driver code
34# create the driver object.
35driver = configure_chrome_driver()
36search_keyword = "Machine Learning"
37getCourses(driver, search_keyword)
38# close the driver.
39driver.close()
python

Similarly, you can do the same for the getAuthors function.

1from selenium.webdriver.support.ui import WebDriverWait
2from selenium.common.exceptions import TimeoutException
3from bs4 import BeautifulSoup
4
5def getAuthors(driver, search_keyword):
6    driver.get(f"https://www.pluralsight.com/search?q={search_keyword}&categories=aem-author")
7    WebDriverWait(driver, 5).until(
8        lambda s: s.find_element_by_id("author-list-target").is_displayed()
9    )
10    
11    # Load all the page data, by clicking Load More button again and again
12    # loadAllContent(driver) ## Uncomment me for loading all the content of the page
13
14    # Step 1: Create a parse tree of page sources after searching
15    soup = BeautifulSoup(driver.page_source, "lxml")
16    # Step 2: Iterate over the search result and fetch the author
17    for author_page in soup.select("div.author-list-page"):
18        for author in author_page.select("div.columns"):
19            author_name = "div.author-name"
20            author_img = "div.author-list-thumbnail img"
21            author_profile = "a.cludo-result"
22            print({
23                "name": author.select_one(author_name).text,
24                "img": author.select_one(author_img)["src"],
25                "profile": author.select_one(author_profile)["href"]
26            })
27            
28# Driver code
29# create the driver object.
30driver = configure_chrome_driver()
31search_keyword = "Machine Learning"
32getAuthors(driver, search_keyword)
33# close the driver.
34driver.close()
python

Waits

Nowadays, most web pages are using dynamic loading techniques such as AJAX. When a page is loaded by the browser, the elements within that page may load at different time intervals, which makes locating an element difficult, and sometimes the script throws the exception ElementNotVisibleException.

Using waits, we can resolve this issue. There can be two different types of waits: implicit and explicit. An explicit waits for a specific condition to occur before proceeding further in execution, where implicit waits for a certain fixed amount of time. You can learn more here.

So, for our example, I have used the WebDriverWait explicit method to wait for an element to load.

1from selenium.webdriver.support.ui import WebDriverWait
2from selenium.common.exceptions import TimeoutException
3
4def loadAllContent(driver):
5    WebDriverWait(driver, 5).until(
6        lambda s: s.find_element_by_class_name("cookie_notification").is_displayed()
7    )
8    driver.find_element_by_class_name('cookie_notification--opt_in').click()
9    while True:
10        try:
11            WebDriverWait(driver, 3).until(
12                lambda s: s.find_element_by_id('search-results-section-load-more').is_displayed()
13            )
14        except TimeoutException:
15            break
16        driver.find_element_by_id('search-results-section-load-more').click()
python

Filling in Forms

Filling in a form on a web page generally involves setting values for text boxes, perhaps selecting options from a drop-box or radio control, and clicking on a submit button. We have already seen how to identify, and now there are many methods available to send the data to the input box, such as send_keys and click methods.

Check out more on this here.

1def login(driver, credentials):
2    driver.get("https://app.pluralsight.com/")
3    uname_element = driver.find_element_by_name("Username")
4    uname_element.send_keys(credentials["username"])
5
6    pwd_element = driver.find_element_by_name("Password")
7    pwd_element.send_keys(credentials["password"])
8
9    login_btn = driver.find_element_by_id("login")
10    login_btn.click()
python

Conclusion

Web scraping using Selenium and BeautifulSoup can be a handy tool in your bag of Python and data knowledge tricks, especially when you face dynamic pages and heavy JavaScript-rendered websites. This guide has covered only some aspects of Selenium and web scraping. To learn more about scraping advanced sites, please visit the official docs of Python Selenium.

If you want to dive deeper into web scraping, check out some of my published guides on Web scraping.

That's it from this guide. Keep scraping challenging sites. For more queries, feel free to ask me at Codealphabet.