Author avatar

Gaurav Singhal

Implementing Web Scraping with Selenium

Gaurav Singhal

  • Feb 15, 2020
  • 13 Min read
  • 799 Views
  • Feb 15, 2020
  • 13 Min read
  • 799 Views
Data
Selenium

Introduction

In recent years, there has been an explosion of front-end frameworks like Angular, React, and Vue, which are becoming more and more popular. Webpages that are generated dynamically can offer a faster user experience; the elements on the webpage itself are created and modified dynamically. These websites are of great benefit, but can be problematic when we want to scrape data from them. The simplest way to scrape these kinds of websites is by using an automated web browser, such as a selenium webdriver, which can be controlled by several languages, including Python.

Selenium is a framework designed to automate tests for your web application. Through Selenium Python API, you can access all functionalities of Selenium WebDriver intuitively. It provides a convenient way to access Selenium webdrivers such as ChromeDriver, Firefox geckodriver, etc.

In this guide, we will explore how to scrape the webpage with the help of Selenium Webdriver and BeautifulSoup. This guide will demonstrate with an example script that will scrape authors and courses from pluralsight.com with a given keyword.

Installation

Download Driver

Selenium requires a driver to interface with the chosen browser. Here are the links to some of the most popular browser drivers:.

Make sure the driver is in PATH folder, i.e., for Linux, place it in /usr/bin or /usr/local/bin. Or you can place the driver in a known location and provide the executable_path afterward.

Install required packages

Install the Selenium Python package, if it is not already installed.

1
pip install selenium
shell
1
2
pip install bs4
pip install lxml
shell

Initialize the Webdriver

Let's create a function to initialize the webdriver by adding some options, such as headless. In the below code, I have created two different functions for Chrome and Firefox, respectively.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from selenium import webdriver
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.firefox.options import Options as FirefoxOptions

# configure Chrome Webdriver
def configure_chrome_driver():
    # Add additional Options to the webdriver
    chrome_options = ChromeOptions()
    # add the argument and make the browser Headless.
    chrome_options.add_argument("--headless")
    # Instantiate the Webdriver: Mention the executable path of the webdriver you have downloaded
    # if driver is in PATH, no need to provide executable_path
    driver = webdriver.Chrome(executable_path="./chromedriver.exe", options = chrome_options)
    return driver

# configure Firefox Driver
def configure_firefox_driver():
    # Add additional Options to the webdriver
    firefox_options = FirefoxOptions()
    # add the argument and make the browser Headless.
    firefox_options.add_argument("--headless")

    # Instantiate the Webdriver: Mention the executable path of the webdriver you have downloaded
    # if driver is in PATH, no need to provide executable_path
    driver = webdriver.Firefox(executable_path = "./geckodriver.exe", options = firefox_options)
    return driver
python

Making Browser Headless

Headless browsers can work without displaying any graphical UI, which allows applications to be a single source of interaction for users and provides a smooth user experience. Selenium helps you make any browser headless by adding an options argument as --headless. There are several option parameters you can set for your selenium webdriver. Check out some Chrome WebDriver Options here .

Locating the Elements on the Page

Selenium offers a wide variety of functions to locate an element on a web page:

1
2
3
4
<div id="search-field">
  <input type="text" name = "search-container" id = "id_search_input" class = "search_input" autocomplete="off">
  <input type="submit" class = "search_submit btn btn-default" >
</div>
html
1
2
3
4
element = driver.find_element_by_id("id_search_input") # by id
element = driver.find_element_by_class_name("search-container") # by class
element = driver.find_element_by_name("search-container") # by name
element = driver.find_element_by_xpath("//input[@type='text']") # by xpath
python

If the element is not be found, a NoSuchElementException is raised. You can read more strategies to locate the element here .

XPath is a powerful language often used in scraping the web. You can learn more about XPath here.

Not only can you locate the element on the page, you can also fill a form by sending the key input, add cookies, switch tabs, etc. You can read more about that here .

Data Extraction

Let's now see how to extract the required data from a web page. In the below code, we define two functions, getCourses and getAuthors, and print the courses and authors respectively for a given search keyword query.

Beautiful Soup remains the best way to traverse the DOM and scrape the data, so after making a GET request to the url, we will transform the page source to a BeautifulSoup object. Before doing that, we can wait for the element to get loaded, and also load all the paginated content by clicking Load More again and again (uncomment the loadAllContent(driver) to see this in action). After that, we can quickly get the required information from the page source using the select method.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup

def getCourses(driver, search_keyword):
    # Step 1: Go to pluralsight.com
    driver.get(f"https://www.pluralsight.com/search?q={search_keyword}&categories=course")
    WebDriverWait(driver, 5).until(
        lambda s: s.find_element_by_id("search-results-category-target").is_displayed()
    )
    
    # Load all the page data, by clicking Load More button again and again
    # loadAllContent(driver) # Uncomment me for loading all the content of the page
    
    # Step 2: Create a parse tree of page sources after searching
    soup = BeautifulSoup(driver.page_source, "lxml")
    
    # Step 3: Iterate over the search result and fetch the course
    for course_page in soup.select("div.search-results-page"):
        for course in course_page.select("div.search-result"):
            # selectors for the required information
            title_selector = "div.search-result__info div.search-result__title a"
            author_selector = "div.search-result__details div.search-result__author"
            level_selector = "div.search-result__details div.search-result__level"
            length_selector = "div.search-result__details div.search-result__length"
            print({
                "title": course.select_one(title_selector).text,
                "author": course.select_one(author_selector).text,
                "level": course.select_one(level_selector).text,
                "length": course.select_one(length_selector).text,
            })
            
# Driver code
# create the driver object.
driver = configure_chrome_driver()
search_keyword = "Machine Learning"
getCourses(driver, search_keyword)
# close the driver.
driver.close()
python

Similarly, you can do the same for the getAuthors function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup

def getAuthors(driver, search_keyword):
    driver.get(f"https://www.pluralsight.com/search?q={search_keyword}&categories=aem-author")
    WebDriverWait(driver, 5).until(
        lambda s: s.find_element_by_id("author-list-target").is_displayed()
    )
    
    # Load all the page data, by clicking Load More button again and again
    # loadAllContent(driver) ## Uncomment me for loading all the content of the page

    # Step 1: Create a parse tree of page sources after searching
    soup = BeautifulSoup(driver.page_source, "lxml")
    # Step 2: Iterate over the search result and fetch the author
    for author_page in soup.select("div.author-list-page"):
        for author in author_page.select("div.columns"):
            author_name = "div.author-name"
            author_img = "div.author-list-thumbnail img"
            author_profile = "a.cludo-result"
            print({
                "name": author.select_one(author_name).text,
                "img": author.select_one(author_img)["src"],
                "profile": author.select_one(author_profile)["href"]
            })
            
# Driver code
# create the driver object.
driver = configure_chrome_driver()
search_keyword = "Machine Learning"
getAuthors(driver, search_keyword)
# close the driver.
driver.close()
python

Waits

Nowadays, most web pages are using dynamic loading techniques such as AJAX. When a page is loaded by the browser, the elements within that page may load at different time intervals, which makes locating an element difficult, and sometimes the script throws the exception ElementNotVisibleException.

Using waits, we can resolve this issue. There can be two different types of waits: implicit and explicit. An explicit waits for a specific condition to occur before proceeding further in execution, where implicit waits for a certain fixed amount of time. You can learn more here.

So, for our example, I have used the WebDriverWait explicit method to wait for an element to load.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException

def loadAllContent(driver):
    WebDriverWait(driver, 5).until(
        lambda s: s.find_element_by_class_name("cookie_notification").is_displayed()
    )
    driver.find_element_by_class_name('cookie_notification--opt_in').click()
    while True:
        try:
            WebDriverWait(driver, 3).until(
                lambda s: s.find_element_by_id('search-results-section-load-more').is_displayed()
            )
        except TimeoutException:
            break
        driver.find_element_by_id('search-results-section-load-more').click()
python

Filling in Forms

Filling in a form on a web page generally involves setting values for text boxes, perhaps selecting options from a drop-box or radio control, and clicking on a submit button. We have already seen how to identify, and now there are many methods available to send the data to the input box, such as send_keys and click methods.

Check out more on this here.

1
2
3
4
5
6
7
8
9
10
def login(driver, credentials):
    driver.get("https://app.pluralsight.com/")
    uname_element = driver.find_element_by_name("Username")
    uname_element.send_keys(credentials["username"])

    pwd_element = driver.find_element_by_name("Password")
    pwd_element.send_keys(credentials["password"])

    login_btn = driver.find_element_by_id("login")
    login_btn.click()
python

Conclusion

Web scraping using Selenium and BeautifulSoup can be a handy tool in your bag of Python and data knowledge tricks, especially when you face dynamic pages and heavy JavaScript-rendered websites. This guide has covered only some aspects of Selenium and web scraping. To learn more about scraping advanced sites, please visit the official docs of Python Selenium.

If you want to dive deeper into web scraping, check out some of my published guides on Web scraping.

That's it from this guide. Keep scraping challenging sites. For more queries, feel free to ask me at Codealphabet.

7