In recent years, there has been an explosion of front-end frameworks like Angular, React, and Vue, which are becoming more and more popular. Webpages that are generated dynamically can offer a faster user experience; the elements on the webpage itself are created and modified dynamically. These websites are of great benefit, but can be problematic when we want to scrape data from them. The simplest way to scrape these kinds of websites is by using an automated web browser, such as a selenium webdriver, which can be controlled by several languages, including Python.
Selenium is a framework designed to automate tests for your web application. Through Selenium Python API, you can access all functionalities of Selenium WebDriver intuitively. It provides a convenient way to access Selenium webdrivers such as ChromeDriver, Firefox geckodriver, etc.
In this guide, we will explore how to scrape the webpage with the help of Selenium Webdriver and BeautifulSoup. This guide will demonstrate with an example script that will scrape authors and courses from pluralsight.com with a given keyword.
Selenium requires a driver to interface with the chosen browser. Here are the links to some of the most popular browser drivers:.
Make sure the driver is in PATH folder, i.e., for Linux, place it in /usr/bin
or /usr/local/bin
. Or you can place the driver in a known location and provide the executable_path
afterward.
Install the Selenium Python package, if it is not already installed.
1pip install selenium
1pip install bs4
2pip install lxml
Let's create a function to initialize the webdriver by adding some options, such as headless
. In the below code, I have created two different functions for Chrome and Firefox, respectively.
1from selenium import webdriver
2from selenium.webdriver.chrome.options import Options as ChromeOptions
3from selenium.webdriver.firefox.options import Options as FirefoxOptions
4
5# configure Chrome Webdriver
6def configure_chrome_driver():
7 # Add additional Options to the webdriver
8 chrome_options = ChromeOptions()
9 # add the argument and make the browser Headless.
10 chrome_options.add_argument("--headless")
11 # Instantiate the Webdriver: Mention the executable path of the webdriver you have downloaded
12 # if driver is in PATH, no need to provide executable_path
13 driver = webdriver.Chrome(executable_path="./chromedriver.exe", options = chrome_options)
14 return driver
15
16# configure Firefox Driver
17def configure_firefox_driver():
18 # Add additional Options to the webdriver
19 firefox_options = FirefoxOptions()
20 # add the argument and make the browser Headless.
21 firefox_options.add_argument("--headless")
22
23 # Instantiate the Webdriver: Mention the executable path of the webdriver you have downloaded
24 # if driver is in PATH, no need to provide executable_path
25 driver = webdriver.Firefox(executable_path = "./geckodriver.exe", options = firefox_options)
26 return driver
Headless browsers can work without displaying any graphical UI, which allows applications to be a single source of interaction for users and provides a smooth user experience. Selenium helps you make any browser headless by adding an options argument as --headless
. There are several option parameters you can set for your selenium webdriver. Check out some Chrome WebDriver Options here
.
Selenium offers a wide variety of functions to locate an element on a web page:
1<div id="search-field">
2 <input type="text" name = "search-container" id = "id_search_input" class = "search_input" autocomplete="off">
3 <input type="submit" class = "search_submit btn btn-default" >
4</div>
1element = driver.find_element_by_id("id_search_input") # by id
2element = driver.find_element_by_class_name("search-container") # by class
3element = driver.find_element_by_name("search-container") # by name
4element = driver.find_element_by_xpath("//input[@type='text']") # by xpath
If the element is not be found, a NoSuchElementException
is raised. You can read more strategies to locate the element here
.
XPath is a powerful language often used in scraping the web. You can learn more about XPath here.
Not only can you locate the element on the page, you can also fill a form by sending the key input, add cookies, switch tabs, etc. You can read more about that here .
Let's now see how to extract the required data from a web page. In the below code, we define two functions, getCourses
and getAuthors
, and print the courses and authors respectively for a given search keyword query.
Beautiful Soup remains the best way to traverse the DOM and scrape the data, so after making a GET request to the url, we will transform the page source to a BeautifulSoup
object. Before doing that, we can wait for the element to get loaded, and also load all the paginated content by clicking Load More again and again (uncomment the loadAllContent(driver)
to see this in action). After that, we can quickly get the required information from the page source using the select
method.
1from selenium.webdriver.support.ui import WebDriverWait
2from selenium.common.exceptions import TimeoutException
3from bs4 import BeautifulSoup
4
5def getCourses(driver, search_keyword):
6 # Step 1: Go to pluralsight.com
7 driver.get(f"https://www.pluralsight.com/search?q={search_keyword}&categories=course")
8 WebDriverWait(driver, 5).until(
9 lambda s: s.find_element_by_id("search-results-category-target").is_displayed()
10 )
11
12 # Load all the page data, by clicking Load More button again and again
13 # loadAllContent(driver) # Uncomment me for loading all the content of the page
14
15 # Step 2: Create a parse tree of page sources after searching
16 soup = BeautifulSoup(driver.page_source, "lxml")
17
18 # Step 3: Iterate over the search result and fetch the course
19 for course_page in soup.select("div.search-results-page"):
20 for course in course_page.select("div.search-result"):
21 # selectors for the required information
22 title_selector = "div.search-result__info div.search-result__title a"
23 author_selector = "div.search-result__details div.search-result__author"
24 level_selector = "div.search-result__details div.search-result__level"
25 length_selector = "div.search-result__details div.search-result__length"
26 print({
27 "title": course.select_one(title_selector).text,
28 "author": course.select_one(author_selector).text,
29 "level": course.select_one(level_selector).text,
30 "length": course.select_one(length_selector).text,
31 })
32
33# Driver code
34# create the driver object.
35driver = configure_chrome_driver()
36search_keyword = "Machine Learning"
37getCourses(driver, search_keyword)
38# close the driver.
39driver.close()
Similarly, you can do the same for the getAuthors
function.
1from selenium.webdriver.support.ui import WebDriverWait
2from selenium.common.exceptions import TimeoutException
3from bs4 import BeautifulSoup
4
5def getAuthors(driver, search_keyword):
6 driver.get(f"https://www.pluralsight.com/search?q={search_keyword}&categories=aem-author")
7 WebDriverWait(driver, 5).until(
8 lambda s: s.find_element_by_id("author-list-target").is_displayed()
9 )
10
11 # Load all the page data, by clicking Load More button again and again
12 # loadAllContent(driver) ## Uncomment me for loading all the content of the page
13
14 # Step 1: Create a parse tree of page sources after searching
15 soup = BeautifulSoup(driver.page_source, "lxml")
16 # Step 2: Iterate over the search result and fetch the author
17 for author_page in soup.select("div.author-list-page"):
18 for author in author_page.select("div.columns"):
19 author_name = "div.author-name"
20 author_img = "div.author-list-thumbnail img"
21 author_profile = "a.cludo-result"
22 print({
23 "name": author.select_one(author_name).text,
24 "img": author.select_one(author_img)["src"],
25 "profile": author.select_one(author_profile)["href"]
26 })
27
28# Driver code
29# create the driver object.
30driver = configure_chrome_driver()
31search_keyword = "Machine Learning"
32getAuthors(driver, search_keyword)
33# close the driver.
34driver.close()
Nowadays, most web pages are using dynamic loading techniques such as AJAX. When a page is loaded by the browser, the elements within that page may load at different time intervals, which makes locating an element difficult, and sometimes the script throws the exception ElementNotVisibleException
.
Using waits, we can resolve this issue. There can be two different types of waits: implicit and explicit. An explicit waits for a specific condition to occur before proceeding further in execution, where implicit waits for a certain fixed amount of time. You can learn more here.
So, for our example, I have used the WebDriverWait
explicit method to wait for an element to load.
1from selenium.webdriver.support.ui import WebDriverWait
2from selenium.common.exceptions import TimeoutException
3
4def loadAllContent(driver):
5 WebDriverWait(driver, 5).until(
6 lambda s: s.find_element_by_class_name("cookie_notification").is_displayed()
7 )
8 driver.find_element_by_class_name('cookie_notification--opt_in').click()
9 while True:
10 try:
11 WebDriverWait(driver, 3).until(
12 lambda s: s.find_element_by_id('search-results-section-load-more').is_displayed()
13 )
14 except TimeoutException:
15 break
16 driver.find_element_by_id('search-results-section-load-more').click()
Filling in a form on a web page generally involves setting values for text boxes, perhaps selecting options from a drop-box or radio control, and clicking on a submit button. We have already seen how to identify, and now there are many methods available to send the data to the input box, such as send_keys
and click methods.
Check out more on this here.
1def login(driver, credentials):
2 driver.get("https://app.pluralsight.com/")
3 uname_element = driver.find_element_by_name("Username")
4 uname_element.send_keys(credentials["username"])
5
6 pwd_element = driver.find_element_by_name("Password")
7 pwd_element.send_keys(credentials["password"])
8
9 login_btn = driver.find_element_by_id("login")
10 login_btn.click()
Web scraping using Selenium and BeautifulSoup can be a handy tool in your bag of Python and data knowledge tricks, especially when you face dynamic pages and heavy JavaScript-rendered websites. This guide has covered only some aspects of Selenium and web scraping. To learn more about scraping advanced sites, please visit the official docs of Python Selenium.
If you want to dive deeper into web scraping, check out some of my published guides on Web scraping.
That's it from this guide. Keep scraping challenging sites. For more queries, feel free to ask me at Codealphabet.