The previous guide Web Scraping with BeautifulSoup explains the essential fundamentals of web scraping as:
beautifulsoup
.This process is suitable for static content which is available by making an HTTP request to get the webpage content, but dynamic websites load the data from a data source (database, file etc) or require a few additional action events on the web page to load the data.
In order to automate this process, our scraping script needs to interact with the browser to perform repetitive tasks like click, scrolling, hover etc. and Selenium is the perfect tool to automate web browser interactions.
Selenium is an automation testing framework for web applications/websites which can also control the browser to navigate the website just like a human. Selenium uses a web-driver package that can take control of the browser and mimic user-oriented actions to trigger desired events. This guide will explain the process of building a web scraping program that will scrape data and download files from Google Shopping Insights.
Google Shopping Insights loads the data at runtime so any attempt to extract data using
requests
package will be responded to with an empty response.
pip
command in terminal: 1pip install selenium
PATH
variable of the operating system (only required in case of manual installation). Download the drivers from official site for Chrome, Firefox, and Edge. Opera drivers can also be downloaded from the Opera Chromium project hosted on Github.
Safari 10 on OS X El Capitan and macOS Sierra have built-in support for the automation driver. This guide contains snippets to interact with popular web-drivers, though Safari is being used as a default browser throughout this guide.
Other browsers like UC, Netscape etc., cannot be used for automation. The Selenium-RC (remote-control) tool can control browsers via injecting its own JavaScript code and can be used for UI testing.
Let's get started by searching a product and downloading the CSV file(s) with the following steps:
webdriver
for particular browser by importing it from selenium
module as: 1from selenium import webdriver # Import module
2from selenium.webdriver.common.keys import Keys # For keyboard keys
3import time # Waiting function
4URL = 'https://shopping.thinkwithgoogle.com' # Define URL
5browser = webdriver.Safari() # Create driver object means open the browser
By default, the automation control is disabled in safari and it needs to be enabled for automation environment otherwise it will raise
SessionNotCreatedException
. So, enable theDevelop
option under the advanced settings in Safari preferences.
Then open the Develop
option and select Allow Remote Automation
.
id
of the search element using its 'subjectInput' id. There's also a hidden input tag which is not required. So, use
find_elements
to get the list of all elements with matched searched criteria and use the index to access it.
send_keys
. 1browser.get('https://shopping.thinkwithgoogle.com') # 1
2time.sleep(2) # 2
3search = browser.find_elements_by_id('subjectInput')[1] # 3
4# find_elements will give us the list of all elements with id as subjectInput
5search.send_keys('Google Pixel 3') # 4
6time.sleep(2) # 5
7search.send_keys(Keys.ENTER) # 6
To use Firefox and Chrome browsers, use their corresponding methods to create browser instances as:
1# Firefox
2firefoxBrowser = webdriver.Firefox(executable_path=FIREFOX_GOCKO_DRIVER_PATH) # Chrome
3chromeBrowser = webdriver.Chrome(executable_path=CHROMEDRIVER_PATH)
div
tag using si-button-data download-all
class
value. 1time.sleep(2) # Wait for button to appear
2browser.find_element_by_class_name('si-button-data download-all').click()
ul
list tag by its class
value and then fetch data of all list items as: 1data = browser.find_element_by_class_name('content content-breakpoint-gt-md')
2dataList = data.find_elements_by_tag_name('li')
3for item in dataList:
4 text = item.text
5 print(text)
Selenium offers a wide variety of functions to locate an element on the web-page as:
find_element
with BY
locater as: 1search = browser.find_element(By.ID,'subjectInput')[1]
Use overloaded versions of functions to find all occurrences of a searched value. Just use elements
instead of element
as:
1searchList = browser.find_elements(By.ID,'subjectInput')[1]
XPath is an expression path syntax to find an object in DOM. XPath has its own syntax to find the node from the root element either via an absolute path or anywhere in the document using a relative path. Below is the explanation of XPath syntax with an example:
/
: Select node from the root. /html/body/div[1]
will find the first div
.
//
: Select node from the current node. //form[1]
will find the first form
element. [@attributename='value']
: Use the syntax to find the node with the required value of the attribute. //input[@name='Email']
will find the first input element with name as Email
.
1<html>
2 <body>
3 <div class = "content-login">
4 <form id="loginForm">
5 <div>
6 <input type="text" name="Email" value="Email Address:">
7 <input type="password" name="Password"value="Password:">
8 </div>
9 <button type="submit">Submit</button>
10 </form>
11 </div>
12 </body>
13<html>
Another simple way to get the XPath is via inspect element
option. Just right click on the desired node and choose copy xpath
option as:
Read more about XPath to combine multiple attributes or use supported function.
Headless or Invisible Browser: During the scraping process, any user action on a browser window can interrupt the flow and can cause an unexpected behavior. So, for scraping applications, it is crucial to avoid any external dependency while creating applications, such as browser. Headless browsers can work without displaying any graphical UI which allows applications to be a single source of interaction for users and provides a smooth user experience.
Some famous headless browsers are PhantomJS and HTMLUnit. Other browsers like Chrome and Firefox also support the headless feature which can be enabled with set_headless
parameter:
1from selenium import webdriver
2from selenium.webdriver.firefox.options import Options
3options = Options()
4# options.headless = True # older webdriver versions
5options.set_headless(True) # newer webdriver versions
6# Firefox
7firefoxBrowser = webdriver.Firefox(options=options, executable_path=FIREFOX_GOCKO_DRIVER_PATH)
8# Chrome
9chromeBrowser = webdriver.Chrome(CHROMEDRIVER_PATH, chrome_options=options)
At the time of writing this guide, Headless mode is not supported by Safari.
WebDriverWait
instead of time.sleep
. It will automatically check and proceed when searched element is visible: 1driver = webdriver.Safari()
2browser.get('https://shopping.thinkwithgoogle.com')
3try: # proceed if element is found within 3 seconds otherwise will raise TimeoutException
4 element = WebDriverWait(browser, 3).until(EC.presence_of_element_located((By.ID,
5'Id_Of_Element')))
6except TimeoutException:
7 print("Time out!")
ActionChains
for interactions like mouse movement like click, hold, hover, drag-drop and TouchActions
for touch interactions like double-tap, long-press, flick, scroll as: 1# perform hover
2from selenium import webdriver
3from selenium.webdriver.common.action_chains import ActionChains
4from selenium.webdriver.common.touch_actions import TouchActions
5browser = webdriver.Firefox()
6browser.get(URLString)
7element_to_hover_over = browser.find_element_by_id("anyID")
8ActionChains(browser).move_to_element(element_to_hover_over).perform()
9scroll_up_arrow = browser.find_element_by_id("scroll_up")
10TouchActions(browser).long_press(scroll_up_arrow)
At the time of writing this guide,
ActionChains
andTouchActions
are not supported by Safari.
Switch(browser).alert()
,Switch(browser)window(window_Object_2)
,
Switch(browser).frame(frame_Object_2)
methods of Switch
class.
driver.switch_to
methods has been deprecated so instead useSwitch
class function.
The code is available on github for demonstration and practice. It also contains few more use-cases and optimized code.