The previous guide Web Scraping with BeautifulSoup explains the essential fundamentals of web scraping as:
This process is suitable for static content which is available by making an HTTP request to get the webpage content, but dynamic websites load the data from a data source (database, file etc) or require a few additional action events on the web page to load the data.
In order to automate this process, our scraping script needs to interact with the browser to perform repetitive tasks like click, scrolling, hover etc. and Selenium is the perfect tool to automate web browser interactions.
Selenium is an automation testing framework for web applications/websites which can also control the browser to navigate the website just like a human. Selenium uses a web-driver package that can take control of the browser and mimic user-oriented actions to trigger desired events. This guide will explain the process of building a web scraping program that will scrape data and download files from Google Shopping Insights.
Google Shopping Insights loads the data at runtime so any attempt to extract data using
requestspackage will be responded to with an empty response.
pipcommand in terminal:
pip install selenium
PATHvariable of the operating system (only required in case of manual installation).
Safari 10 on OS X El Capitan and macOS Sierra have built-in support for the automation driver. This guide contains snippets to interact with popular web-drivers, though Safari is being used as a default browser throughout this guide.
Let's get started by searching a product and downloading the CSV file(s) with the following steps:
webdriverfor particular browser by importing it from
1 2 3 4 5
from selenium import webdriver # Import module from selenium.webdriver.common.keys import Keys # For keyboard keys import time # Waiting function URL = 'https://shopping.thinkwithgoogle.com' # Define URL browser = webdriver.Safari() # Create driver object means open the browser
By default, the automation control is disabled in safari and it needs to be enabled for automation environment otherwise it will raise
SessionNotCreatedException. So, enable the
Developoption under the advanced settings in Safari preferences.
Then open the
Develop option and select
Allow Remote Automation.
idof the search element using its 'subjectInput' id.
There's also a hidden input tag which is not required. So, use
find_elementsto get the list of all elements with matched searched criteria and use the index to access it.
1 2 3 4 5 6 7
browser.get('https://shopping.thinkwithgoogle.com') # 1 time.sleep(2) # 2 search = browser.find_elements_by_id('subjectInput') # 3 # find_elements will give us the list of all elements with id as subjectInput search.send_keys('Google Pixel 3') # 4 time.sleep(2) # 5 search.send_keys(Keys.ENTER) # 6
To use Firefox and Chrome browsers, use their corresponding methods to create browser instances as:
1 2 3
# Firefox firefoxBrowser = webdriver.Firefox(executable_path=FIREFOX_GOCKO_DRIVER_PATH) # Chrome chromeBrowser = webdriver.Chrome(executable_path=CHROMEDRIVER_PATH)
time.sleep(2) # Wait for button to appear browser.find_element_by_class_name('si-button-data download-all').click()
ullist tag by its
classvalue and then fetch data of all list items as:
1 2 3 4 5
data = browser.find_element_by_class_name('content content-breakpoint-gt-md') dataList = data.find_elements_by_tag_name('li') for item in dataList: text = item.text print(text)
Selenium offers a wide variety of functions to locate an element on the web-page as:
search = browser.find_element(By.ID,'subjectInput')
Use overloaded versions of functions to find all occurrences of a searched value. Just use
elements instead of
searchList = browser.find_elements(By.ID,'subjectInput')
XPath is an expression path syntax to find an object in DOM. XPath has its own syntax to find the node from the root element either via an absolute path or anywhere in the document using a relative path. Below is the explanation of XPath syntax with an example:
/ : Select node from the root.
/html/body/div will find the first
//: Select node from the current node.
//form will find the first
[@attributename='value']: Use the syntax to find the node with the required value of the attribute.
//input[@name='Email'] will find the first input element with name as
1 2 3 4 5 6 7 8 9 10 11 12 13
<html> <body> <div class = "content-login"> <form id="loginForm"> <div> <input type="text" name="Email" value="Email Address:"> <input type="password" name="Password"value="Password:"> </div> <button type="submit">Submit</button> </form> </div> </body> <html>
Another simple way to get the XPath is via
inspect element option. Just right click on the desired node and choose
copy xpath option as:
Read more about XPath to combine multiple attributes or use supported function.
Headless or Invisible Browser: During the scraping process, any user action on a browser window can interrupt the flow and can cause an unexpected behavior. So, for scraping applications, it is crucial to avoid any external dependency while creating applications, such as browser. Headless browsers can work without displaying any graphical UI which allows applications to be a single source of interaction for users and provides a smooth user experience.
Some famous headless browsers are PhantomJS and HTMLUnit. Other browsers like Chrome and Firefox also support the headless feature which can be enabled with
1 2 3 4 5 6 7 8 9
from selenium import webdriver from selenium.webdriver.firefox.options import Options options = Options() # options.headless = True # older webdriver versions options.set_headless(True) # newer webdriver versions # Firefox firefoxBrowser = webdriver.Firefox(options=options, executable_path=FIREFOX_GOCKO_DRIVER_PATH) # Chrome chromeBrowser = webdriver.Chrome(CHROMEDRIVER_PATH, chrome_options=options)
At the time of writing this guide, Headless mode is not supported by Safari.
time.sleep. It will automatically check and proceed when searched element is visible:
1 2 3 4 5 6 7
driver = webdriver.Safari() browser.get('https://shopping.thinkwithgoogle.com') try: # proceed if element is found within 3 seconds otherwise will raise TimeoutException element = WebDriverWait(browser, 3).until(EC.presence_of_element_located((By.ID, 'Id_Of_Element'))) except TimeoutException: print("Time out!")
ActionChainsfor interactions like mouse movement like click, hold, hover, drag-drop and
TouchActionsfor touch interactions like double-tap, long-press, flick, scroll as:
1 2 3 4 5 6 7 8 9 10
# perform hover from selenium import webdriver from selenium.webdriver.common.action_chains import ActionChains from selenium.webdriver.common.touch_actions import TouchActions browser = webdriver.Firefox() browser.get(URLString) element_to_hover_over = browser.find_element_by_id("anyID") ActionChains(browser).move_to_element(element_to_hover_over).perform() scroll_up_arrow = browser.find_element_by_id("scroll_up") TouchActions(browser).long_press(scroll_up_arrow)
At the time of writing this guide,
TouchActionsare not supported by Safari.
driver.switch_tomethods has been deprecated so instead use
The code is available on github for demonstration and practice. It also contains few more use-cases and optimized code.