Getting started with Selenium Webdriver and Requests in Python

9 minute read

Published:

Web Scraping using Selenium WebDriver and Requests: A Beginner’s Guide

In this short tutorial, let’s look at the US Patent and Trademark Office (USPTO) website and scrape the patent database using a keyword search. We will use Selenium WebDriver to scrape the data. We will then use the Requests library to download the individual patent PDF documents.

Compiled by Srikar Kashyap Pulipaka

Last Updated: 13 June 2024

Part 1: Scraping the USPTO website for Patent Data

What is Selenium WebDriver?

Selenium Webdriver is a browser simulation framework that allows you to interact with a web page using a real, fully-featured browser. It is primarily used for automating web applications for testing purposes, but it can also be used for web scraping. It is one of the many components of the Selenium Test Suite.

Installation and Setup

First, you need to install the Selenium WebDriver and the Chrome WebDriver. You can install the Selenium WebDriver using the following command:

pip install selenium

You can download the Chrome WebDriver from the following link: Chrome WebDriver

Note: The Chrome WebDriver should be the same as the version of Chrome installed on your system. You can check the version of Chrome by going to chrome://settings/help.

Once you download the Chrome WebDriver, extract the file and place it in the same directory as your Python script.

Importing the necessary libraries

We start off with importing the necessary libraries. We will be using the Selenium WebDriver to scrape the data and the Requests library to download the PDF documents.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd

Keyword Definition

Let’s define the keyword to be used to search the USPTO database. In this case, we will use the keyword “semiconductor”.

keyword = "semiconductor"

Initializing the WebDriver and Navigating to the USPTO Website

We will initialize the WebDriver and navigate to the USPTO website. We will then search for the keyword “semiconductor” in the search bar, and press the search/enter button.

driver = webdriver.Chrome()
driver.get("https://ppubs.uspto.gov/pubwebapp/static/pages/ppubsbasic.html")
driver.find_element("id","searchText1").send_keys("semiconductor")
driver.find_element("id","searchText1").send_keys(Keys.RETURN)

Scraping and Saving the Data

We will scrape the data from the search results and save it into a Pandas Dataframe. We will then save the Dataframe into a CSV file. For this tutorial, we will run this script for the first 5 pages of the search results. You can run eventually run it for all the pages in its current form by removing the count condition.

master_df = pd.DataFrame()
count = 0
while True:
    time.sleep(3)
    df = pd.read_html(str(driver.page_source))[0]
    master_df = pd.concat([master_df, df])
    time.sleep(2)
    try:
        driver.find_element("id","paginationNextItem").click()
    except Exception as e:
        print(e)
        break
    print('Size of collection so far:', master_df.shape[0])
    count += 1
    if count > 5:
        break
master_df.reset_index(drop=True, inplace=True)
master_df.to_csv('patents.csv', index=False)

Let’s look at what happens in this piece of code:

  1. We initialize a master dataframe using the Pandas library. This dataframe will store the data from all the pages.
  2. We also instantiate a count variable to keep track of the number of pages we have scraped.
  3. We start a while loop that will run until the count reaches 5. This loop will scrape the data from each page and save it into the master dataframe.
  4. We sleep for 3 seconds after the page loads to give it enough time to load the javascript elements. Please adjust this time depending on your internet speed and the complexity of the website.
  5. We then provide the webpage source code as the input for the Pandas Read HTML function. This function will scrape the data from the webpage and save it into a Pandas Dataframe. It just reads the list of tables from the HTML source code. The required table is the first table in the list of tables. So we select the first element of the list.
  6. We concat this new dataframe with the master dataframe.
  7. We check if the element for the next page is present in the current page. If it is present, we click on the next page button. If it is not present, we break out of the loop. This is how we know that we have reached the end of the search results.
  8. We increment the count variable by 1.
  9. If the value of count is 5, we break out of the loop (Please comment this line if you want to scrape all the pages).
  10. We reset the index of the master dataframe as the index will be duplicated after each concatenation.
  11. We save the master dataframe into a CSV file.

Final Code (Run Only This Code)

# import the required libraries
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd

# set the keyword
keyword = "semiconductor"

# open the browser and navigate to the website. Enter the keyword and click on search
driver = webdriver.Chrome()
driver.get("https://ppubs.uspto.gov/pubwebapp/static/pages/ppubsbasic.html")
driver.find_element("id","searchText1").send_keys("semiconductor")
driver.find_element("id","searchText1").send_keys(Keys.RETURN)

# create an empty dataframe to store the data
master_df = pd.DataFrame()

# loop through the pages and extract the data
count = 0

while True:
    # wait for the page to load
    time.sleep(3)
    # read the data from the page and append it to the master
    df = pd.read_html(str(driver.page_source))[0]
    master_df = pd.concat([master_df, df])
    time.sleep(2)
    try:
        driver.find_element("id","paginationNextItem").click()
    except Exception as e:
        print(e)
        break
    print('Size of collection so far:', master_df.shape[0])
    count += 1
    if count > 5:
        break
driver.close()
master_df.reset_index(drop=True, inplace=True)
master_df.to_csv('patents.csv', index=False)
Size of collection so far: 50
Size of collection so far: 100
Size of collection so far: 150
Size of collection so far: 200
Size of collection so far: 250
Size of collection so far: 300

Let’s have a look at the data we collected.

master_df.head()
Result #Document/Patent numberDisplayTitleInventor namePublication datePages
01US-20240193696-A1Preview PDFPROACTIVE WEATHER EVENT COMMUNICATION SYSTEM A...Wyatt; Amber et al.2024-06-1321
12US-20240188889-A1Preview PDFFLASH LED AND HEART RATE MONITOR LED INTEGRATI...Tankiewicz; Szymon Michal et al.2024-06-1328
23US-20240193523-A1Preview PDFVIRTUAL CAREER MENTOR THAT CONSIDERS SKILLS AN...O'Donncha; Fearghal et al.2024-06-1316
34US-20240193519-A1Preview PDFSYSTEMS AND METHODS FOR SYSTEM-WIDE GRANULAR A...Holovacs; Jeremy2024-06-1334
45US-20240190459-A1Preview PDFMETHODS AND SYSTEMS FOR VEHICLE CONTROL UNDER ...Mamchuk; Tetyana V. et al.2024-06-1323

Part 2: Downloading the PDF Patent Documents using the Requests Library

Now that we have the patents data, we can use the following code to extract/download the patents in PDF format. We will be using the Requests library to download the PDF documents using simple HTTP requests.

It turns out (to our advantage) that the USPTO websites stores the PDF patent documents in a predictable URL format. We can use this to download the PDFs of the patents we are interested in. The URL is of the format https://ppubs.uspto.gov/dirsearch-public/print/downloadPdf/XXXXXXXX.pdf where XXXXXXXX is the patent number. We can use the requests library to download the PDFs.

import requests

url = "https://ppubs.uspto.gov/dirsearch-public/print/downloadPdf/{}"

# sample 4 rows
sample = master_df.head(4)
for i, row in sample.iterrows():
    patent_number = row["Document/Patent number"].split("-")[1]
    formatted_url = url.format(patent_number)
    response = requests.get(formatted_url)
    with open(f"{row['Document/Patent number']}.pdf", "wb") as f:
        f.write(response.content)
    print(f"Downloaded {row['Document/Patent number']}.pdf")
Downloaded US-20240193696-A1.pdf
Downloaded US-20240188889-A1.pdf
Downloaded US-20240193523-A1.pdf
Downloaded US-20240193519-A1.pdf

Let’s break down the code:

  1. We first import the necessary library: requests.
  2. We sample 4 rows from the master_df DataFrame (first four rows).
  3. We iterate over each row in the sample DataFrame.
  4. We extract the patent number from the “Document/Patent number” column. Since the patent number is of the format “US-XXXXX-XX”, we split the text using the “-“ character and extract the second part.
  5. We format the URL using the extracted patent number.
  6. We make a GET request to the formatted URL.
  7. We write the content of the response to a PDF file with the name of the patent number. We load the file in write-binary mode (“wb”) as it is a PDF file.
  8. We print a message indicating that the PDF file has been downloaded.

Selenium Webdriver vs. Requests Library

Some personal notes on the choice between Selenium WebDriver and the Requests library for web scraping:

When to use Selenium WebDriver

  • When the website is dynamic and requires javascript to load the content.
  • When the website requires user interaction (e.g., clicking buttons, filling forms).

When to use the Requests library

  • When the required elements are present in the source code of the webpage on load. In this case, you can directly scrape the data using the Requests library.
  • When the website is static and does not require javascript to load the content.
  • When the website does not require user interaction.

Tip (that usually works): If you can find the required data/element in the source code after clicking Control+U, you can directly scrape the data using the Requests library. If you cannot find the required data/element in the source code, you might need to use the Selenium WebDriver.