In this short tutorial, let’s look at the US Patent and Trademark Office (USPTO) website and scrape the patent database using a keyword search. We will use Selenium WebDriver to scrape the data. We will then use the Requests library to download the individual patent PDF documents.
Compiled by Srikar Kashyap Pulipaka
Last Updated: 13 June 2024
Selenium Webdriver is a browser simulation framework that allows you to interact with a web page using a real, fully-featured browser. It is primarily used for automating web applications for testing purposes, but it can also be used for web scraping. It is one of the many components of the Selenium Test Suite.
First, you need to install the Selenium WebDriver and the Chrome WebDriver. You can install the Selenium WebDriver using the following command:
pip install selenium
You can download the Chrome WebDriver from the following link: Chrome WebDriver
Note: The Chrome WebDriver should be the same as the version of Chrome installed on your system. You can check the version of Chrome by going to chrome://settings/help.
Once you download the Chrome WebDriver, extract the file and place it in the same directory as your Python script.
We start off with importing the necessary libraries. We will be using the Selenium WebDriver to scrape the data and the Requests library to download the PDF documents.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
Let’s define the keyword to be used to search the USPTO database. In this case, we will use the keyword “semiconductor”.
keyword = "semiconductor"
We will initialize the WebDriver and navigate to the USPTO website. We will then search for the keyword “semiconductor” in the search bar, and press the search/enter button.
driver = webdriver.Chrome()
driver.get("https://ppubs.uspto.gov/pubwebapp/static/pages/ppubsbasic.html")
driver.find_element("id","searchText1").send_keys("semiconductor")
driver.find_element("id","searchText1").send_keys(Keys.RETURN)
We will scrape the data from the search results and save it into a Pandas Dataframe. We will then save the Dataframe into a CSV file. For this tutorial, we will run this script for the first 5 pages of the search results. You can run eventually run it for all the pages in its current form by removing the count condition.
master_df = pd.DataFrame()
count = 0
while True:
time.sleep(3)
df = pd.read_html(str(driver.page_source))[0]
master_df = pd.concat([master_df, df])
time.sleep(2)
try:
driver.find_element("id","paginationNextItem").click()
except Exception as e:
print(e)
break
print('Size of collection so far:', master_df.shape[0])
count += 1
if count > 5:
break
master_df.reset_index(drop=True, inplace=True)
master_df.to_csv('patents.csv', index=False)
Let’s look at what happens in this piece of code:
# import the required libraries
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
# set the keyword
keyword = "semiconductor"
# open the browser and navigate to the website. Enter the keyword and click on search
driver = webdriver.Chrome()
driver.get("https://ppubs.uspto.gov/pubwebapp/static/pages/ppubsbasic.html")
driver.find_element("id","searchText1").send_keys("semiconductor")
driver.find_element("id","searchText1").send_keys(Keys.RETURN)
# create an empty dataframe to store the data
master_df = pd.DataFrame()
# loop through the pages and extract the data
count = 0
while True:
# wait for the page to load
time.sleep(3)
# read the data from the page and append it to the master
df = pd.read_html(str(driver.page_source))[0]
master_df = pd.concat([master_df, df])
time.sleep(2)
try:
driver.find_element("id","paginationNextItem").click()
except Exception as e:
print(e)
break
print('Size of collection so far:', master_df.shape[0])
count += 1
if count > 5:
break
driver.close()
master_df.reset_index(drop=True, inplace=True)
master_df.to_csv('patents.csv', index=False)
Size of collection so far: 50
Size of collection so far: 100
Size of collection so far: 150
Size of collection so far: 200
Size of collection so far: 250
Size of collection so far: 300
Let’s have a look at the data we collected.
master_df.head()
| Result # | Document/Patent number | Display | Title | Inventor name | Publication date | Pages | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | US-20240193696-A1 | Preview PDF | PROACTIVE WEATHER EVENT COMMUNICATION SYSTEM A... | Wyatt; Amber et al. | 2024-06-13 | 21 |
| 1 | 2 | US-20240188889-A1 | Preview PDF | FLASH LED AND HEART RATE MONITOR LED INTEGRATI... | Tankiewicz; Szymon Michal et al. | 2024-06-13 | 28 |
| 2 | 3 | US-20240193523-A1 | Preview PDF | VIRTUAL CAREER MENTOR THAT CONSIDERS SKILLS AN... | O'Donncha; Fearghal et al. | 2024-06-13 | 16 |
| 3 | 4 | US-20240193519-A1 | Preview PDF | SYSTEMS AND METHODS FOR SYSTEM-WIDE GRANULAR A... | Holovacs; Jeremy | 2024-06-13 | 34 |
| 4 | 5 | US-20240190459-A1 | Preview PDF | METHODS AND SYSTEMS FOR VEHICLE CONTROL UNDER ... | Mamchuk; Tetyana V. et al. | 2024-06-13 | 23 |
Now that we have the patents data, we can use the following code to extract/download the patents in PDF format. We will be using the Requests library to download the PDF documents using simple HTTP requests.
It turns out (to our advantage) that the USPTO websites stores the PDF patent documents in a predictable URL format. We can use this to download the PDFs of the patents we are interested in. The URL is of the format https://ppubs.uspto.gov/dirsearch-public/print/downloadPdf/XXXXXXXX.pdf where XXXXXXXX is the patent number. We can use the requests library to download the PDFs.
import requests
url = "https://ppubs.uspto.gov/dirsearch-public/print/downloadPdf/{}"
# sample 4 rows
sample = master_df.head(4)
for i, row in sample.iterrows():
patent_number = row["Document/Patent number"].split("-")[1]
formatted_url = url.format(patent_number)
response = requests.get(formatted_url)
with open(f"{row['Document/Patent number']}.pdf", "wb") as f:
f.write(response.content)
print(f"Downloaded {row['Document/Patent number']}.pdf")
Downloaded US-20240193696-A1.pdf
Downloaded US-20240188889-A1.pdf
Downloaded US-20240193523-A1.pdf
Downloaded US-20240193519-A1.pdf
Let’s break down the code:
Some personal notes on the choice between Selenium WebDriver and the Requests library for web scraping:
When to use Selenium WebDriver
When to use the Requests library
Tip (that usually works): If you can find the required data/element in the source code after clicking Control+U, you can directly scrape the data using the Requests library. If you cannot find the required data/element in the source code, you might need to use the Selenium WebDriver.