Web Scraping on Heroku With NodeJS, Chromedriver, and Selenium
Motivation
More and more websites these days are making use of client-side rendering, which often means that using a simple curl
or NodeJS fetch
command will no longer work for scraping data.
To successfully scrape sites that use JavaScript to update HTML, the best option is to use a headless browser, coupled with some sort of browser automation library.
Tools of choice
- Selenium is an EXTREMELY widely-used and full-featured browser automation library.
- There are selenium libraries for many different languages (eg. JS, Python, Java, etc). We'll be using JS in this tutorial.
- Selenium is also extremely well-documented.
- Chrome + chromedriver gives us access to an automatable headless browser with which Selenium can interact.
Goal of this walk-through
In this walkthrough, I will show you how to automatically load up a site, wait for it to finish rendering, then grab the HTML of the page.
This should be enough to allow you to scrape almost ANY site on the internet (and, once you get more comfortable with Selenium, you'll find that Selenium can do MUCH more than just pull down fully-rendered HTML).
Assumptions
This overview is intended for people that:
- Are familiar with
NodeJS
andnpm
- Make sure you have one of the more modern versions of
NodeJS
installed on your machine, as we'll be using some of the latest NodeJS syntax (eg. async / await, etc) - I highly recommend checking out nvm if you don't use it already.
- Make sure you have one of the more modern versions of
- Are familiar with
git
- Have some basic web-scraping experience
- Know how to create and deploy a Heroku app (if you care about the Heroku part)
The Walk-Through
Setup
In this part we'll set everything up and make sure we have all of our required tools installed.
- Create a new (blank) git repository and hook it up with Heroku.
git remote -v
should include a line that looks likeheroku https://git.heroku.com/your-heroku-repo-name.git
- If you just want to run your scraping script locally, you DO NOT need Heroku.
- Set up
npm
in the repo if you haven't already:npm init
- Ok, you should now have a basic repo where you can install
npm packages
, create and run newNodeJS
scripts, and push anything you create to yourheroku
remote. - Install
selenium-webdriver
npm install --save selenium-webdriver
- Make sure you have the
chrome
browser installed on your local machine. - Download
chromedriver
to somewhere on your computer
- Find out which
chromedriver
version you need by opening your Chrome browser and typingchrome://version/
in the URL bar.- You will want to download the
chromedriver
version that most closely matches the version of your existing Chrome browser.
- You will want to download the
- Go to the chromedriver downloads page and download / unzip the appropriate
chromedriver
version.
- Find out which
Write the script
Ok, now that we have everything installed, we're ready to start writing our scraping script. For this example, we'll just scrape google.com
, but this will work for any other site, also.
- Create a file called
sampleScraper.js
and put the following into it:
const {
Browser,
Builder,
until,
} = require('selenium-webdriver');
const {
Options,
ServiceBuilder,
} = require('selenium-webdriver/chrome');
let options = new Options();
// This tells Selenium where to find your Chrome browser executable
options.setChromeBinaryPath(process.env.CHROME_BINARY_PATH);
// These options are necessary if you'd like to deploy to Heroku
// options.addArguments("--headless");
// options.addArguments("--disable-gpu");
// options.addArguments("--no-sandbox");
(async function run() {
// Necessary to tell Selenium where to find your newly-installed chromedriver executable
let serviceBuilder = new ServiceBuilder(process.env.CHROME_DRIVER_PATH);
// Create a new "driver", which controls the browser and does all the actual scraping
let driver = new Builder()
.forBrowser(Browser.CHROME)
.setChromeOptions(options)
.setChromeService(serviceBuilder)
.build();
try {
// Open up google.com in the browser
const res1 = await driver.get('https://www.google.com');
// Wait on this page until this condition is met
const res2 = await driver.wait(until.titleMatches(/Google/));
// Get the full HTML of the page and log it
const html = await driver.getPageSource();
console.log(`HTML is:\n\n{}\n\n`, html);
} finally {
await driver.quit();
}
})();
- Add required ENV variables to your system.
- You'll notice that this script makes use of some environment variables: "CHROME_DRIVER_PATH" and "CHROME_BINARY_PATH".
- These env variables tell Selenium where to find your
chromedriver
and yourchrome
browser, respectively. - Set these variables to the correct values. On Ubuntu, you can do this by adding the following lines to your
~./bashrc
file.export CHROME_DRIVER_PATH="wherever-you-downloaded-your-driver-earlier-this-walkthrough"
export CHROME_BINARY_PATH="wherever-your-chrome-binary-is"
- On Ubuntu, you can find the location of your chrome binary by typing
which google-chrome
. - For the chromedriver location, make sure you give the full path the chromedriver file you downloaded (and make sure it's unzipped!)
- Run the script!
- If you see this error
The ChromeDriver could not be found on the current PATH. Please download the latest version of the ChromeDriver from http://chromedriver.storage.googleapis.com/index.html and ensure it can be found on your PATH.
, then you didn't correctly enter the location of your Chromedriver in the ENV variable. You DO NOT need to add Chromedriver to your path to make this work. - If you see an error like
session not created: This version of ChromeDriver only supports Chrome version 78
, it means that the chromedriver you downloaded earlier is the incorrect version. Download a different version, and make sure it matches your existing Chrome version (as described earlier).
- If you see this error
- Ok, you've successfully scraped
google.com
using Selenium and Chrome!- This same script should work with any site, even if it's client-side rendered.
- Note: the
driver.wait(until...)
statement is what tells the Browser to wait before it attempts to capture any HTML.- There are many options for this
until
logic, and you can use any of them to make sure a page is "ready" before you attempt to scrape from it. - One of my favorites is
until.elementLocated
, since it allows your script to wait until a specific element has been rendered before continuing.
- There are many options for this
- (Optional: Read the comments above to see what each line of the script is meant to do).
- (Optional: to get this working on Heroku, keep reading below)
Connect with Heroku
Ok, our last (optional) step is to get this working on Heroku. The good news is that by including ENV variables in the example above, we've already done most of the work. Here are the last few steps:
- Add Heroku
buildpacks
for chromedriver and chrome:
heroku buildpacks:add https://github.com/heroku/heroku-buildpack-chromedriver
heroku buildpacks:add https://github.com/heroku/heroku-buildpack-google-chrome
- The buildpacks make sure chrome and chromedriver are installed on your
Heroku
app. Update your Heroku env variables to match these new install locations:
heroku config:set CHROME_DRIVER_PATH=/app/.chromedriver/bin/chromedriver
heroku config:set CHROME_BINARY_PATH=/app/.apt/opt/google/chrome/chrome
- Uncomment the required lines in the script above (eg. the lines about the browser running in headless mode).
- Save your changes and push your script to your Heroku app
git add .
git commit
git push heroku master
- Test it out by running your script on heroku!
heroku run bash
- (should give you a bash prompt inside your heroku app)
node sampleScraper.js
And that's it! Now you should have a working web scraper that can run on Heroku.
Please comment if you have questions, comments, or if you spot anything I've missed.