29 Dec 2019

Web Scraping on Heroku With NodeJS, Chromedriver, and Selenium

Motivation

More and more websites these days are making use of client-side rendering, which often means that using a simple curl or NodeJS fetch command will no longer work for scraping data.

To successfully scrape sites that use JavaScript to update HTML, the best option is to use a headless browser, coupled with some sort of browser automation library.

Tools of choice

Selenium is an EXTREMELY widely-used and full-featured browser automation library.
- There are selenium libraries for many different languages (eg. JS, Python, Java, etc). We'll be using JS in this tutorial.
- Selenium is also extremely well-documented.
Chrome + chromedriver gives us access to an automatable headless browser with which Selenium can interact.

Goal of this walk-through

In this walkthrough, I will show you how to automatically load up a site, wait for it to finish rendering, then grab the HTML of the page.

This should be enough to allow you to scrape almost ANY site on the internet (and, once you get more comfortable with Selenium, you'll find that Selenium can do MUCH more than just pull down fully-rendered HTML).

Assumptions

This overview is intended for people that:

Are familiar with NodeJS and npm
- Make sure you have one of the more modern versions of NodeJS installed on your machine, as we'll be using some of the latest NodeJS syntax (eg. async / await, etc)
- I highly recommend checking out nvm if you don't use it already.
Are familiar with git
Have some basic web-scraping experience
Know how to create and deploy a Heroku app (if you care about the Heroku part)

The Walk-Through

Setup

In this part we'll set everything up and make sure we have all of our required tools installed.

Create a new (blank) git repository and hook it up with Heroku.
- git remote -v should include a line that looks like heroku https://git.heroku.com/your-heroku-repo-name.git
- If you just want to run your scraping script locally, you DO NOT need Heroku.
Set up npm in the repo if you haven't already: npm init
Ok, you should now have a basic repo where you can install npm packages, create and run new NodeJS scripts, and push anything you create to your heroku remote.
Install selenium-webdriver
- npm install --save selenium-webdriver
Make sure you have the chrome browser installed on your local machine.
Download chromedriver to somewhere on your computer
- Find out which chromedriver version you need by opening your Chrome browser and typing chrome://version/ in the URL bar.
  - You will want to download the chromedriver version that most closely matches the version of your existing Chrome browser.
- Go to the chromedriver downloads page and download / unzip the appropriate chromedriver version.

Write the script

Ok, now that we have everything installed, we're ready to start writing our scraping script. For this example, we'll just scrape google.com, but this will work for any other site, also.

Create a file called sampleScraper.js and put the following into it:

const {  
  Browser,
  Builder,
  until,
} = require('selenium-webdriver');
const {  
  Options,
  ServiceBuilder,
} = require('selenium-webdriver/chrome');

let options = new Options();

// This tells Selenium where to find your Chrome browser executable
options.setChromeBinaryPath(process.env.CHROME_BINARY_PATH);

// These options are necessary if you'd like to deploy to Heroku
// options.addArguments("--headless");
// options.addArguments("--disable-gpu");
// options.addArguments("--no-sandbox");

(async function run() {
  // Necessary to tell Selenium where to find your newly-installed chromedriver executable
  let serviceBuilder = new ServiceBuilder(process.env.CHROME_DRIVER_PATH);

  // Create a new "driver", which controls the browser and does all the actual scraping
  let driver = new Builder()
    .forBrowser(Browser.CHROME)
    .setChromeOptions(options)
    .setChromeService(serviceBuilder)
    .build();

  try {
    // Open up google.com in the browser
    const res1 = await driver.get('https://www.google.com');
    // Wait on this page until this condition is met
    const res2 = await driver.wait(until.titleMatches(/Google/));
    // Get the full HTML of the page and log it
    const html = await driver.getPageSource();
    console.log(`HTML is:\n\n{}\n\n`, html);
  } finally {
    await driver.quit();
  }
})();

Add required ENV variables to your system.
- You'll notice that this script makes use of some environment variables: "CHROME_DRIVER_PATH" and "CHROME_BINARY_PATH".
- These env variables tell Selenium where to find your chromedriver and your chrome browser, respectively.
- Set these variables to the correct values. On Ubuntu, you can do this by adding the following lines to your ~./bashrc file.
  - export CHROME_DRIVER_PATH="wherever-you-downloaded-your-driver-earlier-this-walkthrough"
  - export CHROME_BINARY_PATH="wherever-your-chrome-binary-is"
  - On Ubuntu, you can find the location of your chrome binary by typing which google-chrome.
  - For the chromedriver location, make sure you give the full path the chromedriver file you downloaded (and make sure it's unzipped!)
Run the script!
- If you see this error The ChromeDriver could not be found on the current PATH. Please download the latest version of the ChromeDriver from http://chromedriver.storage.googleapis.com/index.html and ensure it can be found on your PATH., then you didn't correctly enter the location of your Chromedriver in the ENV variable. You DO NOT need to add Chromedriver to your path to make this work.
- If you see an error like session not created: This version of ChromeDriver only supports Chrome version 78, it means that the chromedriver you downloaded earlier is the incorrect version. Download a different version, and make sure it matches your existing Chrome version (as described earlier).
Ok, you've successfully scraped google.com using Selenium and Chrome!
- This same script should work with any site, even if it's client-side rendered.
- Note: the driver.wait(until...) statement is what tells the Browser to wait before it attempts to capture any HTML.
  - There are many options for this until logic, and you can use any of them to make sure a page is "ready" before you attempt to scrape from it.
  - One of my favorites is until.elementLocated, since it allows your script to wait until a specific element has been rendered before continuing.
- (Optional: Read the comments above to see what each line of the script is meant to do).
- (Optional: to get this working on Heroku, keep reading below)

Connect with Heroku

Ok, our last (optional) step is to get this working on Heroku. The good news is that by including ENV variables in the example above, we've already done most of the work. Here are the last few steps:

Add Heroku buildpacks for chromedriver and chrome:
- heroku buildpacks:add https://github.com/heroku/heroku-buildpack-chromedriver
- heroku buildpacks:add https://github.com/heroku/heroku-buildpack-google-chrome
The buildpacks make sure chrome and chromedriver are installed on your Heroku app. Update your Heroku env variables to match these new install locations:
- heroku config:set CHROME_DRIVER_PATH=/app/.chromedriver/bin/chromedriver
- heroku config:set CHROME_BINARY_PATH=/app/.apt/opt/google/chrome/chrome
Uncomment the required lines in the script above (eg. the lines about the browser running in headless mode).
Save your changes and push your script to your Heroku app
- git add .
- git commit
- git push heroku master
Test it out by running your script on heroku!
- heroku run bash
  - (should give you a bash prompt inside your heroku app)
- node sampleScraper.js

And that's it! Now you should have a working web scraper that can run on Heroku.

Please comment if you have questions, comments, or if you spot anything I've missed.

Tag

Back to all post