Thal

Getting started with Puppeteer and Chrome Headless for Web Scraping

Generate Convert Improve

Install / Use

/learn @emadehsan/Thal

About this skill

Quality Score

0/100

README

Getting started with Puppeteer and Chrome Headless for Web Scraping

Here is a link to Medium Article

Here is the Chinese Version thanks to @csbun

A Desert in painters perception

Puppeteer is official tool for Chrome Headless by Google Chrome team. Since the official announcement of Chrome Headless, many of the industry standard libraries for automated testing have been discontinued by their maintainers. Including PhantomJS. Selenium IDE for Firefox has been discontinued due to lack of maintainers.

For sure, Chrome being the market leader in web browsing, Chrome Headless is going to industry leader in Automated Testing of web applications. So, I have put together this starter guide on how to get started with Web Scraping in Chrome Headless.

TL;DR

In this guide we will scrape GitHub, login to it and extract and save emails of users using Chrome Headless, Puppeteer, Node and MongoDB. Don't worry GitHub have rate limiting mechanism in place to keep you under control but this post will give you good idea on Scrapping with Chrome Headless and Node. Also, alway stay updated with the documentation because Puppeteer is under development and APIs are prone to changes.

Getting Started

Before we start, we need following tools installed. Head over to their websites and install them.

Project setup

Start off by making the project directory

$ mkdir thal
$ cd thal

Initiate NPM. And put in the necessary details.

$ npm init

Install Puppeteer. Its not stable and repository is updated daily. If you want to avail the latest functionality you can install it directly from its GitHub repository.

$ npm i --save puppeteer

Puppeteer includes its own chrome / chromium, that is guaranteed to work headless. So each time you install / update puppeteer, it will download its specific chrome version.

Coding

We will start by taking a screenshot of the page. This is code from their documentation.

Screenshot

const puppeteer = require('puppeteer');

async function run() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto('https://github.com');
  await page.screenshot({ path: 'screenshots/github.png' });

  browser.close();
}

run();

If its your first time using Node 7 or 8, you might be unfamiliar with async and await keywords. To put async/await in really simple words, an async function returns a Promise. The promise when resolves might return the result that you asked for. But to do this in a single line, you tie the call to async function with await. Save this in index.js inside project directory.

Also create the screenshots dir.

$ mkdir screenshots

Run the code with

$ node index.js

The screenshot is now saved inside screenshots/ dir.

GitHub

Login to GitHub

If you go to GitHub and search for john, then click the users tab. You will see list of all users with names.

Johns

Some of them have made their emails publicly visible and some have chosen not to. But the thing is you can't see these emails without logging in. So, lets login. We will make heavy use of Puppeteer documentation.

Add a file creds.js in project root. I highly recommend signing up for new account with a new dummy email because you might end up getting your account blocked.

module.exports = {
    username: '<GITHUB_USERNAME>',
    password: '<GITHUB_PASSWORD>'
}

Add another file .gitignore and put following content inside it:


node_modules/
creds.js

Launch in non headless

For visual debugging, make chrome launch with GUI by passing an object with headless: false to launch method.

const browser = await puppeteer.launch({
  headless: false
});

Lets navigate to login

await page.goto('https://github.com/login');

Open https://github.com/login in your browser. Right click on input box below Username or email address and select Inspect. From developers tool, right click on the highlighted code and select Copy then Copy selector.

Copy dom element selector

Paste that value to following constant

const USERNAME_SELECTOR = '#login_field'; // "#login_field" is the copied value

Repeat the process for Password input box and Sign in button. You would have following

// dom element selectors
const USERNAME_SELECTOR = '#login_field';
const PASSWORD_SELECTOR = '#password';
const BUTTON_SELECTOR = '#login > form > div.auth-form-body.mt-3 > input.btn.btn-primary.btn-block';

Logging in

Puppeteer provides methods click to click a DOM element and type to type text in some input box. Let's fill in the credentials then click login and wait for redirect.

Up on top, require creds.js file.

const CREDS = require('./creds');

And then

await page.click(USERNAME_SELECTOR);
await page.keyboard.type(CREDS.username);

await page.click(PASSWORD_SELECTOR);
await page.keyboard.type(CREDS.password);

await Promise.all([
  page.click(BUTTON_SELECTOR),
  page.waitForNavigation()
])

Search GitHub

Now, we have logged in. We can programmatically click on search box, fill it and on the results page, click users tab. But there's an easy way. Search requests are usually GET requests. So, every thing is sent via url. So, manually type john inside search box and then click users tab and copy the url. It would be

const searchUrl = 'https://github.com/search?q=john&type=Users&utf8=%E2%9C%93';

Rearranging a bit

const userToSearch = 'john';
const searchUrl = `https://github.com/search?q=${userToSearch}&type=Users&utf8=%E2%9C%93`;

Lets navigate to this page and wait to see if it actually searched?

await page.goto(searchUrl);
await page.waitFor(2*1000);

Extract Emails

We are interested in extracting username and email of users. Lets copy the DOM element selectors like we did above.

const LIST_USERNAME_SELECTOR = '#user_search_results > div.user-list > div:nth-child(1) div.d-flex > div > a';
const LIST_EMAIL_SELECTOR = '#user_search_results > div.user-list > div:nth-child(1) div.d-flex > div > ul > li:nth-child(2) > a';

const LENGTH_SELECTOR_CLASS = 'user-list-item';

You can see that I also added LENGTH_SELECTOR_CLASS above. If you look at the github page's code inside developers tool, you will observe that divs with class user-list-item are actually housing information about a single user each.

Currently one way to extract text from an element is by using evaluate method of Page or ElementHandle. When we navigate to page with search results, we will use page.evaluate method to get the length of users list on the page. The evaluate method evaluates the code inside browser context.

let listLength = await page.evaluate((sel) => {
    return document.getElementsByClassName(sel).length;
  }, LENGTH_SELECTOR_CLASS);

Let's loop through all the listed users and extract emails. As we loop through the DOM, we have to change index inside the selectors to point to the next DOM element. So, I put the INDEX string at the place where we want to place the index as we loop through.

  // const LIST_USERNAME_SELECTOR = '#user_search_results > div.user-list > div:nth-child(1) div.d-flex > div > a';
const LIST_USERNAME_SELECTOR = '#user_search_results > div.user-list > div:nth-child(INDEX) div.d-flex > div > a';
  // const LIST_EMAIL_SELECTOR = '#user_search_results > div.user-list > div:nth-child(1) div.d-flex > div > ul > li:nth-child(2) > a';
const LIST_EMAIL_SELECTOR = '#user_search_results > div.user-list > div:nth-child(INDEX) div.d-flex > div > ul > li:nth-child(2) > a';
const LENGTH_SELECTOR_CLASS = 'user-list-item';

The loop and extraction

for (let i = 1; i <= listLength; i++) {
    // change the index to the next child
    let usernameSelector = LIST_USERNAME_SELECTOR.replace("INDEX", i);
    let emailSelector = LIST_EMAIL_SELECTOR.replace("INDEX", i);

    let username = await page.evaluate((sel) => {
        return document.querySelector(sel).getAttribute('href').replace('/', '');
      }, usernameSelector);

    let email = await page.evaluate((sel) => {
        let element = document.querySelector(sel);
        return element? element.innerHTML: null;
      }, emailSelector);

    // not all users have emails visible
    if (!email)
      continue;

    console.log(username, ' -> ', email);

    // TODO save this user
  }

Now if you run the script with node index.js you would see usernames and there corresponding emails printed.

Go over all the pages

First we would estimate the last page number with search results. At search results page, on top, you can see 69,769 users at the time of this writing.

Fun Fact: If you compare with the previous screenshot of the page, you will notice that 6 more john s have joined GitHub in the matter of a few hours.

Number of search items

Copy its selector from developer tools. We would write a new function below the run function to return the number of pages we can go through.

async function getNumPages(page) {
  const NUM_USER_SELECTOR = '#js-pjax-container > div.container > div > div.column.three-fourths.codesearch-results.pr-6 > div.d-flex.flex-justify-between.border-bottom.pb-3 > h3';

  let inner = await page.evaluate((sel) => {

Related Skills

node-connect

337.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

83.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

337.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

83.2k

Commit, push, and open a PR