How to Scrape Glassdoor Data with Node.js

Bibhuti Poudyal
JavaScript in Plain English
4 min readMar 5, 2022

--

Photo by Markus Spiske on Unsplash

There are multiple ways to scrape data off a webpage. This article explains a smarter way to scrape salaries data of Kathmandu city from Glassdoor using a NodeJS script. First let’s look at the tools we will be using.

JSDOM

  • It’s a JavaScript implementation of web standards, for use with NodeJS. Reason for particularly using this tool: same logic works on the browser’s console and NodeJS. No need to learn extra stuff for scraping. Personally, I try to scrape data from the browser’s console first and use the same code later on the scraper script. Other tools like cheer.io can also be an alternative.

Axios

  • It’s a HTTP client for NodeJS; used to fetch a webpage’s source (HTML).

The smart way

Before running any scraping task, it’s necessary to analyze the page’s source code; look for patterns that make your job easier. Scraping paginated data is more cumbersome than regular data.

In our case, I will be using this (https://www.glassdoor.com/Salaries/kathmandu-salary-SRCH_IL.0,9_IM1598.htm) URL to scrape the salary data of Kathmandu city.

Let’s start

Browse the URL, the first thing you get is a stupid “sign up” dialog covering all data. Quickly spin up the developer tools, goto inspector, identify the element (div#HardsellOverlay) and remove it.

Sign-up dialog covering data

Now the data is visible, look for patterns that help in identifying the list of salaries data. Luckily we have one. All data have attribute data-test=”salaries-list-item-XXX”.

Pattern for salary data

Exploring further, we can see that each data item has a similar pattern and we can simply scrape it using query selector from browser console. Copy-paste the given code in the browser’s console. We will use the same piece of code later when building the script.

// A page has 20 data
for (let i = 0; i <= 19; i++) {
// all salary data are prefixed with
let attrPrefix = `salaries-list-item-${i}`;
console.log("company name", document.querySelector(`[data-test='${attrPrefix}-employer-name'] a`).textContent); console.log("company logo", document.querySelector(`[data-test='${attrPrefix}-employer-url'] img`).src); console.log("job title", document.querySelector(`[data-test='${attrPrefix}-job-title']`).textContent); console.log("salary", document.querySelector(`[data-test='${attrPrefix}-salary-info'] h3`).textContent);
}

Okay, now we have the data and we know how to scrape it from HTML. What we need next is more data.

When we scroll to the bottom we see a pagination there. Next thing to do is: paginate next and scrape data from each page. As you may have noticed, each page has 20 data items.

Pagination. Also, note the total no. of items.

Wait! Maybe there’s another pattern that may make our task easier. As we dig more into it, there’s a pattern on pagination too.

From the second page, the URL has the page number mixed with some weird value. Also, as we saw earlier, there were total 1632 data, 20 on each page. That makes it total 1632/20 = 82 pages.

pattern in pagination

NodeJS script

As we have figured out the way to scrape all those data, it’s time to write a NodeJS script to scrape Glassdoor salary data and save it to a JSON file. We will be writing data to file data.json after each page is done scraping. Reason: you can pause and continue at any page + you can run the script multiple times in parallel with different page no. range.

For the sake of simplicity, I will directly paste the script here with explanatory comments.

After running the script, you will end up with a data.json file containing Glassdoor’s salary data of Kathmandu city. Similarly, you can scrape data of any other city or data on job listings.

Important message: scraping doesn’t have to be brute force, there are always smart ways to minimize the workload.

Hope you like the content. If there are any issues, please feel free to point them out.

More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter and LinkedIn. Join our community Discord.

--

--