a blog for those who code

Monday 4 January 2016

Web Scraping in Node.js

In this post we will be discussing about scraping the web in Node.js using Cheerio module. Cheerio is tiny, fast, and elegant implementation of core jQuery designed specifically for the server. We will build a simple web scraper which will give us the name and description of most dependent modules from NPM.

To get started we need to install Cheerio module from npm using the command npm install cheerio.


In our code we will be making a single request to NPM and get name and description of the most dependent modules. Once we get this information we will display that on the console.

Code : 

var cheerio = require('cheerio');
var http = require('https');

function webScrap() {
 url = 'https://www.npmjs.com/';

 var request = http.get(url, function(response) {
  var json = '';
  response.on('data', function(chunk) {
   json += chunk;
  });

  response.on('end', function() {
   var $ = cheerio.load(json);

   $(".packages").filter(function() {
    var data = $(this);
    data.children().each(function() {
     var dataChild = $(this);
     dataChild.children().children().each(function () {
      var details = $(this);
      console.log('-------------------------');
      console.log('Name : ' + details.children().first().children().text());
      console.log('Description : ' + details.children('.description').text());
     })
    })
   })
  });
 });
 request.on('error', function(err) {
   console.log(err);
 });
}
webScrap();

In the above code we will create a request which will capture the HTML of the website and pass it to the cheerio where we will traverse the DOM and extract information we want.

Now let us look at the NPM Website. Here we have a ul element with class name packages and all the dependent modules in the li element. In the code we have filtered out DOM element based on the class name ".packages". Then we have traversed to each li element to get the name and description of the packages.


When we run the above code we will get the output as below :


Please Like and Share the CodingDefined.com Blog, if you find it interesting and helpful.

1 comment:

  1. It might be less confusing for others if you changed the 'json' variable in your code to html, since that's what it actually is.

    It may also be useful for users to understand that chunk (in the 'data' handler) is a Buffer(https://nodejs.org/api/buffer.html#buffer_buffer) which is implicitly converted to a string on the line where you do json += chunk.

    ReplyDelete