a blog for those who code

Friday 27 January 2017

Web Scraping in Asp.Net using HtmlAgilityPack

In this post we will be discussing about how to do webscraping in Asp.Net using HtmlAgilityPack. Web Scraping is a technique of extracting information from websites. As the volume of data on the web has increased, this practice has become increasingly widespread, and a number of powerful services have emerged to simplify it.

HTML Agility Pack is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT. In simple terms its an .NET library which allows you to parse the files on the web.

Install HTML Agility Pack

You can install HTML Agility Pack using Nuget package manager. To install it you need to run below command in Package Manager Console

PM> Install-Package HtmlAgilityPack

After adding the reference via Nuget, you need to include the reference in your page using the following :

using HTMLAgilityPack;

At first you need to create an instance of HtmlWeb which is a utility class to get the document from HTTP. Now using the Load function of HtmlWeb you need to load the entire HTML document as shown below :

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.codingdefined.com");

If you need you can use any of the below Load function

public HtmlDocument Load(string url, string method); // method as GET, POST, PUT etc
public HtmlDocument Load(string url, string method, WebProxy proxy, NetworkCredential credentials); // method, proxy and the credentials for authenticating
public HtmlDocument Load(string url, string proxyHost, int proxyPort, string userId, string password); // proxy host, proxy port, userid and password for authentication

Next thing is to get the specific div or span by id or class for that you will select the nodes as shown below :

doc.DocumentNode.SelectNodes("//div[contains(@class,'hentry')]") // Select all the div's having class hentry 
doc.DocumentNode.SelectNodes("//div[@id='main')]") // Select all div's having id = main
doc.DocumentNode.SelectNodes("//div[contains(@class,'hentry')]//a") // Select all the a's inside class hentry 

Alternatively you can query it using the LINQ query

var node = doc.DocumentNode.Descendants().Where(a => a.GetAttributeValue("class", "").Equals("hentry")).Single();

Example - Scraping CodingDefined.com home page to get post name and link

If you check the home page of Coding Defined, you will see all the posts is inside a div having class name hentry and post. So at first we will get all the nodes having class hentry. Next thing is to get the h3 tag of that div and then finally a tag to get the title and href. So the code to get all the details are :

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
using HtmlAgilityPack;

public partial class _Default : System.Web.UI.Page
 protected void Page_Load(object sender, EventArgs e)
  HtmlWeb web = new HtmlWeb();
  HtmlDocument doc = web.Load("http://www.codingdefined.com");
  doc.DocumentNode.Descendants().Where(a => a.GetAttributeValue("class", "").Equals("hentry")).Single();
  foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[contains(@class,'hentry')]//h3//a"))
   HyperLink link = new HyperLink();
   link.Text = node.InnerHtml;
   link.NavigateUrl = node.Attributes["href"].Value;
   PlaceHolder1.Controls.Add(new LiteralControl("<br />"));

In the above code we are getting the Title and Href and saving the information in a HyperLink and adding that to the placeholder.

Results :

Please Like and Share the CodingDefined Blog, if you find it interesting and helpful.

1 comment:

  1. Great information. Lucky me I discovered your website
    by accident (stumbleupon). I've saved it for later!