In this post we will be discussing about how to do webscraping in Asp.Net using HtmlAgilityPack. Web Scraping is a technique of extracting information from websites. As the volume of data on the web has increased, this practice has become increasingly widespread, and a number of powerful services have emerged to simplify it.
HTML Agility Pack is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT. In simple terms its an .NET library which allows you to parse the files on the web.
You can install HTML Agility Pack using Nuget package manager. To install it you need to run below command in Package Manager Console
PM> Install-Package HtmlAgilityPack
After adding the reference via Nuget, you need to include the reference in your page using the following :
using HTMLAgilityPack;
At first you need to create an instance of HtmlWeb which is a utility class to get the document from HTTP. Now using the Load function of HtmlWeb you need to load the entire HTML document as shown below :
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.codingdefined.com");
If you need you can use any of the below Load function
public HtmlDocument Load(string url, string method); // method as GET, POST, PUT etc
public HtmlDocument Load(string url, string method, WebProxy proxy, NetworkCredential credentials); // method, proxy and the credentials for authenticating
public HtmlDocument Load(string url, string proxyHost, int proxyPort, string userId, string password); // proxy host, proxy port, userid and password for authentication
Next thing is to get the specific div or span by id or class for that you will select the nodes as shown below :
doc.DocumentNode.SelectNodes("//div[contains(@class,'hentry')]") // Select all the div's having class hentry
doc.DocumentNode.SelectNodes("//div[@id='main')]") // Select all div's having id = main
doc.DocumentNode.SelectNodes("//div[contains(@class,'hentry')]//a") // Select all the a's inside class hentry
Alternatively you can query it using the LINQ query
var node = doc.DocumentNode.Descendants().Where(a => a.GetAttributeValue("class", "").Equals("hentry")).Single();
If you check the home page of Coding Defined, you will see all the posts is inside a div having class name hentry and post. So at first we will get all the nodes having class hentry. Next thing is to get the h3 tag of that div and then finally a tag to get the title and href. So the code to get all the details are :
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
using HtmlAgilityPack;
public partial class _Default : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.codingdefined.com");
doc.DocumentNode.Descendants().Where(a => a.GetAttributeValue("class", "").Equals("hentry")).Single();
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[contains(@class,'hentry')]//h3//a"))
{
HyperLink link = new HyperLink();
link.Text = node.InnerHtml;
link.NavigateUrl = node.Attributes["href"].Value;
PlaceHolder1.Controls.Add(link);
PlaceHolder1.Controls.Add(new LiteralControl("<br />"));
}
}
}
In the above code we are getting the Title and Href and saving the information in a HyperLink and adding that to the placeholder.
Results :
Please Like and Share the CodingDefined Blog, if you find it interesting and helpful.
HTML Agility Pack is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT. In simple terms its an .NET library which allows you to parse the files on the web.
Install HTML Agility Pack
You can install HTML Agility Pack using Nuget package manager. To install it you need to run below command in Package Manager Console
PM> Install-Package HtmlAgilityPack
After adding the reference via Nuget, you need to include the reference in your page using the following :
using HTMLAgilityPack;
At first you need to create an instance of HtmlWeb which is a utility class to get the document from HTTP. Now using the Load function of HtmlWeb you need to load the entire HTML document as shown below :
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.codingdefined.com");
If you need you can use any of the below Load function
public HtmlDocument Load(string url, string method); // method as GET, POST, PUT etc
public HtmlDocument Load(string url, string method, WebProxy proxy, NetworkCredential credentials); // method, proxy and the credentials for authenticating
public HtmlDocument Load(string url, string proxyHost, int proxyPort, string userId, string password); // proxy host, proxy port, userid and password for authentication
Next thing is to get the specific div or span by id or class for that you will select the nodes as shown below :
doc.DocumentNode.SelectNodes("//div[contains(@class,'hentry')]") // Select all the div's having class hentry
doc.DocumentNode.SelectNodes("//div[@id='main')]") // Select all div's having id = main
doc.DocumentNode.SelectNodes("//div[contains(@class,'hentry')]//a") // Select all the a's inside class hentry
Alternatively you can query it using the LINQ query
var node = doc.DocumentNode.Descendants().Where(a => a.GetAttributeValue("class", "").Equals("hentry")).Single();
Example - Scraping CodingDefined.com home page to get post name and link
If you check the home page of Coding Defined, you will see all the posts is inside a div having class name hentry and post. So at first we will get all the nodes having class hentry. Next thing is to get the h3 tag of that div and then finally a tag to get the title and href. So the code to get all the details are :
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
using HtmlAgilityPack;
public partial class _Default : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.codingdefined.com");
doc.DocumentNode.Descendants().Where(a => a.GetAttributeValue("class", "").Equals("hentry")).Single();
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[contains(@class,'hentry')]//h3//a"))
{
HyperLink link = new HyperLink();
link.Text = node.InnerHtml;
link.NavigateUrl = node.Attributes["href"].Value;
PlaceHolder1.Controls.Add(link);
PlaceHolder1.Controls.Add(new LiteralControl("<br />"));
}
}
}
In the above code we are getting the Title and Href and saving the information in a HyperLink and adding that to the placeholder.
Results :
Please Like and Share the CodingDefined Blog, if you find it interesting and helpful.
Great information. Lucky me I discovered your website
ReplyDeleteby accident (stumbleupon). I've saved it for later!