Google

Jsoup – A better HTML parser than HTMLAgilityPack

Written on:February 17, 2013
Comments
Add One

After using HTMLAgilityPack for a long time I finally switched to Jsoup.  Jsoup is better in parsing HTML and best thing is that it supports Jquery like selectors to select elements with ease. Many a times I have noticed HTMLAgilityPack fails to extract correct data. For example while extracting meta keywords and description information for a website built using ASP.net it usually fails because of some extra spaces added by Master pages.

How to bring Java written Jsoup to .Net world?

Solution is very simple. Jsoup can very easily be complied to .Net library using IKVM.Net. Although you will be required to use IKVM.NET VM to run Jsoup library but since last one and half years I haven’t seen any issues because of this. It was able to achieve almost every task where I was utilizing HTMLAgilityPack library. There is a little learning curve while switching to Jsoup but it is more beneficial. One more point to add that Jsoup comes with a good list of examples and selectors to make you life easy.

Because of licensing and distribution issues I am not putting  .Net compiled Jsoup here. If you are having problems converting just drop me a word.

C# Snippet using Jsoup

Links : 

4 Comments add one

  1. jahmani says:

    Hello. I found this page while googling for a good way to extract meta info from webpages. I’m new to Jsoup, but have you tried NSoup (http://nsoup.codeplex.com). its a .NET port of the jsoup. can I use it instead of compiling Jsoup in clr. if no please send me the compiled Jsoup please

  2. James T says:

    If you’re looking for a CSS/jQuery solution in .NET take a look at CsQuery: http://github.com/jamietre/csquery and on nuget: csquery

    It’s a native C# jQuery port based on a C# port of the validator.nu HTML parser. Your code would look much the same

    var doc=CQ.CreateFromURL(“http://www.sitetoscrap.com”);
    var url = doc.Select(“.demo-btn”).First().Attr(“abs:href”).Trim();

    .. and so on, though CsQuery also has some shorthand like [] for selectors:

    CQ urlNode = doc[".demo-btn"].First();

    [] is also an array indexer since the CQ object is array-like, and finally provides attribute access from nodes:

    IDomObject urlNode = doc.Select(“.demo-btn”)[0];
    var url = urlNode["abs:href"];

    .. so you could put it all together:

    var url = doc[".demo-btn"][0]["href"];

    .. though this looks nicer, jQuery style:

    var url = doc[".demo-btn:first"].Attr(“href”);

    Every CQ object exposes IEnumerable, a sequence of elements, so you can use LINQ in addition to the full jquery api.

  3. bin says:

    Hi,

    Thank you for your article, was really useful (simple and looking nice solution, NSoup not works for me).
    I have a problem with https url. Exception thrown:
    http://stackoverflow.com/questions/7744075/how-to-connect-via-https-using-jsoup

    I have imported certificate but I don’t know how to set a trust in .Net code…

Leave a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">