After using HTMLAgilityPack for a long time I finally switched to Jsoup. Jsoup is better in parsing HTML and best thing is that it supports Jquery like selectors to select elements with ease. Many a times I have noticed HTMLAgilityPack fails to extract correct data. For example while extracting meta keywords and description information for a website built using ASP.net it usually fails because of some extra spaces added by Master pages.
How to bring Java written Jsoup to .Net world?
Solution is very simple. Jsoup can very easily be complied to .Net library using IKVM.Net. Although you will be required to use IKVM.NET VM to run Jsoup library but since last one and half years I haven’t seen any issues because of this. It was able to achieve almost every task where I was utilizing HTMLAgilityPack library. There is a little learning curve while switching to Jsoup but it is more beneficial. One more point to add that Jsoup comes with a good list of examples and selectors to make you life easy.
Because of licensing and distribution issues I am not putting .Net compiled Jsoup here. If you are having problems converting just drop me a word.
C# Snippet using Jsoup
|
1 2 3 4 5 6 7 |
// using Jsoup var doc=Jsoup.connect("http://www.sitetoscrap.com").timeout(10000).get(); var url=doc.select(".demo-btn").first().attr("abs:href").Trim(); var tags=doc.select("td.Table-Post3").eq(3).text(); var desc=doc.select("td.Table-Post3").eq(4).text(); var name= doc.select("#post").first().select("H2").first().text(); // |
Links :
Hello. I found this page while googling for a good way to extract meta info from webpages. I’m new to Jsoup, but have you tried NSoup (http://nsoup.codeplex.com). its a .NET port of the jsoup. can I use it instead of compiling Jsoup in clr. if no please send me the compiled Jsoup please
If you’re looking for a CSS/jQuery solution in .NET take a look at CsQuery: http://github.com/jamietre/csquery and on nuget: csquery
It’s a native C# jQuery port based on a C# port of the validator.nu HTML parser. Your code would look much the same
var doc=CQ.CreateFromURL(“http://www.sitetoscrap.com”);
var url = doc.Select(“.demo-btn”).First().Attr(“abs:href”).Trim();
.. and so on, though CsQuery also has some shorthand like [] for selectors:
CQ urlNode = doc[".demo-btn"].First();
[] is also an array indexer since the CQ object is array-like, and finally provides attribute access from nodes:
IDomObject urlNode = doc.Select(“.demo-btn”)[0];
var url = urlNode["abs:href"];
.. so you could put it all together:
var url = doc[".demo-btn"][0]["href"];
.. though this looks nicer, jQuery style:
var url = doc[".demo-btn:first"].Attr(“href”);
Every CQ object exposes IEnumerable, a sequence of elements, so you can use LINQ in addition to the full jquery api.
James
Thanks for sharing, I’ll try this for sure.