Saturday, February 16, 2013

Jsoup – A better HTML parser than HTMLAgilityPack

After using HTMLAgilityPack for a long time I finally switched to Jsoup.  Jsoup is better in parsing HTML and best thing is that it supports Jquery like selectors to select elements with ease. Many a times I have noticed HTMLAgilityPack fails to extract correct data. For example while extracting meta keywords and description information for a website built using ASP.net it usually fails because of some extra spaces added by Master pages.

How to bring Java written Jsoup to .Net world?

Solution is very simple. Jsoup can very easily be complied to .Net library using IKVM.Net. Although you will be required to use IKVM.NET VM to run Jsoup library but since last one and half years I haven’t seen any issues because of this. It was able to achieve almost every task where I was utilizing HTMLAgilityPack library. There is a little learning curve while switching to Jsoup but it is more beneficial. One more point to add that Jsoup comes with a good list of examples and selectors to make you life easy.

Because of licensing and distribution issues I am not putting  .Net compiled Jsoup here. If you are having problems converting just drop me a word.

C# Snippet using Jsoup

// using Jsoup
var doc=Jsoup.connect("http://www.sitetoscrap.com").timeout(10000).get();
var url=doc.select(".demo-btn").first().attr("abs:href").Trim();
var tags=doc.select("td.Table-Post3").eq(3).text();
var desc=doc.select("td.Table-Post3").eq(4).text();
var name= doc.select("#post").first().select("H2").first().text();
//

Links : 


Saturday, January 5, 2013

Using Multiline String Literals in C#

What if you want to put a very long string literal in code which span to multiple lines?  .Net compiler would show error if the string literal spans to multiple line.

var mylongline = @"start of string literal line1 
start of string literal line2
start of string literal line3
start of string literal line4";

Trick is to prefix the string with @ symbol and compiler will stop complaining.

Monday, December 10, 2012

Virtual Store On Windows7 and Vista

Just noticed a new kind of OS folder which got created after some program tried to write in C:\Program Files\ folder.  Why this folder got created when I didn’t do anything about it? 

Yes, on Windows Vista, Windows 7 and latest OS’s there is a feature called Virtual Store. Whenever a write operation is called in OS protected folders with OS performs a redirection and stores the files in a similar location under below folder.

C:\Users\<username>\AppData\Local\VirtualStore\Program Files\

Remove_ProgramFolder_VirtualStore

Applications performing write operation are not aware of this virtualization, and thinks the files are stored on the same location and hence no error is generated.

Monday, July 2, 2012

Enable Disable Anchor Element Using jQuery

Anchor Elements works in a different way than input elements. To enable disable Hyperlinks you need to handle click event and prevent further propagation and manually change the visual effect.

<div>
<input id="b1" value="Disable Yahoo Link" type="button">
<input id="b2" value="Enable Yahoo Link" type="button">
</div>      

<a id="goog" href="http://Google.com">Google.com</a> <br>
<a id="yahoo" href="http://Google.com" target="_blank">Yahoo.com</a>    

<script>
$(document).ready(function(){
    $("#b1").click(function(){
        $("#yahoo").attr("disabled","disabled");
         $("#yahoo").css("background-color","silver");
    })

    $("#b2").click(function(){
        $("#yahoo").removeAttr("disabled");
        $("#yahoo").css("background-color","white");
    })    

    $("#yahoo").click(function(e){
        if($("#yahoo").attr("disabled")=="disabled")
        {
            e.preventDefault();
        }
    });
});
</script>

Live Demo – Enable Disable Anchor Using jQuery

Wednesday, June 27, 2012

Generating .Net Webservice Proxy From Command Line

It was very disappointing to see that Visual Studio 2010 and Visual Studio 2011 Web Express editions were not able to generate a simple web service proxy. So what if you fall into such situation where you don’t have Visual Studio available and you need to access some web service? One well known utility which comes with .Net SDK  i.e WSDL.exe can be used to generate .net soap client library for you. Follow below steps :

Step 1.  Generate proxy code in .net ( c# or vb.net). In my case WSDL.exe was available at

[ C:\Program Files\Microsoft SDKs\Windows\v8.0A\bin\NETFX 4.0 Tools ]

Command line is :

wsdl http://webservice.in/ws.asmx?WSDL /out:d:\mygeolib.cs

Step 2. Generate .Net library (.dll)  for you project

In case you are using  .Net 4.0 framework, just make sure CSC.exe is in your path. In my case CSC is available at

[ C:\Windows\Microsoft.NET\Framework\v4.0.30319 ]

Command line is :

csc /out:d:\mygeolib.dll  /t:library d:\mygeolib.cs

Thursday, May 10, 2012

Web Page Scraping Using Selenium and .Net

Selenium is a popular Browser automation framework with bindings available in various flavors including( C#,Ruby,Python and Java).  From time to time I used different methods to scrap data from web including Javascript, jQuery, HTMLAgilityPack, Jsoup and so on.

Best thing about the selenium is this that you can use it to scrap pages which gets rendered using Ajax, Json or using templates. Here is how you can download Ajax powered pages using selenium.

First of all download the necessary bindings to be used to .Net from selenium download page.

I prefer to used LinQpad for all kind of scripting need. Even if you use Visual Studio, you will require to add reference to ?ThoughtWorks.Selenium.Core.dll? and ?WebDriver.dll? to run the project. The bare minimum code snippet to run selenium is below :

//--
//Below path should contain IEDriverServer.exe
var ie=new OpenQA.Selenium.IE.InternetExplorerDriver(@"D:\Mylibs\selenium");
ie.Url=@"http://www.next.co.in/categories/Electronics-Mobiles-Mobile-Phones/cid-CU00003872.aspx";
ie.Navigate(); 

//extract the html 
var retval=ie.ExecuteScript("return document.body.outerHTML"); 

//save it to local file for further processing 
File.WriteAllText(@"F:\f1.html", retval.ToString()); 
//Load second dynamic page where data gets loaded from ajax call 
//by click of the pager links 
ie.FindElementsByLinkText("2").SingleOrDefault().Click(); 
retval=ie.ExecuteScript("return document.body.outerHTML"); 
File.WriteAllText(@"F:\f2.html", retval.ToString()); 
//quit the browser 
ie.Quit(); 
// --

About Us

Like Us

Distributed By Free Blogger Templates | Designed By Seo Blogger Templates