Web Page Scraping Using Selenium and .Net
Selenium is a popular Browser automation framework with bindings available in various flavors including( C#,Ruby,Python and Java). From time to time I used different methods to scrap data from web including Javascript, jQuery, HTMLAgilityPack, Jsoup and so on.
Best thing about the selenium is this that you can use it to scrap pages which gets rendered using Ajax, Json or using templates. Here is how you can download Ajax powered pages using selenium.
First of all download the necessary bindings to be used to .Net from selenium download page.
I prefer to used LinQpad for all kind of scripting need. Even if you use Visual Studio, you will require to add reference to “ThoughtWorks.Selenium.Core.dll” and “WebDriver.dll” to run the project. The bare minimum code snippet to run selenium is below :
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
//--
//Below path should contain IEDriverServer.exe
var ie=new OpenQA.Selenium.IE.InternetExplorerDriver(@"D:\Mylibs\selenium");
ie.Url=@"http://www.next.co.in/categories/Electronics-Mobiles-Mobile-Phones/cid-CU00003872.aspx";
ie.Navigate();
//extract the html
var retval=ie.ExecuteScript("return document.body.outerHTML");
//save it to local file for further processing
File.WriteAllText(@"F:\f1.html", retval.ToString());
//Load second dynamic page where data gets loaded from ajax call
//by click of the pager links
ie.FindElementsByLinkText("2").SingleOrDefault().Click();
retval=ie.ExecuteScript("return document.body.outerHTML");
File.WriteAllText(@"F:\f2.html", retval.ToString());
//quit the browser
ie.Quit();
// -- |
Now you have complete HTML pages saved locally. You can process it the way you want.
Add One