Article Options
Premium Sponsor
Premium Sponsor

 »  Home  »  Web Development  »  Screen Scraping
 »  Home  »  Web Development  »  Web Services  »  Screen Scraping
Screen Scraping
by Tiberius OsBurn | Published  09/08/2002 | Web Development Web Services | Rating:
Tiberius OsBurn

Tiberius OsBurn is a Senior Developer/System Analyst for The Gallup Organization (http://www.gallup.com). He recently completed a huge data warehousing project that archived data and documents from 1935 to the present - all coded in C#, SQL Server and ASP.NET.

Tiberius has extensive experience in VB, VB.NET, C#, SQL Server, ASP.NET and various other web technologies. Be sure to visit his site for his latest articles of interest to .NET developers.

http://tiberi.us

 

View all articles by Tiberius OsBurn...
Screen Scraping

Article source code: screen_scrape.zip

Not too long ago, if you wanted some particular information off of a particular web site, you'd have to snake the HTML off a page and incorporate it into yours. Whether you did that manually via cut and paste or with a homegrown process was up to you - usually it involved some pain and misery to get it right.

Even today, as we teeter on the 'new age' of web services, we still have problems getting what we want from our favorite web pages - maybe we need some information that isn't exposed via a web service, and until the Frito chomping, Jolt drinking programmer that wrote the page shuts off 'Star Trek', gets up off the sofa and writes a web service, we'll have to do their job for them.

The idea of screen scraping isn't new, in fact, many unsavory types use some sort of screen scraping to retrieve email addresses and harvest images from unsuspecting sites. Actually, this is common practice on the web - one that is nefarious and ill received by most of the Internet community.

No, I'm not going to show you how to screen scrape email addresses off of pages, so don't ask me - instead, we'll do a little constructive scraping in order to put more content out on the web.

A word of caution:

In reality, you can scrape ANY site on the web. Now, just a quick warning, this may not be the most 'legal' thing to do, especially if you haven't received permission from the owner of the content. Just make sure that you get the 'okey-dokey' from the owner of the content if you are going to redistribute their content.

Coding Offensively and Defensively

Over the years I picked up a nice habit of adding comments to my HTML code. I'd always get lost in the many table and td tags, so I'd demark sections of HTML with a begin and end comment. For instance, the section on my site called 'HIP', is demarked with <!-- BEGIN HIP --> and <!-- END HIP -->. What we want to scrape is whatever is between those HTML Comments, being the layout and images of that section.

I'll go on record saying that if you demark out your HTML code, you'll have a hell of a lot easier time setting up a screen scraper for your site. If you don't want more than the curious scraper to snatch information off your site, I would strongly encourage you to 'bunch' up your code - making it as difficult to scrape as possible - in other words, don't format your code and don't add comments. One of the easiest ways to ward off a 'scraper' is to put your entire HTML (or the HTML output) on ONE line. This'll keep even the most ardent of content scrapers busy for hours scouring your code for a nice break.

Remember, you can scrape ANY site, so if you don't like that idea, you'll have to take measures to ensure that it's more pain than gain.

Viewing Source

If you want to scrape, you'll have to view the HTML source of the site. Let's take a quick look at the source of my default.aspx page...

<!-- BEGIN HIP -->
<tr>
<td align="left" valign="center" width="100"><br>
<br>
<IMG src="http://tiberi.us/images/hip.gif">
</td>
</tr>
<tr>
<td>
<IMG SRC="http://tiberi.us/images/hip/microsoftphone.gif">
This is very, very sweet...
Microsoft's new phone, the Pocket PC Phone Edition is sure to
ring your bell.
<br>
<a href="http://www.microsoft.com/mobile/pocketpc/phoneedition/">
Pocket PC Phone</a>
</td>
</tr>
<!-- END HIP -->

Here we can clearly see where my 'HIP' section begins and ends. This is important, because if you want to capture the content on a site, you'll have to find a beginning and an ending section - Look hard for a unique demarcation - somewhere there is a clear beginning to the content and a clear ending, or you'll end up with a lot of garbage that you don't want.

Once you've become familiar with the HTML source, you're ready to craft a regular expression.

Firing up RegEx

So, with that in mind, we'll fire up the regular expression object, REGEX, and parse out the Hip section quite painlessly.

If you're not a fan of Regular Expressions, you soon will be. If you've been a Java or C++ programmer, you've been spoiled by how nice regular expressions are. If you were a Visual Basic programmer, you were stuck with some crappy OCX or a DLL Library or regular expressions in VBScript that didn't quite work right. Now that .NET is on the scene, have no fear - you'll be using RegEx plenty.

Let's take a peek at our regular expression that we use to get out the content we want from tiberi.us:

Regex regex = new Regex("<!-- BEGIN HIP -->((.|\n)*?)<!-- END HIP -->",
    RegexOptions.IgnoreCase);

Look confusing? Naw. It's simple.

We want to get out whatever is between <!-- BEGIN HIP --> and <!-- END HIP -->. The ((.|\n)*?) part of the expression, as foreign and weird as it looks, actually isn't that bad.

The period character followed by the | character and then the \n works to restrict the new line character but allows a match on any other character. The asterisk and question mark tell the RegEx engine to match on zero or more occurrences.

It's beyond the scope of this article to delve too deep into regular expressions, but there are plenty of resources out there if you'd like to learn more.

Getting down to Business

If we look at our code, you'll see that we're using a StreamReader, the web Request and Response objects and the ubiquitous Regex object.

Coding our Screen Scraper:

private string getHip() {

    StreamReader oSR = null;

    //Here's the work horse of what we're doing, the WebRequest object 
    //fetches the URL
    WebRequest objRequest = WebRequest.Create("http://tiberi.us");

    //The WebResponse object gets the Request's response (the HTML) 
    WebResponse objResponse = objRequest.GetResponse();

    //Now dump the contents of our HTML in the Response object to a 
    //Stream reader
    oSR =  new StreamReader(objResponse.GetResponseStream());

    //And dump the StreamReader into a string...
    string strContent = oSR.ReadToEnd();

    //Here we set up our Regular expression to snatch what's between the 
    //BEGIN and END
    Regex regex = new Regex("<!-- BEGIN HIP -->((.|\n)*?)<!-- END HIP -->",
        RegexOptions.IgnoreCase);

    //Here we apply our regular expression to our string using the 
    //Match object. 
    Match oM = regex.Match(strContent);

    //Bam! We return the value from our Match, and we're in business. 
    return oM.Value;
}

I've done some pretty liberal commenting - so you'll be able to figure out what's going on. I fill my WebRequest object with the URL to my site, then fill the WebResponse object up with the resultant HTML. After that, I dump the WebResponse object into a StreamReader and then into a string, which is in turn, parsed by the regular expression engine.

Not much to it, is there?

Web Service

Now that we've done the tough part - we can have a little cake with our code. Transforming a method into a full-blown web service is simple. Essentially, all we need to do is whip a [WebMethod] declaration above our method and magically, we have a web service ready for the world to use.

using System;
using System.Collections;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Web;
using System.IO;
using System.Net;
using System.Text.RegularExpressions;
using System.Text;
using System.Diagnostics;
using System.Web.Services;

namespace screenscrape {

    public class getHip : System.Web.Services.WebService {

        public getHip() {
            InitializeComponent();
        }

        [WebMethod]
        public string getHipWS() {
            StreamReader oSR = null;
            string strURL = "http://tiberi.us";
            WebRequest objRequest = WebRequest.Create(strURL);
            WebResponse objResponse = objRequest.GetResponse();
            oSR =  new StreamReader(objResponse.GetResponseStream());
            string strContent = oSR.ReadToEnd();
            Regex regex = new Regex("<!-- BEGIN HIP -->((.|\n)*?)<!-- END HIP -->",
                RegexOptions.IgnoreCase);
            Match oM = regex.Match(strContent);
            return "<table width=100 border=0 align=center>" + oM.Value + "</table>";
        }

    }

}

Please feel free to download the code for this project.

Finally, if you're interested in doing some serious screen scraping, I'd suggest that you bone up on regular expressions - you'll need them. Most sites have a usage policy that you'll have to slog through as well; you don't want to be getting threatening email from angry lawyers and livid webmasters.

How would you rate the quality of this article?
1 2 3 4 5
Poor Excellent
Tell us why you rated this way (optional):

Article Rating
The average rating is: No-one else has rated this article yet.

Article rating:3.55555555555557 out of 5
 81 people have rated this page
Article Score73856
Sponsored Links