Tuesday, September 15, 2009

Extracting anchor tag <a> using C# RegEx

I was in need of a parsing and extracting all the anchor tags within an HTML file. First I tried using some string manipulation technique but that was a mess!! Then i tried to use regular expression to achive the same, but it since I am not good at regular expressions (not even bad :) ), it gave me some really hard time. But like always web was there so save me, and by combining my search and programming expertise atlast I was able to write a piece of code that can extract all anchor "<a>" tags with css class from an html file...

The code is give below, which first reads a webpage and save it's HTML in a string variable.


            HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://enggwaqas.spaces.live.com");
            try
            {
                HttpWebResponse response = (HttpWebResponse)request.GetResponse();
                StreamReader sr = new StreamReader(response.GetResponseStream());
                string szResult = sr.ReadToEnd();
                sr.Close();
               string pattern= @"<a.*?href=[""'](?<url>.*?)[""'] ?(class=[""']linkClass[""']).*?>(?<name>.*?)</a>";
               MatchCollection matches = Regex.Matches(input, pattern, RegexOptions.Singleline | RegexOptions.IgnoreCase);

               foreach(Match m in matches)
                     Console.WriteLine(m.Value);
             }
             catch(Exception e){}


It will not extract all the anchor tags but those with cssClass set to 'linkClass', why? Because I write the code this way :)

No comments:

Post a Comment