Create a Web Crawler 101

Created August 9, 2010

If you would like to learn how to create a web crawler, spider, or sometimes referred to as a bot... It is actually a lot simpler then you may think. In this short post I will display just how easy it is to obtain a mark-up from another website and then you will easily be able to see how you can parse the data to use for your own evil pleasure. The PHP code to get the markup of another site can be done with one function call file_get_contents as shown below:

<?php
$webpage = file_get_contents('http://www.tonylea.com');
?>

Now, the variable $webpage contains all the mark-up (source) for http://www.tonylea.com.

Okay, so basically if we want to parse the data we could do something like the following:

<?php
$url = 'http://www.tonylea.com';
$webpage = file_get_contents($url);
function get_images($page)
{
     if (!empty($page)){
          preg_match_all('/<img([^>]+)\/>/i', $page, $images);
          return !empty($images[1]) ? $images[1] : FALSE;
     }
}
function get_links($page)
{
     if (!empty($this->markup)){
          preg_match_all('/<a([^>]+)\>(.*?)\<\/a\>/i', $this->markup, $links);
          return !empty($links[1]) ? $links[1] : FALSE;
     }
}

$images = get_images($webpage);
foreach($images as $image)
{
     echo $image.'<br />';
}
?>

In the above example we have gotten the mark-up from the specified URL and gotten the values contained in the 'a' tags and the 'img' tags. The code then prints out the data that is in the 'img' tags. With a bit more parsing you can display images and links from the page you have scraped or crawled.

Very cool stuff, that you can collaborate upon to do some awesome web crawling :)