Programming Examples

ChilkatHOMEASPVisual BasicVB.NETC#CC++MFCDelphiFoxProJavaPerlPythonRubySQL ServerVBScript

PHP Examples

Bounced Mail
Bz2
Certificates/Keys
Charset
CSV
Diffie-Hellman
DSA
Email Object
Encryption
FileAccess
FTP
HTML-to-XML
HTTP
IMAP
MHT / HTML Email
MIME
POP3
RSA
SMTP
Socket
Spider
SSH Key
SSH
SSH Tunnel
SFTP
Tar
Upload
XML
Zip


 

 

 

 

 

 

 

 

A Simple Web Crawler

This demonstrates a very simple web crawler using the Chilkat Spider component.

Download Chilkat Spider ActiveX

<?php

//  The Chilkat Spider component/library is free.
$spider = new COM("Chilkat.Spider");

$seenDomains = new COM("Chilkat.CkStringArray");
$seedUrls = new COM("Chilkat.CkStringArray");

$seenDomains->Unique = true;
$seedUrls->Unique = true;

$seedUrls->Append('http://directory.google.com/Top/Recreation/Outdoors/Hiking/Backpacking/');

//  Set our outbound URL exclude patterns
$spider->AddAvoidOutboundLinkPattern('*?id=*');
$spider->AddAvoidOutboundLinkPattern('*.mypages.*');
$spider->AddAvoidOutboundLinkPattern('*.personal.*');
$spider->AddAvoidOutboundLinkPattern('*.comcast.*');
$spider->AddAvoidOutboundLinkPattern('*.aol.*');
$spider->AddAvoidOutboundLinkPattern('*~*');

//  Use a cache so we don't have to re-fetch URLs previously fetched.
$spider->CacheDir = 'c:/spiderCache/';
$spider->FetchFromCache = true;
$spider->UpdateCache = true;

while ($seedUrls->Count > 0) {

    $url = $seedUrls->pop();
    $spider->Initialize($url);

    //  Spider 5 URLs of this domain.
    //  but first, save the base domain in seenDomains
    $domain = $spider->getDomain($url);
    $seenDomains->Append($spider->getBaseDomain($domain));

    for ($i = 0; $i <= 4; $i++) {
        $success = $spider->CrawlNext();
        if ($success != true) {
            break;
        }

        //  Display the URL we just crawled.
        print $spider->lastUrl() . "\n";

        //  If the last URL was retrieved from cache,
        //  we won't wait.  Otherwise we'll wait 1 second
        //  before fetching the next URL.
        if ($spider->LastFromCache != true) {
            $spider->SleepMs(1000);
        }

    }

    //  Add the outbound links to seedUrls, except
    //  for the domains we've already seen.
    for ($i = 0; $i <= $spider->NumOutboundLinks - 1; $i++) {

        $url = $spider->getOutboundLink($i);
        $domain = $spider->getDomain($url);
        $baseDomain = $spider->getBaseDomain($domain);
        if (!$seenDomains->Contains($baseDomain)) {
            $seedUrls->Append($url);
        }

        //  Don't let our list of seedUrls grow too large.
        if ($seedUrls->Count > 1000) {
            break;
        }

    }

}

?>

Need a specific example? Send a request to support@chilkatsoft.com

© 2000-2008 Chilkat Software, Inc. All Rights Reserved.