Perl Examples

ChilkatHOMEASPVisual BasicVB.NETC#Visual C++CMFCDelphiFoxProJavaPerlPHPPythonRubySQL ServerVBScript

Perl Examples

Quick Start
Perl Unicode
Perl Byte Array
Perl Certs
Perl Email
Perl Encryption
Perl FTP
HTML-to-XML
Perl HTTP
Perl IMAP
Perl MHT
Perl MIME
Perl RSA
Perl S/MIME
Perl Signatures
Perl Socket
Perl Spider
Perl Tar
Perl Upload
Perl XML
Perl XMP
Perl Zip

More Examples...
String
Email Object
POP3
SMTP
RSS
Atom
Self-Extractor

Unreleased...
Service
PPMD
Deflate
Bzip2
LZW
Bz2
DH Key Exchange
DSA
Icon

 

 

 

 

 

 

 

A Simple Web Crawler

This demonstrates a very simple web crawler using the Chilkat Spider component.

Download Chilkat Perl Module

use chilkat;

#  The Chilkat Spider component/library is free.
$spider = new chilkat::CkSpider();

$seenDomains = new chilkat::CkStringArray();
$seedUrls = new chilkat::CkStringArray();

$seenDomains->put_Unique(1);
$seedUrls->put_Unique(1);

$seedUrls->Append("http://directory.google.com/Top/Recreation/Outdoors/Hiking/Backpacking/");

#  Set our outbound URL exclude patterns
$spider->AddAvoidOutboundLinkPattern("*?id=*");
$spider->AddAvoidOutboundLinkPattern("*.mypages.*");
$spider->AddAvoidOutboundLinkPattern("*.personal.*");
$spider->AddAvoidOutboundLinkPattern("*.comcast.*");
$spider->AddAvoidOutboundLinkPattern("*.aol.*");
$spider->AddAvoidOutboundLinkPattern("*~*");

#  Use a cache so we don't have to re-fetch URLs previously fetched.
$spider->put_CacheDir("c:/spiderCache/");
$spider->put_FetchFromCache(1);
$spider->put_UpdateCache(1);

while ($seedUrls->get_Count() > 0) {

    $url = $seedUrls->pop();
    $spider->Initialize($url);

    #  Spider 5 URLs of this domain.
    #  but first, save the base domain in seenDomains
    $domain = $spider->getDomain($url);
    $seenDomains->Append($spider->getBaseDomain($domain));

    for ($i = 0; $i <= 5; $i++) {
        $success = $spider->CrawlNext();
        if ($success != 1) {
            last;
        }

        #  Display the URL we just crawled.
        print $spider->lastUrl() . "\r\n";

        #  If the last URL was retrieved from cache,
        #  we won't wait.  Otherwise we'll wait 1 second
        #  before fetching the next URL.
        if ($spider->get_LastFromCache() != 1) {
            $spider->SleepMs(1000);
        }

    }

    #  Add the outbound links to seedUrls, except
    #  for the domains we've already seen.
    for ($i = 0; $i <= $spider->get_NumOutboundLinks() - 1; $i++) {

        $url = $spider->getOutboundLink($i);
        $domain = $spider->getDomain($url);
        $baseDomain = $spider->getBaseDomain($domain);
        if (!$seenDomains->Contains($baseDomain)) {
            $seedUrls->Append($url);
        }

        #  Don't let our list of seedUrls grow too large.
        if ($seedUrls->get_Count() > 1000) {
            last;
        }

    }

}

 

Need a specific example? Send a request to support@chilkatsoft.com

© 2000-2007 Chilkat Software, Inc. All Rights Reserved.