Perl Examples

ChilkatHOMEAndroid™ASPVisual BasicVB.NETC#iOS (IPhone)Objective-CC++CMFCDelphiFoxProJavaPerl
PHP ExtensionPHP ActiveXPythonPowerShellRubySQL ServerVBScript

Perl Examples

Quick Start
Unicode
Byte Array
Bz2
Certificates
CSV
Email
Encryption
FTP
HTML Conversion
HTTP
IMAP
MHT
MIME
POP3
RSA
S/MIME
Signatures
SMTP
Socket / SSL
Spider
SFTP
SSH Key
SSH
SSH Tunnel
Tar
HTTP Upload
XML
XMP
Zip

More Examples...
String
Amazon S3
Email Object
DKIM / DomainKey
NTLM
FileAccess
RSS
Atom
Self-Extractor
Service
PPMD
Deflate
DH Key Exchange
DSA
Bzip2
LZW

 

 

 

 

 

 

 

A Simple Web Crawler

This demonstrates a very simple web crawler using the Chilkat Spider component.

 Chilkat Perl Module Downloads for Windows, Linux, and MAC OS X

use chilkat();

#  The Chilkat Spider component/library is free.
$spider = new chilkat::CkSpider();

$seenDomains = new chilkat::CkStringArray();
$seedUrls = new chilkat::CkStringArray();

$seenDomains->put_Unique(1);
$seedUrls->put_Unique(1);

$seedUrls->Append("http://directory.google.com/Top/Recreation/Outdoors/Hiking/Backpacking/");

#  Set our outbound URL exclude patterns
$spider->AddAvoidOutboundLinkPattern("*?id=*");
$spider->AddAvoidOutboundLinkPattern("*.mypages.*");
$spider->AddAvoidOutboundLinkPattern("*.personal.*");
$spider->AddAvoidOutboundLinkPattern("*.comcast.*");
$spider->AddAvoidOutboundLinkPattern("*.aol.*");
$spider->AddAvoidOutboundLinkPattern("*~*");

#  Use a cache so we don't have to re-fetch URLs previously fetched.
$spider->put_CacheDir("c:/spiderCache/");
$spider->put_FetchFromCache(1);
$spider->put_UpdateCache(1);

while ($seedUrls->get_Count() > 0) {

    $url = $seedUrls->pop();
    $spider->Initialize($url);

    #  Spider 5 URLs of this domain.
    #  but first, save the base domain in seenDomains
    $domain = $spider->getUrlDomain($url);
    $seenDomains->Append($spider->getBaseDomain($domain));

    for ($i = 0; $i <= 4; $i++) {
        $success = $spider->CrawlNext();
        if ($success != 1) {
            last;
        }

        #  Display the URL we just crawled.
        print $spider->lastUrl() . "\r\n";

        #  If the last URL was retrieved from cache,
        #  we won't wait.  Otherwise we'll wait 1 second
        #  before fetching the next URL.
        if ($spider->get_LastFromCache() != 1) {
            $spider->SleepMs(1000);
        }

    }

    #  Add the outbound links to seedUrls, except
    #  for the domains we've already seen.
    for ($i = 0; $i <= $spider->get_NumOutboundLinks() - 1; $i++) {

        $url = $spider->getOutboundLink($i);
        $domain = $spider->getUrlDomain($url);
        $baseDomain = $spider->getBaseDomain($domain);
        if (!$seenDomains->Contains($baseDomain)) {
            $seedUrls->Append($url);
        }

        #  Don't let our list of seedUrls grow too large.
        if ($seedUrls->get_Count() > 1000) {
            last;
        }

    }

}

 

© 2000-2010 Chilkat Software, Inc. All Rights Reserved.