Tcl
Tcl
Efficiently Process a Huge XML File
See more XML Examples
Demonstrates a technique for processing a huge XML file (can be any size, even many gigabytes).Note: This example requires Chilkat v9.5.0.80 or greater.
Chilkat Tcl Downloads
load ./chilkat.dll
set success 0
# This example shows a way to efficiently process a gigantic XML file -- one that may be too large
# to fit in memory.
#
# Two types of XML parsers exist: DOM parsers and SAX parsers.
# A DOM parser is a Document Object Model parser, where the entire XML is loaded into memory
# and the application has the luxury of interacting with the XML in a convenient, random-access
# way. The Chilkat Xml class is a DOM parser. Because the entire XML is loaded into memory,
# huge XML files (on the order of gigabytes) are usually not loadable for memory constraints.
# A SAX parser is such that the XML file is parsed as an input stream. No DOM exists.
# Using a SAX parser is generally less palatable than using a DOM parser, for many reasons.
#
# The technique described here is a hybrid. It streams the XML file as unstructured text
# to extract fragments that are individually treated as separate XML documents loaded into
# the Chilkat Xml parser.
#
# For example, imagine your XML file is several GBs in size, but has a relatively simple structure, such as:
#
# <Transactions>
# <Transaction id="1">
# ...
# </Transaction>
# <Transaction id="2">
# ...
# </Transaction>
# <Transaction id="3">
# ...
# </Transaction>
# ...
# </Transactions>
# In the following code, each <Transaction ...> ... </Transaction>
# is extracted and loaded separately into an Xml object, where it can be manipulated
# independently. The entire XML file is never entirely loaded into memory.
set fac [new_CkFileAccess]
set success [CkFileAccess_OpenForRead $fac "qa_data/xml/transactions.xml"]
if {$success == 0} then {
puts [CkFileAccess_lastErrorText $fac]
delete_CkFileAccess $fac
exit
}
set xml [new_CkXml]
set sb [new_CkStringBuilder]
set firstIteration 1
set retval 1
set numTransactions 0
# The begin marker is "XML tag aware". If the begin marker begins with "<"
# and ends with ">", then it is assumed to be an XML tag and it will also match
# substrings where the ">" can be a whitespace char.
set beginMarker "<Transaction>"
set endMarker "</Transaction>"
while {$retval == 1} {
CkStringBuilder_Clear $sb
# The retval can have the following values:
# 0: No more fragments exist.
# 1: Captured the next fragment. The text from beginMarker to endMarker, including the markers, are returned in sb.
# -1: Error.
set retval [CkFileAccess_ReadNextFragment $fac $firstIteration $beginMarker $endMarker "utf-8" $sb]
set firstIteration 0
if {$retval == 1} then {
set numTransactions [expr $numTransactions + 1]
set success [CkXml_LoadSb $xml $sb 1]
# Your application may now do what it needs with this particular XML fragment...
}
}
if {$retval < 0} then {
puts [CkFileAccess_lastErrorText $fac]
}
puts "numTransactions: $numTransactions"
delete_CkFileAccess $fac
delete_CkXml $xml
delete_CkStringBuilder $sb