Are you familiar with that awkward moment when you realize that fn:doc() is no good to deal with the 24 GB XML file you would like to process? Here’s a quick tip that will make you feel like an XML Rock Star.
xml:parse() is the Swiss army knife of XML parsing. This function provides the same parsing utilities from top notch XML libraries but in the XQuery userland: from DTD or XML Schema validation to XML entities and XIncludes subsitutions. It prevents you from doing unnecessary round-tripping between XQuery and other programming languages.
Moreover, this function is especially convenient for parsing large documents or streams of data. For instance, in this live example, even tough we fetch a 700MB XML document via HTTP, Zorba only consumes the required resources to compute the result. So how does this work?
Let’s dive in.
We would like to parse a dump of the English wikibooks and create a timeline of all pages related to XQuery
In a first step, we are reading a Wikibooks dataset (available at http://dumps.wikimedia.org/enwikibooks) from a file. The dataset used in the example is a database dump containing the metadata of all pages in order to get the creation date of each page (e.g. enwikibooks-20120520-pages-meta-history.xml). After this, we use xml:parse() to parse the string into XML fragments (i.e. the page elements in the dataset) and iterate over each such fragment.
import module namespace file = "http://expath.org/ns/file";
import module namespace p = "http://www.zorba-xquery.com/modules/xml";
import schema namespace opt = "http://www.zorba-xquery.com/modules/xml-options";
let $raw-data as xs:string := file:read-text("enwikibooks-20120520-pages-meta-history.xml")
let $pages := p:parse($raw-data, <>
< ="1"/>
</>)
for $page in $pages
return
$page
In the code snippet above, it’s important to note that the execution happens in a streaming manner from top to bottom. First, the string returned by file:read-text() is streamed and never entirely materialized in memory. That is, the 24GB of raw data will never be loaded into memory at once. Second, we ask the XML parser to only return the children of the document’s root element as a sequence of elements. This is done using the "<opt:parse-external-parsed-entity opt:skip-root-nodes="1"/>" option. It makes sure that the entire document is never materialized in memory and the resulting sequence of elements can be processed in a streaming fashion.
Once this is done, generating the timeline is a piece of cake. We select only the pages whose title starts with XQuery. For each XQuery page, we compute the creation date of the first revision of that page. The full query is available below.
import module namespace file = "http://expath.org/ns/file";
import module namespace p = "http://www.zorba-xquery.com/modules/xml";
import schema namespace opt = "http://www.zorba-xquery.com/modules/xml-options";
declare namespace w = "http://www.mediawiki.org/xml/export-0.6/";
let $raw-data as xs:string := file:read-text("enwikibooks-20120520-pages-meta-history.xml")
let $pages := p:parse($raw-data, <>
< ="1"/>
</>)
for $page in $pages
where starts-with($page/w:title, "XQuery")
let $title := $page/w:title/text()
let $creation-date := min($page/w:revision/w:timestamp/xs:dateTime(.))
return
< ="{$title}" ="{$creation-date}" />
Pretty simple, no?
xml:parse() is to XML what XQXQ is to XQuery. We are looking to provide a rich programming experience that will enable developers to stay in the XQuery userland and avoid unnecessary roundtrips between XQuery and another technology.
Traditional XML streaming APIs such as XMLReader or SAX provide very good performance but have extremely low productivity. From this perspective, xml:parse() is quite an interesting function because it does not give up on anything. This function provides steady performance and is 100% bound to the XQuery processing model.
We hope that you will take xml:parse() for a ride and send us feedback. And as always, may the FLWOR be with you! ;-)