Zorba - The NoSQL Query Processor

Note: this is an edited version of a previous tutorial we updated to run with Zorba 2.0.

The Atom Syndication Format (RFC 4287) is one of the most popular XML formats to aggregate XML data. Such an aggregate is called an Atom feed. Atom is heavily used by the industry for web services (e.g. Google) and countless people use it to subscribe to their favourite blogs in their Feed Readers. The aim of this tutorial is to present how easy an Atom feed can be processed with XQuery.

We will download the Zorba blog in Atom format, create a Tiny URL to the blog post, extract some keywords and output the result as HTML.

Let’s start with two helper functions that we’ll use afterwards. One to create a Tiny URL, the other to get frequent keywords out of a longer string (such as a Atom blog entry).

Create the Tiny URL

Creating a Tiny URL is simple. It can be done by sending an HTTP GET request to http://tinyurl.com/api-create.php with the URL to shorten as a parameter named “url”.

We write a simple function that uses the EXPath http-client module. The result of a HTTP GET using that module is a sequence of two items: the first is the response header, the second the body. We’re only interested in the response body which will contain the shortened URL.

import module namespace http = "http://expath.org/ns/http-client";
declare namespace an = "http://www.zorba-xquery.com/annotations";

declare %an:sequential function local:create-tinyurl($url as xs:string) as xs:string
{
  let $url-encoded := encode-for-uri($url)
  let $result := http:send-request(
      <http:request href="http://tinyurl.com/api-create.php?url={$url-encoded}"
                    method="get" />
  )
  return $result[2]
};

local:create-tinyurl("http://www.zorba-xquery.com/")

You can run this code snippet in the Zorba live sandbox our with your local zorba installation. You should get the following result:

<?xml version="1.0" encoding="UTF-8"?>
http://tinyurl.com/yc76gyh

Generating Keywords

Now let’s write a function which finds the most frequent words in a string while discarding common stop words such as “or”, “and”, or “not”. For this, we use Zorba’s full-text module.

First, we tokenize the string into a sequence of words. Then, for each word we convert it to lower-case and discard the stop words. Finally we group all occurrences of the same word together and count how many times the word appeared in the string and order by that count.

import module namespace ft = "http://www.zorba-xquery.com/modules/full-text";

declare function local:words-by-frequency($content)
{   
    for $token in ft:tokenize-string($content, xs:language("en"))
    let $word := lower-case($token)
    where not(ft:is-stop-word($word))
    group by $word
    order by count($token) descending
    return <li>{ $word }</li>
};

local:words-by-frequency("Processing text with Zorba is easy when using the full-text module.")

The result is a sequence of words (each wrapped in an HTML li tag), with the most frequent words at the beginning. Have a look at the result on live sandbox

<?xml version="1.0" encoding="UTF-8"?>
<li>text</li>
<li>zorba</li>
<li>easy</li>
<li>processing</li>
<li>module</li>
<li>using</li>

Processing the Atom feed

Now that we have our two helper functions, let’s finish the job.

In the same way we made an HTTP GET request to tinyurl.com, we can make one to get the Zorba blog in Atom format.

import module namespace http = "http://expath.org/ns/http-client";

http:send-request(
  <http:request href="http://www.zorba-xquery.com/blog/feed"  method="get" />
)[2]

One entry in the Atom feed might look like this.

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:base="http://www.28msec.com">
[...]
<entry>
<title type="text">Zorba "Eros" 2.6: The JSONiq Release</title>
<link rel="alternate" type="text/html" href="http://www.zorba-xquery.com/html/entry/2012/08/20/Eros"/>
<author>
 <name>Zorba Team</name>
 <email>zorba-dev@googlegroups.com</email>
</author>
<updated>2012-08-20T07:47:04.74+02:00</updated>
<content type="xhtml">
 <div xmlns="http://www.w3.org/1999/xhtml">
 <p>
 The Zorba development team is pleased to announce the release of Zorba 2.6, codename Eros.
 The release is available in our <a href="/html/download">download section</a>.
[...]
 </p>
 </div>
</content>
</entry>
[...]
</feed>

Have you noticed the xmlns=”http://www.w3.org/2005/Atom” above? That means that the “feed” node, and all its children, are in the “atom” XML namespace. That’s why we have to write $feed//atom:entry to get all the entries. We could also have used$feed//*:entry, but $feed//entry would not have worked. Limiting to “less or equal 3” with $feed//atom:entry[position() le 3], will give us a sequence containing the first three entries in the blog.

For each blog entry, we get its title and its URL…

declare namespace atom = "http://www.w3.org/2005/Atom";

for $entry in $feed//atom:entry[position() le 3]
let $title := $entry/atom:title/text()
let $url   := encode-for-uri( $entry/atom:link[@rel = "alternate"]/@href/data() )

…and when we invoke our previously written helper functions (local:create-tinyurl and local:words-by-frequency), and construct the XHTML to return, we get the finished script:

import module namespace http = "http://expath.org/ns/http-client";
import module namespace ft = "http://www.zorba-xquery.com/modules/full-text";

declare namespace atom = "http://www.w3.org/2005/Atom";
declare namespace an = "http://www.zorba-xquery.com/annotations";

declare %an:sequential function local:create-tinyurl($url as xs:string) as xs:string
{
  let $url-encoded := encode-for-uri($url)
  let $result := http:send-request(
    <http:request href="http://tinyurl.com/api-create.php?url={$url-encoded}" 
                  method="get" />
  )
  return $result[2]
};

declare function local:words-by-frequency($content)
{   
    for $token in ft:tokenize-string($content, xs:language("en"))
    let $word := lower-case($token)
    where not(ft:is-stop-word($word))
    group by $word
    order by count($token) descending
    return <li>{ $word }</li>
};

<html>
<body>
{ 
    (: (1) Download Atom feed :)
    let $feed := http:send-request(
      <http:request href="http://www.zorba-xquery.com/blog/feed"
                    method="get" />
    )[2]

    (: For the first three entries in the feed... :)
    for $entry in $feed//atom:entry[position() le 3]
    let $title := $entry/atom:title/text()
    let $url   := $entry/atom:link[@rel = "alternate"]/@href/data()

    (: (2) Create the tiny URL :)
    let $tinyurl := local:create-tinyurl($url)

    (: (3) Get 10 most frequent words in feed entry:)
    let $flattened-content := string-join($entry/atom:content//text(), " ")
    let $top-ten-words := local:words-by-frequency($flattened-content)[position() le 10]

    (: (4) Return it as HTML :)
    return
      <div>
        <h2><a href="{ $tinyurl }">{ $title }</a></h2>
        <ul>{ $top-ten-words }</ul>
      </div>
}
</body>
</html>

You can run the complete query online on our live sandbox here or download the source and run it locally.

Et Voilà! With just a few lines of code, we were able to to process an Atom feed, interact with the Tiny URL web service and do some text analysis. In such use-cases, Zorba makes your life very easy. Check out the rest of Zorba’s modules fully packed with functionality.

Loading

Atom Processing

Create the Tiny URL

Generating Keywords

Processing the Atom feed

Atom Processing