Full Text Tokenizer

Introduction

The Zorba XQuery processor implements the XQuery and XPath Full Text 1.0 specification that, among other things, tokenizes a string into a sequence of tokens.

Tokenization

By default, Zorba uses the ICU library for tokenization. For Roman alphabets, Zorba (ICU) considers only alpha-numeric sequences of characters to be part of a token; whitespace and punctuation characters are not and separate tokens. However, alpha-numeric sequences matching the regular expression [0-9][.,][0-9] are retained as part of a token, e.g.: "98.6" and "1,432.58" are tokens.Alternatively, you can implement your own tokenizer by deriving from the Tokenizer class.

The Tokenizer Class

The Tokenizer class is:

class Tokenizer {
public:
  typedef /* implementation-defined */ ptr;
  typedef /* implementation-defined */ size_type;

  struct State {
    typedef Tokenizer::size_type value_type;

    value_type token;   // Token number.
    value_type sent;    // Sentence number.
    value_type para;    // Paragraph number.

    State();
  };

  class Callback {
  public:
    typedef Tokenizer::size_type size_type;

    virtual ~Callback();

    virtual void token( char const *utf8_s, size_type utf8_len, locale::iso639_1::type lang,
                        size_type token_no, size_type sent_no, size_type para_no,
                        Item const *item = 0 ) = 0;
  };

  struct Properties {
    typedef std::vector<locale::iso639_1::type> languages_type;

    bool comments_separate_tokens;
    bool elements_separate_tokens;
    bool processing_instructions_separate_tokens;
    languages_type languages;
    char const *uri;
  };

  virtual void properties( Properties *result ) const = 0;

  virtual void destroy() const = 0;
  State& state();
  State const& state() const;

  void tokenize_node( Item const &node, locale::iso639_1::type lang, Callback &callback );

  virtual void tokenize_string( char const *utf8_s, size_type utf8_len, locale::iso639_1::type lang,
                                bool wildcards, Callback &callback, Item const *item = 0 ) = 0;

protected:
  Tokenizer( State& );
  virtual ~Tokenizer();

  bool find_lang_attribute( Item const&, locale::iso639_1::type *lang );
  virtual void item( Item const&, bool entering );
  virtual void tokenize_node_impl( Item const&, locale::iso639_1::type, Callback&, bool tokenize_acp );
};

For details about the ptr type, the destroy() function, and why the destructor is protected, see the Memory Management document.The State struct is created by Zorba and passed to your constructor. It simply keeps track of the current token, sentence, and paragraph numbers.To implement a Tokenizer, you need to implement the tokenize_string() function where:

`utf8_s`	A pointer to the UTF-8 byte sequence comprising the string to be tokenized.
`utf8_len`	The number of bytes in the string to be tokenized.
`lang`	The language of the string.
`wildcards`	If `true`, allows XQuery wildcard syntax characters to be part of tokens.
`callback`	The `Callback` to call once per token.
`item`	The `Item` whence this token came. If the token occurred within an element, the `Item` is the text node. If the token occurred within an attribute, the `Item` is the attribute node.

A complete implementation of tokenize_string() is non-trivial and therefore an example is beyond the scope of this API documentation. However, the things a tokenizer should take into consideration include:

Detecting sentence termination ('.', '?', and '!' characters).
Handling floating-point numbers with possible thousands separators in US and European formats, e.g. "98.7", "98,7", "10,000", etc.
Distinguishing '.' used as a sentence terminator from '.' used as a decimal point.
Handling apostrophies, e.g., "men's".
Handling acronyms, e.g., "AT&T".

The task of iterating over an XML element's child nodes is done by tokenize_node_impl(). Its default implementation treats XML elements, comments, and processing instructions as token separators. (See Properties.) If you want to change that, you need to override tokenize_node_impl().

Paragraphs

By default, Zorba increments the current paragraph number once for each XML element encountered. However, this doesn't work well for mixed content. For example, in the XHTML:

<p>The <em>best</em> thing ever!</p>

all the tokens are both in the same sentence and paragraph, but Zorba will consider that 3 paragraphs by default.Your tokenizer can take control over when the paragraph number is incremented by overriding the item() function. The item() function is passed the Item of the current XML element and whether the item is being entered or exited.For example, the item() function for tokenizing XHTML would be along the lines of:

void MyTokenizer::item( Item const &item, bool entering ) {
  if ( entering && item.isNode() && item.getNodeKind() == store::StoreConsts::elementNode ) {
    Item qname;
    item.getNodeName( qname );
    if ( /* qname matches an XHTML block-level element's name */ )
      ++state().para;
}

Properties

To implement a Tokenizer, you need also to implement the properties() function that fills in the Properties struct where:

`comments_separate_tokens`	If `true`, XML comments separate tokens. For example, `net<!-- -->work` would be 2 tokens instead of 1.
`elements_separate_tokens`	If `true`, XML elements separate tokens. For example, `<b>B</b>old` would be 2 tokens instead of 1.
`processing_instructions_separate_tokens`	If `true`, XML processing instructions separate tokens. For example, `net<?PI pi?>work` would be 2 tokens instead of 1.
`languages`	The list of languages supported by the tokenizer.
`uri`	The URI that uniquely identifies the Tokenizer.

The TokenizerProviderClass

In addition to a Tokenizer, you must also implement a TokenizerProvider that, given a language, provides a Tokenizer for that language:

class TokenizerProvider {
public:
  virtual ~TokenizerProvider();
  virtual bool getTokenizer( locale::iso639_1::type lang, Tokenizer::State *state = 0, Tokenizer::ptr* = 0 ) const = 0;
};

Specifically, you need to implement the getTokenizer() function where:

`lang`	The language to tokenize.
`state`	The `State` to use. If `null`, t is not set.
`t`	If not `null`, set to point to a Tokenizer for lang.

A simple TokenizerProvider for our tokenizer can be implemented as:

class MyTokenizerProvider : public TokenizerProvider {
public:
  getTokenizer( locale::iso639_1::type lang, Tokenizer::State* = 0, Tokenizer::ptr* = 0 ) const;
};

bool MyTokenizerProvider::getTokenizer( locale::iso639_1::type lang, Tokenizer::State *state, Tokenizer::ptr *t ) const {
  switch ( lang ) {
    case iso639_1::en:
      if ( state && t )
        t->reset( new MyTokenizer );
      return true;
    default:
      return false;
  }
}

Using Your Tokenizer

To enable your tokenizer to be used, you need to register it with the XmlDataManager:

void *const store = StoreManager::getStore();
Zorba *const zorba = Zorba::getInstance( store );

MyTokenizerProvider provider;
zorba->getXmlDataManager()->registerTokenizerProvider( &provider );