URI Resolvers

Motivation

In JSONiq or XQuery, many resources are named by URIs - most notably schemas and modules, but also documents, full-text stopwords lists and thesauri, and so on. URIs are by convention often given with the http: scheme and the domain name of the provider. However, in general it is not desirable to load the resource via HTTP, and in many cases the resource is not actually available at the named URI - in other words, they are URIs (Uniform Resource Identifiers), not URLs (Uniform Resource Locators).Zorba provides a built-in mechanism for mapping URIs to a local filesystem path. It can also load resources via HTTP in situations where that is in fact appropriate. These built-in mechanisms will solve a large number of resource-loading problems. However, it also offers a highly flexible API for extending the built-in mechanisms, which will allow you to resolve URIs from queries in almost any way imaginable to handle application-specific problems.

Zorba's Built-in URI Resolver

JSONiq or XQuery itself offer an approach to locating resources via the "at" clause in import module and import schema statements. However, we recommend that you do not use this approach for most robust applications.For example, the following code snippet imports a library module with target namespace "http://www.example.com/modules/utils".

   (: Evil import statement :)
   import module namespace utils = "http://www.example.com/modules/utils"
     at "/home/foo/xquery/utils.xq";

In this import statement, the user specifies that the file containing the module is physically located at "/home/foo/xquery/utils.xq".Having the physical location hard-coded isn't desirable for two reasons: First, the physical location of a module or schema may change during the development process; also, frequently the module will be located in a different directory after the application has been deployed. Second, the user may want to package and distribute the application without having other developers to perform code changes in order to have location hints which are valid on their system.Out of the box, Zorba will attempt to map URIs to local filesystem locations. It does so in two steps:

First, the URI is transformed to a relative path via the algorithm listed below.
Second, Zorba attempts to load the relative path within every directory listed on its URI path. (Through this documentation, a "path" is defines as "an ordered list of directories in the filesystem.)

URI Transformation

Consider a library module with the namespace http://www.example.com/modules/utils. This namespace URI will be rewritten by the following steps:

The domain (authority) component of the URI is reversed, and then transformed into a relative filename separated by forward slashes: www.example.com => com/example/www
The path component of the URI is appended: /modules/utils
If the path component ends with a trailing slash, the word "index" will be appended
Finally, an appropriate filename extension will be added if it is not already present. For modules, the suffix is .xq, and for schemas, the suffix is .xsd.

So, the final relative path in our example would be com/example/www/modules/utils.xq. A couple more examples (all are presumed to be module URIs):

=> com/example/www/modules/utils/index.xq
=> com/example/www/modules/mylib.xq

Zorba's URI Path

Zorba will now attempt to find the named file relative to each directory in the URI Path. By default, the only entry on the URI path is the directory /usr/share/zorba/uris/ (on Unix and MacOS X) or C:\Program Files\Zorba 2.1.0\share\zorba\uris (on Windows). So, to complete this example, assuming Zorba is installed on a Unix system in the default location, Zorba will attempt to resolve the module namespace URI to the file

   /usr/share/zorba/modules/com/example/www/modules/utils.xq

If you have modules or schemas installed in other locations on your system, you may provide additional search directories either by passing the --uri-path command-line argument to Zorba, or by setting the ZORBA_URI_PATH environment variable. In both cases, the value is an ordered list of filesystem directories separated by ":" (on Unix/MacOS X) or ";" (on Windows). Zorba will search each directory on the URI path in the order specified, and the first match found will be used. So, for example, if you invoke Zorba as follows (example is for a Unix system):

   zorba --uri-path '/home/foo/xquery/uris:/opt/share/xquery/uris'
     -q 'import module namespace utils="http://www.example.com/modules/utils"; 1'

Zorba will attempt to load the module from the following locations, in order:

   /home/foo/xquery/uris/com/example/www/modules/utils.xq
   /opt/share/xquery/uris/com/example/www/modules/utils.xq
   /usr/share/zorba/uris/com/example/www/modules/utils.xq

If, after searching all URI path directories, no match is found for a given URI, Zorba will by default fall back to interpretting the URI as a URL and loading the resource via HTTP / HTTPS / FTP (depending on the URI scheme). This behaviour can be defeated by disabling the http-uri-resolution Zorba option (see Enabling or Disabling Features).

Zorba's Library Path

The above URI path mechanism is used for all URIs that are resolved by Zorba - most especially module and schema imports, but also full-text thesaurus and stop-word lists, documents, and so on.When considering modules in particular, however, there is another important path: the Library Path. Zorba will look on this path when it needs to load dynamic libraries containing the implementation of external functions for a module (ie, those functions implemented in C++; see External Functions in C++). This path is separate from the URI path because on certain systems - most notably Fedora, but also other Linux distributions - it is important that platform-dependent binary files (such as dynamic libraries) be installed in a separate directory structure from platform-independent files (such as .xq module files and .xsd schema files).The Library Path mechanism is exactly parallel to the URI Path mechanism. The default, built-in entry on this path is /usr/lib/zorba (on Unix and MacOS X) and C:\Program Files\Zorba 2.1.0\lib\zorba (on Windows). You can add directories to this path using the --lib-path command-line argument, or by setting the ZORBA_LIB_PATH environment variable.

Internal (Core) Paths

There is actually one additional built-in directory on Zorba's URI Path, and one additional built-in directory on Zorba's Library Path, in addition to the default values mentioned above. These directories hold Zorba's built-in "core" modules (see Core Modules). The directories are subdirectories of the default directory named "core/ZORBA_VERSION". So for example, for Zorba 2.1.0 on Unix, the default internal URI directory is /usr/share/zorba/uris/core/2.1.0 and the default internal library directory is /usr/lib/zorba/core/2.1.0.These directories are separate to make it easier to upgrade Zorba, or support multiple installed versions of Zorba, while allowing non-core modules to be installed and versioned indepedently of the Zorba version. Normally you should not modify the contents of these directories.

Changing the Default Paths

All four paths mentioned above - the core and non-core URI Path, and the core and non-core Library Path - have compiled-in default values as discussed. All four of these default values can be modified at Zorba build time to meet your environment's requirements. You can do this by specifying alternate values for the following CMake variables:

ZORBA_NONCORE_URI_DIR
ZORBA_CORE_URI_DIR
ZORBA_NONCORE_LIB_DIR
ZORBA_CORE_LIB_DIR

Note that these are relative directories, and will be resolved relative to CMAKE_INSTALL_PREFIX.

Zorba's "Module Path"

Earlier versions of Zorba had a single path for specifying where both platform-dependent library files and platform-independent module and schema files were located. This was somewhat inaccurately named the "module path". For backwards compatibility, Zorba still supports a --module-path command-line argument and a ZORBA_MODULE_PATH environment variable (and the C++ API has a StaticContext::setModulePaths() method). Specifying a set of directories as the "module path" using any of these mechanisms is exactly the same as specifying that set of directories as both the URI path and library path.The Module Path is deprecated, and these mechanisms may be removed in a future major version of Zorba. There is no "default module path".

C++ API for URI Resolving

Modifying the URI Path programmatically

The simplest modification to Zorba's default behavior is setting the URI path programmatically. This allows you to have different URI paths per static context, if you wish.The StaticContext C++ API class provides the setURIPath() method for this purpose. It is passed a vector of values, each being an absolute directory to add to the URI path.For example, the following code snippet creates a StaticContext object; adds two directories to the URI path component of this static context; and compiles and executes a query given the information that is present in this static context (passed as the second parameter to the compileQuery() method).

   // Create a new static context
   zorba::StaticContext_t staticCtx = zorba->createStaticContext();
   
   // Set the URI Path
   std::vector<zorba::String> uriPath(2);
   uriPath[0] = "/home/foo/xquery/uris";
   uriPath[1] = "/opt/share/xquery/uris";
   
   staticCtx->setURIPath(uriPath);
   
   // Compile a query using the static context created above
   zorba::XQuery_t query = zorba->compileQuery(
       "import module namespace m='http://example.com/module'; m:foo()",
       staticCtx);
   
   // execute the compiled query printing the result to standard out
   std::cout << query << std::endl;

Modifying the Library Path programmatically

Similarly, StaticContext has a method named setLibPath() for specifying the Library Path to use. In operation it behaves exactly like setURIPath().

URI Mappers and URL Resolvers

Now we will discuss more advanced techniques for manipulating Zorba's URI resolution mechanism. The built-in mechanisms described above are in fact implemented internally by using these same techniques.There are two types of class that you may implement to modify the URI resolution process:

A URI Mapper takes a URI and returns one or more "Candidate URIs", which are alternative URIs Zorba will attempt to resolve. This allows you to change the URI, or provide several different potential URIs, for Zorba to resolve.
A URL Resolver takes a URI (which is presumed to be a URL, that is, a URI which actually points to a resource) and returns a Resource object which Zorba will use to load the resource data.

The general algorithm used by Zorba when resolving a URI is as follows:Zorba will start by passing the original URI to the first registered URI Mapper. If that Mapper returns any candidate URIs, then Zorba will pass each such URI to the next registered URI Mapper, and so on. If any Mapper does not return any candidate URIs for a given input URI, Zorba will simply pass the input URI unchanged to the next Mapper.After all Mappers have been invoked, Zorba will have a set of candidate URIs. It will then pass each candidate URI, in order, to each registered URL Resolver. The first time a URL Resolver returns a Resource object, Zorba will use that Resource as the final source of content for the URI. If no URL Resolver ever returns a Resource, then Zorba will raise an appropriate "resource not found" error.

URI Mappers

Zorba includes a few built-in URI Mappers. For instance, the mechanism which iterates through the URI path and produces a set of filesystem files where the URI might be located is implemented as a URI Mapper.To implement your own URI Mapper, subclass the C++ API class URIMapper and implement the mapURI() method:

   virtual void mapURI(const zorba::String aUri,
     EntityData const* aEntityData, std::vector<zorba::String>& oUris);

and then register an instance of your subclass with the static context using the method registerURIMapper():

   StaticContext_t lContext = aZorba->createStaticContext();
   MyURIMapperSubclass* lMapper = new MyURIMapperSubclass();
   lContext->registerURIMapper(lMapper);

Note that the memory ownership of the URIMapper instance remains with the client program; it must de-allocate it appropriately when the static context is no longer used.In your mapURI() implementation, aUri is the input URI. aEntityData is a pointer to additional information about the URI being resolved. As of Zorba 2.0, the only method on EntityData is getKind(), which will return an enumerated value describing what kind of URI is being resolved: SCHEMA, MODULE, THESAURUS, STOP_WORDS, COLLECTION, or DOCUMENT.oUris is where you place any candidate URIs, by calling the method push_back(). (You should not look at any existing contents of the vector.) If you push any candidate URIs onto the vector, Zorba will replace the input URI with the set of candidate URIs you provide. That means that if you want Zorba to consider the original URI in addition to the alternative URIs you provide, you must push the original URI onto the vector as well.The mapURI() method should not throw any exceptions.As a limited but functional example, here is a full class which will change the URI of a specific schema to an alternative URI. You could use this, for example, if you have a large body of XQueries using a particular schema that you do want Zorba to download from the web (hence the built-in filesystem mapping mechanism is not appropriate), but the URI that the schema is available from has changed and you do not wish to modify all the import schema statements.

   class MySchemaURIMapper : public URIMapper
   {
     public:
   
     virtual ~MySchemaURIMapper() {}
   
     virtual void mapURI(const zorba::String aUri,
       EntityData const* aEntityData,
       std::vector<zorba::String>& oUris)
     {
       if (aEntityData->getKind() != EntityData::SCHEMA) {
         return;
       }
       if(aUri == "http://www.example.com/helloworld") {
         oUris.push_back("http://example.com/schemas/helloworld.xsd");
       }
     }
   };

Note that the first thing mapURI() does is check that the URI being resolved is in fact for a schema. This is generally good practice to prevent surprises in the off-chance that the same URI is also used to identify some other kind of resource.

URL Resolvers

Zorba includes two built-in URL Resolvers: One which handles file: URLs, and one which handles http:, https:, and ftp: URLs. These are implemented by using Zorba's file and http-client modules, respectively.To implement your own URL Resolver, subclass the C++ API class URLResolver and implement the resolveURL() method:

   virtual Resource* resolveURL(const zorba::String& aUrl,
      EntityData const* aEntityData);

and then register an instance of your subclass with the static context using the method registerURLResolver():

   StaticContext_t lContext = aZorba->createStaticContext();
   MyURLResolverSubclass* lResolver = new MyURLResolverSubclass();
   lContext->registerURLResolver(lResolver);

You will note that this mechanism is exactly parallel to the URI Mapper mechanism. Also, as with URI Mappers, the memory ownership of the URLResolver instance remains with the client program; it must de-allocate it appropriately when the static context is no longer used.In your resolveURL() method, the aUri and aEntityData arguments have exactly the same meanings as they do for mapURI().If your code recognizes the URL and wishes to return content for it, it must return a newly-allocated instance of some subclass of the abstract class Resource. In Zorba 2.0's public API, there is only one such subclass, StreamResource, which wraps around a std::istream.It is important that all URL Resolvers check the EntityData's Kind and only return Resources for the appropriate kind of URIs, for two reasons:

They must return an appropriate subclass of Resource depending on the entity kind. Zorba is prepared to accept StreamResource for schemas, modules, stop-word lists, thesauri, and documents, but not for collections. In future, there will likely be additional resource subclasses and additional entity kinds, and returning an inappropriate resource subclass for a given entity type will have negative consequences.
They must return a resource which produces data appropriate for the kind of entity being resolved. For instance, when resolving a schema or document URI, Zorba expects the resource to produce well-formed XML. When resolving a module URI, Zorba expects the resource to produce an XQuery library module. Returning a resource which outputs incorrect data will result in errors.

As a fairly silly but functional example, here is a URL Resolver that returns a small hard-coded module for a specific URL:

   using namespace zorba;

   static void streamReleaser(std::istream* aStream)
   {
     delete aStream;
   }

   class FoobarModuleURLResolver : public URLResolver
   {
     public:
     virtual ~FoobarModuleURLResolver() {}
   
     virtual Resource* resolveURL(const String& aUrl,
       EntityData const* aEntityData)
     {
       // we have only one module
       if (aEntityData->getKind() == EntityData::MODULE &&
         aUrl == "http://www.example.com/foobar") 
       {
         return StreamResource::create
           (new std::istringstream
             ("module namespace lm = 'http://www.example.com/foobar'; "
              "declare function lm:foo() { 'foo' };"), &streamReleaser);
       }
       else {
         return NULL;
       }
     }
   };

A more realistic example would be a resolver that takes URLs (possibly with a non-standard scheme, such as db:) and loads the content for those URLs from a database. As long as the database API allows you to obtain the information as a std::istream, you may stream this data directly to Zorba.Two notes about memory management: First, when a user-defined URLResolver returns a Resource, Zorba will take memory ownership of the Resource, and will free it when it is no longer needed. Second, when user code creates a StreamResource, the StreamResource assumes memory ownership of the std::istream that is wrapped. However, Zorba cannot free the std::istream itself, because it was instantiated inside user code rather than inside Zorba's own library. On Windows, in some circumstances, if a DLL deletes an object that was not instantiated inside that DLL, the application will crash. Therefore, the StreamResource factory function create() also takes a StreamReleaser, which is a function pointer. Zorba will call this function pointer, passing the std::istream, when it is no longer needed; the function is expected to free the std::istream.Unlike mapURI(), it is acceptable for resolveURL to throw exceptions. A URL Resolver should throw an exception if it believes that it is canonical for the URL (that is, it "should be able" to resolve it) but had some error during the attempt to resolve it. However, because Zorba may be attempting to resolve a number of candidate URIs, any exceptions thrown from a URL Resolver will be caught and consumed by Zorba. It will never re-throw any of these exceptions. It will merely remember the message of the first exception (assuming that it extends std::exception). If and only if no URL Resolver ever returns a valid Resource, Zorba will then throw a new exception with the saved message from the first-thrown exception.

Component URI Mappers for modules

In XQuery, it is possible for a particular module to actually be implemented as a set of more than one .xq files. When a query imports the module's URI, the query processor is expected to provide some mechanism whereby multiple files will be loaded and combined to form the whole module definition.Zorba uses Component URI Mappers to allow for this. The API is the same URIMapper class, and they are registered with the static context using the same registerURIMapper() method.How does Zorba know which URI Mappers are intended to provide a set of URIs for the components of a module, and which are intended to provide a set of possible candidate URIs for other purposes? There is actually another method on the URIMapper class: mapperKind(). This method should return a value from the enumeration URIMapper::Kind. There are two possible values: COMPONENT and CANDIDATE. The default implementation of URIMapper returns CANDIDATE, so for normal URI Mappers there is no need to override this method.When Zorba needs to resolve a URI for a module import, it first invokes all registered component URI mappers to form a set of component URIs. Then, it resolves each of these component URIs using the full URI resoltuion mechanism documented above - calling all candidate URI mappers and URL resolvers in turn. Assuming that it successfully loads a Resource for each component URI, it then assembles all of these Resources into the final, loaded module.Here is an example of a component URI mapper, which tells Zorba to load two other URIs to form a complete module:

   class MyModuleURIMapper : public URIMapper
   {
     public:
   
     virtual ~MyModuleURIMapper() {}
   
     virtual URIMapper::Kind mapperKind() throw() { return URIMapper::COMPONENT; }
   
     virtual void mapURI(const zorba::String aUri,
       EntityData const* aEntityData,
       std::vector<zorba::String>& oUris)
     {
       if (aEntityData->getKind() != EntityData::MODULE) {
         return;
       }
       if(aUri == "http://www.example.com/mymodule") {
         oUris.push_back("http://www.example.com/mymodule/mod1");
         oUris.push_back("http://www.example.com/mymodule/mod2");
       }
     }
   };

As mentioned, each of the component URIs will be treated to the full URI resolution mechanism, including Zorba's built-in mechanisms. So, given the code above and a default Unix installation, Zorba will proceed to attempt to load the following files:

   /usr/share/zorba-2.0.0/modules/com/example/www/mymodule/mod1.xq
   /usr/share/zorba-2.0.0/modules/com/example/www/mymodule/mod2.xq

If both are found, then the two together will be taken to form the complete definition for the module .

Disallowing URIs

Sometimes, it might be required to forbid access to a certain URI. For example, a user might disallow access to the file module because she doesn’t want XQuery developers to access files stored locally on the machine that is running Zorba. (Remember that the built-in URL Resolver for file: URLs is implemented in terms of the file module, so doing this will effectively disable all file access - module and schema importing, loading documents, and so on.)Therefore, the static context also provides a mechanism for preventing developers from importing particular modules. This is also handled with a URIMapper in a simple way: If any registered URI Mapper (candidate or component) ever returns the value URIMapper::DENY_ACCESS in the oUris vector, Zorba will immediately throw a "URI access denied" exception.Here is an example URI Mapper which will suppress the loading of the file module:

   class DenyFileAccessURIMapper : public URIMapper
   {
     public:
   
     virtual ~DenyFileAccessURIMapper() {}
   
     virtual void mapURI(const zorba::String aUri,
       EntityData const* aEntityData,
       std::vector<zorba::String>& oUris)
     {
       if(aEntityData->getKind() == EntityData::MODULE &&
          aUri == "http://www.zorba-xquery.com/modules/file") {
         oUris.push_back(URIMapper::DENY_ACCESS);
       }
     }
   };