This library module provides hybrid string similarity functions, combining the properties of character-based string similarity functions and token-based string similarity functions.
The logic contained in this module is not specific to any particular XQuery implementation, although the module requires the trigonometic functions of XQuery 3.0 or a math extension function such as sqrt($x as numeric) for computing the square root.
monge-elkan-jaro-winkler
($s1 as xs:string, $s2 as xs:string, $prefix as xs:integer, $fact as xs:double) as xs:double
Returns the Monge-Elkan similarity coefficient between two strings, using the Jaro-Winkler similarity function to discover token identity. |
soft-cosine-tokens-edit-distance
($s1 as xs:string, $s2 as xs:string, $r as xs:string, $t as xs:integer) as xs:double
Returns the cosine similarity coefficient between sets of tokens extracted from two strings. |
soft-cosine-tokens-jaro-winkler
($s1 as xs:string, $s2 as xs:string, $r as xs:string, $t as xs:double, $prefix as xs:integer?, $fact as xs:double?) as xs:double
Returns the cosine similarity coefficient between sets of tokens extracted from two strings. |
soft-cosine-tokens-jaro
($s1 as xs:string, $s2 as xs:string, $r as xs:string, $t as xs:double) as xs:double
Returns the cosine similarity coefficient between sets of tokens extracted from two strings. |
soft-cosine-tokens-metaphone
($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double
Returns the cosine similarity coefficient between sets of tokens extracted from two strings. |
soft-cosine-tokens-soundex
($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double
Returns the cosine similarity coefficient between sets of tokens extracted from two strings. |
declare function simh:monge-elkan-jaro-winkler($s1 as xs:string, $s2 as xs:string, $prefix as xs:integer, $fact as xs:double) as xs:double
Returns the Monge-Elkan similarity coefficient between two strings, using the Jaro-Winkler
similarity function to discover token identity.
Example usage :
monge-elkan-jaro-winkler("Comput. Sci. and Eng. Dept., University of California, San Diego", "Department of Computer Scinece, Univ. Calif., San Diego", 4, 0.1)
The function invocation in the example above returns :
0.992
declare function simh:soft-cosine-tokens-edit-distance($s1 as xs:string, $s2 as xs:string, $r as xs:string, $t as xs:integer) as xs:double
Returns the cosine similarity coefficient between sets of tokens extracted from two strings.
The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval).
The Edit Distance similarity function is used to discover token identity, and tokens having an edit distance bellow a given threshold are considered as matching tokens.
Example usage :
soft-cosine-tokens-edit-distance("The FLWOR Foundation", "FLWOR Found.", " +", 0 )
The function invocation in the example above returns :
0.408248290463863
declare function simh:soft-cosine-tokens-jaro-winkler($s1 as xs:string, $s2 as xs:string, $r as xs:string, $t as xs:double, $prefix as xs:integer?, $fact as xs:double?) as xs:double
Returns the cosine similarity coefficient between sets of tokens extracted from two strings.
The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval).
The Jaro-Winkler similarity function is used to discover token identity, and tokens having a Jaro-Winkler similarity above a given threshold are considered as matching tokens.
Example usage :
soft-cosine-tokens-jaro-winkler("The FLWOR Foundation", "FLWOR Found.", " +", 1, 4, 0.1 )
The function invocation in the example above returns :
0.45
declare function simh:soft-cosine-tokens-jaro($s1 as xs:string, $s2 as xs:string, $r as xs:string, $t as xs:double) as xs:double
Returns the cosine similarity coefficient between sets of tokens extracted from two strings.
The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval).
The Jaro similarity function is used to discover token identity, and tokens having a Jaro similarity above a given threshold are considered as matching tokens.
Example usage :
soft-cosine-tokens-jaro("The FLWOR Foundation", "FLWOR Found.", " +", 1 )
The function invocation in the example above returns :
0.5
declare function simh:soft-cosine-tokens-metaphone($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double
Returns the cosine similarity coefficient between sets of tokens extracted from two strings.
The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval).
The Metaphone phonetic similarity function is used to discover token identity, which is equivalent to saying that this function returns the cosine similarity coefficient between sets of Metaphone keys.
Example usage :
soft-cosine-tokens-metaphone("ALEKSANDER SMITH", "ALEXANDER SMYTH", " +" )
The function invocation in the example above returns :
1.0
declare function simh:soft-cosine-tokens-soundex($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double
Returns the cosine similarity coefficient between sets of tokens extracted from two strings.
The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval).
The Soundex phonetic similarity function is used to discover token identity, which is equivalent to saying that this function returns the cosine similarity coefficient between sets of Soundex keys.
Example usage :
soft-cosine-tokens-soundex("ALEKSANDER SMITH", "ALEXANDER SMYTH", " +")
The function invocation in the example above returns :
1.0