This library module provides token-based string similarity functions that view strings as sets or multi-sets of tokens and use set-related properties to compute similarity scores.
The tokens correspond to groups of characters extracted from the strings being compared, such as individual words or character n-grams.
These functions are particularly useful for matching near duplicate strings in cases where typographical conventions often lead to rearrangement of words (e.g., "John Smith" versus "Smith, John").
The logic contained in this module is not specific to any particular XQuery implementation, although the module requires the trigonometic functions of XQuery 3.0 or a math extension function such as sqrt($x as numeric) for computing the square root.
cosine-ngrams
($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double
Returns the cosine similarity coefficient between sets of character n-grams extracted from two strings. |
cosine-tokens
($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double
Returns the cosine similarity coefficient between sets of tokens extracted from two strings. |
cosine
($desc1 as xs:string*, $desc2 as xs:string*) as xs:double
Auxiliary function for computing the cosine similarity coefficient between strings, using stringdescriptors based on sets of character n-grams or sets of tokens extracted from two strings. |
dice-ngrams
($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double
Returns the Dice similarity coefficient between sets of character n-grams extracted from two strings. |
dice-tokens
($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double
Returns the Dice similarity coefficient between sets of tokens extracted from two strings. |
jaccard-ngrams
($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double
Returns the Jaccard similarity coefficient between sets of character n-grams extracted from two strings. |
jaccard-tokens
($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double
Returns the Jaccard similarity coefficient between sets of tokens extracted from two strings. |
ngrams
($s as xs:string, $n as xs:integer) as xs:string*
Returns the individual character n-grams forming a string. |
overlap-ngrams
($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double
Returns the overlap similarity coefficient between sets of character n-grams extracted from two strings. |
overlap-tokens
($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double
Returns the overlap similarity coefficient between sets of tokens extracted from two strings. |
declare function simt:cosine-ngrams($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double
Returns the cosine similarity coefficient between sets of character n-grams extracted from two strings.
The n-grams from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval).
Example usage :
cosine-ngrams("DWAYNE", "DUANE", 2 )
The function invocation in the example above returns :
0.2401922307076307
declare function simt:cosine-tokens($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double
Returns the cosine similarity coefficient between sets of tokens extracted from two strings. The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval).
Example usage :
cosine-tokens("The FLWOR Foundation", "FLWOR Found.", " +" )
The function invocation in the example above returns :
0.408248290463863
declare function simt:cosine($desc1 as xs:string*, $desc2 as xs:string*) as xs:double
Auxiliary function for computing the cosine similarity coefficient between strings, using stringdescriptors based on sets of character n-grams or sets of tokens extracted from two strings.
Example usage :
cosine( ("aa","bb") , ("bb","aa"))
The function invocation in the example above returns :
1.0
declare function simt:dice-ngrams($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double
Returns the Dice similarity coefficient between sets of character n-grams extracted from two strings.
Example usage :
dice-ngrams("DWAYNE", "DUANE", 2 )
The function invocation in the example above returns :
0.4615384615384616
declare function simt:dice-tokens($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double
Returns the Dice similarity coefficient between sets of tokens extracted from two strings.
Example usage :
dice-tokens("The FLWOR Foundation", "FLWOR Found.", " +" )
The function invocation in the example above returns :
0.4
declare function simt:jaccard-ngrams($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double
Returns the Jaccard similarity coefficient between sets of character n-grams extracted from two strings.
Example usage :
jaccard-ngrams("DWAYNE", "DUANE", 2 )
The function invocation in the example above returns :
0.3
declare function simt:jaccard-tokens($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double
Returns the Jaccard similarity coefficient between sets of tokens extracted from two strings.
Example usage :
jaccard-tokens("The FLWOR Foundation", "FLWOR Found.", " +" )
The function invocation in the example above returns :
0.25
declare function simt:ngrams($s as xs:string, $n as xs:integer) as xs:string*
Returns the individual character n-grams forming a string.
Example usage :
ngrams("FLWOR", 2 )
The function invocation in the example above returns :
("_F" , "FL" , "LW" , "WO" , "LW" , "WO" , "OR" , "R_")
declare function simt:overlap-ngrams($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double
Returns the overlap similarity coefficient between sets of character n-grams extracted from two strings.
Example usage :
overlap-ngrams("DWAYNE", "DUANE", 2 )
The function invocation in the example above returns :
0.5
declare function simt:overlap-tokens($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double
Returns the overlap similarity coefficient between sets of tokens extracted from two strings.
Example usage :
overlap-tokens("The FLWOR Foundation", "FLWOR Found.", " +" )
The function invocation in the example above returns :
0.5