http://zorba.io/modules/data-cleaning/token-based-string-similarity

View as XML or JSON.

This library module provides token-based string similarity functions that view strings as sets or multi-sets of tokens and use set-related properties to compute similarity scores.

The tokens correspond to groups of characters extracted from the strings being compared, such as individual words or character n-grams.

These functions are particularly useful for matching near duplicate strings in cases where typographical conventions often lead to rearrangement of words (e.g., "John Smith" versus "Smith, John").

The logic contained in this module is not specific to any particular XQuery implementation, although the module requires the trigonometic functions of XQuery 3.0 or a math extension function such as sqrt($x as numeric) for computing the square root.

Function Summary


                                        
                                        cosine-ngrams
                                        ($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double

Returns the cosine similarity coefficient between sets of character n-grams extracted from two strings.


                                        
                                        cosine-tokens
                                        ($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.


                                        
                                        cosine
                                        ($desc1 as xs:string*, $desc2 as xs:string*) as xs:double

Auxiliary function for computing the cosine similarity coefficient between strings, using stringdescriptors based on sets of character n-grams or sets of tokens extracted from two strings.


                                        
                                        dice-ngrams
                                        ($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double

Returns the Dice similarity coefficient between sets of character n-grams extracted from two strings.


                                        
                                        dice-tokens
                                        ($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the Dice similarity coefficient between sets of tokens extracted from two strings.


                                        
                                        jaccard-ngrams
                                        ($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double

Returns the Jaccard similarity coefficient between sets of character n-grams extracted from two strings.


                                        
                                        jaccard-tokens
                                        ($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the Jaccard similarity coefficient between sets of tokens extracted from two strings.


                                        
                                        ngrams
                                        ($s as xs:string, $n as xs:integer) as xs:string*

Returns the individual character n-grams forming a string.


                                        
                                        overlap-ngrams
                                        ($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double

Returns the overlap similarity coefficient between sets of character n-grams extracted from two strings.


                                        
                                        overlap-tokens
                                        ($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the overlap similarity coefficient between sets of tokens extracted from two strings.

Functions

cosine-ngrams#3

declare  function simt:cosine-ngrams($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double

Returns the cosine similarity coefficient between sets of character n-grams extracted from two strings.

The n-grams from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval).

Example usage :

 cosine-ngrams("DWAYNE", "DUANE", 2 )

The function invocation in the example above returns :

 0.2401922307076307

Parameters

s1 as xs:string: The first string.

s2 as xs:string: The second string.

n as xs:integer: The number of characters to consider when extracting n-grams.

Returns

xs:double: The cosine similarity coefficient between the sets n-grams extracted from the two strings.

cosine-tokens#3

declare  function simt:cosine-tokens($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings. The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval).

Example usage :

 cosine-tokens("The FLWOR Foundation", "FLWOR Found.", " +" )

The function invocation in the example above returns :

 0.408248290463863

Parameters

s1 as xs:string: The first string.

s2 as xs:string: The second string.

r as xs:string: A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.

Returns

xs:double: The cosine similarity coefficient between the sets tokens extracted from the two strings.

cosine#2

declare  function simt:cosine($desc1 as xs:string*, $desc2 as xs:string*) as xs:double

Auxiliary function for computing the cosine similarity coefficient between strings, using stringdescriptors based on sets of character n-grams or sets of tokens extracted from two strings.

Example usage :

 cosine( ("aa","bb") , ("bb","aa"))

The function invocation in the example above returns :

1.0

Parameters

desc1 as xs:string: The descriptor for the first string.

desc2 as xs:string: The descriptor for the second string.

Returns

xs:double: The cosine similarity coefficient between the descriptors for the two strings.

dice-ngrams#3

declare  function simt:dice-ngrams($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double

Returns the Dice similarity coefficient between sets of character n-grams extracted from two strings.

Example usage :

 dice-ngrams("DWAYNE", "DUANE", 2 )

The function invocation in the example above returns :

 0.4615384615384616

Parameters

s1 as xs:string: The first string.

s2 as xs:string: The second string.

n as xs:integer: The number of characters to consider when extracting n-grams.

Returns

xs:double: The Dice similarity coefficient between the sets of character n-grams extracted from the two strings.

dice-tokens#3

declare  function simt:dice-tokens($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the Dice similarity coefficient between sets of tokens extracted from two strings.

Example usage :

 dice-tokens("The FLWOR Foundation", "FLWOR Found.", " +" )

The function invocation in the example above returns :

0.4

Parameters

s1 as xs:string: The first string.

s2 as xs:string: The second string.

r as xs:string: A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.

Returns

xs:double: The Dice similarity coefficient between the sets tokens extracted from the two strings.

jaccard-ngrams#3

declare  function simt:jaccard-ngrams($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double

Returns the Jaccard similarity coefficient between sets of character n-grams extracted from two strings.

Example usage :

 jaccard-ngrams("DWAYNE", "DUANE", 2 )

The function invocation in the example above returns :

0.3

Parameters

s1 as xs:string: The first string.

s2 as xs:string: The second string.

n as xs:integer: The number of characters to consider when extracting n-grams.

Returns

xs:double: The Jaccard similarity coefficient between the sets of character n-grams extracted from the two strings.

jaccard-tokens#3

declare  function simt:jaccard-tokens($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the Jaccard similarity coefficient between sets of tokens extracted from two strings.

Example usage :

 jaccard-tokens("The FLWOR Foundation", "FLWOR Found.", " +" )

The function invocation in the example above returns :

 0.25

Parameters

s1 as xs:string: The first string.

s2 as xs:string: The second string.

r as xs:string: A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.

Returns

xs:double: The Jaccard similarity coefficient between the sets tokens extracted from the two strings.

ngrams#2

declare  function simt:ngrams($s as xs:string, $n as xs:integer) as xs:string*

Returns the individual character n-grams forming a string.

Example usage :

 ngrams("FLWOR", 2 )

The function invocation in the example above returns :

 ("_F" , "FL" , "LW" , "WO" , "LW" , "WO" , "OR" , "R_")

Parameters

s as xs:string: The input string.

n as xs:integer: The number of characters to consider when extracting n-grams.

Returns

xs:string*: The sequence of strings with the extracted n-grams.

overlap-ngrams#3

declare  function simt:overlap-ngrams($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double

Returns the overlap similarity coefficient between sets of character n-grams extracted from two strings.

Example usage :

 overlap-ngrams("DWAYNE", "DUANE", 2 )

The function invocation in the example above returns :

0.5

Parameters

s1 as xs:string: The first string.

s2 as xs:string: The second string.

n as xs:integer: The number of characters to consider when extracting n-grams.

Returns

xs:double: The overlap similarity coefficient between the sets of character n-grams extracted from the two strings.

overlap-tokens#3

declare  function simt:overlap-tokens($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the overlap similarity coefficient between sets of tokens extracted from two strings.

Example usage :

 overlap-tokens("The FLWOR Foundation", "FLWOR Found.", " +" )

The function invocation in the example above returns :

0.5

Parameters

s1 as xs:string: The first string.

s2 as xs:string: The second string.

r as xs:string: A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.

Returns

xs:double: The overlap similarity coefficient between the sets tokens extracted from the two strings.

http://zorba.io/modules/data-cleaning/token-based-string-similarity

Function Summary

Variable Summary

Functions

cosine-ngrams#3

Parameters

Returns

cosine-tokens#3

Parameters

Returns

cosine#2

Parameters

Returns

dice-ngrams#3

Parameters

Returns

dice-tokens#3

Parameters

Returns

jaccard-ngrams#3

Parameters

Returns

jaccard-tokens#3

Parameters

Returns

ngrams#2

Parameters

Returns

overlap-ngrams#3

Parameters

Returns

overlap-tokens#3

Parameters

Returns

Variables