This library module provides character-based string similarity functions that view strings as sequences of characters, generally computing a similarity score that corresponds to the cost of transforming one string into another. These functions are particularly useful for matching near duplicate strings in the presence of typographical errors.
The logic contained in this module is not specific to any particular XQuery implementation.
edit-distance
($s1 as xs:string, $s2 as xs:string) as xs:integer
Returns the edit distance between two strings. |
jaro-winkler
($s1 as xs:string, $s2 as xs:string, $prefix as xs:integer, $fact as xs:double) as xs:double
Returns the Jaro-Winkler similarity coefficient between two strings. |
jaro
($s1 as xs:string, $s2 as xs:string) as xs:double
Returns the Jaro similarity coefficient between two strings. |
needleman-wunsch
($s1 as xs:string, $s2 as xs:string, $score as xs:integer, $penalty as xs:integer) as xs:double
Returns the Needleman-Wunsch distance between two strings. |
smith-waterman
($s1 as xs:string, $s2 as xs:string, $score as xs:integer, $penalty as xs:integer) as xs:double
Returns the Smith-Waterman distance between two strings. |
declare function simc:edit-distance($s1 as xs:string, $s2 as xs:string) as xs:integer
Returns the edit distance between two strings.
This distance, also refered to as the Levenshtein distance, is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.
Example usage :
edit-distance("FLWOR", "FLOWER")
The function invocation in the example above returns :
2
declare function simc:jaro-winkler($s1 as xs:string, $s2 as xs:string, $prefix as xs:integer, $fact as xs:double) as xs:double
Returns the Jaro-Winkler similarity coefficient between two strings.
This similarity coefficient corresponds to an extension of the Jaro similarity coefficient that weights or penalizes strings based on their similarity at the beginning of the string, up to a given prefix size.
Example usage :
jaro-winkler("DWAYNE", "DUANE", 4, 0.1 )
The function invocation in the example above returns :
0.8577777777777778
declare function simc:jaro($s1 as xs:string, $s2 as xs:string) as xs:double
Returns the Jaro similarity coefficient between two strings.
This similarity coefficient is based on the number of transposed characters and on a weighted sum of the percentage of matched characters held within the strings. The higher the Jaro-Winkler value is, the more similar the strings are. The coefficient is normalized such that 0 equates to no similarity and 1 is an exact match.
Example usage :
jaro("FLWOR Found.", "FLWOR Foundation")
The function invocation in the example above returns :
0.5853174603174603
declare function simc:needleman-wunsch($s1 as xs:string, $s2 as xs:string, $score as xs:integer, $penalty as xs:integer) as xs:double
Returns the Needleman-Wunsch distance between two strings.
The Needleman-Wunsch distance is similar to the basic edit distance metric, adding a variable cost adjustment to the cost of a gap (i.e., an insertion or deletion) in the distance metric.
Example usage :
needleman-wunsch("KAK", "KQRK", 1, 1)
The function invocation in the example above returns :
0
declare function simc:smith-waterman($s1 as xs:string, $s2 as xs:string, $score as xs:integer, $penalty as xs:integer) as xs:double
Returns the Smith-Waterman distance between two strings.
Example usage :
smith-waterman("ACACACTA", "AGCACACA", 2, 1)
The function invocation in the example above returns :
12