DIFFERENCE is different
My latest gig involves some quite heavy text-searching in SQL Server 2008 R2, and I was investigating into various approaches for phonetic searching in the case of misspellings of search phrases.
An approach that came to mind was the good olde SOUNDEX and DIFFERENCE built-in TSQL functions. So, I tried it but quickly became perplexed at how wrong comparisons can become..
Let's have a look with an example...
SELECT
SOUNDEX
(
'BUNINYONG'
),
SOUNDEX
(
'BOONANARRING')
The result is
----- -----
B555 B556
(1 row(s) affected)
Now let's add DIFFERENCE into the mix...
SELECT
SOUNDEX('BUNINYONG'), SOUNDEX('BOONANARRING'), DIFFERENCE('BUNINYONG', 'BOONANARRING'), DIFFERENCE('BOONANARRING', 'BUNINYONG')
The result is
----- ----- ----------- -----------
B555 B556 4 3
(1 row(s) affected)
Excuse me... what happened here? Wouldn't you expect the DIFFERENCE result to be the same regardless of the order that the strings are given?
It appears that the DIFFERENCE function is parameter order dependant because the comparison is performed in the manner that the second parameter is compared to the first in an ordinal AND value fashion. This old KB article explains it.
So for me, SOUNDEX and DIFFERENCE were no good, and off I was to find other algorithms and implementations.
I ending up using the Master Data Services similarity function, which is 'slower' but much more appropriate for my situation.
HTH!
PS: Some links were broken... and I've fixed them now. Sorry about that...