October 2, 2012 Leave a comment
“Don’t give me what I asked for! Give me what I want!” – How often did you shout that sentence at your computer during the last week?
Sometimes you misspell a word or a name. Sometimes you don’t know for sure how it is spelled but have a vague idea.
Google gives you suggestions when it thinks it knows what you really wanted as does RavenDB and many others.
Fuzzy string matching is an interesting topic. There are a million different algorithms that work better or worse in various scenarios. Just to get a feeling for what is available out there I decided to implement some of the most well-known. Namely these are
- SoundEx (the algorithm described on Wikipedia is the slightly modified American SoundEx)
- Hamming distance
- Levenshtein distance
- Jaccard index
- Damerau-Levenshtein distance
- Dice’s coefficient
The output of these algorithms varies from 4-character strings to values in the range of [0..1] to the number of differences between the compared strings. If you want to make the different algorithms interchangeable in your application you would need to convert these heterogeneous values to a specific numerical range. I made some attempts to that as well but did not put too much effort into it.
It was an interesting experiment!
Get the source code for the above algorithms here (project TecX.StringSimilarity and the test suite that puts them to use in TecX.StringSimilarity.Test).