I want to find string similarity between two strings. en.wikipedia has examples of some of them. code.google has a Python implementation of Levenshtein distance.
Is there a better algorithm, (and hopefully a Python library), under these constraints:
- I want to do fuzzy matches between strings. eg matches(‘Hello, All you people’, ‘hello, all You peopl’) should return True
- False negatives are acceptable, False positives, except in extremely rare cases are not.
- This is done in a non realtime setting, so speed is not (much) of concern.
- [Edit] I am comparing multi word strings.
Would something other than Levenshtein distance(or Levenshtein ratio) be a better algorithm for my case?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
I realize it’s not the same thing, but this is close enough:
>>> import difflib >>> a = 'Hello, All you people' >>> b = 'hello, all You peopl' >>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower()) >>> seq.ratio() 0.97560975609756095
You can make this as a function
def similar(seq1, seq2):
return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9
>>> similar(a, b)
True
>>> similar('Hello, world', 'Hi, world')
False
Method 2
There’s a great resource for string similarity metrics at the University of Sheffield. It has a list of various metrics (beyond just Levenshtein) and has open-source implementations of them. Looks like many of them should be easy to adapt into Python.
http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
Here’s a bit of the list:
- Hamming distance
- Levenshtein distance
- Needleman-Wunch distance or Sellers Algorithm
- and many more…
Method 3
This snippet will calculate the difflib, Levenshtein, Sørensen, and Jaccard similarity values for two strings. In the snippet below, I was iterating over a tsv in which the strings of interest occupied columns [3] and [4] of the tsv. (pip install python-Levenshtein and pip install distance):
import codecs, difflib, Levenshtein, distance
with codecs.open("titles.tsv","r","utf-8") as f:
title_list = f.read().split("n")[:-1]
for row in title_list:
sr = row.lower().split("t")
diffl = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()
lev = Levenshtein.ratio(sr[3], sr[4])
sor = 1 - distance.sorensen(sr[3], sr[4])
jac = 1 - distance.jaccard(sr[3], sr[4])
print diffl, lev, sor, jac
Method 4
I would use Levenshtein distance, or the so-called Damerau distance (which takes transpositions into account) rather than the difflib stuff for two reasons (1) “fast enough” (dynamic programming algo) and “whoooosh” (bit-bashing) C code is available and (2) well-understood behaviour e.g. Levenshtein satisfies the triangle inequality and thus can be used in e.g. a Burkhard-Keller tree.
Threshold: you should treat as “positive” only those cases where distance < (1 – X) * max(len(string1), len(string2)) and adjust X (the similarity factor) to suit yourself. One way of choosing X is to get a sample of matches, calculate X for each, ignore cases where X < say 0.8 or 0.9, then sort the remainder in descending order of X and eye-ball them and insert the correct result and calculate some cost-of-mistakes measure for various levels of X.
N.B. Your ape/apple example has distance 2, so X is 0.6 … I would only use a threshold as low as 0.75 if I were desperately looking for something and had a high false-negative penalty
Method 5
Is that what you mean?
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']
look at http://docs.python.org/library/difflib.html#difflib.get_close_matches
Method 6
I know this isn’t the same but you can adjust the ratio to filter out strings that are not similar enough and return the closest match to the string you are looking for.
Perhaps you would be more interested in semantic similarity metrics.
I realize you said speed is not an issue but if you are processing a lot of the strings for your algorithm the below is very helpful.
def spellcheck(self, sentence):
#return ' '.join([difflib.get_close_matches(word, wordlist,1 , 0)[0] for word in sentence.split()])
return ' '.join( [ sorted( { Levenshtein.ratio(x, word):x for x in wordlist }.items(), reverse=True)[0][1] for word in sentence.split() ] )
Its about 20 times faster than difflib.
https://pypi.python.org/pypi/python-Levenshtein/
import Levenshtein
Method 7
To avoid false positives, the method nratio() from the library ngramratio may help.
>>> pip install ngramratio >>> from ngramratio import ngramratio >>> SequenceMatcherExtended = ngramratio.SequenceMatcherExtended >>> a = 'Hi there' >>> b = 'Hit here' >>> seq=SequenceMatcherExtended(a=a.lower(), b=b.lower()) >>> seq.ratio() >>> 0.875 >>> seq.nratio(1) #this replicates `seq.ratio`. >>> 0.875 >>> seq.nratio(2) >>> 0.75 >>> seq.nratio(3) >>> 0.5
nratio(n) only matches n-grams of length >= n.
You can pick a value for n, say n = 2, and create a boolean similarity function as Nadia did in a previous reply.
def similar(seq1, seq2):
return SequenceMatcherExtended(a=seq1.lower(), b=seq2.lower()).nratio(2) > 0.8
>>> similar(a, b)
False
>>> similar('Hi there', 'Hi ther')
True
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0