I need to find the longest sequence in a string with the caveat that the sequence must be repeated three or more times. So, for example, if my string is:
fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld
then I would like the value “helloworld” to be returned.
I know of a few ways of accomplishing this but the problem I’m facing is that the actual string is absurdly large so I’m really looking for a method that can do it in a timely fashion.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
This problem is a variant of the longest repeated substring problem and there is an O(n)-time algorithm for solving it that uses suffix trees. The idea (as suggested by Wikipedia) is to construct a suffix tree (time O(n)), annotate all the nodes in the tree with the number of descendants (time O(n) using a DFS), and then to find the deepest node in the tree with at least three descendants (time O(n) using a DFS). This overall algorithm takes time O(n).
That said, suffix trees are notoriously hard to construct, so you would probably want to find a Python library that implements suffix trees for you before attempting this implementation. A quick Google search turns up this library, though I’m not sure whether this is a good implementation.
Another option would be to use suffix arrays in conjunction with LCP arrays. You can iterate over pairs of adjacent elements in the LCP array, taking the minimum of each pair, and store the largest number you find this way. That will correspond to the length of the longest string that repeats at least three times, and from there you can then read off the string itself.
There are several simple algorithms for building suffix arrays (the Manber-Myers algorithm runs in time O(n log n) and isn’t too hard to code up), and Kasai’s algorithm builds LCP arrays in time O(n) and is fairly straightforward to code up.
Hope this helps!
Method 2
Use defaultdict to tally each substring beginning with each position in the input string. The OP wasn’t clear whether overlapping matches should or shouldn’t be included, this brute force method includes them.
from collections import defaultdict
def getsubs(loc, s):
substr = s[loc:]
i = -1
while(substr):
yield substr
substr = s[loc:i]
i -= 1
def longestRepetitiveSubstring(r, minocc=3):
occ = defaultdict(int)
# tally all occurrences of all substrings
for i in range(len(r)):
for sub in getsubs(i,r):
occ[sub] += 1
# filter out all substrings with fewer than minocc occurrences
occ_minocc = [k for k,v in occ.items() if v >= minocc]
if occ_minocc:
maxkey = max(occ_minocc, key=len)
return maxkey, occ[maxkey]
else:
raise ValueError("no repetitions of any substring of '%s' with %d or more occurrences" % (r,minocc))
prints:
('helloworld', 3)
Method 3
Let’s start from the end, count the frequency and stop as soon as the most frequent element appears 3 or more times.
from collections import Counter
a='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
times=3
for n in range(1,len(a)/times+1)[::-1]:
substrings=[a[i:i+n] for i in range(len(a)-n+1)]
freqs=Counter(substrings)
if freqs.most_common(1)[0][1]>=3:
seq=freqs.most_common(1)[0][0]
break
print "sequence '%s' of length %s occurs %s or more times"%(seq,n,times)
Result:
>>> sequence 'helloworld' of length 10 occurs 3 or more times
Edit: if you have the feeling that you’re dealing with random input and the common substring should be of small length, you better start (if you need the speed) with small substrings and stop when you can’t find any that appear at least 3 time:
from collections import Counter
a='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
times=3
for n in range(1,len(a)/times+1):
substrings=[a[i:i+n] for i in range(len(a)-n+1)]
freqs=Counter(substrings)
if freqs.most_common(1)[0][1]<3:
n-=1
break
else:
seq=freqs.most_common(1)[0][0]
print "sequence '%s' of length %s occurs %s or more times"%(seq,n,times)
The same result as above.
Method 4
The first idea that came to mind is searching with progressively larger regular expressions:
import re
text = 'fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
largest = ''
i = 1
while 1:
m = re.search("(" + ("w" * i) + ").*\1.*\1", text)
if not m:
break
largest = m.group(1)
i += 1
print largest # helloworld
The code ran successfully. The time complexity appears to be at least O(n^2).
Method 5
If you reverse the input string, then feed it to a regex like (.+)(?:.*1){2}
It should give you the longest string repeated 3 times. (Reverse capture group 1 for the answer)
Edit:
I have to say cancel this way. It’s dependent on the first match. Unless its tested against a curr length vs max length so far, in an itterative loop, regex won’t work for this.
Method 6
from collections import Counter
def Longest(string):
b = []
le = []
for i in set(string):
for j in range(Counter(string)[i]+1):
b.append(i* (j+1))
for i in b:
if i in string:
le.append(i)
return ([s for s in le if len(s)==len(max( le , key = len))])
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0