Best way to determine if a sequence is in another sequence?

This is a generalization of the “string contains substring” problem to (more) arbitrary types.

Given an sequence (such as a list or tuple), what’s the best way of determining whether another sequence is inside it? As a bonus, it should return the index of the element where the subsequence starts:

Example usage (Sequence in Sequence):

>>> seq_in_seq([5,6],  [4,'a',3,5,6])
3
>>> seq_in_seq([5,7],  [4,'a',3,5,6])
-1 # or None, or whatever

So far, I just rely on brute force and it seems slow, ugly, and clumsy.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

I second the Knuth-Morris-Pratt algorithm. By the way, your problem (and the KMP solution) is exactly recipe 5.13 in Python Cookbook 2nd edition. You can find the related code at http://code.activestate.com/recipes/117214/

It finds all the correct subsequences in a given sequence, and should be used as an iterator:

>>> for s in KnuthMorrisPratt([4,'a',3,5,6], [5,6]): print s
3
>>> for s in KnuthMorrisPratt([4,'a',3,5,6], [5,7]): print s
(nothing)

Method 2

Here’s a brute-force approach O(n*m) (similar to @mcella’s answer). It might be faster than the Knuth-Morris-Pratt algorithm implementation in pure Python O(n+m) (see @Gregg Lind answer) for small input sequences.

#!/usr/bin/env python
def index(subseq, seq):
    """Return an index of `subseq`uence in the `seq`uence.

    Or `-1` if `subseq` is not a subsequence of the `seq`.

    The time complexity of the algorithm is O(n*m), where

        n, m = len(seq), len(subseq)

    >>> index([1,2], range(5))
    1
    >>> index(range(1, 6), range(5))
    -1
    >>> index(range(5), range(5))
    0
    >>> index([1,2], [0, 1, 0, 1, 2])
    3
    """
    i, n, m = -1, len(seq), len(subseq)
    try:
        while True:
            i = seq.index(subseq[0], i + 1, n - m + 1)
            if subseq == seq[i:i + m]:
               return i
    except ValueError:
        return -1

if __name__ == '__main__':
    import doctest; doctest.testmod()

I wonder how large is the small in this case?

Method 3

A simple approach: Convert to strings and rely on string matching.

Example using lists of strings:

 >>> f = ["foo", "bar", "baz"]
 >>> g = ["foo", "bar"]
 >>> ff = str(f).strip("[]")
 >>> gg = str(g).strip("[]")
 >>> gg in ff
 True

Example using tuples of strings:

>>> x = ("foo", "bar", "baz")
>>> y = ("bar", "baz")
>>> xx = str(x).strip("()")
>>> yy = str(y).strip("()")
>>> yy in xx
True

Example using lists of numbers:

>>> f = [1 , 2, 3, 4, 5, 6, 7]
>>> g = [4, 5, 6]
>>> ff = str(f).strip("[]")
>>> gg = str(g).strip("[]")
>>> gg in ff
True

Method 4

Same thing as string matching sir…Knuth-Morris-Pratt string matching

Method 5

>>> def seq_in_seq(subseq, seq):
...     while subseq[0] in seq:
...         index = seq.index(subseq[0])
...         if subseq == seq[index:index + len(subseq)]:
...             return index
...         else:
...             seq = seq[index + 1:]
...     else:
...         return -1
... 
>>> seq_in_seq([5,6], [4,'a',3,5,6])
3
>>> seq_in_seq([5,7], [4,'a',3,5,6])
-1

Sorry I’m not an algorithm expert, it’s just the fastest thing my mind can think about at the moment, at least I think it looks nice (to me) and I had fun coding it. 😉

Most probably it’s the same thing your brute force approach is doing.

Method 6

Brute force may be fine for small patterns.

For larger ones, look at the Aho-Corasick algorithm.

Method 7

Here is another KMP implementation:

from itertools import tee

def seq_in_seq(seq1,seq2):
    '''
    Return the index where seq1 appears in seq2, or -1 if 
    seq1 is not in seq2, using the Knuth-Morris-Pratt algorithm

    based heavily on code by Neale Pickett <<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="d4bab1b5b8b194a3bbbbaeb8b1fabba6b3">[email protected]</a>>
    found at:  woozle.org/~neale/src/python/kmp.py

    >>> seq_in_seq(range(3),range(5))
    0
    >>> seq_in_seq(range(3)[-1:],range(5))
    2
    >>>seq_in_seq(range(6),range(5))
    -1
    '''
    def compute_prefix_function(p):
        m = len(p)
        pi = [0] * m
        k = 0
        for q in xrange(1, m):
            while k > 0 and p[k] != p[q]:
                k = pi[k - 1]
            if p[k] == p[q]:
                k = k + 1
            pi[q] = k
        return pi

    t,p = list(tee(seq2)[0]), list(tee(seq1)[0])
    m,n = len(p),len(t)
    pi = compute_prefix_function(p)
    q = 0
    for i in range(n):
        while q > 0 and p[q] != t[i]:
            q = pi[q - 1]
        if p[q] == t[i]:
            q = q + 1
        if q == m:
            return i - m + 1
    return -1

Method 8

I’m a bit late to the party, but here’s something simple using strings:

>>> def seq_in_seq(sub, full):
...     f = ''.join([repr(d) for d in full]).replace("'", "")
...     s = ''.join([repr(d) for d in sub]).replace("'", "")
...     #return f.find(s) #<-- not reliable for finding indices in all cases
...     return s in f
...
>>> seq_in_seq([5,6], [4,'a',3,5,6])
True
>>> seq_in_seq([5,7], [4,'a',3,5,6])
False
>>> seq_in_seq([4,'abc',33], [4,'abc',33,5,6])
True

As noted by Ilya V. Schurov, the find method in this case will not return the correct indices with multi-character strings or multi-digit numbers.

Method 9

For what it’s worth, I tried using a deque like so:

from collections import deque
from itertools import islice

def seq_in_seq(needle, haystack):
    """Generator of indices where needle is found in haystack."""
    needle = deque(needle)
    haystack = iter(haystack)  # Works with iterators/streams!
    length = len(needle)
    # Deque will automatically call deque.popleft() after deque.append()
    # with the `maxlen` set equal to the needle length.
    window = deque(islice(haystack, length), maxlen=length)
    if needle == window:
        yield 0  # Match at the start of the haystack.
    for index, value in enumerate(haystack, start=1):
        window.append(value)
        if needle == window:
            yield index

One advantage of the deque implementation is that it makes only a single linear pass over the haystack. So if the haystack is streaming then it will still work (unlike the solutions that rely on slicing).

The solution is still brute-force, O(n*m). Some simple local benchmarking showed it was ~100x slower than the C-implementation of string searching in str.index.

Method 10

Another approach, using sets:

set([5,6])== set([5,6])&set([4,'a',3,5,6])
True


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x