— I just parsed a big file and I created a list containing 42.000 strings/words. I want to query [against this list] to check if a given word/string belongs to it. So my question is:
What is the most efficient way for such a lookup?
A first approach is to sort the list (list.sort()) and then just use
>> if word in list: print 'word'
which is really trivial and I am sure there is a better way to do it. My goal is to apply a fast lookup that finds whether a given string is in this list or not. If you have any ideas of another data structure, they are welcome. Yet, I want to avoid for now more sophisticated data-structures like Tries etc. I am interested in hearing ideas (or tricks) about fast lookups or any other python library methods that might do the search faster than the simple in.
And also i want to know the index of the search item
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Don’t create a list, create a set. It does lookups in constant time.
If you don’t want the memory overhead of a set then keep a sorted list and search through it with the bisect module.
from bisect import bisect_left
def bi_contains(lst, item):
""" efficient `item in lst` for sorted lists """
# if item is larger than the last its not in the list, but the bisect would
# find `len(lst)` as the index to insert, so check that first. Else, if the
# item is in the list then it has to be at index bisect_left(lst, item)
return (item <= lst[-1]) and (lst[bisect_left(lst, item)] == item)
Method 2
A point about sets versus lists that hasn’t been considered: in “parsing a big file” one would expect to need to handle duplicate words/strings. You haven’t mentioned this at all.
Obviously adding new words to a set removes duplicates on the fly, at no additional cost of CPU time or your thinking time. If you try that with a list it ends up O(N**2). If you append everything to a list and remove duplicates at the end, the smartest way of doing that is … drum roll … use a set, and the (small) memory advantage of a list is likely to be overwhelmed by the duplicates.
Method 3
Using this program it looks like dicts are the fastes, set second, list with bi_contains third:
from datetime import datetime
def ReadWordList():
""" Loop through each line in english.txt and add it to the list in uppercase.
Returns:
Returns array with all the words in english.txt.
"""
l_words = []
with open(r'c:english.txt', 'r') as f_in:
for line in f_in:
line = line.strip().upper()
l_words.append(line)
return l_words
# Loop through each line in english.txt and add it to the l_words list in uppercase.
l_words = ReadWordList()
l_words = {key: None for key in l_words}
#l_words = set(l_words)
#l_words = tuple(l_words)
t1 = datetime.now()
for i in range(10000):
#w = 'ZEBRA' in l_words
w = bi_contains(l_words, 'ZEBRA')
t2 = datetime.now()
print('After: ' + str(t2 - t1))
# list = 41.025293 seconds
# dict = 0.001488 seconds
# set = 0.001499 seconds
# tuple = 38.975805 seconds
# list with bi_contains = 0.014000 seconds
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0