What is the complexity of the function most_common provided by the collections.Counter object in Python?
More specifically, is Counter keeping some kind of sorted list while it’s counting, allowing it to perform the most_common operation faster than O(n) when n is the number of (unique) items added to the counter? For you information, I am processing some large amount of text data trying to find the n-th most frequent tokens.
I checked the official documentation and the TimeComplexity article on the CPython wiki but I couldn’t find the answer.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
From the source code of collections.py, we see that if we don’t specify a number of returned elements, most_common returns a sorted list of the counts. This is an O(n log n) algorithm.
If we use most_common to return k > 1 elements, then we use heapq.nlargest. This is an O(k) + O((n - k) log k) + O(k log k) algorithm, which is very good for a small constant k, since it’s essentialy linear. The O(k) part comes from heapifying the initial k counts, the second part from n - k calls to heappushpop method and the third part from sorting the final heap of k elements. Since k <= n we can conclude that the complexity is:
O(n log k)
If k = 1 then it’s easy to show that the complexity is:
O(n)
Method 2
The source shows exactly what happens:
def most_common(self, n=None):
'''List the n most common elements and their counts from the most
common to the least. If n is None, then list all element counts.
>>> Counter('abracadabra').most_common(3)
[('a', 5), ('r', 2), ('b', 2)]
'''
# Emulate Bag.sortedByCount from Smalltalk
if n is None:
return sorted(self.iteritems(), key=_itemgetter(1), reverse=True)
return _heapq.nlargest(n, self.iteritems(), key=_itemgetter(1))
heapq.nlargest is defined in heapq.py
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0