I obtained very surprising results with timeit, can someone tell me if I am doing something wrong ? I am using Python 2.7.
This is the contents of file speedtest_init.py:
import random to_count = [random.randint(0, 100) for r in range(60)]
These are the contents of speedtest.py:
__author__ = 'BlueTrin'
import timeit
def test_init1():
print(timeit.timeit('import speedtest_init'))
def test_counter1():
s = """
d = defaultdict(int);
for i in speedtest_init.to_count:
d[i] += 1
"""
print(timeit.timeit(s, 'from collections import defaultdict; import speedtest_init;'))
def test_counter2():
print(timeit.timeit('d = Counter(speedtest_init.to_count);', 'from collections import Counter; import speedtest_init;'))
if __name__ == "__main__":
test_init1()
test_counter1()
test_counter2()
The console output is:
C:Python27python.exe C:/Dev/codility/chlorum2014/speedtest.py 2.71501962931 65.7090444503 91.2953839048 Process finished with exit code 0
I think by default timeit() runs 1000000 times the code, so I need to divide the times by 1000000, but what is surprising is that the Counter is slower than the defaultdict().
Is that expected ?
EDIT:
Also using a dict is faster than a defaultdict(int):
def test_counter3():
s = """
d = {};
for i in speedtest_init.to_count:
if i not in d:
d[i] = 1
else:
d[i] += 1
"""
print(timeit.timeit(stmt=s, setup='from collections import defaultdict; import speedtest_init;')
this last version is faster than the defaultdict(int) meaning that unless you care more about readability you should use the dict() rather than the defaultdict().
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Yes, this is expected; the Counter() constructor uses Counter.update() which uses self.get() to load initial values rather than rely on __missing__.
Moreover, the defaultdict __missing__ factory is handled entirely in C code, especially when using another type like int() that is itself implemented in C. The Counter source is pure Python and as such the Counter.__missing__ method requires a Python frame to execute.
Because dict.get() is still handled in C, the constructor approach is the faster approach for a Counter(), provided you use the same trick Counter.update() uses and alias the self.get lookup as a local first:
>>> import timeit
>>> import random
>>> import sys
>>> sys.version_info
sys.version_info(major=2, minor=7, micro=9, releaselevel='final', serial=0)
>>> to_count = [random.randint(0, 100) for r in range(60)]
>>> timeit.timeit('for i in to_count: c[i] += 1',
... 'from collections import Counter; from __main__ import to_count; c = Counter()',
... number=10000)
0.2510359287261963
>>> timeit.timeit('for i in to_count: c[i] = c_get(i, 0) + 1',
... 'from collections import Counter; from __main__ import to_count; c = Counter(); c_get = c.get',
... number=10000)
0.20978617668151855
Both defaultdict and Counter are helpful classes built for their functionality, not their performance; not relying on the __missing__ hook can be faster still:
>>> timeit.timeit('for i in to_count: d[i] = d_get(i, 0) + 1',
... 'from __main__ import to_count; d = {}; d_get = d.get',
... number=10000)
0.11437392234802246
That’s a regular dictionary using an aliased dict.get() method for maximum speed. But then you’ll also have to re-implement the bag behaviour of Counter, or the Counter.most_common() method. The defaultdict use cases go way beyond counting.
In Python 3.2, updating a Counter() got a speed boost by adding a C library that handles this case; see issue 10667. Testing on Python 3.4, the Counter() constructor now beats the aliased dict.get case:
>>> timeit.timeit('Counter(to_count)',
... 'from collections import Counter; from __main__ import to_count',
... number=100000)
0.8332311600097455
>>> timeit.timeit('for i in to_count: d[i] = d_get(i, 0) + 1',
... 'from __main__ import to_count; d = {}; d_get = d.get',
... number=100000)
0.961191965994658
>>> import sys
>>> sys.version_info
sys.version_info(major=3, minor=4, micro=2, releaselevel='final', serial=0)
(Note: to get a meaningful timing result the number of iterations was increased from 10k to 100k; so if you compare these against the dict.get() case above you need to take the timing there times ten, at 1.144 seconds).
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0