How can the length of the lists in the column be determine without iteration?
I have a dataframe like this:
CreationDate 2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik]
I am calculating the length of lists in the CreationDate column and making a new Length column like this:
df['Length'] = df.CreationDate.apply(lambda x: len(x))
Which gives me this:
CreationDate Length 2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 3 2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 4 2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 4
Is there a more pythonic way to do this?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You can use the str accessor for some list operations as well. In this example,
df['CreationDate'].str.len()
returns the length of each list. See the docs for str.len.
df['Length'] = df['CreationDate'].str.len()
df
Out:
CreationDate Length
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 3
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 4
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 4
For these operations, vanilla Python is generally faster. pandas handles NaNs though. Here are timings:
ser = pd.Series([random.sample(string.ascii_letters,
random.randint(1, 20)) for _ in range(10**6)])
%timeit ser.apply(lambda x: len(x))
1 loop, best of 3: 425 ms per loop
%timeit ser.str.len()
1 loop, best of 3: 248 ms per loop
%timeit [len(x) for x in ser]
10 loops, best of 3: 84 ms per loop
%timeit pd.Series([len(x) for x in ser], index=ser.index)
1 loop, best of 3: 236 ms per loop
Method 2
-
pandas.Series.map(len)andpandas.Series.apply(len)are equivalent in execution time, and slightly faster thanpandas.Series.str.len(). - Difference between map, applymap and apply methods in Pandas
import pandas as pd
data = {'os': [['ubuntu', 'mac-osx', 'syslinux'], ['ubuntu', 'mod-rewrite', 'laconica', 'apache-2.2'], ['ubuntu', 'nat', 'squid', 'mikrotik']]}
index = ['2013-12-22 15:25:02', '2009-12-14 14:29:32', '2013-12-22 15:42:00']
df = pd.DataFrame(data, index)
# create Length column
df['Length'] = df.os.map(len)
# display(df)
os Length
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 3
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 4
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 4
%timeit
import pandas as pd
import random
import string
random.seed(365)
ser = pd.Series([random.sample(string.ascii_letters, random.randint(1, 20)) for _ in range(10**6)])
%timeit ser.str.len()
252 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit ser.map(len)
220 ms ± 7.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit ser.apply(len)
222 ms ± 8.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0