I have a dataframe where one column is a list of groups each of my users belongs to. Something like:
index groups 0 ['a','b','c'] 1 ['c'] 2 ['b','c','e'] 3 ['a','c'] 4 ['b','e']
And what I would like to do is create a series of dummy columns to identify which groups each user belongs to in order to run some analyses
index a b c d e 0 1 1 1 0 0 1 0 0 1 0 0 2 0 1 1 0 1 3 1 0 1 0 0 4 0 1 0 0 0 pd.get_dummies(df['groups'])
won’t work because that just returns a column for each different list in my column.
The solution needs to be efficient as the dataframe will contain 500,000+ rows. Any advice would be appreciated!
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Using s for your df['groups']:
In [21]: s = pd.Series({0: ['a', 'b', 'c'], 1:['c'], 2: ['b', 'c', 'e'], 3: ['a', 'c'], 4: ['b', 'e'] })
In [22]: s
Out[22]:
0 [a, b, c]
1 [c]
2 [b, c, e]
3 [a, c]
4 [b, e]
dtype: object
This is a possible solution:
In [23]: pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0) Out[23]: a b c e 0 1 1 1 0 1 0 0 1 0 2 0 1 1 1 3 1 0 1 0 4 0 1 0 1
The logic of this is:
.apply(Series)converts the series of lists to a dataframe.stack()puts everything in one column again (creating a multi-level index)pd.get_dummies( )creating the dummies.sum(level=0) for remerging the different rows that should be one row (by summing up the second level, only keeping the original level (level=0))
An slight equivalent is pd.get_dummies(s.apply(pd.Series), prefix='', prefix_sep='').sum(level=0, axis=1)
If this will be efficient enough, I don’t know, but in any case, if performance is important, storing lists in a dataframe is not a very good idea.
Method 2
Very fast solution in case you have a large dataframe
Using sklearn.preprocessing.MultiLabelBinarizer
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
df = pd.DataFrame(
{'groups':
[['a','b','c'],
['c'],
['b','c','e'],
['a','c'],
['b','e']]
}, columns=['groups'])
s = df['groups']
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index)
Result:
a b c e 0 1 1 1 0 1 0 0 1 0 2 0 1 1 1 3 1 0 1 0 4 0 1 0 1
Worked for me and also was suggested here and here
Method 3
This is even faster:
pd.get_dummies(df['groups'].explode()).sum(level=0)
Using .explode() instead of .apply(pd.Series).stack()
Comparing with the other solutions:
import timeit
import pandas as pd
setup = '''
import time
import pandas as pd
s = pd.Series({0:['a','b','c'],1:['c'],2:['b','c','e'],3:['a','c'],4:['b','e']})
df = s.rename('groups').to_frame()
'''
m1 = "pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)"
m2 = "df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')"
m3 = "pd.get_dummies(df['groups'].explode()).sum(level=0)"
times = {f"m{i+1}":min(timeit.Timer(m, setup=setup).repeat(7, 1000)) for i, m in enumerate([m1, m2, m3])}
pd.DataFrame([times],index=['ms'])
# m1 m2 m3
# ms 5.586517 3.821662 2.547167
Method 4
Even though this quest was answered, I have a faster solution:
df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')
And, in case you have empty groups or NaN, you could just:
df.loc[df.groups.str.len() > 0].apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')
How it works
Inside the lambda, x is your list, for example ['a', 'b', 'c']. So pd.Series will be as follows:
In [2]: pd.Series([1, 1, 1], index=['a', 'b', 'c']) Out[2]: a 1 b 1 c 1 dtype: int64
When all pd.Series comes together, they become pd.DataFrame and their index become columns; missing index became a column with NaN as you can see next:
In [4]: a = pd.Series([1, 1, 1], index=['a', 'b', 'c'])
In [5]: b = pd.Series([1, 1, 1], index=['a', 'b', 'd'])
In [6]: pd.DataFrame([a, b])
Out[6]:
a b c d
0 1.0 1.0 1.0 NaN
1 1.0 1.0 NaN 1.0
Now fillna fills those NaN with 0:
In [7]: pd.DataFrame([a, b]).fillna(0)
Out[7]:
a b c d
0 1.0 1.0 1.0 0.0
1 1.0 1.0 0.0 1.0
And downcast='infer' is to downcast from float to int:
In [11]: pd.DataFrame([a, b]).fillna(0, downcast='infer') Out[11]: a b c d 0 1 1 1 0 1 1 1 0 1
PS.: It’s not required the use of .fillna(0, downcast='infer').
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0