Using sample data:
df = pd.DataFrame({'key1' : ['a','a','b','b','a'],
'key2' : ['one', 'two', 'one', 'two', 'one'],
'data1' : np.random.randn(5),
'data2' : np. random.randn(5)})
df
data1 data2 key1 key2 0 0.361601 0.375297 a one 1 0.069889 0.809772 a two 2 1.468194 0.272929 b one 3 -1.138458 0.865060 b two 4 -0.268210 1.250340 a one
I’m trying to figure out how to group the data by key1 and sum only the data1 values where key2 equals ‘one’.
Here’s what I’ve tried
def f(d,a,b):
d.ix[d[a] == b, 'data1'].sum()
df.groupby(['key1']).apply(f, a = 'key2', b = 'one').reset_index()
But this gives me a dataframe with ‘None’ values
index key1 0 0 a None 1 b None
Any ideas here? I’m looking for the Pandas equivalent of the following SQL:
SELECT Key1, SUM(CASE WHEN Key2 = 'one' then data1 else 0 end) FROM df GROUP BY key1
FYI – I’ve seen conditional sums for pandas aggregate but couldn’t transform the answer provided there to work with sums rather than counts.
Thanks in advance
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
First groupby the key1 column:
In [11]: g = df.groupby('key1')
and then for each group take the subDataFrame where key2 equals ‘one’ and sum the data1 column:
In [12]: g.apply(lambda x: x[x['key2'] == 'one']['data1'].sum()) Out[12]: key1 a 0.093391 b 1.468194 dtype: float64
To explain what’s going on let’s look at the ‘a’ group:
In [21]: a = g.get_group('a')
In [22]: a
Out[22]:
data1 data2 key1 key2
0 0.361601 0.375297 a one
1 0.069889 0.809772 a two
4 -0.268210 1.250340 a one
In [23]: a[a['key2'] == 'one']
Out[23]:
data1 data2 key1 key2
0 0.361601 0.375297 a one
4 -0.268210 1.250340 a one
In [24]: a[a['key2'] == 'one']['data1']
Out[24]:
0 0.361601
4 -0.268210
Name: data1, dtype: float64
In [25]: a[a['key2'] == 'one']['data1'].sum()
Out[25]: 0.093391000000000002
It may be slightly easier/clearer to do this by restricting the dataframe to just those with key2 equals one first:
In [31]: df1 = df[df['key2'] == 'one']
In [32]: df1
Out[32]:
data1 data2 key1 key2
0 0.361601 0.375297 a one
2 1.468194 0.272929 b one
4 -0.268210 1.250340 a one
In [33]: df1.groupby('key1')['data1'].sum()
Out[33]:
key1
a 0.093391
b 1.468194
Name: data1, dtype: float64
Method 2
I think that today with pandas 0.23 you can do this:
import numpy as np
df.assign(result = np.where(df['key2']=='one',df.data1,0))
.groupby('key1').agg({'result':sum})
The advantage of this is that you can apply it to more than one column of the same dataframe
df.assign(
result1 = np.where(df['key2']=='one',df.data1,0),
result2 = np.where(df['key2']=='two',df.data1,0)
).groupby('key1').agg({'result1':sum, 'result2':sum})
Method 3
You can filter your dataframe before you perform your groupby operation. If this reduces your series index due to all values being out-of-scope, you can use reindex with fillna:
res = df.loc[df['key2'].eq('one')]
.groupby('key1')['data1'].sum()
.reindex(df['key1'].unique()).fillna(0)
print(res)
key1
a 3.631610
b 0.978738
c 0.000000
Name: data1, dtype: float64
Setup
I have added an additional row for demonstration purposes.
np.random.seed(0)
df = pd.DataFrame({'key1': ['a','a','b','b','a','c'],
'key2': ['one', 'two', 'one', 'two', 'one', 'two'],
'data1': np.random.randn(6),
'data2': np.random.randn(6)})
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0