I have a dataframe that looks like:
data = {'index': ['2014-06-22 10:46:00', '2014-06-24 19:52:00', '2014-06-25 17:02:00', '2014-06-25 17:55:00', '2014-07-02 11:36:00', '2014-07-06 12:40:00', '2014-07-05 12:46:00', '2014-07-27 15:12:00'],
'type': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'C'],
'sum_col': [1, 2, 3, 1, 1, 3, 2, 1]}
df = pd.DataFrame(data, columns=['index', 'type', 'sum_col'])
df['index'] = pd.to_datetime(df['index'])
df = df.set_index('index')
df['weekofyear'] = df.index.weekofyear
df['date'] = df.index.date
df['date'] = pd.to_datetime(df['date'])
type sum_col weekofyear date
index
2014-06-22 10:46:00 A 1 25 2014-06-22
2014-06-24 19:52:00 B 2 26 2014-06-24
2014-06-25 17:02:00 C 3 26 2014-06-25
2014-06-25 17:55:00 A 1 26 2014-06-25
2014-07-02 11:36:00 B 1 27 2014-07-02
2014-07-06 12:40:00 C 3 27 2014-07-06
2014-07-05 12:46:00 A 2 27 2014-07-05
2014-07-27 15:12:00 C 1 30 2014-07-27
I’m looking to groupby the weekofyear, then sum up the sum_col. In addition, I need to find the earliest, and the latest date for the week. The first part is pretty easy:
gb = df.groupby(['type', 'weekofyear'])
gb['sum_col'].agg({'sum_col' : np.sum})
I’ve tried to find the min/max date with this, but haven’t been successful:
gb = df.groupby(['type', 'weekofyear'])
gb.agg({'sum_col' : np.sum,
'date' : np.min,
'date' : np.max})
How would one find the earliest/latest date that appears?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You need to combine the functions that apply to the same column, like this:
In [116]: gb.agg({'sum_col' : np.sum,
...: 'date' : [np.min, np.max]})
Out[116]:
date sum_col
amin amax sum
type weekofyear
A 25 2014-06-22 2014-06-22 1
26 2014-06-25 2014-06-25 1
27 2014-07-05 2014-07-05 2
B 26 2014-06-24 2014-06-24 2
27 2014-07-02 2014-07-02 1
C 26 2014-06-25 2014-06-25 3
27 2014-07-06 2014-07-06 3
30 2014-07-27 2014-07-27 1
Method 2
Simple code can be
df.groupby([key_field]).agg({'time_field': [np.min,np.max]})
where key_field here can be event_id and time_field can be timestamp field.
Method 3
Another possible solution where you can have more control over the resulting column names:
gb = df.groupby(['type', 'weekofyear'])
gb.agg(
sum_col=('sum_col', np.sum),
first_date=('date', np.min),
last_date=('date', np.max)
).reset_index()
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0