I have a pandas DataFrame like following:
df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4,5,6,6,6,7,7],
'value' : ["first","second","second","first",
"second","first","third","fourth",
"fifth","second","fifth","first",
"first","second","third","fourth","fifth"]})
I want to group this by ["id","value"] and get the first row of each group:
id value 0 1 first 1 1 second 2 1 second 3 2 first 4 2 second 5 3 first 6 3 third 7 3 fourth 8 3 fifth 9 4 second 10 4 fifth 11 5 first 12 6 first 13 6 second 14 6 third 15 7 fourth 16 7 fifth
Expected outcome:
id value
1 first
2 first
3 first
4 second
5 first
6 first
7 fourth
I tried following, which only gives the first row of the DataFrame. Any help regarding this is appreciated.
In [25]: for index, row in df.iterrows(): ....: df2 = pd.DataFrame(df.groupby(['id','value']).reset_index().ix[0])
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
>>> df.groupby('id').first()
value
id
1 first
2 first
3 first
4 second
5 first
6 first
7 fourth
If you need id as column:
>>> df.groupby('id').first().reset_index()
id value
0 1 first
1 2 first
2 3 first
3 4 second
4 5 first
5 6 first
6 7 fourth
To get n first records, you can use head():
>>> df.groupby('id').head(2).reset_index(drop=True)
id value
0 1 first
1 1 second
2 2 first
3 2 second
4 3 first
5 3 third
6 4 second
7 4 fifth
8 5 first
9 6 first
10 6 second
11 7 fourth
12 7 fifth
Method 2
This will give you the second row of each group (zero indexed, nth(0) is the same as first()):
df.groupby('id').nth(1)
Documentation: http://pandas.pydata.org/pandas-docs/stable/groupby.html#taking-the-nth-row-of-each-group
Method 3
I’d suggest to use .nth(0) rather than .first() if you need to get the first row.
The difference between them is how they handle NaNs, so .nth(0) will return the first row of group no matter what are the values in this row, while .first() will eventually return the first not NaN value in each column.
E.g. if your dataset is :
df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4],
'value' : ["first","second","third", np.NaN,
"second","first","second","third",
"fourth","first","second"]})
>>> df.groupby('id').nth(0)
value
id
1 first
2 NaN
3 first
4 first
And
>>> df.groupby('id').first()
value
id
1 first
2 second
3 first
4 first
Method 4
If you only need the first row from each group we can do with drop_duplicates, Notice the function default method keep='first'.
df.drop_duplicates('id')
Out[1027]:
id value
0 1 first
3 2 first
5 3 first
9 4 second
11 5 first
12 6 first
15 7 fourth
Method 5
maybe this is what you want
import pandas as pd
idx = pd.MultiIndex.from_product([['state1','state2'], ['county1','county2','county3','county4']])
df = pd.DataFrame({'pop': [12,15,65,42,78,67,55,31]}, index=idx)
pop
state1 county1 12
county2 15
county3 65
county4 42
state2 county1 78
county2 67
county3 55
county4 31
df.groupby(level=0, group_keys=False).apply(lambda x: x.sort_values('pop', ascending=False)).groupby(level=0).head(3)
> Out[29]:
pop
state1 county3 65
county4 42
county2 15
state2 county1 78
county2 67
county3 55
Method 6
I suppose “first” means you have already sorted your DataFrame as you want.
What I do is :
df.groupby(‘id’).agg(‘first’)
I suppose “first” means you have already sorted your DataFrame as you want.
What I do is :
df.groupby('id').agg('first')
value
id
1 first
2 first
3 first
4 second
5 first
6 first
7 fourth
the nice thing is that you can plug any function you want :
df.groupby('id').agg(['first','last','count']))
value
first last count
id
1 first second 3
2 first second 2
3 first fifth 4
4 second fifth 2
5 first first 1
6 first third 3
7 fourth fifth 2
Output DataFrame has MultiIndex columns
MultiIndex([('value', 'first'),
('value', 'last'),
('value', 'count')],
)
Method 7
Considering that the 'id' column is of numeric type, such as int32/int64, one might also use groupby.rank() as following
[In]: df[df.groupby('value')['id'].rank() == 1]
[Out]:
id value
0 1 first
6 3 third
7 3 fourth
8 3 fifth
If one wants to reset the index, just pass .reset_index() such as
[In]: df[df.groupby('value')['id'].rank() == 1].reset_index()
[Out]:
index id value
0 0 1 first
1 6 3 third
2 7 3 fourth
3 8 3 fifth
If the index and id columns are not needed
[In]: df.drop(['index', 'id'], axis=1, inplace=True)
[Out]:
value
0 first
1 third
2 fourth
3 fifth
Method 8
You can use the method take that accepts a list of indices of elements to select:
df.groupby('id').take([0])
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0