Pandas dataframe get first row of each group

I have a pandas DataFrame like following:

df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4,5,6,6,6,7,7],
                'value'  : ["first","second","second","first",
                            "second","first","third","fourth",
                            "fifth","second","fifth","first",
                            "first","second","third","fourth","fifth"]})

I want to group this by ["id","value"] and get the first row of each group:

        id   value
0        1   first
1        1  second
2        1  second
3        2   first
4        2  second
5        3   first
6        3   third
7        3  fourth
8        3   fifth
9        4  second
10       4   fifth
11       5   first
12       6   first
13       6  second
14       6   third
15       7  fourth
16       7   fifth

Expected outcome:

    id   value
     1   first
     2   first
     3   first
     4  second
     5  first
     6  first
     7  fourth

I tried following, which only gives the first row of the DataFrame. Any help regarding this is appreciated.

In [25]: for index, row in df.iterrows():
   ....:     df2 = pd.DataFrame(df.groupby(['id','value']).reset_index().ix[0])

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Method 5

Method 6

Method 7

Method 8

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

>>> df.groupby('id').first()
     value
id        
1    first
2    first
3    first
4   second
5    first
6    first
7   fourth

If you need id as column:

>>> df.groupby('id').first().reset_index()
   id   value
0   1   first
1   2   first
2   3   first
3   4  second
4   5   first
5   6   first
6   7  fourth

To get n first records, you can use head():

>>> df.groupby('id').head(2).reset_index(drop=True)
    id   value
0    1   first
1    1  second
2    2   first
3    2  second
4    3   first
5    3   third
6    4  second
7    4   fifth
8    5   first
9    6   first
10   6  second
11   7  fourth
12   7   fifth

Method 2

This will give you the second row of each group (zero indexed, nth(0) is the same as first()):

df.groupby('id').nth(1)

Documentation: http://pandas.pydata.org/pandas-docs/stable/groupby.html#taking-the-nth-row-of-each-group

Method 3

I’d suggest to use .nth(0) rather than .first() if you need to get the first row.

The difference between them is how they handle NaNs, so .nth(0) will return the first row of group no matter what are the values in this row, while .first() will eventually return the first not NaN value in each column.

E.g. if your dataset is :

df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4],
            'value'  : ["first","second","third", np.NaN,
                        "second","first","second","third",
                        "fourth","first","second"]})

>>> df.groupby('id').nth(0)
    value
id        
1    first
2    NaN
3    first
4    first

And

>>> df.groupby('id').first()
    value
id        
1    first
2    second
3    first
4    first

Method 4

If you only need the first row from each group we can do with drop_duplicates, Notice the function default method keep='first'.

df.drop_duplicates('id')
Out[1027]: 
    id   value
0    1   first
3    2   first
5    3   first
9    4  second
11   5   first
12   6   first
15   7  fourth

Method 5

maybe this is what you want

import pandas as pd
idx = pd.MultiIndex.from_product([['state1','state2'],   ['county1','county2','county3','county4']])
df = pd.DataFrame({'pop': [12,15,65,42,78,67,55,31]}, index=idx)

                pop
state1 county1   12
       county2   15
       county3   65
       county4   42
state2 county1   78
       county2   67
       county3   55
       county4   31

df.groupby(level=0, group_keys=False).apply(lambda x: x.sort_values('pop', ascending=False)).groupby(level=0).head(3)

> Out[29]: 
                pop
state1 county3   65
       county4   42
       county2   15
state2 county1   78
       county2   67
       county3   55

Method 6

I suppose “first” means you have already sorted your DataFrame as you want.

What I do is :

df.groupby(‘id’).agg(‘first’)
I suppose “first” means you have already sorted your DataFrame as you want.
What I do is :

df.groupby('id').agg('first')
     value
id        
1    first
2    first
3    first
4   second
5    first
6    first
7   fourth

the nice thing is that you can plug any function you want :

df.groupby('id').agg(['first','last','count']))
     value              
     first    last count
id                      
1    first  second     3
2    first  second     2
3    first   fifth     4
4   second   fifth     2
5    first   first     1
6    first   third     3
7   fourth   fifth     2

Output DataFrame has MultiIndex columns

MultiIndex([('value', 'first'),
            ('value',  'last'),
            ('value', 'count')],
           )

Method 7

Considering that the 'id' column is of numeric type, such as int32/int64, one might also use groupby.rank() as following

[In]: df[df.groupby('value')['id'].rank() == 1]
[Out]:
   id   value
0   1   first
6   3   third
7   3  fourth
8   3   fifth

If one wants to reset the index, just pass .reset_index() such as

[In]: df[df.groupby('value')['id'].rank() == 1].reset_index()
[Out]:
   index  id   value
0      0   1   first
1      6   3   third
2      7   3  fourth
3      8   3   fifth

If the index and id columns are not needed

[In]: df.drop(['index', 'id'], axis=1, inplace=True)
[Out]:
    value
0   first
1   third
2  fourth
3   fifth

Method 8

You can use the method take that accepts a list of indices of elements to select:

df.groupby('id').take([0])

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating