Python: Random selection per group

Say that I have a dataframe that looks like:

Name Group_Id
AAA  1
ABC  1
CCC  2
XYZ  2
DEF  3 
YYH  3

How could I randomly select one (or more) row for each Group_Id? Say that I want one random draw per Group_Id, I would get:

Name Group_Id
AAA  1
XYZ  2
DEF  3

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Method 5

Method 6

Method 7

Method 8

Method 9

Method 10

Method 11

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

From 0.16.x onwards pd.DataFrame.sample provides a way to return a random sample of items from an axis of object.

In [664]: df.groupby('Group_Id').apply(lambda x: x.sample(1)).reset_index(drop=True)
Out[664]:
  Name  Group_Id
0  ABC         1
1  XYZ         2
2  DEF         3

Method 2

size = 2        # sample size
replace = True  # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
df.groupby('Group_Id', as_index=False).apply(fn)

Method 3

There are two ways to do this very simply, one without using anything except basic pandas syntax:

df[['x','y']].groupby('x').agg(pd.DataFrame.sample)

This takes 14.4ms with 50k row dataset.

The other, slightly faster method, involves numpy.

df[['x','y']].groupby('x').agg(np.random.choice)

This takes 10.9ms with (the same) 50k row dataset.

Generally speaking, when using pandas, it’s preferable to stick with its native syntax. Especially for beginners.

Method 4

Using groupby and random.choice in an elegant one liner:

df.groupby('Group_Id').apply(lambda x :x.iloc[random.choice(range(0,len(x)))])

Method 5

for randomly selecting just one row per group try:

df.sample(frac = 1.0).groupby('Group_Id').head(1)

Method 6

df.groupby('Group_Id').sample(n=1)

New in version 1.1.0.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.sample.html

Method 7

The solutions offered fail if a group has fewer samples than the desired sample size n. This addresses this problem:

n = 10
df.groupby('Group_Id').apply(lambda x: x.sample(min(n,len(x)))).reset_index(drop=True)

Method 8

A very pandas-ish way:

takesamp = lambda d: d.sample(n)
df = df.groupby('Group_Id').apply(takesamp)

Method 9

Using random.choice, you can do something like this:

import random
name_group = {'AAA': 1, 'ABC':1, 'CCC':2, 'XYZ':2, 'DEF':3, 'YYH':3}

names = [name for name in name_group.iterkeys()] #create a list out of the keys in the name_group dict

first_name = random.choice(names)
first_group = name_group[first_name]
print first_name, first_group

random.choice(seq)

Return a random element from the non-empty sequence seq. If seq is empty, raises IndexError.

Method 10

You can use a combination of pandas.groupby, pandas.concat and random.sample:

import pandas as pd
import random

df = pd.DataFrame({
        'Name': ['AAA', 'ABC', 'CCC', 'XYZ', 'DEF', 'YYH'],
        'Group_ID': [1,1,2,2,3,3]
     })

grouped = df.groupby('Group_ID')
df_sampled = pd.concat([d.ix[random.sample(d.index, 1)] for _, d in grouped]).reset_index(drop=True)
print df_sampled

Output:

   Group_ID Name
0         1  AAA
1         2  XYZ
2         3  DEF

Method 11

I found another one:

size=2
count_s = df['Id'].value_counts()
df.iloc[np.concatenate([previous_count + np.random.choice(count, size) 
                        for count, previous_count in zip(count_s, 
                                                         count_s.shift(fill_value=0))])]

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating