Say that I have a dataframe that looks like:
Name Group_Id AAA 1 ABC 1 CCC 2 XYZ 2 DEF 3 YYH 3
How could I randomly select one (or more) row for each Group_Id? Say that I want one random draw per Group_Id, I would get:
Name Group_Id AAA 1 XYZ 2 DEF 3
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
From 0.16.x onwards pd.DataFrame.sample provides a way to return a random sample of items from an axis of object.
In [664]: df.groupby('Group_Id').apply(lambda x: x.sample(1)).reset_index(drop=True)
Out[664]:
Name Group_Id
0 ABC 1
1 XYZ 2
2 DEF 3
Method 2
size = 2 # sample size
replace = True # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
df.groupby('Group_Id', as_index=False).apply(fn)
Method 3
There are two ways to do this very simply, one without using anything except basic pandas syntax:
df[['x','y']].groupby('x').agg(pd.DataFrame.sample)
This takes 14.4ms with 50k row dataset.
The other, slightly faster method, involves numpy.
df[['x','y']].groupby('x').agg(np.random.choice)
This takes 10.9ms with (the same) 50k row dataset.
Generally speaking, when using pandas, it’s preferable to stick with its native syntax. Especially for beginners.
Method 4
Using groupby and random.choice in an elegant one liner:
df.groupby('Group_Id').apply(lambda x :x.iloc[random.choice(range(0,len(x)))])
Method 5
for randomly selecting just one row per group try:
df.sample(frac = 1.0).groupby('Group_Id').head(1)
Method 6
df.groupby('Group_Id').sample(n=1)
New in version 1.1.0.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.sample.html
Method 7
The solutions offered fail if a group has fewer samples than the desired sample size n. This addresses this problem:
n = 10
df.groupby('Group_Id').apply(lambda x: x.sample(min(n,len(x)))).reset_index(drop=True)
Method 8
A very pandas-ish way:
takesamp = lambda d: d.sample(n)
df = df.groupby('Group_Id').apply(takesamp)
Method 9
Using random.choice, you can do something like this:
import random
name_group = {'AAA': 1, 'ABC':1, 'CCC':2, 'XYZ':2, 'DEF':3, 'YYH':3}
names = [name for name in name_group.iterkeys()] #create a list out of the keys in the name_group dict
first_name = random.choice(names)
first_group = name_group[first_name]
print first_name, first_group
random.choice(seq)Return a random element from the non-empty sequence seq. If seq is empty, raises IndexError.
Method 10
You can use a combination of pandas.groupby, pandas.concat and random.sample:
import pandas as pd
import random
df = pd.DataFrame({
'Name': ['AAA', 'ABC', 'CCC', 'XYZ', 'DEF', 'YYH'],
'Group_ID': [1,1,2,2,3,3]
})
grouped = df.groupby('Group_ID')
df_sampled = pd.concat([d.ix[random.sample(d.index, 1)] for _, d in grouped]).reset_index(drop=True)
print df_sampled
Output:
Group_ID Name 0 1 AAA 1 2 XYZ 2 3 DEF
Method 11
I found another one:
size=2
count_s = df['Id'].value_counts()
df.iloc[np.concatenate([previous_count + np.random.choice(count, size)
for count, previous_count in zip(count_s,
count_s.shift(fill_value=0))])]
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0