Is there a way to select random rows from a DataFrame in Pandas.
In R, using the car package, there is a useful function some(x, n) which is similar to head but selects, in this example, 10 rows at random from x.
I have also looked at the slicing documentation and there seems to be nothing equivalent.
Update
Now using version 20. There is a sample method.
df.sample(n)
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
With pandas version 0.16.1 and up, there is now a DataFrame.sample method built-in:
import pandas df = pandas.DataFrame(pandas.np.random.random(100)) # Randomly sample 70% of your dataframe df_percent = df.sample(frac=0.7) # Randomly sample 7 elements from your dataframe df_elements = df.sample(n=7)
For either approach above, you can get the rest of the rows by doing:
df_rest = df.loc[~df.index.isin(df_percent.index)]
Per Pedram‘s comment, if you would like to get reproducible samples, pass the random_state parameter.
df_percent = df.sample(frac=0.7, random_state=42)
Method 2
Something like this?
import random
def some(x, n):
return x.ix[random.sample(x.index, n)]
Note: As of Pandas v0.20.0, ix has been deprecated in favour of loc for label based indexing.
Method 3
sample
As of v0.20.0, you can use pd.DataFrame.sample, which can be used to return a random sample of a fixed number rows, or a percentage of rows:
df = df.sample(n=k) # k rows df = df.sample(frac=k) # int(len(df.index) * k) rows
For reproducibility, you can specify an integer random_state, equivalent to using np.ramdom.seed. So, instead of setting, for example, np.random.seed = 0, you can:
df = df.sample(n=k, random_state=0)
Method 4
The best way to do this is with the sample function from the random module,
import numpy as np import pandas as pd from random import sample # given data frame df # create random index rindex = np.array(sample(xrange(len(df)), 10)) # get 10 random rows from df dfr = df.ix[rindex]
Method 5
Below line will randomly select n number of rows out of the total existing row numbers from the dataframe df without replacement.
df = df.take(np.random.permutation(len(df))[:n])
Method 6
Actually this will give you repeated indices np.random.random_integers(0, len(df), N) where N is a large number.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0