I’m a beginner in Pandas. I have a data file containing 10000 different information of users. This data contain 5 columns and 10000 rows. One of these columns is the district of the users and it divides users according to their living place(It defines just 7 different locations and in each of locations some number of users live). as an example, out of this 10000 users, 300 users live in USA and 250 Live in Canada and..
I want to define a DataFrame which includes five random rows of users with the distinct of: USA,Canada,LA,NY and Japan. Also, the dimensions needs to be 20*5. Can you please help me how to do that?
I know for choosing random I need to use
s = df.sample(n=5)
but how can I define that choose 5 random information from the users with those locations and define the dimension?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You can also sample from groups generated with groupby
:
df.groupby('district').sample(n=5)
To restrict the sampling to those districts you can filter the df beforehand:
df[df['district'].isin(['USA', 'Canada', 'LA', 'NY', 'Japan'])].groupby('district').sample(n=5)
This is assuming 'district'
is the district column. Also, if I understood correctly, since you are sampling 5 items from 5 districts, the dimension of the final DataFrame should be (5*5)x5 = 25×5 (25 rows and 5 columns).
You need pandas version >= 1.1.0 to use this method.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0