I have pandas DF as below ,
id age gender country sales_year 1 None M India 2016 2 23 F India 2016 1 20 M India 2015 2 25 F India 2015 3 30 M India 2019 4 36 None India 2019
I want to group by on id, take the latest 1 row as per sales_date with all non null element.
output expected,
id age gender country sales_year 1 20 M India 2016 2 23 F India 2016 3 30 M India 2019 4 36 None India 2019
In pyspark,
df = df.withColumn('age', f.first('age', True).over(Window.partitionBy("id").orderBy(df.sales_year.desc())))
But i need same solution in pandas .
EDIT ::
This can the case with all the columns. Not just age. I need it to pick up latest non null data(id exist) for all the ids.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Use GroupBy.first:
df1 = df.groupby('id', as_index=False).first()
print (df1)
id age gender country sales_year
0 1 20.0 M India 2016
1 2 23.0 F India 2016
2 3 30.0 M India 2019
3 4 36.0 NaN India 2019
If column sales_year is not sorted:
df2 = df.sort_values('sales_year', ascending=False).groupby('id', as_index=False).first()
print (df2)
id age gender country sales_year
0 1 20.0 M India 2016
1 2 23.0 F India 2016
2 3 30.0 M India 2019
3 4 36.0 NaN India 2019
Method 2
Use –
df.dropna(subset=['gender']).sort_values('sales_year', ascending=False).groupby('id')['age'].first()
Output
id 1 20 2 23 3 30 4 36 Name: age, dtype: object
Remove the ['age'] to get full rows –
df.dropna().sort_values('sales_year', ascending=False).groupby('id').first()
Output
age gender country sales_year id 1 20 M India 2015 2 23 F India 2016 3 30 M India 2019 4 36 None India 2019
You can put the id back as a column with reset_index() –
df.dropna().sort_values('sales_year', ascending=False).groupby('id').first().reset_index()
Output
id age gender country sales_year 0 1 20 M India 2015 1 2 23 F India 2016 2 3 30 M India 2019 3 4 36 None India 2019
Method 3
print(df.replace('None',np.NaN).groupby('id').first())
- first replace the ‘None’ with NaN
- next use groupby() to group by ‘id’
- next filter out the first row using first()
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0