Without using groupby how would I filter out data without NaN?
Let say I have a matrix where customers will fill in 'N/A','n/a' or any of its variations and others leave it blank:
import pandas as pd
import numpy as np
df = pd.DataFrame({'movie': ['thg', 'thg', 'mol', 'mol', 'lob', 'lob'],
'rating': [3., 4., 5., np.nan, np.nan, np.nan],
'name': ['John', np.nan, 'N/A', 'Graham', np.nan, np.nan]})
nbs = df['name'].str.extract('^(N/A|NA|na|n/a)')
nms=df[(df['name'] != nbs) ]
output:
>>> nms movie name rating 0 thg John 3 1 thg NaN 4 3 mol Graham NaN 4 lob NaN NaN 5 lob NaN NaN
How would I filter out NaN values so I can get results to work with like this:
movie name rating 0 thg John 3 3 mol Graham NaN
I am guessing I need something like ~np.isnan but the tilda does not work with strings.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Just drop them:
nms.dropna(thresh=2)
this will drop all rows where there are at least two non-NaN.
Then you could then drop where name is NaN:
In [87]: nms Out[87]: movie name rating 0 thg John 3 1 thg NaN 4 3 mol Graham NaN 4 lob NaN NaN 5 lob NaN NaN [5 rows x 3 columns] In [89]: nms = nms.dropna(thresh=2) In [90]: nms[nms.name.notnull()] Out[90]: movie name rating 0 thg John 3 3 mol Graham NaN [2 rows x 3 columns]
EDIT
Actually looking at what you originally want you can do just this without the dropna call:
nms[nms.name.notnull()]
UPDATE
Looking at this question 3 years later, there is a mistake, firstly thresh arg looks for at least n non-NaN values so in fact the output should be:
In [4]: nms.dropna(thresh=2) Out[4]: movie name rating 0 thg John 3.0 1 thg NaN 4.0 3 mol Graham NaN
It’s possible that I was either mistaken 3 years ago or that the version of pandas I was running had a bug, both scenarios are entirely possible.
Method 2
Simplest of all solutions:
filtered_df = df[df['name'].notnull()]
Thus, it filters out only rows that doesn’t have NaN values in ‘name’ column.
For multiple columns:
filtered_df = df[df[['name', 'country', 'region']].notnull().all(1)]
Method 3
df.dropna(subset=['columnName1', 'columnName2'])
Method 4
df = pd.DataFrame({'movie': ['thg', 'thg', 'mol', 'mol', 'lob', 'lob'],'rating': [3., 4., 5., np.nan, np.nan, np.nan],'name': ['John','James', np.nan, np.nan, np.nan,np.nan]})
for col in df.columns:
df = df[~pd.isnull(df[col])]
Method 5
You can also use query:
out = df.query("name.notna() & name !='N/A'", engine='python')
Output:
movie rating name 0 thg 3.0 John 3 mol NaN Graham
Method 6
Inside query() pass column_name == column_name to keep the rows where column_name is not NA.
For your case:
nms.query('name == name')
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0