I have a large dataframe with ID numbers:
ID.head() Out[64]: 0 4806105017087 1 4806105017087 2 4806105017087 3 4901295030089 4 4901295030089
These are all strings at the moment.
I want to convert to int without using loops – for this I use ID.astype(int).
The problem is that some of my lines contain dirty data which cannot be converted to int, for e.g.
ID[154382] Out[58]: 'CN414149'
How can I (without using loops) remove these type of occurrences so that I can use astype with peace of mind?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You need add parameter errors='coerce' to function to_numeric:
ID = pd.to_numeric(ID, errors='coerce')
If ID is column:
df.ID = pd.to_numeric(df.ID, errors='coerce')
but non numeric are converted to NaN, so all values are float.
For int need convert NaN to some value e.g. 0 and then cast to int:
df.ID = pd.to_numeric(df.ID, errors='coerce').fillna(0).astype(np.int64)
Sample:
df = pd.DataFrame({'ID':['4806105017087','4806105017087','CN414149']})
print (df)
ID
0 4806105017087
1 4806105017087
2 CN414149
print (pd.to_numeric(df.ID, errors='coerce'))
0 4.806105e+12
1 4.806105e+12
2 NaN
Name: ID, dtype: float64
df.ID = pd.to_numeric(df.ID, errors='coerce').fillna(0).astype(np.int64)
print (df)
ID
0 4806105017087
1 4806105017087
2 0
EDIT: If use pandas 0.25+ then is possible use integer_na:
df.ID = pd.to_numeric(df.ID, errors='coerce').astype('Int64')
print (df)
ID
0 4806105017087
1 4806105017087
2 NaN
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0