Consider the following situation:
In [2]: a = pd.Series([1,2,3,4,'.'])
In [3]: a
Out[3]:
0 1
1 2
2 3
3 4
4 .
dtype: object
In [8]: a.astype('float64', raise_on_error = False)
Out[8]:
0 1
1 2
2 3
3 4
4 .
dtype: object
I would have expected an option that allows conversion while turning erroneous values (such as that .) to NaNs. Is there a way to achieve this?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Use pd.to_numeric with errors='coerce'
# Setup s = pd.Series(['1', '2', '3', '4', '.']) s 0 1 1 2 2 3 3 4 4 . dtype: object
pd.to_numeric(s, errors='coerce') 0 1.0 1 2.0 2 3.0 3 4.0 4 NaN dtype: float64
If you need the NaNs filled in, use Series.fillna.
pd.to_numeric(s, errors='coerce').fillna(0, downcast='infer') 0 1 1 2 2 3 3 4 4 0 dtype: float64
Note, downcast='infer' will attempt to downcast floats to integers where possible. Remove the argument if you don’t want that.
From v0.24+, pandas introduces a Nullable Integer type, which allows
integers to coexist with NaNs. If you have integers in your column,
you can usepd.__version__ # '0.24.1' pd.to_numeric(s, errors='coerce').astype('Int32') 0 1 1 2 2 3 3 4 4 NaN dtype: Int32There are other options to choose from as well, read the docs for more.
Extension for DataFrames
If you need to extend this to DataFrames, you will need to apply it to each row. You can do this using DataFrame.apply.
# Setup.
np.random.seed(0)
df = pd.DataFrame({
'A' : np.random.choice(10, 5),
'C' : np.random.choice(10, 5),
'B' : ['1', '###', '...', 50, '234'],
'D' : ['23', '1', '...', '268', '$$']}
)[list('ABCD')]
df
A B C D
0 5 1 9 23
1 0 ### 3 1
2 3 ... 5 ...
3 3 50 2 268
4 7 234 4 $$
df.dtypes
A int64
B object
C int64
D object
dtype: object
df2 = df.apply(pd.to_numeric, errors='coerce') df2 A B C D 0 5 1.0 9 23.0 1 0 NaN 3 1.0 2 3 NaN 5 NaN 3 3 50.0 2 268.0 4 7 234.0 4 NaN df2.dtypes A int64 B float64 C int64 D float64 dtype: object
You can also do this with DataFrame.transform; although my tests indicate this is marginally slower:
df.transform(pd.to_numeric, errors='coerce') A B C D 0 5 1.0 9 23.0 1 0 NaN 3 1.0 2 3 NaN 5 NaN 3 3 50.0 2 268.0 4 7 234.0 4 NaN
If you have many columns (numeric; non-numeric), you can make this a little more performant by applying pd.to_numeric on the non-numeric columns only.
df.dtypes.eq(object) A False B True C False D True dtype: bool cols = df.columns[df.dtypes.eq(object)] # Actually, `cols` can be any list of columns you need to convert. cols # Index(['B', 'D'], dtype='object') df[cols] = df[cols].apply(pd.to_numeric, errors='coerce') # Alternatively, # for c in cols: # df[c] = pd.to_numeric(df[c], errors='coerce') df A B C D 0 5 1.0 9 23.0 1 0 NaN 3 1.0 2 3 NaN 5 NaN 3 3 50.0 2 268.0 4 7 234.0 4 NaN
Applying pd.to_numeric along the columns (i.e., axis=0, the default) should be slightly faster for long DataFrames.
Method 2
In [30]: pd.Series([1,2,3,4,'.']).convert_objects(convert_numeric=True) Out[30]: 0 1 1 2 2 3 3 4 4 NaN dtype: float64
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0