Convert pandas.Series from dtype object to float, and errors to nans

Consider the following situation:

In [2]: a = pd.Series([1,2,3,4,'.'])

In [3]: a
Out[3]: 
0    1
1    2
2    3
3    4
4    .
dtype: object

In [8]: a.astype('float64', raise_on_error = False)
Out[8]: 
0    1
1    2
2    3
3    4
4    .
dtype: object

I would have expected an option that allows conversion while turning erroneous values (such as that .) to NaNs. Is there a way to achieve this?

Contents hide

Answers:

Method 1

Use pd.to_numeric with errors='coerce'

Extension for DataFrames

Method 2

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Use `pd.to_numeric` with `errors='coerce'`

# Setup
s = pd.Series(['1', '2', '3', '4', '.'])
s

0    1
1    2
2    3
3    4
4    .
dtype: object

pd.to_numeric(s, errors='coerce')

0    1.0
1    2.0
2    3.0
3    4.0
4    NaN
dtype: float64

If you need the NaNs filled in, use Series.fillna.

pd.to_numeric(s, errors='coerce').fillna(0, downcast='infer')

0    1
1    2
2    3
3    4
4    0
dtype: float64

Note, downcast='infer' will attempt to downcast floats to integers where possible. Remove the argument if you don’t want that.

From v0.24+, pandas introduces a Nullable Integer type, which allows
integers to coexist with NaNs. If you have integers in your column,
you can use
pd.__version__
# '0.24.1'

pd.to_numeric(s, errors='coerce').astype('Int32')

0      1
1      2
2      3
3      4
4    NaN
dtype: Int32
There are other options to choose from as well, read the docs for more.

Extension for `DataFrames`

If you need to extend this to DataFrames, you will need to apply it to each row. You can do this using DataFrame.apply.

# Setup.
np.random.seed(0)
df = pd.DataFrame({
    'A' : np.random.choice(10, 5), 
    'C' : np.random.choice(10, 5), 
    'B' : ['1', '###', '...', 50, '234'], 
    'D' : ['23', '1', '...', '268', '$$']}
)[list('ABCD')]
df

   A    B  C    D
0  5    1  9   23
1  0  ###  3    1
2  3  ...  5  ...
3  3   50  2  268
4  7  234  4   $$

df.dtypes

A     int64
B    object
C     int64
D    object
dtype: object

df2 = df.apply(pd.to_numeric, errors='coerce')
df2

   A      B  C      D
0  5    1.0  9   23.0
1  0    NaN  3    1.0
2  3    NaN  5    NaN
3  3   50.0  2  268.0
4  7  234.0  4    NaN

df2.dtypes

A      int64
B    float64
C      int64
D    float64
dtype: object

You can also do this with DataFrame.transform; although my tests indicate this is marginally slower:

df.transform(pd.to_numeric, errors='coerce')

   A      B  C      D
0  5    1.0  9   23.0
1  0    NaN  3    1.0
2  3    NaN  5    NaN
3  3   50.0  2  268.0
4  7  234.0  4    NaN

If you have many columns (numeric; non-numeric), you can make this a little more performant by applying pd.to_numeric on the non-numeric columns only.

df.dtypes.eq(object)

A    False
B     True
C    False
D     True
dtype: bool

cols = df.columns[df.dtypes.eq(object)]
# Actually, `cols` can be any list of columns you need to convert.
cols
# Index(['B', 'D'], dtype='object')

df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
# Alternatively,
# for c in cols:
#     df[c] = pd.to_numeric(df[c], errors='coerce')

df

   A      B  C      D
0  5    1.0  9   23.0
1  0    NaN  3    1.0
2  3    NaN  5    NaN
3  3   50.0  2  268.0
4  7  234.0  4    NaN

Applying pd.to_numeric along the columns (i.e., axis=0, the default) should be slightly faster for long DataFrames.

Method 2

In [30]: pd.Series([1,2,3,4,'.']).convert_objects(convert_numeric=True)
Out[30]: 
0     1
1     2
2     3
3     4
4   NaN
dtype: float64

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating