I have a pandas dataframe with mixed type columns, and I’d like to apply sklearn’s min_max_scaler to some of the columns. Ideally, I’d like to do these transformations in place, but haven’t figured out a way to do that yet. I’ve written the following code that works:
import pandas as pd
import numpy as np
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],'B':[103.02,107.26,110.35,114.23,114.68], 'C':['big','small','big','small','small']})
min_max_scaler = preprocessing.MinMaxScaler()
def scaleColumns(df, cols_to_scale):
for col in cols_to_scale:
df[col] = pd.DataFrame(min_max_scaler.fit_transform(pd.DataFrame(dfTest[col])),columns=[col])
return df
dfTest
A B C
0 14.00 103.02 big
1 90.20 107.26 small
2 90.95 110.35 big
3 96.27 114.23 small
4 91.21 114.68 small
scaled_df = scaleColumns(dfTest,['A','B'])
scaled_df
A B C
0 0.000000 0.000000 big
1 0.926219 0.363636 small
2 0.935335 0.628645 big
3 1.000000 0.961407 small
4 0.938495 1.000000 small
I’m curious if this is the preferred/most efficient way to do this transformation. Is there a way I could use df.apply that would be better?
I’m also surprised I can’t get the following code to work:
bad_output = min_max_scaler.fit_transform(dfTest['A'])
If I pass an entire dataframe to the scaler it works:
dfTest2 = dfTest.drop('C', axis = 1)
good_output = min_max_scaler.fit_transform(dfTest2)
good_output
I’m confused why passing a series to the scaler fails. In my full working code above I had hoped to just pass a series to the scaler then set the dataframe column = to the scaled series.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
I am not sure if previous versions of pandas prevented this but now the following snippet works perfectly for me and produces exactly what you want without having to use apply
>>> import pandas as pd
>>> from sklearn.preprocessing import MinMaxScaler
>>> scaler = MinMaxScaler()
>>> dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],
'B':[103.02,107.26,110.35,114.23,114.68],
'C':['big','small','big','small','small']})
>>> dfTest[['A', 'B']] = scaler.fit_transform(dfTest[['A', 'B']])
>>> dfTest
A B C
0 0.000000 0.000000 big
1 0.926219 0.363636 small
2 0.935335 0.628645 big
3 1.000000 0.961407 small
4 0.938495 1.000000 small
Method 2
Like this?
dfTest = pd.DataFrame({
'A':[14.00,90.20,90.95,96.27,91.21],
'B':[103.02,107.26,110.35,114.23,114.68],
'C':['big','small','big','small','small']
})
dfTest[['A','B']] = dfTest[['A','B']].apply(
lambda x: MinMaxScaler().fit_transform(x))
dfTest
A B C
0 0.000000 0.000000 big
1 0.926219 0.363636 small
2 0.935335 0.628645 big
3 1.000000 0.961407 small
4 0.938495 1.000000 small
Method 3
df = pd.DataFrame(scale.fit_transform(df.values), columns=df.columns, index=df.index)
This should work without depreciation warnings.
Method 4
As it is being mentioned in pir’s comment – the .apply(lambda el: scale.fit_transform(el)) method will produce the following warning:
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17
and will raise ValueError in 0.19. Reshape your data either using
X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1)
if it contains a single sample.
Converting your columns to numpy arrays should do the job (I prefer StandardScaler):
from sklearn.preprocessing import StandardScaler scale = StandardScaler() dfTest[['A','B','C']] = scale.fit_transform(dfTest[['A','B','C']].as_matrix())
— Edit Nov 2018 (Tested for pandas 0.23.4)–
As Rob Murray mentions in the comments, in the current (v0.23.4) version of pandas .as_matrix() returns FutureWarning. Therefore, it should be replaced by .values:
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit_transform(dfTest[['A','B']].values)
— Edit May 2019 (Tested for pandas 0.24.2)–
As joelostblom mentions in the comments, “Since 0.24.0, it is recommended to use .to_numpy() instead of .values.”
Updated example:
import pandas as pd
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
dfTest = pd.DataFrame({
'A':[14.00,90.20,90.95,96.27,91.21],
'B':[103.02,107.26,110.35,114.23,114.68],
'C':['big','small','big','small','small']
})
dfTest[['A', 'B']] = scaler.fit_transform(dfTest[['A','B']].to_numpy())
dfTest
A B C
0 -1.995290 -1.571117 big
1 0.436356 -0.603995 small
2 0.460289 0.100818 big
3 0.630058 0.985826 small
4 0.468586 1.088469 small
Method 5
You can do it using pandas only:
In [235]:
dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],'B':[103.02,107.26,110.35,114.23,114.68], 'C':['big','small','big','small','small']})
df = dfTest[['A', 'B']]
df_norm = (df - df.min()) / (df.max() - df.min())
print df_norm
print pd.concat((df_norm, dfTest.C),1)
A B
0 0.000000 0.000000
1 0.926219 0.363636
2 0.935335 0.628645
3 1.000000 0.961407
4 0.938495 1.000000
A B C
0 0.000000 0.000000 big
1 0.926219 0.363636 small
2 0.935335 0.628645 big
3 1.000000 0.961407 small
4 0.938495 1.000000 small
Method 6
I know it’s a very old comment, but still:
Instead of using single bracket (dfTest['A']), use double brackets (dfTest[['A']]).
i.e: min_max_scaler.fit_transform(dfTest[['A']]).
I believe this will give the desired result.
Method 7
(Tested for pandas 1.0.5)
Based on @athlonshi answer (it had ValueError: could not convert string to float: ‘big’, on C column), full working example without warning:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
scale = preprocessing.MinMaxScaler()
df = pd.DataFrame({
'A':[14.00,90.20,90.95,96.27,91.21],
'B':[103.02,107.26,110.35,114.23,114.68],
'C':['big','small','big','small','small']
})
print(df)
df[["A","B"]] = pd.DataFrame(scale.fit_transform(df[["A","B"]].values), columns=["A","B"], index=df.index)
print(df)
A B C
0 14.00 103.02 big
1 90.20 107.26 small
2 90.95 110.35 big
3 96.27 114.23 small
4 91.21 114.68 small
A B C
0 0.000000 0.000000 big
1 0.926219 0.363636 small
2 0.935335 0.628645 big
3 1.000000 0.961407 small
4 0.938495 1.000000 small
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0