I have a dataframe in pandas where each column has different value range. For example:
df:
A B C 1000 10 0.5 765 5 0.35 800 7 0.09
Any idea how I can normalize the columns of this dataframe where each value is between 0 and 1?
My desired output is:
A B C 1 1 1 0.765 0.5 0.7 0.8 0.7 0.18(which is 0.09/0.5)
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
one easy way by using Pandas: (here I want to use mean normalization)
normalized_df=(df-df.mean())/df.std()
to use min-max normalization:
normalized_df=(df-df.min())/(df.max()-df.min())
Edit: To address some concerns, need to say that Pandas automatically applies colomn-wise function in the code above.
Method 2
You can use the package sklearn and its associated preprocessing utilities to normalize the data.
import pandas as pd from sklearn import preprocessing x = df.values #returns a numpy array min_max_scaler = preprocessing.MinMaxScaler() x_scaled = min_max_scaler.fit_transform(x) df = pd.DataFrame(x_scaled)
For more information look at the scikit-learn documentation on preprocessing data: scaling features to a range.
Method 3
Based on this post: https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range
You can do the following:
def normalize(df):
result = df.copy()
for feature_name in df.columns:
max_value = df[feature_name].max()
min_value = df[feature_name].min()
result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
return result
You don’t need to stay worrying about whether your values are negative or positive. And the values should be nicely spread out between 0 and 1.
Method 4
Detailed Example of Normalization Methods
- Pandas normalization (unbiased)
- Sklearn normalization (biased)
- Does biased-vs-unbiased affect Machine Learning?
- Mix-max scaling
References:
Wikipedia: Unbiased Estimation of Standard Deviation
Example Data
import pandas as pd
df = pd.DataFrame({
'A':[1,2,3],
'B':[100,300,500],
'C':list('abc')
})
print(df)
A B C
0 1 100 a
1 2 300 b
2 3 500 c
Normalization using pandas (Gives unbiased estimates)
When normalizing we simply subtract the mean and divide by standard deviation.
df.iloc[:,0:-1] = df.iloc[:,0:-1].apply(lambda x: (x-x.mean())/ x.std(), axis=0)
print(df)
A B C
0 -1.0 -1.0 a
1 0.0 0.0 b
2 1.0 1.0 c
Normalization using sklearn (Gives biased estimates, different from pandas)
If you do the same thing with sklearn you will get DIFFERENT output!
import pandas as pd
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df = pd.DataFrame({
'A':[1,2,3],
'B':[100,300,500],
'C':list('abc')
})
df.iloc[:,0:-1] = scaler.fit_transform(df.iloc[:,0:-1].to_numpy())
print(df)
A B C
0 -1.224745 -1.224745 a
1 0.000000 0.000000 b
2 1.224745 1.224745 c
Does Biased estimates of sklearn makes Machine Learning Less Powerful?
NO.
The official documentation of sklearn.preprocessing.scale states that using biased estimator is UNLIKELY to affect the performance of machine learning algorithms and we can safely use them.
From official documentation:
We use a biased estimator for the standard deviation, equivalent to
numpy.std(x, ddof=0). Note that the choice ofddofis unlikely to affect model performance.
What about MinMax Scaling?
There is no Standard Deviation calculation in MinMax scaling. So the result is same in both pandas and scikit-learn.
import pandas as pd
df = pd.DataFrame({
'A':[1,2,3],
'B':[100,300,500],
})
(df - df.min()) / (df.max() - df.min())
A B
0 0.0 0.0
1 0.5 0.5
2 1.0 1.0
# Using sklearn
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
arr_scaled = scaler.fit_transform(df)
print(arr_scaled)
[[0. 0. ]
[0.5 0.5]
[1. 1. ]]
df_scaled = pd.DataFrame(arr_scaled, columns=df.columns,index=df.index)
print(df_scaled)
A B
0 0.0 0.0
1 0.5 0.5
2 1.0 1.0
Method 5
Your problem is actually a simple transform acting on the columns:
def f(s):
return s/s.max()
frame.apply(f, axis=0)
Or even more terse:
frame.apply(lambda x: x/x.max(), axis=0)
Method 6
If you like using the sklearn package, you can keep the column and index names by using pandas loc like so:
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() scaled_values = scaler.fit_transform(df) df.loc[:,:] = scaled_values
Method 7
Take care with this answer, as it ONLY works for data that ranges [0, n]. This does not work for any range of data.
Simple is Beautiful:
df["A"] = df["A"] / df["A"].max() df["B"] = df["B"] / df["B"].max() df["C"] = df["C"] / df["C"].max()
Method 8
You can create a list of columns that you want to normalize
column_names_to_normalize = ['A', 'E', 'G', 'sadasdsd', 'lol'] x = df[column_names_to_normalize].values x_scaled = min_max_scaler.fit_transform(x) df_temp = pd.DataFrame(x_scaled, columns=column_names_to_normalize, index = df.index) df[column_names_to_normalize] = df_temp
Your Pandas Dataframe is now normalized only at the columns you want
However, if you want the opposite, select a list of columns that you DON’T want to normalize, you can simply create a list of all columns and remove that non desired ones
column_names_to_not_normalize = ['B', 'J', 'K'] column_names_to_normalize = [x for x in list(df) if x not in column_names_to_not_normalize ]
Method 9
I think that a better way to do that in pandas is just
df = df/df.max().astype(np.float64)
Edit If in your data frame negative numbers are present you should use instead
df = df/df.loc[df.abs().idxmax()].astype(np.float64)
Method 10
The solution given by Sandman and Praveen is very well. The only problem with that if you have categorical variables in other columns of your data frame this method will need some adjustments.
My solution to this type of issue is following:
from sklearn import preprocesing x = pd.concat([df.Numerical1, df.Numerical2,df.Numerical3]) min_max_scaler = preprocessing.MinMaxScaler() x_scaled = min_max_scaler.fit_transform(x) x_new = pd.DataFrame(x_scaled) df = pd.concat([df.Categoricals,x_new])
Method 11
You might want to have some of columns being normalized and the others be unchanged like some of regression tasks which data labels or categorical columns are unchanged So I suggest you this pythonic way (It’s a combination of @shg and @Cina answers ):
features_to_normalize = ['A', 'B', 'C'] # could be ['A','B'] df[features_to_normalize] = df[features_to_normalize].apply(lambda x:(x-x.min()) / (x.max()-x.min()))
Method 12
df_normalized = df / df.max(axis=0)
Method 13
It is only simple mathematics. The answer should as simple as below.
normed_df = (df - df.min()) / (df.max() - df.min())
Method 14
This is how you do it column-wise using list comprehension:
[df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns]
Method 15
You can simply use the pandas.DataFrame.transform1 function in this way:
df.transform(lambda x: x/x.max())
Method 16
def normalize(x):
try:
x = x/np.linalg.norm(x,ord=1)
return x
except :
raise
data = pd.DataFrame.apply(data,normalize)
From the document of pandas,DataFrame structure can apply an operation (function) to itself .
DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)
Applies function along input axis of DataFrame.
Objects passed to functions are Series objects having index either the DataFrame’s index (axis=0) or the columns (axis=1). Return type depends on whether passed function aggregates, or the reduce argument if the DataFrame is empty.
You can apply a custom function to operate the DataFrame .
Method 17
The following function calculates the Z score:
def standardization(dataset):
""" Standardization of numeric fields, where all values will have mean of zero
and standard deviation of one. (z-score)
Args:
dataset: A `Pandas.Dataframe`
"""
dtypes = list(zip(dataset.dtypes.index, map(str, dataset.dtypes)))
# Normalize numeric columns.
for column, dtype in dtypes:
if dtype == 'float32':
dataset<div class="su-column su-column-size-1-2"><div class="su-column-inner su-u-clearfix su-u-trim"></div></div> -= dataset<div class="su-column su-column-size-1-2"><div class="su-column-inner su-u-clearfix su-u-trim"></div></div>.mean()
dataset<div class="su-column su-column-size-1-2"><div class="su-column-inner su-u-clearfix su-u-trim"></div></div> /= dataset<div class="su-column su-column-size-1-2"><div class="su-column-inner su-u-clearfix su-u-trim"></div></div>.std()
return dataset
Method 18
You can use minmax_scale to transform each column to a scale from 0-1.
Normalize all columns
from sklearn.preprocessing import minmax_scale df[:] = minmax_scale(df)
Normalize single column
from sklearn.preprocessing import minmax_scale df['a'] = minmax_scale(df['a'])
Normalize only numerical columns:
import numpy as np from sklearn.preprocessing import minmax_scale cols = df.select_dtypes(np.number).columns df[cols] = minmax_scale(df[cols])
Full example:
# Prep
import pandas as pd
import numpy as np
from sklearn.preprocessing import minmax_scale
# Sample data
df = pd.DataFrame({'a':[0,1,2], 'b':[-10,-30,-50], 'c':['x', 'y', 'z']})
# MinMax normalize all numeric columns
cols = df.select_dtypes(np.number).columns
df[cols] = minmax_scale(df[cols])
# Result
print(df)
# a b c
# 0 0.0 1.0 x
# 2 0.5 0.5 y
# 3 1.0 0.0 z
Note: Keeps index, column names or non-numerical variables unchanged. Function is applied for each column.
More info on standardization and normalization:
- https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/
- https://en.wikipedia.org/wiki/Normalization_(statistics)
- https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing
Method 19
You can do this in one line
DF_test = DF_test.sub(DF_test.mean(axis=0), axis=1)/DF_test.mean(axis=0)
it takes mean for each of the column and then subtracts it(mean) from every row(mean of particular column subtracts from its row only) and divide by mean only. Finally, we what we get is the normalized data set.
Method 20
Pandas does column wise normalization by default. Try the code below.
X= pd.read_csv('.\data.csv')
X = (X-X.min())/(X.max()-X.min())
The output values will be in range of 0 and 1.
Method 21
Hey use the apply function with lambda which speeds up the process:
def normalize(df_col):
# Condition to exclude 'ID' and 'Class' feature
if (str(df_col.name) != str('ID') and str(df_col.name)!=str('Class')):
max_value = df_col.max()
min_value = df_col.min()
#It avoids NaN and return 0 instead
if max_value == min_value:
return 0
sub_value = max_value - min_value
return np.divide(np.subtract(df_col,min_value),sub_value)
else:
return df_col
df_normalize = df.apply(lambda x :normalize(x))
Method 22
If your data is positively skewed, the best way to normalize is to use the log transformation:
df = np.log10(df)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0