Normalize columns of pandas data frame

I have a dataframe in pandas where each column has different value range. For example:

df:

A     B   C
1000  10  0.5
765   5   0.35
800   7   0.09

Any idea how I can normalize the columns of this dataframe where each value is between 0 and 1?

My desired output is:

A     B    C
1     1    1
0.765 0.5  0.7
0.8   0.7  0.18(which is 0.09/0.5)

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

one easy way by using Pandas: (here I want to use mean normalization)

normalized_df=(df-df.mean())/df.std()

to use min-max normalization:

normalized_df=(df-df.min())/(df.max()-df.min())

Edit: To address some concerns, need to say that Pandas automatically applies colomn-wise function in the code above.

Method 2

You can use the package sklearn and its associated preprocessing utilities to normalize the data.

import pandas as pd
from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

For more information look at the scikit-learn documentation on preprocessing data: scaling features to a range.

Method 3

Based on this post: https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range

You can do the following:

def normalize(df):
    result = df.copy()
    for feature_name in df.columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result

You don’t need to stay worrying about whether your values are negative or positive. And the values should be nicely spread out between 0 and 1.

Method 4

Detailed Example of Normalization Methods

  • Pandas normalization (unbiased)
  • Sklearn normalization (biased)
  • Does biased-vs-unbiased affect Machine Learning?
  • Mix-max scaling

References:
Wikipedia: Unbiased Estimation of Standard Deviation

Example Data

import pandas as pd
df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
               'C':list('abc')
             })
print(df)
   A    B  C
0  1  100  a
1  2  300  b
2  3  500  c

Normalization using pandas (Gives unbiased estimates)

When normalizing we simply subtract the mean and divide by standard deviation.

df.iloc[:,0:-1] = df.iloc[:,0:-1].apply(lambda x: (x-x.mean())/ x.std(), axis=0)
print(df)
     A    B  C
0 -1.0 -1.0  a
1  0.0  0.0  b
2  1.0  1.0  c

Normalization using sklearn (Gives biased estimates, different from pandas)

If you do the same thing with sklearn you will get DIFFERENT output!

import pandas as pd

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()


df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
               'C':list('abc')
             })
df.iloc[:,0:-1] = scaler.fit_transform(df.iloc[:,0:-1].to_numpy())
print(df)
          A         B  C
0 -1.224745 -1.224745  a
1  0.000000  0.000000  b
2  1.224745  1.224745  c

Does Biased estimates of sklearn makes Machine Learning Less Powerful?

NO.

The official documentation of sklearn.preprocessing.scale states that using biased estimator is UNLIKELY to affect the performance of machine learning algorithms and we can safely use them.

From official documentation:

We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0). Note that the choice of ddof is unlikely to affect model performance.

What about MinMax Scaling?

There is no Standard Deviation calculation in MinMax scaling. So the result is same in both pandas and scikit-learn.

import pandas as pd
df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
             })
(df - df.min()) / (df.max() - df.min())
     A    B
0  0.0  0.0
1  0.5  0.5
2  1.0  1.0


# Using sklearn
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() 
arr_scaled = scaler.fit_transform(df) 

print(arr_scaled)
[[0.  0. ]
 [0.5 0.5]
 [1.  1. ]]

df_scaled = pd.DataFrame(arr_scaled, columns=df.columns,index=df.index)
print(df_scaled)
     A    B
0  0.0  0.0
1  0.5  0.5
2  1.0  1.0

Method 5

Your problem is actually a simple transform acting on the columns:

def f(s):
    return s/s.max()

frame.apply(f, axis=0)

Or even more terse:

   frame.apply(lambda x: x/x.max(), axis=0)

Method 6

If you like using the sklearn package, you can keep the column and index names by using pandas loc like so:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() 
scaled_values = scaler.fit_transform(df) 
df.loc[:,:] = scaled_values

Method 7

Take care with this answer, as it ONLY works for data that ranges [0, n]. This does not work for any range of data.


Simple is Beautiful:

df["A"] = df["A"] / df["A"].max()
df["B"] = df["B"] / df["B"].max()
df["C"] = df["C"] / df["C"].max()

Method 8

You can create a list of columns that you want to normalize

column_names_to_normalize = ['A', 'E', 'G', 'sadasdsd', 'lol']
x = df[column_names_to_normalize].values
x_scaled = min_max_scaler.fit_transform(x)
df_temp = pd.DataFrame(x_scaled, columns=column_names_to_normalize, index = df.index)
df[column_names_to_normalize] = df_temp

Your Pandas Dataframe is now normalized only at the columns you want


However, if you want the opposite, select a list of columns that you DON’T want to normalize, you can simply create a list of all columns and remove that non desired ones

column_names_to_not_normalize = ['B', 'J', 'K']
column_names_to_normalize = [x for x in list(df) if x not in column_names_to_not_normalize ]

Method 9

I think that a better way to do that in pandas is just

df = df/df.max().astype(np.float64)

Edit If in your data frame negative numbers are present you should use instead

df = df/df.loc[df.abs().idxmax()].astype(np.float64)

Method 10

The solution given by Sandman and Praveen is very well. The only problem with that if you have categorical variables in other columns of your data frame this method will need some adjustments.

My solution to this type of issue is following:

 from sklearn import preprocesing
 x = pd.concat([df.Numerical1, df.Numerical2,df.Numerical3])
 min_max_scaler = preprocessing.MinMaxScaler()
 x_scaled = min_max_scaler.fit_transform(x)
 x_new = pd.DataFrame(x_scaled)
 df = pd.concat([df.Categoricals,x_new])

Method 11

You might want to have some of columns being normalized and the others be unchanged like some of regression tasks which data labels or categorical columns are unchanged So I suggest you this pythonic way (It’s a combination of @shg and @Cina answers ):

features_to_normalize = ['A', 'B', 'C']
# could be ['A','B'] 

df[features_to_normalize] = df[features_to_normalize].apply(lambda x:(x-x.min()) / (x.max()-x.min()))

Method 12

df_normalized = df / df.max(axis=0)

Method 13

It is only simple mathematics. The answer should as simple as below.

normed_df = (df - df.min()) / (df.max() - df.min())

Method 14

This is how you do it column-wise using list comprehension:

[df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns]

Method 15

You can simply use the pandas.DataFrame.transform1 function in this way:

df.transform(lambda x: x/x.max())

Method 16

def normalize(x):
    try:
        x = x/np.linalg.norm(x,ord=1)
        return x
    except :
        raise
data = pd.DataFrame.apply(data,normalize)

From the document of pandas,DataFrame structure can apply an operation (function) to itself .

DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)

Applies function along input axis of DataFrame.
Objects passed to functions are Series objects having index either the DataFrame’s index (axis=0) or the columns (axis=1). Return type depends on whether passed function aggregates, or the reduce argument if the DataFrame is empty.

You can apply a custom function to operate the DataFrame .

Method 17

The following function calculates the Z score:

def standardization(dataset):
  """ Standardization of numeric fields, where all values will have mean of zero 
  and standard deviation of one. (z-score)

  Args:
    dataset: A `Pandas.Dataframe` 
  """
  dtypes = list(zip(dataset.dtypes.index, map(str, dataset.dtypes)))
  # Normalize numeric columns.
  for column, dtype in dtypes:
      if dtype == 'float32':
          dataset<div class="su-column su-column-size-1-2"><div class="su-column-inner su-u-clearfix su-u-trim"></div></div> -= dataset<div class="su-column su-column-size-1-2"><div class="su-column-inner su-u-clearfix su-u-trim"></div></div>.mean()
          dataset<div class="su-column su-column-size-1-2"><div class="su-column-inner su-u-clearfix su-u-trim"></div></div> /= dataset<div class="su-column su-column-size-1-2"><div class="su-column-inner su-u-clearfix su-u-trim"></div></div>.std()
  return dataset

Method 18

You can use minmax_scale to transform each column to a scale from 0-1.

Normalize all columns

from sklearn.preprocessing import minmax_scale
df[:] = minmax_scale(df)

Normalize single column

from sklearn.preprocessing import minmax_scale
df['a'] = minmax_scale(df['a'])

Normalize only numerical columns:

import numpy as np
from sklearn.preprocessing import minmax_scale
cols = df.select_dtypes(np.number).columns
df[cols] = minmax_scale(df[cols])

Full example:

# Prep
import pandas as pd
import numpy as np
from sklearn.preprocessing import minmax_scale

# Sample data
df = pd.DataFrame({'a':[0,1,2], 'b':[-10,-30,-50], 'c':['x', 'y', 'z']})

# MinMax normalize all numeric columns
cols = df.select_dtypes(np.number).columns
df[cols] = minmax_scale(df[cols])

# Result
print(df)

#    a    b  c
# 0  0.0  1.0  x
# 2  0.5  0.5  y
# 3  1.0  0.0  z

Note: Keeps index, column names or non-numerical variables unchanged. Function is applied for each column.

More info on standardization and normalization:

Method 19

You can do this in one line

DF_test = DF_test.sub(DF_test.mean(axis=0), axis=1)/DF_test.mean(axis=0)

it takes mean for each of the column and then subtracts it(mean) from every row(mean of particular column subtracts from its row only) and divide by mean only. Finally, we what we get is the normalized data set.

Method 20

Pandas does column wise normalization by default. Try the code below.

X= pd.read_csv('.\data.csv')
X = (X-X.min())/(X.max()-X.min())

The output values will be in range of 0 and 1.

Method 21

Hey use the apply function with lambda which speeds up the process:

def normalize(df_col):

  # Condition to exclude 'ID' and 'Class' feature
  if (str(df_col.name) != str('ID') and str(df_col.name)!=str('Class')):
        max_value = df_col.max()
        min_value = df_col.min()

        #It avoids NaN and return 0 instead
        if max_value == min_value:
          return 0

        sub_value = max_value - min_value
        return np.divide(np.subtract(df_col,min_value),sub_value)
  else:
        return df_col

 df_normalize = df.apply(lambda x :normalize(x))

Method 22

If your data is positively skewed, the best way to normalize is to use the log transformation:

df = np.log10(df)


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x