I have the following table. I want to calculate a weighted average grouped by each date based on the formula below. I can do this using some standard conventional code, but assuming that this data is in a pandas dataframe, is there any easier way to achieve this rather than through iteration?
Date ID wt value w_avg 01/01/2012 100 0.50 60 0.791666667 01/01/2012 101 0.75 80 01/01/2012 102 1.00 100 01/02/2012 201 0.50 100 0.722222222 01/02/2012 202 1.00 80
01/01/2012 w_avg = 0.5 * ( 60/ sum(60,80,100)) + .75 * (80/
sum(60,80,100)) + 1.0 * (100/sum(60,80,100))01/02/2012 w_avg = 0.5 * ( 100/ sum(100,80)) + 1.0 * ( 80/
sum(100,80))
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Let’s first create the example pandas dataframe:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: index = pd.Index(['01/01/2012','01/01/2012','01/01/2012','01/02/2012','01/02/2012'], name='Date')
In [4]: df = pd.DataFrame({'ID':[100,101,102,201,202],'wt':[.5,.75,1,.5,1],'value':[60,80,100,100,80]},index=index)
Then, the average of ‘wt’ weighted by ‘value’ and grouped by the index is obtained as:
In [5]: df.groupby(df.index).apply(lambda x: np.average(x.wt, weights=x.value)) Out[5]: Date 01/01/2012 0.791667 01/02/2012 0.722222 dtype: float64
Alternatively, one can also define a function:
In [5]: def grouped_weighted_avg(values, weights, by): ...: return (values * weights).groupby(by).sum() / weights.groupby(by).sum() In [6]: grouped_weighted_avg(values=df.wt, weights=df.value, by=df.index) Out[6]: Date 01/01/2012 0.791667 01/02/2012 0.722222 dtype: float64
Method 2
I think I would do this with two groupbys.
First to calculate the “weighted average”:
In [11]: g = df.groupby('Date')
In [12]: df.value / g.value.transform("sum") * df.wt
Out[12]:
0 0.125000
1 0.250000
2 0.416667
3 0.277778
4 0.444444
dtype: float64
If you set this as a column, you can groupby over it:
In [13]: df['wa'] = df.value / g.value.transform("sum") * df.wt
Now the sum of this column is the desired:
In [14]: g.wa.sum() Out[14]: Date 01/01/2012 0.791667 01/02/2012 0.722222 Name: wa, dtype: float64
or potentially:
In [15]: g.wa.transform("sum")
Out[15]:
0 0.791667
1 0.791667
2 0.791667
3 0.722222
4 0.722222
Name: wa, dtype: float64
Method 3
I feel the following is an elegant solution to this problem from:(Pandas DataFrame aggregate function using multiple columns)
grouped = df.groupby('Date')
def wavg(group):
d = group['value']
w = group['wt']
return (d * w).sum() / w.sum()
grouped.apply(wavg)
Method 4
I saved the table in the .csv file
df=pd.read_csv('book1.csv')
grouped=df.groupby('Date')
g_wavg= lambda x: np.average(x.wt, weights=x.value)
grouped.apply(g_wavg)
Method 5
If speed is an important factor for you, vectorizing is critical. Thus, based on the answer by Andy Hayden, here is a solution using only Pandas native functions:
def weighted_mean(df, values, weights, groupby):
df = df.copy()
grouped = df.groupby(groupby)
df['weighted_average'] = df[values] / grouped[weights].transform('sum') * df[weights]
return grouped['weighted_average'].sum(min_count=1) #min_count is required for Grouper objects
In comparison, using a custom lambda function is less code, but slower:
import numpy as np
def weighted_mean_by_lambda(df, values, weights, groupby):
return df.groupby(groupby).apply(lambda x: np.average(x[values], weights=x[weights]))
Speed test:
import time
import numpy as np
import pandas as pd
n = 100000000
df = pd.DataFrame({
'values': np.random.uniform(0, 1, size=n),
'weights': np.random.randint(0, 5, size=n),
'groupby': np.random.randint(0, 10000, size=n),
})
time1 = time.time()
weighted_mean(df, 'values', 'weights', 'groupby')
print('Time for `weighted_mean`:', time.time() - time1)
time2 = time.time()
weighted_mean_by_lambda(df, 'values', 'weights', 'groupby')
print('Time for `weighted_mean_by_lambda`:', time.time() - time2)
Speed test output:
Time for `weighted_mean`: 3.4519572257995605 Time for `weighted_mean_by_lambda`: 11.41335940361023
Method 6
Another option to an old question (using @kadee sample data):
(df.drop(columns='ID')
.assign(wt = lambda df: df.prod(1)) # product of weight and value
.groupby('Date').sum()
.pipe(lambda df: df.wt / df.value) # weighted average computation
)
Date
01/01/2012 0.791667
01/02/2012 0.722222
Should be much faster if the computation is done without the use of anonymous functions:
temp = df.drop(columns='ID')
temp = temp.assign(wt = temp.prod(1)).groupby('Date').sum()
temp.wt / temp.value
Date
01/01/2012 0.791667
01/02/2012 0.722222
dtype: float64
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0