numpy.average() has a weights option, but numpy.std() does not. Does anyone have suggestions for a workaround?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
How about the following short “manual calculation”?
def weighted_avg_and_std(values, weights):
"""
Return the weighted average and standard deviation.
values, weights -- Numpy ndarrays with the same shape.
"""
average = numpy.average(values, weights=weights)
# Fast and numerically precise:
variance = numpy.average((values-average)**2, weights=weights)
return (average, math.sqrt(variance))
Method 2
There is a class in statsmodels that makes it easy to calculate weighted statistics: statsmodels.stats.weightstats.DescrStatsW.
Assuming this dataset and weights:
import numpy as np from statsmodels.stats.weightstats import DescrStatsW array = np.array([1,2,1,2,1,2,1,3]) weights = np.ones_like(array) weights[3] = 100
You initialize the class (note that you have to pass in the correction factor, the delta degrees of freedom at this point):
weighted_stats = DescrStatsW(array, weights=weights, ddof=0)
Then you can calculate:
-
.meanthe weighted mean:>>> weighted_stats.mean 1.97196261682243
-
.stdthe weighted standard deviation:>>> weighted_stats.std 0.21434289609681711
-
.varthe weighted variance:>>> weighted_stats.var 0.045942877107170932
-
.std_meanthe standard error of weighted mean:>>> weighted_stats.std_mean 0.020818822467555047
Just in case you’re interested in the relation between the standard error and the standard deviation: The standard error is (for
ddof == 0) calculated as the weighted standard deviation divided by the square root of the sum of the weights minus 1 (corresponding source forstatsmodelsversion 0.9 on GitHub):standard_error = standard_deviation / sqrt(sum(weights) - 1)
Method 3
Here’s one more option:
np.sqrt(np.cov(values, aweights=weights))
Method 4
There doesn’t appear to be such a function in numpy/scipy yet, but there is a ticket proposing this added functionality. Included there you will find Statistics.py which implements weighted standard deviations.
Method 5
There is a very good example proposed by gaborous:
import pandas as pd import numpy as np # X is the dataset, as a Pandas' DataFrame mean = mean = np.ma.average(X, axis=0, weights=weights) # Computing the weighted sample mean (fast, efficient and precise) # Convert to a Pandas' Series (it's just aesthetic and more # ergonomic; no difference in computed values) mean = pd.Series(mean, index=list(X.keys())) xm = X-mean # xm = X diff to mean xm = xm.fillna(0) # fill NaN with 0 (because anyway a variance of 0 is just void, but at least it keeps the other covariance's values computed correctly)) sigma2 = 1./(w.sum()-1) * xm.mul(w, axis=0).T.dot(xm); # Compute the unbiased weighted sample covariance
Correct equation for weighted unbiased sample covariance, URL (version: 2016-06-28)
Method 6
A follow-up to “sample” or “unbiased” standard deviation in the “frequency weights” sense since “weighted sample standard deviation python” Google search leads to this post:
def frequency_sample_std_dev(X, n):
"""
Sample standard deviation for X and n,
where X[i] is the quantity each person in group i has,
and n[i] is the number of people in group i.
See Equation 6.4 of:
Montgomery, Douglas, C. and George C. Runger. Applied Statistics
and Probability for Engineers, Enhanced eText. Available from:
WileyPLUS, (7th Edition). Wiley Global Education US, 2018.
"""
n_groups = len(n)
n_people = sum(n)
lhs_numerator = sum([ni*Xi**2 for Xi, ni in zip(X, n)])
rhs_numerator = sum([Xi*ni for Xi, ni in zip(X,n)])**2/n_people
denominator = n_people-1
var = (lhs_numerator - rhs_numerator) / denominator
std = sqrt(var)
return std
Or modifying the answer by @Eric as follows:
def weighted_sample_avg_std(values, weights):
"""
Return the weighted average and weighted sample standard deviation.
values, weights -- Numpy ndarrays with the same shape.
Assumes that weights contains only integers (e.g. how many samples in each group).
See also https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Frequency_weights
"""
average = np.average(values, weights=weights)
variance = np.average((values-average)**2, weights=weights)
variance = variance*sum(weights)/(sum(weights)-1)
return (average, sqrt(variance))
print(weighted_sample_avg_std(X, n))
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0