How can I calculate matrix mean values along a matrix, but to remove nan values from calculation? (For R people, think na.rm = TRUE).
Here is my [non-]working example:
import numpy as np
dat = np.array([[1, 2, 3],
[4, 5, np.nan],
[np.nan, 6, np.nan],
[np.nan, np.nan, np.nan]])
print(dat)
print(dat.mean(1)) # [ 2. nan nan nan]
With NaNs removed, my expected output would be:
array([ 2., 4.5, 6., nan])
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
I think what you want is a masked array:
dat = np.array([[1,2,3], [4,5,'nan'], ['nan',6,'nan'], ['nan','nan','nan']]) mdat = np.ma.masked_array(dat,np.isnan(dat)) mm = np.mean(mdat,axis=1) print mm.filled(np.nan) # the desired answer
Edit: Combining all of the timing data
from timeit import Timer
setupstr="""
import numpy as np
from scipy.stats.stats import nanmean
dat = np.random.normal(size=(1000,1000))
ii = np.ix_(np.random.randint(0,99,size=50),np.random.randint(0,99,size=50))
dat[ii] = np.nan
"""
method1="""
mdat = np.ma.masked_array(dat,np.isnan(dat))
mm = np.mean(mdat,axis=1)
mm.filled(np.nan)
"""
N = 2
t1 = Timer(method1, setupstr).timeit(N)
t2 = Timer("[np.mean([l for l in d if not np.isnan(l)]) for d in dat]", setupstr).timeit(N)
t3 = Timer("np.array([r[np.isfinite(r)].mean() for r in dat])", setupstr).timeit(N)
t4 = Timer("np.ma.masked_invalid(dat).mean(axis=1)", setupstr).timeit(N)
t5 = Timer("nanmean(dat,axis=1)", setupstr).timeit(N)
print 'Time: %ftRatio: %f' % (t1,t1/t1 )
print 'Time: %ftRatio: %f' % (t2,t2/t1 )
print 'Time: %ftRatio: %f' % (t3,t3/t1 )
print 'Time: %ftRatio: %f' % (t4,t4/t1 )
print 'Time: %ftRatio: %f' % (t5,t5/t1 )
Returns:
Time: 0.045454 Ratio: 1.000000 Time: 8.179479 Ratio: 179.950595 Time: 0.060988 Ratio: 1.341755 Time: 0.070955 Ratio: 1.561029 Time: 0.065152 Ratio: 1.433364
Method 2
If performance matters, you should use bottleneck.nanmean() instead:
http://pypi.python.org/pypi/Bottleneck
Method 3
Assuming you’ve also got SciPy installed:
http://www.scipy.org/doc/api_docs/SciPy.stats.stats.html#nanmean
Method 4
From numpy 1.8 (released 2013-10-30) onwards, nanmean does precisely what you need:
>>> import numpy as np >>> np.nanmean(np.array([1.5, 3.5, np.nan])) 2.5
Method 5
A masked array with the nans filtered out can also be created on the fly:
print np.ma.masked_invalid(dat).mean(1)
Method 6
You can always find a workaround in something like:
numpy.nansum(dat, axis=1) / numpy.sum(numpy.isfinite(dat), axis=1)
Numpy 2.0’s numpy.mean has a skipna option which should take care of that.
Method 7
This is built upon the solution suggested by JoshAdel.
Define the following function:
def nanmean(data, **args):
return numpy.ma.filled(numpy.ma.masked_array(data,numpy.isnan(data)).mean(**args), fill_value=numpy.nan)
Example use:
data = [[0, 1, numpy.nan], [8, 5, 1]] data = numpy.array(data) print data print nanmean(data) print nanmean(data, axis=0) print nanmean(data, axis=1)
Will print out:
[[ 0. 1. nan] [ 8. 5. 1.]] 3.0 [ 4. 3. 1.] [ 0.5 4.66666667]
Method 8
How about using Pandas to do this:
import numpy as np import pandas as pd dat = np.array([[1, 2, 3], [4, 5, np.nan], [np.nan, 6, np.nan], [np.nan, np.nan, np.nan]]) print dat print dat.mean(1) df = pd.DataFrame(dat) print df.mean(axis=1)
Gives:
0 2.0 1 4.5 2 6.0 3 NaN
Method 9
Or you use laxarray, freshly uploaded, which is among other a wrapper for masked arrays.
import laxarray as la la.array(dat).mean(axis=1)
following JoshAdel’s protocoll I get:
Time: 0.048791 Ratio: 1.000000 Time: 0.062242 Ratio: 1.275689 # laxarray's one-liner
So laxarray is marginally slower (would need to check why, maybe fixable), but much easier to use and allow labelling dimensions with strings.
check out: https://github.com/perrette/laxarray
EDIT: I have checked with another module, “la”, larry, which beats all tests:
import la la.larry(dat).mean(axis=1) By hand, Time: 0.049013 Ratio: 1.000000 Larry, Time: 0.005467 Ratio: 0.111540 laxarray Time: 0.061751 Ratio: 1.259889
Impressive !
Method 10
One more speed check for all proposed approaches:
Python 2.7.11 |Anaconda 2.4.1 (64-bit)| (default, Jan 19 2016, 12:08:31) [MSC v.1500 64 bit (AMD64)]
IPython 4.0.1 -- An enhanced Interactive Python.
import numpy as np
from scipy.stats.stats import nanmean
dat = np.random.normal(size=(1000,1000))
ii = np.ix_(np.random.randint(0,99,size=50),np.random.randint(0,99,size=50))
dat[ii] = np.nan
In[185]: def method1():
mdat = np.ma.masked_array(dat,np.isnan(dat))
mm = np.mean(mdat,axis=1)
mm.filled(np.nan)
In[190]: %timeit method1()
100 loops, best of 3: 7.09 ms per loop
In[191]: %timeit [np.mean([l for l in d if not np.isnan(l)]) for d in dat]
1 loops, best of 3: 1.04 s per loop
In[192]: %timeit np.array([r[np.isfinite(r)].mean() for r in dat])
10 loops, best of 3: 19.6 ms per loop
In[193]: %timeit np.ma.masked_invalid(dat).mean(axis=1)
100 loops, best of 3: 11.8 ms per loop
In[194]: %timeit nanmean(dat,axis=1)
100 loops, best of 3: 6.36 ms per loop
In[195]: import bottleneck as bn
In[196]: %timeit bn.nanmean(dat,axis=1)
1000 loops, best of 3: 1.05 ms per loop
In[197]: from scipy import stats
In[198]: %timeit stats.nanmean(dat)
100 loops, best of 3: 6.19 ms per loop
So the best is ‘bottleneck.nanmean(dat, axis=1)’
‘scipy.stats.nanmean(dat)’ is not faster then numpy.nanmean(dat, axis=1).
Method 11
# I suggest you this way: import numpy as np dat = np.array([[1, 2, 3], [4, 5, np.nan], [np.nan, 6, np.nan], [np.nan, np.nan, np.nan]]) dat2 = np.ma.masked_invalid(dat) print np.mean(dat2, axis=1)
Method 12
'''define dataMat'''
numFeat= shape(datMat)[1]
for i in range(numFeat):
meanVal=mean(dataMat[nonzero(~isnan(datMat[:,i].A))[0],i])
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0