NumPy: calculate averages with NaNs removed

How can I calculate matrix mean values along a matrix, but to remove nan values from calculation? (For R people, think na.rm = TRUE).

Here is my [non-]working example:

import numpy as np
dat = np.array([[1, 2, 3],
                [4, 5, np.nan],
                [np.nan, 6, np.nan],
                [np.nan, np.nan, np.nan]])
print(dat)
print(dat.mean(1))  # [  2.  nan  nan  nan]

With NaNs removed, my expected output would be:

array([ 2.,  4.5,  6.,  nan])

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Method 5

Method 6

Method 7

Method 8

Method 9

Method 10

Method 11

Method 12

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

I think what you want is a masked array:

dat = np.array([[1,2,3], [4,5,'nan'], ['nan',6,'nan'], ['nan','nan','nan']])
mdat = np.ma.masked_array(dat,np.isnan(dat))
mm = np.mean(mdat,axis=1)
print mm.filled(np.nan) # the desired answer

Edit: Combining all of the timing data

   from timeit import Timer
    
    setupstr="""
import numpy as np
from scipy.stats.stats import nanmean    
dat = np.random.normal(size=(1000,1000))
ii = np.ix_(np.random.randint(0,99,size=50),np.random.randint(0,99,size=50))
dat[ii] = np.nan
"""  

    method1="""
mdat = np.ma.masked_array(dat,np.isnan(dat))
mm = np.mean(mdat,axis=1)
mm.filled(np.nan)    
"""
    
    N = 2
    t1 = Timer(method1, setupstr).timeit(N)
    t2 = Timer("[np.mean([l for l in d if not np.isnan(l)]) for d in dat]", setupstr).timeit(N)
    t3 = Timer("np.array([r[np.isfinite(r)].mean() for r in dat])", setupstr).timeit(N)
    t4 = Timer("np.ma.masked_invalid(dat).mean(axis=1)", setupstr).timeit(N)
    t5 = Timer("nanmean(dat,axis=1)", setupstr).timeit(N)
    
    print 'Time: %ftRatio: %f' % (t1,t1/t1 )
    print 'Time: %ftRatio: %f' % (t2,t2/t1 )
    print 'Time: %ftRatio: %f' % (t3,t3/t1 )
    print 'Time: %ftRatio: %f' % (t4,t4/t1 )
    print 'Time: %ftRatio: %f' % (t5,t5/t1 )

Returns:

Time: 0.045454  Ratio: 1.000000
Time: 8.179479  Ratio: 179.950595
Time: 0.060988  Ratio: 1.341755
Time: 0.070955  Ratio: 1.561029
Time: 0.065152  Ratio: 1.433364

Method 2

If performance matters, you should use bottleneck.nanmean() instead:

http://pypi.python.org/pypi/Bottleneck

Method 3

Assuming you’ve also got SciPy installed:

http://www.scipy.org/doc/api_docs/SciPy.stats.stats.html#nanmean

Method 4

From numpy 1.8 (released 2013-10-30) onwards, nanmean does precisely what you need:

>>> import numpy as np
>>> np.nanmean(np.array([1.5, 3.5, np.nan]))
2.5

Method 5

A masked array with the nans filtered out can also be created on the fly:

print np.ma.masked_invalid(dat).mean(1)

Method 6

You can always find a workaround in something like:

numpy.nansum(dat, axis=1) / numpy.sum(numpy.isfinite(dat), axis=1)

Numpy 2.0’s numpy.mean has a skipna option which should take care of that.

Method 7

This is built upon the solution suggested by JoshAdel.

Define the following function:

def nanmean(data, **args):
    return numpy.ma.filled(numpy.ma.masked_array(data,numpy.isnan(data)).mean(**args), fill_value=numpy.nan)

Example use:

data = [[0, 1, numpy.nan], [8, 5, 1]]
data = numpy.array(data)
print data
print nanmean(data)
print nanmean(data, axis=0)
print nanmean(data, axis=1)

Will print out:

[[  0.   1.  nan]
 [  8.   5.   1.]]

3.0

[ 4.  3.  1.]

[ 0.5         4.66666667]

Method 8

How about using Pandas to do this:

import numpy as np
import pandas as pd
dat = np.array([[1, 2, 3], [4, 5, np.nan], [np.nan, 6, np.nan], [np.nan, np.nan, np.nan]])
print dat
print dat.mean(1)

df = pd.DataFrame(dat)
print df.mean(axis=1)

Gives:

Method 9

Or you use laxarray, freshly uploaded, which is among other a wrapper for masked arrays.

import laxarray as la
la.array(dat).mean(axis=1)

following JoshAdel’s protocoll I get:

Time: 0.048791  Ratio: 1.000000   
Time: 0.062242  Ratio: 1.275689   # laxarray's one-liner

So laxarray is marginally slower (would need to check why, maybe fixable), but much easier to use and allow labelling dimensions with strings.

check out: https://github.com/perrette/laxarray

EDIT: I have checked with another module, “la”, larry, which beats all tests:

import la
la.larry(dat).mean(axis=1)

By hand, Time: 0.049013 Ratio: 1.000000
Larry,   Time: 0.005467 Ratio: 0.111540
laxarray Time: 0.061751 Ratio: 1.259889

Impressive !

Method 10

One more speed check for all proposed approaches:

Python 2.7.11 |Anaconda 2.4.1 (64-bit)| (default, Jan 19 2016, 12:08:31) [MSC v.1500 64 bit (AMD64)]
IPython 4.0.1 -- An enhanced Interactive Python.

import numpy as np
from scipy.stats.stats import nanmean    
dat = np.random.normal(size=(1000,1000))
ii = np.ix_(np.random.randint(0,99,size=50),np.random.randint(0,99,size=50))
dat[ii] = np.nan
In[185]: def method1():
    mdat = np.ma.masked_array(dat,np.isnan(dat))
    mm = np.mean(mdat,axis=1)
    mm.filled(np.nan) 

In[190]: %timeit method1()
100 loops, best of 3: 7.09 ms per loop
In[191]: %timeit [np.mean([l for l in d if not np.isnan(l)]) for d in dat]
1 loops, best of 3: 1.04 s per loop
In[192]: %timeit np.array([r[np.isfinite(r)].mean() for r in dat])
10 loops, best of 3: 19.6 ms per loop
In[193]: %timeit np.ma.masked_invalid(dat).mean(axis=1)
100 loops, best of 3: 11.8 ms per loop
In[194]: %timeit nanmean(dat,axis=1)
100 loops, best of 3: 6.36 ms per loop
In[195]: import bottleneck as bn
In[196]: %timeit bn.nanmean(dat,axis=1)
1000 loops, best of 3: 1.05 ms per loop
In[197]: from scipy import stats
In[198]: %timeit stats.nanmean(dat)
100 loops, best of 3: 6.19 ms per loop

So the best is ‘bottleneck.nanmean(dat, axis=1)’
‘scipy.stats.nanmean(dat)’ is not faster then numpy.nanmean(dat, axis=1).

Method 11

# I suggest you this way:
import numpy as np
dat  = np.array([[1, 2, 3], [4, 5, np.nan], [np.nan, 6, np.nan], [np.nan, np.nan, np.nan]])
dat2 = np.ma.masked_invalid(dat)
print np.mean(dat2, axis=1)

Method 12

'''define dataMat'''
numFeat= shape(datMat)[1]
for i in range(numFeat):
     meanVal=mean(dataMat[nonzero(~isnan(datMat[:,i].A))[0],i])

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating