I have two arrays that have the shapes N X T and M X T. I’d like to compute the correlation coefficient across T between every possible pair of rows n and m (from N and M, respectively).
What’s the fastest, most pythonic way to do this? (Looping over N and M would seem to me to be neither fast nor pythonic.) I’m expecting the answer to involve numpy and/or scipy. Right now my arrays are numpy arrays, but I’m open to converting them to a different type.
I’m expecting my output to be an array with the shape N X M.
N.B. When I say “correlation coefficient,” I mean the Pearson product-moment correlation coefficient.
Here are some things to note:
- The
numpyfunctioncorrelaterequires input arrays to be one-dimensional. - The
numpyfunctioncorrcoefaccepts two-dimensional arrays, but they must have the same shape. - The
scipy.statsfunctionpearsonrrequires input arrays to be one-dimensional.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Correlation (default ‘valid’ case) between two 2D arrays:
You can simply use matrix-multiplication np.dot like so –
out = np.dot(arr_one,arr_two.T)
Correlation with the default "valid" case between each pairwise row combinations (row1,row2) of the two input arrays would correspond to multiplication result at each (row1,row2) position.
Row-wise Correlation Coefficient calculation for two 2D arrays:
def corr2_coeff(A, B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(1)[:, None]
B_mB = B - B.mean(1)[:, None]
# Sum of squares across rows
ssA = (A_mA**2).sum(1)
ssB = (B_mB**2).sum(1)
# Finally get corr coeff
return np.dot(A_mA, B_mB.T) / np.sqrt(np.dot(ssA[:, None],ssB[None]))
This is based upon this solution to How to apply corr2 functions in Multidimentional arrays in MATLAB
Benchmarking
This section compares runtime performance with the proposed approach against generate_correlation_map & loopy pearsonr based approach listed in the other answer.(taken from the function test_generate_correlation_map() without the value correctness verification code at the end of it). Please note the timings for the proposed approach also include a check at the start to check for equal number of columns in the two input arrays, as also done in that other answer. The runtimes are listed next.
Case #1:
In [106]: A = np.random.rand(1000, 100) In [107]: B = np.random.rand(1000, 100) In [108]: %timeit corr2_coeff(A, B) 100 loops, best of 3: 15 ms per loop In [109]: %timeit generate_correlation_map(A, B) 100 loops, best of 3: 19.6 ms per loop
Case #2:
In [110]: A = np.random.rand(5000, 100) In [111]: B = np.random.rand(5000, 100) In [112]: %timeit corr2_coeff(A, B) 1 loops, best of 3: 368 ms per loop In [113]: %timeit generate_correlation_map(A, B) 1 loops, best of 3: 493 ms per loop
Case #3:
In [114]: A = np.random.rand(10000, 10) In [115]: B = np.random.rand(10000, 10) In [116]: %timeit corr2_coeff(A, B) 1 loops, best of 3: 1.29 s per loop In [117]: %timeit generate_correlation_map(A, B) 1 loops, best of 3: 1.83 s per loop
The other loopy pearsonr based approach seemed too slow, but here are the runtimes for one small datasize –
In [118]: A = np.random.rand(1000, 100) In [119]: B = np.random.rand(1000, 100) In [120]: %timeit corr2_coeff(A, B) 100 loops, best of 3: 15.3 ms per loop In [121]: %timeit generate_correlation_map(A, B) 100 loops, best of 3: 19.7 ms per loop In [122]: %timeit pearsonr_based(A, B) 1 loops, best of 3: 33 s per loop
Method 2
@Divakar provides a great option for computing the unscaled correlation, which is what I originally asked for.
In order to calculate the correlation coefficient, a bit more is required:
import numpy as np
def generate_correlation_map(x, y):
"""Correlate each n with each m.
Parameters
----------
x : np.array
Shape N X T.
y : np.array
Shape M X T.
Returns
-------
np.array
N X M array in which each element is a correlation coefficient.
"""
mu_x = x.mean(1)
mu_y = y.mean(1)
n = x.shape[1]
if n != y.shape[1]:
raise ValueError('x and y must ' +
'have the same number of timepoints.')
s_x = x.std(1, ddof=n - 1)
s_y = y.std(1, ddof=n - 1)
cov = np.dot(x,
y.T) - n * np.dot(mu_x[:, np.newaxis],
mu_y[np.newaxis, :])
return cov / np.dot(s_x[:, np.newaxis], s_y[np.newaxis, :])
Here’s a test of this function, which passes:
from scipy.stats import pearsonr
def test_generate_correlation_map():
x = np.random.rand(10, 10)
y = np.random.rand(20, 10)
desired = np.empty((10, 20))
for n in range(x.shape[0]):
for m in range(y.shape[0]):
desired[n, m] = pearsonr(x[n, :], y[m, :])[0]
actual = generate_correlation_map(x, y)
np.testing.assert_array_almost_equal(actual, desired)
Method 3
For those interested in computing the Pearson correlation coefficient between a 1D and 2D array, I wrote the following function, where x is a 1D array and y a 2D array.
def pearsonr_2D(x, y):
"""computes pearson correlation coefficient
where x is a 1D and y a 2D array"""
upper = np.sum((x - np.mean(x)) * (y - np.mean(y, axis=1)[:,None]), axis=1)
lower = np.sqrt(np.sum(np.power(x - np.mean(x), 2)) * np.sum(np.power(y - np.mean(y, axis=1)[:,None], 2), axis=1))
rho = upper / lower
return rho
Example run:
>>> x
Out[1]: array([1, 2, 3])
>>> y
Out[2]: array([[ 1, 2, 3],
[ 6, 7, 12],
[ 9, 3, 1]])
>>> pearsonr_2D(x, y)
Out[3]: array([ 1. , 0.93325653, -0.96076892])
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0