This question is intended to be a canonical duplicate target
Given two arrays X and Y of shapes (i, n) and (j, n), representing lists of n-dimensional coordinates,
def test_data(n, i, j, r = 100):
X = np.random.rand(i, n) * r - r / 2
Y = np.random.rand(j, n) * r - r / 2
return X, Y
X, Y = test_data(3, 1000, 1000)
what are the fastest ways to find:
- The distance
Dwith shape(i,j)between every point inXand every point inY - The indices
k_iand distancek_dof theknearest neighbors against all points inXfor every point inY - The indices
r_i,r_jand distancer_dof every point inXwithin distancerof every pointjinY
Given the following sets of restrictions:
- Only using
numpy - Using any
pythonpackage
Including the special case:
YisX
In all cases distance primarily means Euclidean distance, but feel free to highlight methods that allow other distance calculations.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
#1. All Distances
- only using
numpy
The naive method is:
D = np.sqrt(np.sum((X[:, None, :] - Y[None, :, :])**2, axis = -1))
However this takes up a lot of memory creating an (i, j, n)-shaped intermediate matrix, and is very slow
However, thanks to a trick from @Divakar (eucl_dist package, wiki), we can use a bit of algebra and np.einsum to decompose as such: (X - Y)**2 = X**2 - 2*X*Y + Y**2
D = np.sqrt( # (X - Y) ** 2
np.einsum('ij, ij ->i', X, X)[:, None] + # = X ** 2
np.einsum('ij, ij ->i', Y, Y) - # + Y ** 2
2 * X.dot(Y.T)) # - 2 * X * Y
YisX
Similar to above:
XX = np.einsum('ij, ij ->i', X, X)
D = np.sqrt(XX[:, None] + XX - 2 * X.dot(X.T))
Beware that floating-point imprecision can make the diagonal terms deviate very slightly from zero with this method. If you need to make sure they are zero, you’ll need to explicitly set it:
np.einsum('ii->i', D)[:] = 0
- Any Package
scipy.spatial.distance.cdist is the most intuitive builtin function for this, and far faster than bare numpy
from scipy.spatial.distance import cdist D = cdist(X, Y)
cdist can also deal with many, many distance measures as well as user-defined distance measures (although these are not optimized). Check the documentation linked above for details.
YisX
For self-referring distances, scipy.spatial.distance.pdist works similar to cdist, but returns a 1-D condensed distance array, saving space on the symmetric distance matrix by only having each term once. You can convert this to a square matrix using squareform
from scipy.spatial.distance import pdist, squareform D_cond = pdist(X) D = squareform(D_cond)
#2. K Nearest Neighbors (KNN)
- Only using
numpy
We could use np.argpartition to get the k-nearest indices and use those to get the corresponding distance values. So, with D as the array holding the distance values obtained above, we would have –
if k == 1:
k_i = D.argmin(0)
else:
k_i = D.argpartition(k, axis = 0)[:k]
k_d = np.take_along_axis(D, k_i, axis = 0)
However we can speed this up a bit by not taking the square roots until we have reduced our dataset. np.sqrt is the slowest part of calculating the Euclidean norm, so we don’t want to do that until the end.
D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +
np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)
if k == 1:
k_i = D_sq.argmin(0)
else:
k_i = D_sq.argpartition(k, axis = 0)[:k]
k_d = np.sqrt(np.take_along_axis(D_sq, k_i, axis = 0))
Now, np.argpartition performs indirect partition and doesn’t necessarily give us the elements in sorted order and only makes sure that the first k elements are the smallest ones. So, for a sorted output, we need to use argsort on the output from previous step –
sorted_idx = k_d.argsort(axis = 0) k_i_sorted = np.take_along_axis(k_i, sorted_idx, axis = 0) k_d_sorted = np.take_along_axis(k_d, sorted_idx, axis = 0)
If you only need, k_i, you never need the square root at all:
D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +
np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)
if k == 1:
k_i = D_sq.argmin(0)
else:
k_i = D_sq.argpartition(k, axis = 0)[:k]
k_d_sq = np.take_along_axis(D_sq, k_i, axis = 0)
sorted_idx = k_d_sq.argsort(axis = 0)
k_i_sorted = np.take_along_axis(k_i, sorted_idx, axis = 0)
XisY
In the above code, replace:
D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +
np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)
with:
XX = np.einsum('ij, ij ->i', X, X)
D_sq = XX[:, None] + XX - 2 * X.dot(X.T))
- Any Package
KD-Tree is a much faster method to find neighbors and constrained distances. Be aware the while KDTree is usually much faster than brute force solutions above for 3d (as long as oyu have more than 8 points), if you have n-dimensions, KDTree only scales well if you have more than 2**n points. For discussion and more advanced methods for high dimensions, see Here
The most recommended method for implementing KDTree is to use scipy‘s scipy.spatial.KDTree or scipy.spatial.cKDTree
from scipy.spatial import KDTree X_tree = KDTree(X) k_d, k_i = X_tree.query(Y, k = k)
Unfortunately scipy‘s KDTree implementation is slow and has a tendency to segfault for larger data sets. As pointed out by @HansMusgrave here, pykdtree increases the performance a lot, but is not as common an include as scipy and can only deal with Euclidean distance currently (while the KDTree in scipy can handle Minkowsi p-norms of any order)
XisY
Use instead:
k_d, k_i = X_tree.query(X, k = k)
- Arbitrary metrics
A BallTree has similar algorithmic properties to a KDTree. I’m not aware of a parallel/vectorized/fast BallTree in Python, but using scipy we can still have reasonable KNN queries for user-defined metrics. If available, builtin metrics will be much faster.
def d(a, b):
return max(np.abs(a-b))
tree = sklearn.neighbors.BallTree(X, metric=d)
k_d, k_i = tree.query(Y)
This answer will be wrong if d() is not a metric. The only reason a BallTree is faster than brute force is because the properties of a metric allow it to rule out some solutions. For truly arbitrary functions, brute force is actually necessary.
#3. Radius search
- Only using
numpy
The simplest method is just to use boolean indexing:
mask = D_sq < r**2 r_i, r_j = np.where(mask) r_d = np.sqrt(D_sq[mask])
- Any Package
Similar to above, you can use scipy.spatial.KDTree.query_ball_point
r_ij = X_tree.query_ball_point(Y, r = r)
or scipy.spatial.KDTree.query_ball_tree
Y_tree = KDTree(Y) r_ij = X_tree.query_ball_tree(Y_tree, r = r)
Unfortunately r_ij ends up being a list of index arrays that are a bit difficult to untangle for later use.
Much easier is to use cKDTree‘s sparse_distance_matrix, which can output a coo_matrix
from scipy.spatial import cKDTree X_cTree = cKDTree(X) Y_cTree = cKDTree(Y) D_coo = X_cTree.sparse_distance_matrix(Y_cTree, r = r, output_type = `coo_matrix`) r_i = D_coo.row r_j = D_coo.column r_d = D_coo.data
This is an extraordinarily flexible format for the distance matrix, as it stays an actual matrix (if converted to csr) can also be used for many vectorized operations.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0