I am trying to calculate a distance matrix for a long list of locations identified by Latitude & Longitude using the Haversine formula that takes two tuples of coordinate pairs to produce the distance:

def haversine(point1, point2, miles=False): """ Calculate the great-circle distance bewteen two points on the Earth surface. :input: two 2-tuples, containing the latitude and longitude of each point in decimal degrees. Example: haversine((45.7597, 4.8422), (48.8567, 2.3508)) :output: Returns the distance bewteen the two points. The default unit is kilometers. Miles can be returned if the ``miles`` parameter is set to True. """

I can calculate the distance between all points using a nested for loop as follows:

data.head() id coordinates 0 1 (16.3457688674, 6.30354512503) 1 2 (12.494749307, 28.6263955635) 2 3 (27.794615136, 60.0324947881) 3 4 (44.4269923769, 110.114216113) 4 5 (-69.8540884125, 87.9468778773)

using a simple function:

distance = {} def haver_loop(df): for i, point1 in df.iterrows(): distance[i] = [] for j, point2 in df.iterrows(): distance[i].append(haversine(point1.coordinates, point2.coordinates)) return pd.DataFrame.from_dict(distance, orient='index')

But this takes quite a while given the time complexity, running at around 20s for 500 points and I have a much longer list. This has me looking at vectorization, and I’ve come across `numpy.vectorize`

((docs), but can’t figure out how to apply it in this context.

## Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

### Method 1

From `haversine's function definition`

, it looked pretty *parallelizable*. So, using one of the best tools for vectorization with NumPy aka `broadcasting`

and replacing the math funcs with the NumPy equivalents `ufuncs`

, here’s one vectorized solution –

# Get data as a Nx2 shaped NumPy array data = np.array(df['coordinates'].tolist()) # Convert to radians data = np.deg2rad(data) # Extract col-1 and 2 as latitudes and longitudes lat = data[:,0] lng = data[:,1] # Elementwise differentiations for lattitudes & longitudes diff_lat = lat[:,None] - lat diff_lng = lng[:,None] - lng # Finally Calculate haversine d = np.sin(diff_lat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(diff_lng/2)**2 return 2 * 6371 * np.arcsin(np.sqrt(d))

Runtime tests –

The other `np.vectorize based solution`

has shown some positive promise on performance improvement over the original code, so this section would compare the posted broadcasting based approach against that one.

Function definitions –

def vectotized_based(df): haver_vec = np.vectorize(haversine, otypes=[np.int16]) return df.groupby('id').apply(lambda x: pd.Series(haver_vec(df.coordinates, x.coordinates))) def broadcasting_based(df): data = np.array(df['coordinates'].tolist()) data = np.deg2rad(data) lat = data[:,0] lng = data[:,1] diff_lat = lat[:,None] - lat diff_lng = lng[:,None] - lng d = np.sin(diff_lat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(diff_lng/2)**2 return 2 * 6371 * np.arcsin(np.sqrt(d))

Timings –

In [123]: # Input ...: length = 500 ...: d1 = np.random.uniform(-90, 90, length) ...: d2 = np.random.uniform(-180, 180, length) ...: coords = tuple(zip(d1, d2)) ...: df = pd.DataFrame({'id':np.arange(length), 'coordinates':coords}) ...: In [124]: %timeit vectotized_based(df) 1 loops, best of 3: 1.12 s per loop In [125]: %timeit broadcasting_based(df) 10 loops, best of 3: 68.7 ms per loop

### Method 2

You would provide your function as an argument to `np.vectorize()`

, and could then use it as an argument to `pandas.groupby.apply`

as illustrated below:

haver_vec = np.vectorize(haversine, otypes=[np.int16]) distance = df.groupby('id').apply(lambda x: pd.Series(haver_vec(df.coordinates, x.coordinates)))

For instance, with sample data as follows:

length = 500 df = pd.DataFrame({'id':np.arange(length), 'coordinates':tuple(zip(np.random.uniform(-90, 90, length), np.random.uniform(-180, 180, length)))})

compare for 500 points:

def haver_vect(data): distance = data.groupby('id').apply(lambda x: pd.Series(haver_vec(data.coordinates, x.coordinates))) return distance %timeit haver_loop(df): 1 loops, best of 3: 35.5 s per loop %timeit haver_vect(df): 1 loops, best of 3: 593 ms per loop

### Method 3

start by getting all combinations using `itertools.product`

results= [(p1,p2,haversine(p1,p2))for p1,p2 in itertools.product(points,repeat=2)]

that said Im not sure how fast it will be this looks like it might be a duplicate of Python: speeding up geographic comparison

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0