# Vectorizing Haversine distance calculation in Python

I am trying to calculate a distance matrix for a long list of locations identified by Latitude & Longitude using the Haversine formula that takes two tuples of coordinate pairs to produce the distance:

```def haversine(point1, point2, miles=False):
""" Calculate the great-circle distance bewteen two points on the Earth surface.

:input: two 2-tuples, containing the latitude and longitude of each point
in decimal degrees.

Example: haversine((45.7597, 4.8422), (48.8567, 2.3508))

:output: Returns the distance bewteen the two points.
The default unit is kilometers. Miles can be returned
if the ``miles`` parameter is set to True.

"""```

I can calculate the distance between all points using a nested for loop as follows:

```data.head()

id                      coordinates
0   1   (16.3457688674, 6.30354512503)
1   2    (12.494749307, 28.6263955635)
2   3    (27.794615136, 60.0324947881)
3   4   (44.4269923769, 110.114216113)
4   5  (-69.8540884125, 87.9468778773)```

using a simple function:

```distance = {}
def haver_loop(df):
for i, point1 in df.iterrows():
distance[i] = []
for j, point2 in df.iterrows():
distance[i].append(haversine(point1.coordinates, point2.coordinates))

return pd.DataFrame.from_dict(distance, orient='index')```

But this takes quite a while given the time complexity, running at around 20s for 500 points and I have a much longer list. This has me looking at vectorization, and I’ve come across `numpy.vectorize` ((docs), but can’t figure out how to apply it in this context.

Contents

### Method 1

From `haversine's function definition`, it looked pretty parallelizable. So, using one of the best tools for vectorization with NumPy aka `broadcasting` and replacing the math funcs with the NumPy equivalents `ufuncs`, here’s one vectorized solution –

```# Get data as a Nx2 shaped NumPy array
data = np.array(df['coordinates'].tolist())

# Extract col-1 and 2 as latitudes and longitudes
lat = data[:,0]
lng = data[:,1]

# Elementwise differentiations for lattitudes & longitudes
diff_lat = lat[:,None] - lat
diff_lng = lng[:,None] - lng

# Finally Calculate haversine
d = np.sin(diff_lat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(diff_lng/2)**2
return 2 * 6371 * np.arcsin(np.sqrt(d))```

Runtime tests –

The other `np.vectorize based solution` has shown some positive promise on performance improvement over the original code, so this section would compare the posted broadcasting based approach against that one.

Function definitions –

```def vectotized_based(df):
haver_vec = np.vectorize(haversine, otypes=[np.int16])
return df.groupby('id').apply(lambda x: pd.Series(haver_vec(df.coordinates, x.coordinates)))

data = np.array(df['coordinates'].tolist())
lat = data[:,0]
lng = data[:,1]
diff_lat = lat[:,None] - lat
diff_lng = lng[:,None] - lng
d = np.sin(diff_lat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(diff_lng/2)**2
return 2 * 6371 * np.arcsin(np.sqrt(d))```

Timings –

```In : # Input
...: length = 500
...: d1 = np.random.uniform(-90, 90, length)
...: d2 = np.random.uniform(-180, 180, length)
...: coords = tuple(zip(d1, d2))
...: df = pd.DataFrame({'id':np.arange(length), 'coordinates':coords})
...:

In : %timeit vectotized_based(df)
1 loops, best of 3: 1.12 s per loop

10 loops, best of 3: 68.7 ms per loop```

### Method 2

You would provide your function as an argument to `np.vectorize()`, and could then use it as an argument to `pandas.groupby.apply` as illustrated below:

```haver_vec = np.vectorize(haversine, otypes=[np.int16])
distance = df.groupby('id').apply(lambda x: pd.Series(haver_vec(df.coordinates, x.coordinates)))```

For instance, with sample data as follows:

```length = 500
df = pd.DataFrame({'id':np.arange(length), 'coordinates':tuple(zip(np.random.uniform(-90, 90, length), np.random.uniform(-180, 180, length)))})```

compare for 500 points:

```def haver_vect(data):
distance = data.groupby('id').apply(lambda x: pd.Series(haver_vec(data.coordinates, x.coordinates)))
return distance

%timeit haver_loop(df): 1 loops, best of 3: 35.5 s per loop

%timeit haver_vect(df): 1 loops, best of 3: 593 ms per loop```

### Method 3

start by getting all combinations using `itertools.product`

` results= [(p1,p2,haversine(p1,p2))for p1,p2 in itertools.product(points,repeat=2)]`

that said Im not sure how fast it will be this looks like it might be a duplicate of Python: speeding up geographic comparison 