I need to find the index of the first value in a 1d NumPy array, or Pandas numeric series, satisfying a condition. The array is large and the index may be near the start or end of the array, or the condition may not be met at all. I can’t tell in advance which is more likely. If the condition is not met, the return value should be -1. I’ve considered a few approaches.
Attempt 1
# func(arr) returns a Boolean array idx = next(iter(np.where(func(arr))[0]), -1)
But this is often too slow as func(arr) applies a vectorised function on the entire array rather than stopping when the condition is met. Specifically, it is expensive when the condition is met near the start of the array.
Attempt 2
np.argmax is marginally faster, but fails to identify when a condition is never met:
np.random.seed(0) arr = np.random.rand(10**7) assert next(iter(np.where(arr > 0.999999)[0]), -1) == np.argmax(arr > 0.999999) %timeit next(iter(np.where(arr > 0.999999)[0]), -1) # 21.2 ms %timeit np.argmax(arr > 0.999999) # 17.7 ms
np.argmax(arr > 1.0) returns 0, i.e. an instance when the condition is not satisfied.
Attempt 3
# func(arr) returns a Boolean scalar idx = next((idx for idx, val in enumerate(arr) if func(arr)), -1)
But this is too slow when the condition is met near the end of the array. Presumably this is because the generator expression has an expensive overhead from a large number of __next__ calls.
Is this always a compromise or is there a way, for generic func, to extract the first index efficiently?
Benchmarking
For benchmarking, assume func finds the index when a value is greater than a given constant:
# Python 3.6.5, NumPy 1.14.3, Numba 0.38.0 import numpy as np np.random.seed(0) arr = np.random.rand(10**7) m = 0.9 n = 0.999999 # Start of array benchmark %timeit next(iter(np.where(arr > m)[0]), -1) # 43.5 ms %timeit next((idx for idx, val in enumerate(arr) if val > m), -1) # 2.5 µs # End of array benchmark %timeit next(iter(np.where(arr > n)[0]), -1) # 21.4 ms %timeit next((idx for idx, val in enumerate(arr) if val > n), -1) # 39.2 ms
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
numba
With numba it’s possible to optimise both scenarios. Syntactically, you need only construct a function with a simple for loop:
from numba import njit
@njit
def get_first_index_nb(A, k):
for i in range(len(A)):
if A[i] > k:
return i
return -1
idx = get_first_index_nb(A, 0.9)
Numba improves performance by JIT (“Just In Time”) compiling code and leveraging CPU-level optimisations. A regular for loop without the @njit decorator would typically be slower than the methods you’ve already tried for the case where the condition is met late.
For a Pandas numeric series df['data'], you can simply feed the NumPy representation to the JIT-compiled function:
idx = get_first_index_nb(df['data'].values, 0.9)
Generalisation
Since numba permits functions as arguments, and assuming the passed the function can also be JIT-compiled, you can arrive at a method to calculate the nth index where a condition is met for an arbitrary func.
@njit
def get_nth_index_count(A, func, count):
c = 0
for i in range(len(A)):
if func(A[i]):
c += 1
if c == count:
return i
return -1
@njit
def func(val):
return val > 0.9
# get index of 3rd value where func evaluates to True
idx = get_nth_index_count(arr, func, 3)
For the 3rd last value, you can feed the reverse, arr[::-1], and negate the result from len(arr) - 1, the - 1 necessary to account for 0-indexing.
Performance benchmarking
# Python 3.6.5, NumPy 1.14.3, Numba 0.38.0
np.random.seed(0)
arr = np.random.rand(10**7)
m = 0.9
n = 0.999999
@njit
def get_first_index_nb(A, k):
for i in range(len(A)):
if A[i] > k:
return i
return -1
def get_first_index_np(A, k):
for i in range(len(A)):
if A[i] > k:
return i
return -1
%timeit get_first_index_nb(arr, m) # 375 ns
%timeit get_first_index_np(arr, m) # 2.71 µs
%timeit next(iter(np.where(arr > m)[0]), -1) # 43.5 ms
%timeit next((idx for idx, val in enumerate(arr) if val > m), -1) # 2.5 µs
%timeit get_first_index_nb(arr, n) # 204 µs
%timeit get_first_index_np(arr, n) # 44.8 ms
%timeit next(iter(np.where(arr > n)[0]), -1) # 21.4 ms
%timeit next((idx for idx, val in enumerate(arr) if val > n), -1) # 39.2 ms
Method 2
I also wanted to do something similar and found out that the solutions presented in this question do not really help me. In particular, the numba solution was much slower for me than the more conventional methods presented in the question itself. I have a times_all list, typically of the order of tens of thousands of elements, and want to find the index of the first element of times_all which is bigger than a time_event. And I have thousands of time_events. My solution is to divide times_all into chunks of for example 100 elements, first decide time_event belongs to which time segment, keep the index of the first element of this segment, then find which index in that segment, and add the two indices. Here is a minimal code. For me, it runs of orders of magnitude faster than other solutions in this page.
def event_time_2_index(time_event, times_all, STEPS=100):
import numpy as np
time_indices_jumps = np.arange(0, len(times_all), STEPS)
time_list_jumps = [times_all[idx] for idx in time_indices_jumps]
time_list_jumps_idx = next((idx for idx, val in enumerate(time_list_jumps)
if val > time_event), -1)
index_in_jumps = time_indices_jumps[time_list_jumps_idx-1]
times_cropped = times_all[index_in_jumps:]
event_index_rel = next((idx for idx, val in enumerate(times_cropped)
if val > time_event), -1)
event_index = event_index_rel + index_in_jumps
return event_index
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0