If I have the following dataframe, derived like so: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 1)))
0 0 0 1 2 2 8 3 1 4 0 5 0 6 7 7 0 8 2 9 2
Is there an efficient way cumsum rows with a limit and each time this limit is reached, to start a new cumsum. After each limit is reached (however many rows), a row is created with the total cumsum.
Below I have created an example of a function that does this, but it’s very slow, especially when the dataframe becomes very large.
I don’t like that my function is looping and I am looking for a way to make it faster (I guess a way without a loop).
def foo(df, max_value):
last_value = 0
storage = []
for index, row in df.iterrows():
this_value = np.nansum(<div class="su-row"></div>, last_value])
if this_value >= max_value:
storage.append((index, this_value))
this_value = 0
last_value = this_value
return storage
If you rum my function like so: foo(df, 5)
In in the above context, it returns:
0 2 10 6 8
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
The loop cannot be avoided, but it can be parallelized using numba‘s njit:
from numba import njit, prange
@njit
def dynamic_cumsum(seq, index, max_value):
cumsum = []
running = 0
for i in prange(len(seq)):
if running > max_value:
cumsum.append([index[i], running])
running = 0
running += seq[i]
cumsum.append([index[-1], running])
return cumsum
The index is required here, assuming your index is not numeric/monotonically increasing.
%timeit foo(df, 5) 1.24 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit dynamic_cumsum(df.iloc(axis=1)[0].values, df.index.values, 5) 77.2 µs ± 4.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
If the index is of Int64Index type, you can shorten this to:
@njit
def dynamic_cumsum2(seq, max_value):
cumsum = []
running = 0
for i in prange(len(seq)):
if running > max_value:
cumsum.append([i, running])
running = 0
running += seq[i]
cumsum.append([i, running])
return cumsum
lst = dynamic_cumsum2(df.iloc(axis=1)[0].values, 5)
pd.DataFrame(lst, columns=['A', 'B']).set_index('A')
B
A
3 10
7 8
9 4
%timeit foo(df, 5) 1.23 ms ± 30.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit dynamic_cumsum2(df.iloc(axis=1)[0].values, 5) 71.4 µs ± 1.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
njit Functions Performance
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.randint(0, 10, size=(n, 1))),
kernels=[
lambda df: list(cumsum_limit_nb(df.iloc[:, 0].values, 5)),
lambda df: dynamic_cumsum2(df.iloc[:, 0].values, 5)
],
labels=['cumsum_limit_nb', 'dynamic_cumsum2'],
n_range=[2**k for k in range(0, 17)],
xlabel='N',
logx=True,
logy=True,
equality_check=None # TODO - update when @jpp adds in the final `yield`
)
The log-log plot shows that the generator function is faster for larger inputs:
A possible explanation is that, as N increases, the overhead of appending to a growing list in dynamic_cumsum2 becomes prominent. While cumsum_limit_nb just has to yield.
Method 2
A loop isn’t necessarily bad. The trick is to make sure it’s performed on low-level objects. In this case, you can use Numba or Cython. For example, using a generator with numba.njit:
from numba import njit
@njit
def cumsum_limit(A, limit=5):
count = 0
for i in range(A.shape[0]):
count += A[i]
if count > limit:
yield i, count
count = 0
idx, vals = zip(*cumsum_limit(df[0].values))
res = pd.Series(vals, index=idx)
To demonstrate the performance benefits of JIT-compiling with Numba:
import pandas as pd, numpy as np
from numba import njit
df = pd.DataFrame({0: [0, 2, 8, 1, 0, 0, 7, 0, 2, 2]})
@njit
def cumsum_limit_nb(A, limit=5):
count = 0
for i in range(A.shape[0]):
count += A[i]
if count > limit:
yield i, count
count = 0
def cumsum_limit(A, limit=5):
count = 0
for i in range(A.shape[0]):
count += A[i]
if count > limit:
yield i, count
count = 0
n = 10**4
df = pd.concat([df]*n, ignore_index=True)
%timeit list(cumsum_limit_nb(df[0].values)) # 4.19 ms ± 90.4 µs per loop
%timeit list(cumsum_limit(df[0].values)) # 58.3 ms ± 194 µs per loop
Method 3
simpler approach:
def dynamic_cumsum(seq,limit):
res=[]
cs=seq.cumsum()
for i, e in enumerate(cs):
if cs[i] >limit:
res.append([i,e])
cs[i+1:] -= e
if res[-1][0]==i:
return res
res.append([i,e])
return res
result:
x=dynamic_cumsum(df[0].values,5) x >>[[2, 10], [6, 8], [9, 4]]
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0
