Why apply sometimes isn't faster than for-loop in a Pandas dataframe?

It seems apply could accelerate the operation process on dataframe in most cases but, when I use apply I don’t find the speedup. Here is my example; I have a dataframe with two columns:

>>>df
index col1 col2
1 10 20
2 20 30
3 30 40

What I want to do is to calculate values for each row in the dataframe by implementing a function R(x) on col1 and the result will be divided by the values in col2. For example, the result of the first row should be R(10)/20.

This is my function which will be called in apply:

def _f(input):
    return R(input['col1'])/input['col2']

Then I call _f in apply: df.apply(_f, axis=1)

But, I find in this case, apply is much slower than a for loop, like

for i in list(df.index)
    new_df.loc[i] = R(df.loc[i,'col1'])/df.loc[i,'col2']

Can anyone explain the reason?

Contents hide

Answers:

Method 1

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

It is my understanding that .apply is not generally faster than iteration over the axis. I believe underneath the hood it is merely a loop over the axis, except you are incurring the overhead of a function call each time in this case.

If we look at the source code, we can see that essentially we are iterating over the indicated axis and applying the function, building the individual results as series into a dictionary, and the finally calling the dataframe constructor on the dictionary returning a new DataFrame:

    if axis == 0:
        series_gen = (self._ixs(i, axis=1)
                      for i in range(len(self.columns)))
        res_index = self.columns
        res_columns = self.index
    elif axis == 1:
        res_index = self.index
        res_columns = self.columns
        values = self.values
        series_gen = (Series.from_array(arr, index=res_columns, name=name,
                                        dtype=dtype)
                      for i, (arr, name) in enumerate(zip(values,
                                                          res_index)))
    else:  # pragma : no cover
        raise AssertionError('Axis must be 0 or 1, got %s' % str(axis))

    i = None
    keys = []
    results = {}
    if ignore_failures:
        successes = []
        for i, v in enumerate(series_gen):
            try:
                results[i] = func(v)
                keys.append(v.name)
                successes.append(i)
            except Exception:
                pass
        # so will work with MultiIndex
        if len(successes) < len(res_index):
            res_index = res_index.take(successes)
    else:
        try:
            for i, v in enumerate(series_gen):
                results[i] = func(v)
                keys.append(v.name)
        except Exception as e:
            if hasattr(e, 'args'):
                # make sure i is defined
                if i is not None:
                    k = res_index[i]
                    e.args = e.args + ('occurred at index %s' %
                                       pprint_thing(k), )
            raise

    if len(results) > 0 and is_sequence(results[0]):
        if not isinstance(results[0], Series):
            index = res_columns
        else:
            index = None

        result = self._constructor(data=results, index=index)
        result.columns = res_index

        if axis == 1:
            result = result.T
        result = result._convert(datetime=True, timedelta=True, copy=False)

    else:

        result = Series(results)
        result.index = res_index

    return result

Specifically:

for i, v in enumerate(series_gen):
                results[i] = func(v)
                keys.append(v.name)

Where series_gen was constructed based on the requested axis.

To get more performance out of a function, you can follow the advice given here.

Essentially, your options are:

Write a C extension
Use numba (a JIT compiler)
Use pandas.eval to squeeze performance out of large Dataframes

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating