I have a dataframe with a timeindex and 3 columns containing the coordinates of a 3D vector:
x y z ts 2014-05-15 10:38 0.120117 0.987305 0.116211 2014-05-15 10:39 0.117188 0.984375 0.122070 2014-05-15 10:40 0.119141 0.987305 0.119141 2014-05-15 10:41 0.116211 0.984375 0.120117 2014-05-15 10:42 0.119141 0.983398 0.118164
I would like to apply a transformation to each row that also returns a vector
def myfunc(a, b, c):
do something
return e, f, g
but if I do:
df.apply(myfunc, axis=1)
I end up with a Pandas series whose elements are tuples. This is beacause apply will take the result of myfunc without unpacking it. How can I change myfunc so that I obtain a new df with 3 columns?
Edit:
All solutions below work. The Series solution does allow for column names, the List solution seem to execute faster.
def myfunc1(args):
e=args[0] + 2*args[1]
f=args[1]*args[2] +1
g=args[2] + args[0] * args[1]
return pd.Series([e,f,g], index=['a', 'b', 'c'])
def myfunc2(args):
e=args[0] + 2*args[1]
f=args[1]*args[2] +1
g=args[2] + args[0] * args[1]
return [e,f,g]
%timeit df.apply(myfunc1 ,axis=1)
100 loops, best of 3: 4.51 ms per loop
%timeit df.apply(myfunc2 ,axis=1)
100 loops, best of 3: 2.75 ms per loop
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Return Series and it will put them in a DataFrame.
def myfunc(a, b, c):
do something
return pd.Series([e, f, g])
This has the bonus that you can give labels to each of the resulting columns. If you return a DataFrame it just inserts multiple rows for the group.
Method 2
Based on the excellent answer by @U2EF1, I’ve created a handy function that applies a specified function that returns tuples to a dataframe field, and expands the result back to the dataframe.
def apply_and_concat(dataframe, field, func, column_names):
return pd.concat((
dataframe,
dataframe[field].apply(
lambda cell: pd.Series(func(cell), index=column_names))), axis=1)
Usage:
df = pd.DataFrame([1, 2, 3], index=['a', 'b', 'c'], columns=['A'])
print df
A
a 1
b 2
c 3
def func(x):
return x*x, x*x*x
print apply_and_concat(df, 'A', func, ['x^2', 'x^3'])
A x^2 x^3
a 1 1 1
b 2 4 8
c 3 9 27
Hope it helps someone.
Method 3
I’ve tried returning a tuple (I was using functions like scipy.stats.pearsonr which return that kind of structures) but It returned a 1D Series instead of a Dataframe which was I expected. If I created a Series manually the performance was worse, so I fixed It using the result_type as explained in the official API documentation:
Returning a Series inside the function is similar to passing
result_type=’expand’. The resulting column names will be the Series
index.
So you could edit your code this way:
def myfunc(a, b, c):
# do something
return (e, f, g)
df.apply(myfunc, axis=1, result_type='expand')
Method 4
Some of the other people’s answers contain mistakes, so I’ve summarized them below. The perfect answer is below.
Prepare the dataset. The version of pandas uses 1.1.5.
import numpy as np
import pandas as pd
import timeit
# check pandas version
print(pd.__version__)
# 1.1.5
# prepare DataFrame
df = pd.DataFrame({
'x': [0.120117, 0.117188, 0.119141, 0.116211, 0.119141],
'y': [0.987305, 0.984375, 0.987305, 0.984375, 0.983398],
'z': [0.116211, 0.122070, 0.119141, 0.120117, 0.118164]},
index=[
'2014-05-15 10:38',
'2014-05-15 10:39',
'2014-05-15 10:40',
'2014-05-15 10:41',
'2014-05-15 10:42'],
columns=['x', 'y', 'z'])
df.index.name = 'ts'
# x y z
# ts
# 2014-05-15 10:38 0.120117 0.987305 0.116211
# 2014-05-15 10:39 0.117188 0.984375 0.122070
# 2014-05-15 10:40 0.119141 0.987305 0.119141
# 2014-05-15 10:41 0.116211 0.984375 0.120117
# 2014-05-15 10:42 0.119141 0.983398 0.118164
Solution 01.
Returns pd.Series in the apply function.
def myfunc1(args):
e = args[0] + 2*args[1]
f = args[1]*args[2] + 1
g = args[2] + args[0] * args[1]
return pd.Series([e, f, g])
df[['e', 'f', 'g']] = df.apply(myfunc1, axis=1)
# x y z e f g
# ts
# 2014-05-15 10:38 0.120117 0.987305 0.116211 2.094727 1.114736 0.234803
# 2014-05-15 10:39 0.117188 0.984375 0.122070 2.085938 1.120163 0.237427
# 2014-05-15 10:40 0.119141 0.987305 0.119141 2.093751 1.117629 0.236770
# 2014-05-15 10:41 0.116211 0.984375 0.120117 2.084961 1.118240 0.234512
# 2014-05-15 10:42 0.119141 0.983398 0.118164 2.085937 1.116202 0.235327
t1 = timeit.timeit(
'df.apply(myfunc1, axis=1)',
globals=dict(df=df, myfunc1=myfunc1), number=10000)
print(round(t1, 3), 'seconds')
# 14.571 seconds
Solution 02.
Use result_type ='expand' when applying.
def myfunc2(args):
e = args[0] + 2*args[1]
f = args[1]*args[2] + 1
g = args[2] + args[0] * args[1]
return [e, f, g]
df[['e', 'f', 'g']] = df.apply(myfunc2, axis=1, result_type='expand')
# x y z e f g
# ts
# 2014-05-15 10:38 0.120117 0.987305 0.116211 2.094727 1.114736 0.234803
# 2014-05-15 10:39 0.117188 0.984375 0.122070 2.085938 1.120163 0.237427
# 2014-05-15 10:40 0.119141 0.987305 0.119141 2.093751 1.117629 0.236770
# 2014-05-15 10:41 0.116211 0.984375 0.120117 2.084961 1.118240 0.234512
# 2014-05-15 10:42 0.119141 0.983398 0.118164 2.085937 1.116202 0.235327
t2 = timeit.timeit(
"df.apply(myfunc2, axis=1, result_type='expand')",
globals=dict(df=df, myfunc2=myfunc2), number=10000)
print(round(t2, 3), 'seconds')
# 9.907 seconds
Solution 03.
If you want to make it faster, use np.vectorize. Note that args cannot be a single argument when using np.vectorize.
def myfunc3(args0, args1, args2):
e = args0 + 2*args1
f = args1*args2 + 1
g = args2 + args0 * args1
return [e, f, g]
df[['e', 'f', 'g']] = pd.DataFrame(np.row_stack(np.vectorize(myfunc3, otypes=['O'])(df['x'], df['y'], df['z'])), index=df.index)
# x y z e f g
# ts
# 2014-05-15 10:38 0.120117 0.987305 0.116211 2.094727 1.114736 0.234803
# 2014-05-15 10:39 0.117188 0.984375 0.122070 2.085938 1.120163 0.237427
# 2014-05-15 10:40 0.119141 0.987305 0.119141 2.093751 1.117629 0.236770
# 2014-05-15 10:41 0.116211 0.984375 0.120117 2.084961 1.118240 0.234512
# 2014-05-15 10:42 0.119141 0.983398 0.118164 2.085937 1.116202 0.235327
t3 = timeit.timeit(
"pd.DataFrame(np.row_stack(np.vectorize(myfunc3, otypes=['O'])(df['x'], df['y'], df['z'])), index=df.index)",
globals=dict(pd=pd, np=np, df=df, myfunc3=myfunc3), number=10000)
print(round(t3, 3), 'seconds')
# 1.598 seconds
Method 5
Just return a list instead of tuple.
In [81]: df
Out[81]:
x y z
ts
2014-05-15 10:38:00 0.120117 0.987305 0.116211
2014-05-15 10:39:00 0.117188 0.984375 0.122070
2014-05-15 10:40:00 0.119141 0.987305 0.119141
2014-05-15 10:41:00 0.116211 0.984375 0.120117
2014-05-15 10:42:00 0.119141 0.983398 0.118164
[5 rows x 3 columns]
In [82]: def myfunc(args):
....: e=args[0] + 2*args[1]
....: f=args[1]*args[2] +1
....: g=args[2] + args[0] * args[1]
....: return [e,f,g]
....:
In [83]: df.apply(myfunc ,axis=1)
Out[83]:
x y z
ts
2014-05-15 10:38:00 2.094727 1.114736 0.234803
2014-05-15 10:39:00 2.085938 1.120163 0.237427
2014-05-15 10:40:00 2.093751 1.117629 0.236770
2014-05-15 10:41:00 2.084961 1.118240 0.234512
2014-05-15 10:42:00 2.085937 1.116202 0.235327
Method 6
Found a possible solution, by changing myfunc to return an np.array like this:
import numpy as np
def myfunc(a, b, c):
do something
return np.array((e, f, g))
any better solution?
Method 7
Pandas 1.0.5 has DataFrame.apply with parameter result_type that can help here.
from the docs:
These only act when axis=1 (columns): ‘expand’ : list-like results will be turned into columns. ‘reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’. ‘broadcast’ : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0