I’ve got a dataframe with the following information:
filename val1 val2 t 1 file1.csv 5 10 2 file1.csv NaN NaN 3 file1.csv 15 20 6 file2.csv NaN NaN 7 file2.csv 10 20 8 file2.csv 12 15
I would like to interpolate the values in the dataframe based on the indices, but only within each file group.
To interpolate, I would normally do
df = df.interpolate(method="index")
And to group, I do
grouped = df.groupby("filename")
I would like the interpolated dataframe to look like this:
filename val1 val2 t 1 file1.csv 5 10 2 file1.csv 10 15 3 file1.csv 15 20 6 file2.csv NaN NaN 7 file2.csv 10 20 8 file2.csv 12 15
Where the NaN’s are still present at t = 6 since they are the first items in the file2 group.
I suspect I need to use “apply”, but haven’t been able to figure out exactly how…
grouped.apply(interp1d) ... TypeError: __init__() takes at least 3 arguments (2 given)
Any help would be appreciated.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
>>> df.groupby('filename').apply(lambda group: group.interpolate(method='index'))
filename val1 val2
t
1 file1.csv 5 10
2 file1.csv 10 15
3 file1.csv 15 20
6 file2.csv NaN NaN
7 file2.csv 10 20
8 file2.csv 12 15
Method 2
I ran into this as well. Instead of using apply, you can use transform, which will reduce your run time by more than 25% if you have on the order of 1000 groups:
import numpy as np
import pandas as pd
np.random.seed(500)
test_df = pd.DataFrame({
'a': np.random.randint(low=0, high=1000, size=10000),
'b': np.random.choice([1, 2, 4, 7, np.nan], size=10000, p=([0.2475]*4 + [0.01]))
})
Tests:
%timeit test_df.groupby('a').transform(pd.DataFrame.interpolate)
Output: 566 ms ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_df.groupby('a').apply(pd.DataFrame.interpolate)
Output: 788 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_df.groupby('a').apply(lambda group: group.interpolate())
Output: 787 ms ± 17.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_df.interpolate()
Output: 918 µs ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You will still see a significant increase in run-time compared to a fully vectorized call to interpolate on the full DataFrame, but I don’t think you can do much better in pandas.
Method 3
Considering the long running time of above methods, I suggest use a for loop and interpolate(), which is no more than few lines of codes, but much faster in speed.
for i in range(len(df.filename.unique())):
mask = df.loc[:,'filename']==df.filename.unique()[i]
df[mask]=dfs[mask].interpolate(method='index')
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0