I have a pandas df, like this:
ID date value 0 10 2022-01-01 100 1 10 2022-01-02 150 2 10 2022-01-03 0 3 10 2022-01-04 0 4 10 2022-01-05 200 5 10 2022-01-06 0 6 10 2022-01-07 150 7 10 2022-01-08 0 8 10 2022-01-09 0 9 10 2022-01-10 0 10 10 2022-01-11 0 11 10 2022-01-12 100 12 23 2022-02-01 490 13 23 2022-02-02 0 14 23 2022-02-03 350 15 23 2022-02-04 333 16 23 2022-02-05 0 17 23 2022-02-06 0 18 23 2022-02-07 0 19 23 2022-02-08 211 20 23 2022-02-09 100
I would like calculate the days of last value. Like the bellow example. How can I using diff() for this? And the calculus change by ID.
Output:
ID date value days_last_value 0 10 2022-01-01 100 0 1 10 2022-01-02 150 1 2 10 2022-01-03 0 3 10 2022-01-04 0 4 10 2022-01-05 200 3 5 10 2022-01-06 0 6 10 2022-01-07 150 2 7 10 2022-01-08 0 8 10 2022-01-09 0 9 10 2022-01-10 0 10 10 2022-01-11 0 11 10 2022-01-12 100 5 12 23 2022-02-01 490 0 13 23 2022-02-02 0 14 23 2022-02-03 350 2 15 23 2022-02-04 333 1 16 23 2022-02-05 0 17 23 2022-02-06 0 18 23 2022-02-07 0 19 23 2022-02-08 211 4 20 23 2022-02-09 100 1
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Explanation below.
import pandas as pd
df = pd.DataFrame({'ID': 12 * [10] + 9 * [23],
'value': [100, 150, 0, 0, 200, 0, 150, 0, 0, 0, 0, 100, 490, 0, 350, 333, 0, 0, 0, 211, 100]})
days = df.groupby(['ID', (df['value'] != 0).cumsum()]).size().groupby('ID').shift(fill_value=0)
days.index = df.index[df['value'] != 0]
df['days_last_value'] = days
df
ID value days_last_value 0 10 100 0.0 1 10 150 1.0 2 10 0 NaN 3 10 0 NaN 4 10 200 3.0 5 10 0 NaN 6 10 150 2.0 7 10 0 NaN 8 10 0 NaN 9 10 0 NaN 10 10 0 NaN 11 10 100 5.0 12 23 490 0.0 13 23 0 NaN 14 23 350 2.0 15 23 333 1.0 16 23 0 NaN 17 23 0 NaN 18 23 0 NaN 19 23 211 4.0 20 23 100 1.0
First, we’ll have to group by ‘ID’.
We also creates groups for each block of days, by creating a True/False series where value is not 0, then performing a cumulative sum. That is the part (df['value'] != 0).cumsum(), which results in
0 1 1 2 2 2 3 2 4 3 5 3 6 4 7 4 8 4 9 4 10 4 11 5 12 6 13 6 14 7 15 8 16 8 17 8 18 8 19 9 20 10
We can use the values in this series to also group on; combining that with the ‘ID’ group, you have the individual blocks of days. This is the df.groupby(['ID', (df['value'] != 0).cumsum()]) part.
Now, for each block, we get its size, which is obviously the interval in days; which is what you want. We do need to shift one up, since we’ve counted the total number of days per group, and the difference would be one less; and fill with 0 at the bottom. But this shift has to be by ID group, so we first group by ID again before shifting (as we lost the grouping after doing .size()).
Now, this new series needs to get assigned back to the dataframe, but it’s obviously shorter. Since its index it also reset, we can’t easily reassign it (not with df[‘days_last_value’], df.loc[…] or df.iloc).
Instead, we select the index values of the original dataframe where value is not zero, and set the index of the days equal to that.
Now, it’s easy step to directly assign the days to relevant column in the dataframe: Pandas will match the indices.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0