Pandas monthly rolling operation

I ended up figuring it out while writing out this question so I’ll just post anyway and answer my own question in case someone else needs a little help.

Contents hide

Problem

Goal

What I have tried

Answers:

Method 1

Method 2

Problem

Suppose we have a DataFrame, df, containing this data.

import pandas as pd
from io import StringIO

data = StringIO(
"""
date          spendings  category
2014-03-25    10         A
2014-04-05    20         A
2014-04-15    10         A
2014-04-25    10         B
2014-05-05    10         B
2014-05-15    10         A
2014-05-25    10         A
"""
)

df = pd.read_csv(data,sep="s+",parse_dates=True,index_col="date")

Goal

For each row, sum the spendings over every row that is within one month of it, ideally using DataFrame.rolling as it’s a very clean syntax.

What I have tried

df = df.rolling("M").sum()

But this throws an exception

ValueError: <MonthEnd> is a non-fixed frequency

version: pandas==0.19.2

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Use the "D" offset rather than "M" and specifically use "30D" for 30 days or approximately one month.

df = df.rolling("30D").sum()

Initially, I intuitively jumped to using "M" as I figured it stands for one month, but now it’s clear why that doesn’t work.

Method 2

To address why you cannot use things like “AS” or “Y”, in this case, “Y” offset is not “a year”, it is actually referencing YearEnd (http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases), and therefore the rolling function does not get a fixed window (e.g. you get a 365 day window if your index falls on Jan 1, and 1 day if Dec 31).

The proposed solution (offset by 30D) works if you do not need strict calendar months. Alternatively, you would iterate over your date index, and slice with an offset to get more precise control over your sum.

If you have to do it in one line (separated for readability):

df['Sum'] = [
    df.loc[
        edt - pd.tseries.offsets.DateOffset(months=1):edt, 'spendings'
    ].sum() for edt in df.index
]
spendings   category    Sum
date            
2014-03-25  10  A   10
2014-04-05  20  A   30
2014-04-15  10  A   40
2014-04-25  10  B   50
2014-05-05  10  B   50
2014-05-15  10  A   40
2014-05-25  10  A   40

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating