Pandas filling missing dates and values within group

I’ve a data frame that looks like the following

x = pd.DataFrame({'user': ['a','a','b','b'], 'dt': ['2016-01-01','2016-01-02', '2016-01-05','2016-01-06'], 'val': [1,33,2,1]})

What I would like to be able to do is find the minimum and maximum date within the date column and expand that column to have all the dates there while simultaneously filling in 0 for the val column. So the desired output is

            dt user  val
0   2016-01-01    a    1
1   2016-01-02    a   33
2   2016-01-03    a    0
3   2016-01-04    a    0
4   2016-01-05    a    0
5   2016-01-06    a    0
6   2016-01-01    b    0
7   2016-01-02    b    0
8   2016-01-03    b    0
9   2016-01-04    b    0
10  2016-01-05    b    2
11  2016-01-06    b    1

I’ve tried the solution mentioned here and here but they aren’t what I’m after.
Any pointers much appreciated.

Contents hide

Answers:

Method 1

Method 2

Method 3

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Initial Dataframe:

            dt  user    val
0   2016-01-01     a      1
1   2016-01-02     a     33
2   2016-01-05     b      2
3   2016-01-06     b      1

First, convert the dates to datetime:

x['dt'] = pd.to_datetime(x['dt'])

Then, generate the dates and unique users:

dates = x.set_index('dt').resample('D').asfreq().index

>> DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06'],
              dtype='datetime64[ns]', name='dt', freq='D')

users = x['user'].unique()

>> array(['a', 'b'], dtype=object)

This will allow you to create a MultiIndex:

idx = pd.MultiIndex.from_product((dates, users), names=['dt', 'user'])

>> MultiIndex(levels=[[2016-01-01 00:00:00, 2016-01-02 00:00:00, 2016-01-03 00:00:00, 2016-01-04 00:00:00, 2016-01-05 00:00:00, 2016-01-06 00:00:00], ['a', 'b']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
           names=['dt', 'user'])

You can use that to reindex your DataFrame:

x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index()
Out: 
           dt user  val
0  2016-01-01    a    1
1  2016-01-01    b    0
2  2016-01-02    a   33
3  2016-01-02    b    0
4  2016-01-03    a    0
5  2016-01-03    b    0
6  2016-01-04    a    0
7  2016-01-04    b    0
8  2016-01-05    a    0
9  2016-01-05    b    2
10 2016-01-06    a    0
11 2016-01-06    b    1

which then can be sorted by users:

x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index().sort_values(by='user')
Out: 
           dt user  val
0  2016-01-01    a    1
2  2016-01-02    a   33
4  2016-01-03    a    0
6  2016-01-04    a    0
8  2016-01-05    a    0
10 2016-01-06    a    0
1  2016-01-01    b    0
3  2016-01-02    b    0
5  2016-01-03    b    0
7  2016-01-04    b    0
9  2016-01-05    b    2
11 2016-01-06    b    1

Method 2

As @ayhan suggests

x.dt = pd.to_datetime(x.dt)

One-liner using mostly @ayhan’s ideas while incorporating stack/unstack and fill_value

x.set_index(
    ['dt', 'user']
).unstack(
    fill_value=0
).asfreq(
    'D', fill_value=0
).stack().sort_index(level=1).reset_index()

           dt user  val
0  2016-01-01    a    1
1  2016-01-02    a   33
2  2016-01-03    a    0
3  2016-01-04    a    0
4  2016-01-05    a    0
5  2016-01-06    a    0
6  2016-01-01    b    0
7  2016-01-02    b    0
8  2016-01-03    b    0
9  2016-01-04    b    0
10 2016-01-05    b    2
11 2016-01-06    b    1

Method 3

An old question, with already excellent answers; this is an alternative, using the complete function from pyjanitor that could help with the abstraction when generating explicitly missing rows:

#pip install pyjanitor
import pandas as pd
import janitor as jn

 x['dt'] = pd.to_datetime(x['dt'])

# generate complete list of dates
dates = dict(dt = pd.date_range(x.dt.min(), x.dt.max(), freq='1D'))

# build the new dataframe, and fill nulls with 0
x.complete('user', dates, fill_value = 0)

   user         dt  val
0     a 2016-01-01    1
1     a 2016-01-02   33
2     a 2016-01-03    0
3     a 2016-01-04    0
4     a 2016-01-05    0
5     a 2016-01-06    0
6     b 2016-01-01    0
7     b 2016-01-02    0
8     b 2016-01-03    0
9     b 2016-01-04    0
10    b 2016-01-05    2
11    b 2016-01-06    1

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating