pandas or python equivalent of tidyr complete

I have data that looks like this:

library("tidyverse")

df <- tibble(user = c(1, 1, 2, 3, 3, 3), x = c("a", "b", "a", "a", "c", "d"), y = 1)
df

#    user     x     y
# 1     1     a     1
# 2     1     b     1
# 3     2     a     1
# 4     3     a     1
# 5     3     c     1
# 6     3     d     1

Python format:

import pandas as pd
df = pd.DataFrame({'user':[1, 1, 2, 3, 3, 3], 'x':['a', 'b', 'a', 'a', 'c', 'd'], 'y':1})

I’d like to “complete” the data frame so that every user has a record for every possible x with the default y fill set to 0.

This is somewhat trivial in R (tidyverse/tidyr):

df %>% 
    complete(nesting(user), x = c("a", "b", "c", "d"), fill = list(y = 0))

#    user     x     y
# 1     1     a     1
# 2     1     b     1
# 3     1     c     0
# 4     1     d     0
# 5     2     a     1
# 6     2     b     0
# 7     2     c     0
# 8     2     d     0
# 9     3     a     1
# 10    3     b     0
# 11    3     c     1
# 12    3     d     1

Is there a complete equivalent in pandas / python that will yield the same result?

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

You can use reindex by MultiIndex.from_product:

df = df.set_index(['user','x'])
mux = pd.MultiIndex.from_product([df.index.levels[0], df.index.levels[1]],names=['user','x'])
df = df.reindex(mux, fill_value=0).reset_index()
print (df)
    user  x  y
0      1  a  1
1      1  b  1
2      1  c  0
3      1  d  0
4      2  a  1
5      2  b  0
6      2  c  0
7      2  d  0
8      3  a  1
9      3  b  0
10     3  c  1
11     3  d  1

Or set_index + stack + unstack:

df = df.set_index(['user','x'])['y'].unstack(fill_value=0).stack().reset_index(name='y')
print (df)
    user  x  y
0      1  a  1
1      1  b  1
2      1  c  0
3      1  d  0
4      2  a  1
5      2  b  0
6      2  c  0
7      2  d  0
8      3  a  1
9      3  b  0
10     3  c  1
11     3  d  1

Method 2

It’s very easy now to use those dplyr/tidyr APIs in python with datar:

>>> from datar.all import f, c, tibble, complete, nesting
>>> df = tibble(user=c(1, 1, 2, 3, 3, 3), x=c("a", "b", "a", "a", "c", "d"), y=1)
>>> df >> complete(nesting(f.user), x=c("a", "b", "c", "d"), fill={'y': 0})
      user        x         y
   <int64> <object> <float64>
0        1        a       1.0
1        1        b       1.0
2        1        c       0.0
3        1        d       0.0
4        2        a       1.0
5        2        b       0.0
6        2        c       0.0
7        2        d       0.0
8        3        a       1.0
9        3        b       0.0
10       3        c       1.0
11       3        d       1.0

I am the author of the package. Feel free to submit issues if you have any questions.

Method 3

We could use the complete function from pyjanitor, which provides a convenient abstraction to generate the missing rows :

# pip install pyjanitor
import pandas as pd
import janitor
df.complete('user', 'x', fill_value = 0)

    user  x  y
0      1  a  1
1      1  b  1
2      1  c  0
3      1  d  0
4      2  a  1
5      2  b  0
6      2  c  0
7      2  d  0
8      3  a  1
9      3  b  0
10     3  c  1
11     3  d  1

More examples can be found here

Method 4

Another Panda’s option could be using pivot + fillna + melt:

df2 = (df
       .pivot(index='user', columns='x', values='y')
       .fillna(0)
       .melt(value_name='y', ignore_index=False)
       .reset_index()
       .sort_values(['user', 'x'])
       )

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating