Reverse a get_dummies encoding in pandas

Column names are: ID,1,2,3,4,5,6,7,8,9.

The col values are either 0 or 1

My dataframe looks like this:

 ID     1    2    3    4    5    6   7   8   9 

1002    0    1    0    1    0    0   0   0   0
1003    0    0    0    0    0    0   0   0   0 
1004    1    1    0    0    0    0   0   0   0
1005    0    0    0    0    1    0   0   0   0
1006    0    0    0    0    0    1   0   0   0
1007    1    0    1    0    0    0   0   0   0
1000    0    0    0    0    0    0   0   0   0
1009    0    0    1    0    0    0   1   0   0

I want the column names in front of the ID where the value in a row is 1.

The Dataframe i want should look like this:

 ID      Col2
1002       2    // has 1 at Col(2) and Col(4)
1002       4    
1004       1    // has 1 at col(1) and col(2)
1004       2
1005       5    // has 1 at col(5)
1006       6    // has 1 at col(6)
1007       1    // has 1 at col(1) and col(3)
1007       3
1009       3    // has 1 at col(3) and col(7)
1009       7

Please help me in this, Thanks in advance

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Pretty one-liner 🙂

new_df = df.idxmax(axis=1)

Method 2

Several great answers for the OP post. However, often get_dummies is used for multiple categorical features. Pandas uses a prefix separator prefix_sep to distinguish different values for a column.

The following function collapses a “dummified” dataframe while keeping the order of columns:

def undummify(df, prefix_sep="_"):
    cols2collapse = {
        item.split(prefix_sep)[0]: (prefix_sep in item) for item in df.columns
    }
    series_list = []
    for col, needs_to_collapse in cols2collapse.items():
        if needs_to_collapse:
            undummified = (
                df.filter(like=col)
                .idxmax(axis=1)
                .apply(lambda x: x.split(prefix_sep, maxsplit=1)[1])
                .rename(col)
            )
            series_list.append(undummified)
        else:
            series_list.append(df[col])
    undummified_df = pd.concat(series_list, axis=1)
    return undummified_df

Example

>>> df
     a    b    c
0  A_1  B_1  C_1
1  A_2  B_2  C_2
>>> df2 = pd.get_dummies(df)
>>> df2
   a_A_1  a_A_2  b_B_1  b_B_2  c_C_1  c_C_2
0      1      0      1      0      1      0
1      0      1      0      1      0      1
>>> df3 = undummify(df2)
>>> df3
     a    b    c
0  A_1  B_1  C_1
1  A_2  B_2  C_2

Method 3

set_index + stack, stack will dropna by default

df.set_index('ID',inplace=True)

df[df==1].stack().reset_index().drop(0,1)
Out[363]: 
     ID level_1
0  1002       2
1  1002       4
2  1004       1
3  1004       2
4  1005       5
5  1006       6
6  1007       1
7  1007       3
8  1009       3
9  1009       7

Method 4

np.argwhere

v = np.argwhere(df.drop('ID', 1).values).T
pd.DataFrame({'ID' : df.loc[v[0], 'ID'], 'Col2' : df.columns[1:][v[1]]})

  Col2    ID
0    2  1002
0    4  1002
2    1  1004
2    2  1004
3    5  1005
4    6  1006
5    1  1007
5    3  1007
7    3  1009
7    7  1009

argwhere gets the i, j indices of all non-zero elements in your DataFrame. Use the first column of indices to index into column ID, and the second column of indices to index into df.columns.

I transpose v before step 2 for cache efficiency, and less typing.

Method 5

Use:

df = (df.melt('ID', var_name='Col2')
       .query('value== 1')
       .sort_values(['ID', 'Col2'])
       .drop('value',1))

Alternative solution:

df = (df.set_index('ID')
        .mask(lambda x: x == 0)
        .stack()
        .reset_index()
        .drop(0,1))

print (df)
      ID Col2
8   1002    2
24  1002    4
2   1004    1
10  1004    2
35  1005    5
44  1006    6
5   1007    1
21  1007    3
23  1009    3
55  1009    7

Explanation:

  1. First reshape values by melt or set_index with unstack
  2. Filter only 1 by query or convert 0 to NaNs by mask
  3. sort_values for first solution
  4. create columns from MultiIndex by reset_index
  5. Last remove unnecessary columns by drop


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x