Column names are: ID,1,2,3,4,5,6,7,8,9.
The col values are either 0 or 1
My dataframe looks like this:
ID 1 2 3 4 5 6 7 8 9 1002 0 1 0 1 0 0 0 0 0 1003 0 0 0 0 0 0 0 0 0 1004 1 1 0 0 0 0 0 0 0 1005 0 0 0 0 1 0 0 0 0 1006 0 0 0 0 0 1 0 0 0 1007 1 0 1 0 0 0 0 0 0 1000 0 0 0 0 0 0 0 0 0 1009 0 0 1 0 0 0 1 0 0
I want the column names in front of the ID where the value in a row is 1.
The Dataframe i want should look like this:
ID Col2 1002 2 // has 1 at Col(2) and Col(4) 1002 4 1004 1 // has 1 at col(1) and col(2) 1004 2 1005 5 // has 1 at col(5) 1006 6 // has 1 at col(6) 1007 1 // has 1 at col(1) and col(3) 1007 3 1009 3 // has 1 at col(3) and col(7) 1009 7
Please help me in this, Thanks in advance
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Pretty one-liner 🙂
new_df = df.idxmax(axis=1)
Method 2
Several great answers for the OP post. However, often get_dummies is used for multiple categorical features. Pandas uses a prefix separator prefix_sep to distinguish different values for a column.
The following function collapses a “dummified” dataframe while keeping the order of columns:
def undummify(df, prefix_sep="_"):
cols2collapse = {
item.split(prefix_sep)[0]: (prefix_sep in item) for item in df.columns
}
series_list = []
for col, needs_to_collapse in cols2collapse.items():
if needs_to_collapse:
undummified = (
df.filter(like=col)
.idxmax(axis=1)
.apply(lambda x: x.split(prefix_sep, maxsplit=1)[1])
.rename(col)
)
series_list.append(undummified)
else:
series_list.append(df[col])
undummified_df = pd.concat(series_list, axis=1)
return undummified_df
Example
>>> df
a b c
0 A_1 B_1 C_1
1 A_2 B_2 C_2
>>> df2 = pd.get_dummies(df)
>>> df2
a_A_1 a_A_2 b_B_1 b_B_2 c_C_1 c_C_2
0 1 0 1 0 1 0
1 0 1 0 1 0 1
>>> df3 = undummify(df2)
>>> df3
a b c
0 A_1 B_1 C_1
1 A_2 B_2 C_2
Method 3
set_index + stack, stack will dropna by default
df.set_index('ID',inplace=True)
df[df==1].stack().reset_index().drop(0,1)
Out[363]:
ID level_1
0 1002 2
1 1002 4
2 1004 1
3 1004 2
4 1005 5
5 1006 6
6 1007 1
7 1007 3
8 1009 3
9 1009 7
Method 4
np.argwhere
v = np.argwhere(df.drop('ID', 1).values).T
pd.DataFrame({'ID' : df.loc[v[0], 'ID'], 'Col2' : df.columns[1:][v[1]]})
Col2 ID
0 2 1002
0 4 1002
2 1 1004
2 2 1004
3 5 1005
4 6 1006
5 1 1007
5 3 1007
7 3 1009
7 7 1009
argwhere gets the i, j indices of all non-zero elements in your DataFrame. Use the first column of indices to index into column ID, and the second column of indices to index into df.columns.
I transpose v before step 2 for cache efficiency, and less typing.
Method 5
Use:
df = (df.melt('ID', var_name='Col2')
.query('value== 1')
.sort_values(['ID', 'Col2'])
.drop('value',1))
Alternative solution:
df = (df.set_index('ID')
.mask(lambda x: x == 0)
.stack()
.reset_index()
.drop(0,1))
print (df)
ID Col2
8 1002 2
24 1002 4
2 1004 1
10 1004 2
35 1005 5
44 1006 6
5 1007 1
21 1007 3
23 1009 3
55 1009 7
Explanation:
-
First reshape values by
meltorset_indexwithunstack -
Filter only
1byqueryor convert0toNaNs bymask -
sort_valuesfor first solution -
create columns from
MultiIndexbyreset_index -
Last remove unnecessary columns by
drop
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0