pd.get_dummies allows to convert a categorical variable into dummy variables. Besides the fact that it’s trivial to reconstruct the categorical variable, is there a preferred/quick way to do it?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
It’s been a few years, so this may well not have been in the pandas toolkit back when this question was originally asked, but this approach seems a little easier to me. idxmax will return the index corresponding to the largest element (i.e. the one with a 1). We do axis=1 because we want the column name where the 1 occurs.
EDIT: I didn’t bother making it categorical instead of just a string, but you can do that the same way as @Jeff did by wrapping it with pd.Categorical (and pd.Series, if desired).
In [1]: import pandas as pd In [2]: s = pd.Series(['a', 'b', 'a', 'c']) In [3]: s Out[3]: 0 a 1 b 2 a 3 c dtype: object In [4]: dummies = pd.get_dummies(s) In [5]: dummies Out[5]: a b c 0 1 0 0 1 0 1 0 2 1 0 0 3 0 0 1 In [6]: s2 = dummies.idxmax(axis=1) In [7]: s2 Out[7]: 0 a 1 b 2 a 3 c dtype: object In [8]: (s2 == s).all() Out[8]: True
EDIT in response to @piRSquared’s comment:
This solution does indeed assume there’s one 1 per row. I think this is usually the format one has. pd.get_dummies can return rows that are all 0 if you have drop_first=True or if there are NaN values and dummy_na=False (default) (any cases I’m missing?). A row of all zeros will be treated as if it was an instance of the variable named in the first column (e.g. a in the example above).
If drop_first=True, you have no way to know from the dummies dataframe alone what the name of the “first” variable was, so that operation isn’t invertible unless you keep extra information around; I’d recommend leaving drop_first=False (default).
Since dummy_na=False is the default, this could certainly cause problems. Please set dummy_na=True when you call pd.get_dummies if you want to use this solution to invert the “dummification” and your data contains any NaNs. Setting dummy_na=True will always add a “nan” column, even if that column is all 0s, so you probably don’t want to set this unless you actually have NaNs. A nice approach might be to set dummies = pd.get_dummies(series, dummy_na=series.isnull().any()). What’s also nice is that idxmax solution will correctly regenerate your NaNs (not just a string that says “nan”).
It’s also worth mentioning that setting drop_first=True and dummy_na=False means that NaNs become indistinguishable from an instance of the first variable, so this should be strongly discouraged if your dataset may contain any NaN values.
Method 2
In [46]: s = Series(list('aaabbbccddefgh')).astype('category')
In [47]: s
Out[47]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
In [48]: df = pd.get_dummies(s)
In [49]: df
Out[49]:
a b c d e f g h
0 1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0
4 0 1 0 0 0 0 0 0
5 0 1 0 0 0 0 0 0
6 0 0 1 0 0 0 0 0
7 0 0 1 0 0 0 0 0
8 0 0 0 1 0 0 0 0
9 0 0 0 1 0 0 0 0
10 0 0 0 0 1 0 0 0
11 0 0 0 0 0 1 0 0
12 0 0 0 0 0 0 1 0
13 0 0 0 0 0 0 0 1
In [50]: x = df.stack()
# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
Out[51]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
So I think we need a function to ‘do’ this as it seems to be a natural operations. Maybe get_categories(), see here
Method 3
This is quite a late answer, but since you ask for a quick way to do it, I assume you’re looking for the most performant strategy. On a large dataframe (for instance 10000 rows), you can get a very significant speed boost by using np.where instead of idxmax or get_level_values, and obtain get the same result. The idea is to index the column names where the dummy dataframe is not 0:
Method:
Using the same sample data as @Nathan:
>>> dummies a b c 0 1 0 0 1 0 1 0 2 1 0 0 3 0 0 1 s2 = pd.Series(dummies.columns[np.where(dummies!=0)[1]]) >>> s2 0 a 1 b 2 a 3 c dtype: object
Benchmark:
On a small dummy dataframe, you won’t see much difference in performance. However, testing different strategies to solving this problem on a large series:
s = pd.Series(np.random.choice(['a','b','c'], 10000))
dummies = pd.get_dummies(s)
def np_method(dummies=dummies):
return pd.Series(dummies.columns[np.where(dummies!=0)[1]])
def idx_max_method(dummies=dummies):
return dummies.idxmax(axis=1)
def get_level_values_method(dummies=dummies):
x = dummies.stack()
return pd.Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
def dot_method(dummies=dummies):
return dummies.dot(dummies.columns)
import timeit
# Time each method, 1000 iterations each:
>>> timeit.timeit(np_method, number=1000)
1.0491090340074152
>>> timeit.timeit(idx_max_method, number=1000)
12.119140846014488
>>> timeit.timeit(get_level_values_method, number=1000)
4.109266621991992
>>> timeit.timeit(dot_method, number=1000)
1.6741622970002936
The np.where method is about 4 times faster than the get_level_values method 11.5 times faster than the idxmax method! It also beats (but only by a little) the .dot() method outlined in this answer to a similar question
They all return the same result:
>>> (get_level_values_method() == np_method()).all() True >>> (idx_max_method() == np_method()).all() True
Method 4
Setup
Using @Jeff’s setup
s = Series(list('aaabbbccddefgh')).astype('category')
df = pd.get_dummies(s)
If columns are strings
and there is only one 1 per row
df.dot(df.columns) 0 a 1 a 2 a 3 b 4 b 5 b 6 c 7 c 8 d 9 d 10 e 11 f 12 g 13 h dtype: object
numpy.where
Again! Assuming only one 1 per row
i, j = np.where(df) pd.Series(df.columns[j], i) 0 a 1 a 2 a 3 b 4 b 5 b 6 c 7 c 8 d 9 d 10 e 11 f 12 g 13 h dtype: category Categories (8, object): [a, b, c, d, e, f, g, h]
numpy.where
Not assuming one 1 per row
i, j = np.where(df) pd.Series(dict(zip(zip(i, j), df.columns[j]))) 0 0 a 1 0 a 2 0 a 3 1 b 4 1 b 5 1 b 6 2 c 7 2 c 8 3 d 9 3 d 10 4 e 11 5 f 12 6 g 13 7 h dtype: object
numpy.where
Where we don’t assume one 1 per row and we drop the index
i, j = np.where(df) pd.Series(dict(zip(zip(i, j), df.columns[j]))).reset_index(-1, drop=True) 0 a 1 a 2 a 3 b 4 b 5 b 6 c 7 c 8 d 9 d 10 e 11 f 12 g 13 h dtype: object
Method 5
Converting dat[“classification”] to one hot encodes and back!!
import pandas as pd from sklearn.preprocessing import LabelEncoder le = LabelEncoder() dat["labels"]= le.fit_transform(dat["classification"]) Y= pd.get_dummies(dat["labels"]) tru=[] for i in range(0, len(Y)): tru.append(np.argmax(Y.iloc[i])) tru= le.inverse_transform(tru) ##Identical check! (tru==dat["classification"]).value_counts()
Method 6
If you’re categorizing the rows in your dataframe based on some row-wise mutually exclusive boolean conditions (these are the “dummy” variables) which don’t form a partition (i.e. some rows are all 0 because of, for example, some missing data), it may be better to initialize a pd.Categorical full with np.nan and then explicitly set the category of each subset. An example follows.
0. Data setup:
np.random.seed(42)
student_names = list('abcdefghi')
marks = np.random.randint(0, 100, len(student_names)).astype(float)
passes = marks >= 50
marks[[1, 5]] = np.nan # artificially introduce NAs
students = pd.DataFrame({'mark': marks, 'pass': passes}, index=student_names)
>>> students
mark pass
a 51.0 True
b NaN True
c 14.0 False
d 71.0 True
e 60.0 True
f NaN False
g 82.0 True
h 86.0 True
i 74.0 True
1. Compute the value of the relevant boolean conditions:
failed = ~students['pass']
barely_passed = students['pass'] & (students['mark'] < 60)
well_passed = students['pass'] & (students['mark'] >= 60)
>>> pd.DataFrame({'f': failed, 'b': barely_passed, 'p': well_passed}).astype(int)
b f p
a 1 0 0
b 0 0 0
c 0 1 0
d 0 0 1
e 0 0 1
f 0 1 0
g 0 0 1
h 0 0 1
i 0 0 1
As you can see row b has False for all three categories (since the mark is NaN and pass is True).
2. Generate the categorical series:
cat = pd.Series(
pd.Categorical([np.nan] * len(students), categories=["failed", "barely passed", "well passed"]),
index=students.index
)
cat[failed] = "failed"
cat[barely_passed] = "barely passed"
cat[well_passed] = "well passed"
>>> cat
a barely passed
b NaN
c failed
d well passed
e well passed
f failed
g well passed
h well passed
i well passed
As you can see, a NaN was kept where none of the categories applied.
This approach is as performant as using np.where but allows for the flexibility of possible NaNs.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0