Running get_dummies on several DataFrame columns?

How can one idiomatically run a function like get_dummies, which expects a single column and returns several, on multiple DataFrame columns?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

With pandas 0.19, you can do that in a single line :

pd.get_dummies(data=df, columns=['A', 'B'])

Columns specifies where to do the One Hot Encoding.

>>> df
   A  B  C
0  a  c  1
1  b  c  2
2  a  b  3

>>> pd.get_dummies(data=df, columns=['A', 'B'])
   C  A_a  A_b  B_b  B_c
0  1  1.0  0.0  0.0  1.0
1  2  0.0  1.0  0.0  1.0
2  3  1.0  0.0  1.0  0.0

Method 2

Since pandas version 0.15.0, pd.get_dummies can handle a DataFrame directly (before that, it could only handle a single Series, and see below for the workaround):

In [1]: df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
   ...:                 'C': [1, 2, 3]})

In [2]: df
Out[2]:
   A  B  C
0  a  c  1
1  b  c  2
2  a  b  3

In [3]: pd.get_dummies(df)
Out[3]:
   C  A_a  A_b  B_b  B_c
0  1    1    0    0    1
1  2    0    1    0    1
2  3    1    0    1    0

Workaround for pandas < 0.15.0

You can do it for each column seperate and then concat the results:

In [111]: df
Out[111]: 
   A  B
0  a  x
1  a  y
2  b  z
3  b  x
4  c  x
5  a  y
6  b  y
7  c  z

In [112]: pd.concat([pd.get_dummies(df[col]) for col in df], axis=1, keys=df.columns)
Out[112]: 
   A        B      
   a  b  c  x  y  z
0  1  0  0  1  0  0
1  1  0  0  0  1  0
2  0  1  0  0  0  1
3  0  1  0  1  0  0
4  0  0  1  1  0  0
5  1  0  0  0  1  0
6  0  1  0  0  1  0
7  0  0  1  0  0  1

If you don’t want the multi-index column, then remove the keys=.. from the concat function call.

Method 3

Somebody may have something more clever, but here are two approaches. Assuming you have a dataframe named df with columns ‘Name’ and ‘Year’ you want dummies for.

First, simply iterating over the columns isn’t too bad:

In [93]: for column in ['Name', 'Year']:
    ...:     dummies = pd.get_dummies(df<div class="su-column su-column-size-1-2"><div class="su-column-inner su-u-clearfix su-u-trim"></div></div>)
    ...:     df[dummies.columns] = dummies

Another idea would be to use the patsy package, which is designed to construct data matrices from R-type formulas.

In [94]: patsy.dmatrix(' ~ C(Name) + C(Year)', df, return_type="dataframe")

Method 4

Unless I don’t understand the question, it is supported natively in get_dummies by passing the columns argument.

Method 5

The simple trick I am currently using is a for-loop.
First separate categorical data from Data Frame by using select_dtypes(include="object"),
then by using for loop apply get_dummies to each column iteratively
as I have shown in code below:

train_cate=train_data.select_dtypes(include="object")
test_cate=test_data.select_dtypes(include="object")
# vectorize catagorical data
for col in train_cate:
    cate1=pd.get_dummies(train_cate[col])
    train_cate[cate1.columns]=cate1
    cate2=pd.get_dummies(test_cate[col])
    test_cate[cate2.columns]=cate2


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x