How can one idiomatically run a function like get_dummies, which expects a single column and returns several, on multiple DataFrame columns?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
With pandas 0.19, you can do that in a single line :
pd.get_dummies(data=df, columns=['A', 'B'])
Columns specifies where to do the One Hot Encoding.
>>> df A B C 0 a c 1 1 b c 2 2 a b 3 >>> pd.get_dummies(data=df, columns=['A', 'B']) C A_a A_b B_b B_c 0 1 1.0 0.0 0.0 1.0 1 2 0.0 1.0 0.0 1.0 2 3 1.0 0.0 1.0 0.0
Method 2
Since pandas version 0.15.0, pd.get_dummies can handle a DataFrame directly (before that, it could only handle a single Series, and see below for the workaround):
In [1]: df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
...: 'C': [1, 2, 3]})
In [2]: df
Out[2]:
A B C
0 a c 1
1 b c 2
2 a b 3
In [3]: pd.get_dummies(df)
Out[3]:
C A_a A_b B_b B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
Workaround for pandas < 0.15.0
You can do it for each column seperate and then concat the results:
In [111]: df Out[111]: A B 0 a x 1 a y 2 b z 3 b x 4 c x 5 a y 6 b y 7 c z In [112]: pd.concat([pd.get_dummies(df[col]) for col in df], axis=1, keys=df.columns) Out[112]: A B a b c x y z 0 1 0 0 1 0 0 1 1 0 0 0 1 0 2 0 1 0 0 0 1 3 0 1 0 1 0 0 4 0 0 1 1 0 0 5 1 0 0 0 1 0 6 0 1 0 0 1 0 7 0 0 1 0 0 1
If you don’t want the multi-index column, then remove the keys=.. from the concat function call.
Method 3
Somebody may have something more clever, but here are two approaches. Assuming you have a dataframe named df with columns ‘Name’ and ‘Year’ you want dummies for.
First, simply iterating over the columns isn’t too bad:
In [93]: for column in ['Name', 'Year']:
...: dummies = pd.get_dummies(df<div class="su-column su-column-size-1-2"><div class="su-column-inner su-u-clearfix su-u-trim"></div></div>)
...: df[dummies.columns] = dummies
Another idea would be to use the patsy package, which is designed to construct data matrices from R-type formulas.
In [94]: patsy.dmatrix(' ~ C(Name) + C(Year)', df, return_type="dataframe")
Method 4
Unless I don’t understand the question, it is supported natively in get_dummies by passing the columns argument.
Method 5
The simple trick I am currently using is a for-loop.
First separate categorical data from Data Frame by using select_dtypes(include="object"),
then by using for loop apply get_dummies to each column iteratively
as I have shown in code below:
train_cate=train_data.select_dtypes(include="object")
test_cate=test_data.select_dtypes(include="object")
# vectorize catagorical data
for col in train_cate:
cate1=pd.get_dummies(train_cate[col])
train_cate[cate1.columns]=cate1
cate2=pd.get_dummies(test_cate[col])
test_cate[cate2.columns]=cate2
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0