How to translate “bytes” objects into literal strings in pandas Dataframe, Python3.x?

I have a Python3.x pandas DataFrame whereby certain columns are strings which as expressed as bytes (like in Python2.x)

import pandas as pd
df = pd.DataFrame(...)
df
       COLUMN1         ....
0      b'abcde'        ....
1      b'dog'          ....
2      b'cat1'         ....
3      b'bird1'        ....
4      b'elephant1'    ....

When I access by column with df.COLUMN1, I see Name: COLUMN1, dtype: object

However, if I access by element, it is a “bytes” object

df.COLUMN1.ix[0].dtype
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'dtype'

How do I convert these into “regular” strings? That is, how can I get rid of this b'' prefix?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

You can use vectorised str.decode to decode byte strings into ordinary strings:

df['COLUMN1'].str.decode("utf-8")

To do this for multiple columns you can select just the str columns:

str_df = df.select_dtypes([np.object])

convert all of them:

str_df = str_df.stack().str.decode('utf-8').unstack()

You can then swap out converted cols with the original df cols:

for col in str_df:
    df[col] = str_df[col]

Method 2

I add issue with some columns being either full of str or mixed of str and bytes in a dataframe. Solved with a minor modification of the solution provided by @Christabella Irwanto: (i’m more of fan of the str.decode('utf-8') as suggested by @Mad Physicist)

for col, dtype in df.dtypes.items():
        if dtype == np.object:  # Only process object columns.
            # decode, or return original value if decode return Nan
            df[col] = df[col].str.decode('utf-8').fillna(df[col]) 


>>> df[col]
0        Element
1     b'Element'
2         b'165'
3            165
4             25
5             25

>>> df[col].str.decode('utf-8').fillna(df[col])
0     Element
1     Element
2         165
3         165
4          25
5          25
6          25

Method 3

Combining the answers by @EdChum and @Yu Zhou, a simpler solution would be:

for col, dtype in df.dtypes.items():
    if dtype == np.object:  # Only process byte object columns.
        df[col] = df[col].apply(lambda x: x.decode("utf-8"))

Method 4

I came across this thread while trying to solve the same problem but more generally for a Series where some values my be of type str, others of type bytes. Drawing from earlier solutions, I achieved this selective decoding as follows, resulting in a Series all of whose values are of type str. (python 3.6.9, pandas 1.0.5)

>>> import pandas as pd
>>> ser = pd.Series(["value_1".encode("utf-8"), "value_2"])
>>> ser.values
array([b'value_1', 'value_2'], dtype=object)
>>> ser2 = ser.str.decode("utf-8")
>>> ser[~ser2.isna()] = ser2
>>> ser.values
array(['value_1', 'value_2'], dtype=object)

Maybe there exists a more convenient/efficient one-liner for this use case? At first I figured there would be some value to pass in the “errors” kwarg to str.decode but I didn’t find one documented.

EDIT: One can definitely achieve the same in one line, but the ways I have thought to so do so take about 25% (tested for Series of length 10^4 and 10^6), but presumably does no copying. E.g.:

ser[ser.apply(type) == bytes] = ser.str.decode("utf-8")

Method 5

df['COLUMN1'].apply(lambda x: x.decode("utf-8"))


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x