How do I get rid of the b-prefix in a string in python?

A bunch of the tweets I am importing are having this issue where they read

b'I posted a new photo to Facebook'

I gather the b indicates it is a byte. But this is proving problematic because in my CSV files that I end up writing, the b doesn’t go away and is interferring in future code.

Is there a simple way to remove this b prefix from my lines of text?

Keep in mind, I seem to need to have the text encoded in utf-8 or tweepy has trouble pulling them from the web.


Here’s the link content I’m analyzing:

https://www.dropbox.com/s/sjmsbuhrghj7abt/new_tweets.txt?dl=0

new_tweets = 'content in the link'

Code Attempt

outtweets = [[tweet.text.encode("utf-8").decode("utf-8")] for tweet in new_tweets]
print(outtweets)

Error

UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-21-6019064596bf> in <module>()
      1 for screen_name in user_list:
----> 2     get_all_tweets(screen_name,"instance file")

<ipython-input-19-e473b4771186> in get_all_tweets(screen_name, mode)
     99             with open(os.path.join(save_location,'%s.instance' % screen_name), 'w') as f:
    100                 writer = csv.writer(f)
--> 101                 writer.writerows(outtweets)
    102         else:
    103             with open(os.path.join(save_location,'%s.csv' % screen_name), 'w') as f:

C:UsersStan ShunpikeAnaconda3libencodingscp1252.py in encode(self, input, final)
     17 class IncrementalEncoder(codecs.IncrementalEncoder):
     18     def encode(self, input, final=False):
---> 19         return codecs.charmap_encode(input,self.errors,encoding_table)[0]
     20 
     21 class IncrementalDecoder(codecs.IncrementalDecoder):

UnicodeEncodeError: 'charmap' codec can't encode characters in position 64-65: character maps to <undefined>

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

decode the bytes to produce a str:

b = b'1234'
print(b.decode('utf-8'))  # '1234'

Method 2

The object you are printing is not a string, but rather a bytes object as a byte literal.

Consider creating a byte object by typing a byte literal (literally defining a byte object without actually using a byte object e.g. by typing b”) and converting it into a string object encoded in utf-8. (Note that converting here means decoding)

byte_object= b"test" # byte object by literally typing characters
print(byte_object) # Prints b'test'
print(byte_object.decode('utf8')) # Prints "test" without quotations

We simply applied the .decode(utf8) function.


String literals are described by the following lexical definitions:

https://docs.python.org/3.3/reference/lexical_analysis.html#string-and-bytes-literals

stringliteral   ::=  [stringprefix](shortstring | longstring)
stringprefix    ::=  "r" | "u" | "R" | "U"
shortstring     ::=  "'" shortstringitem* "'" | '"' shortstringitem* '"'
longstring      ::=  "'''" longstringitem* "'''" | '"""' longstringitem* '"""'
shortstringitem ::=  shortstringchar | stringescapeseq
longstringitem  ::=  longstringchar | stringescapeseq
shortstringchar ::=  <any source character except "" or newline or the quote>
longstringchar  ::=  <any source character except "">
stringescapeseq ::=  "" <any source character>

bytesliteral   ::=  bytesprefix(shortbytes | longbytes)
bytesprefix    ::=  "b" | "B" | "br" | "Br" | "bR" | "BR" | "rb" | "rB" | "Rb" | "RB"
shortbytes     ::=  "'" shortbytesitem* "'" | '"' shortbytesitem* '"'
longbytes      ::=  "'''" longbytesitem* "'''" | '"""' longbytesitem* '"""'
shortbytesitem ::=  shortbyteschar | bytesescapeseq
longbytesitem  ::=  longbyteschar | bytesescapeseq
shortbyteschar ::=  <any ASCII character except "" or newline or the quote>
longbyteschar  ::=  <any ASCII character except "">
bytesescapeseq ::=  "" <any ASCII character>

Method 3

You need to decode it to convert it to a string. Check the answer here about bytes literal in python3.

b'I posted a new photo to Facebook'.decode('utf-8')
# 'I posted a new photo to Facebook'

Method 4

How to remove b' ' chars which is a decoded string in python:

import base64
a='cm9vdA=='
b=base64.b64decode(a).decode('utf-8')
print(b)

Method 5

On python 3.6 with django 2.0, decode on a byte literal does not work as expected.
Yes I get the right result when I print it, but the b'value' is still there even if you print it right.

This is what I’m encoding

uid': urlsafe_base64_encode(force_bytes(user.pk)),

This is what I’m decoding:

uid = force_text(urlsafe_base64_decode(uidb64))

This is what django 2.0 says :

urlsafe_base64_encode(s)[source]

Encodes a bytestring in base64 for use in URLs, stripping any trailing equal signs.

urlsafe_base64_decode(s)[source]

Decodes a base64 encoded string, adding back any trailing equal signs that might have been stripped.


This is my account_activation_email_test.html file

{% autoescape off %}
Hi {{ user.username }},

Please click on the link below to confirm your registration:

http://{{ domain }}{% url 'accounts:activate' uidb64=uid token=token %}
{% endautoescape %}

This is my console response:

Content-Type: text/plain; charset=”utf-8″ MIME-Version: 1.0
Content-Transfer-Encoding: 7bit Subject: Activate Your MySite Account
From: [email protected] To: [email protected] Date: Fri, 20 Apr
2018 06:26:46 -0000 Message-ID:
<[email protected]>

Hi testuser,

Please click on the link below to confirm your registration:

http://127.0.0.1:8000/activate/b'MjU'/4vi-fasdtRf2db2989413ba/

as you can see uid = b'MjU'

expected uid = MjU


test in console:

$ python
Python 3.6.4 (default, Apr  7 2018, 00:45:33) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from django.utils.http import urlsafe_base64_encode, urlsafe_base64_decode
>>> from django.utils.encoding import force_bytes, force_text
>>> var1=urlsafe_base64_encode(force_bytes(3))
>>> print(var1)
b'Mw'
>>> print(var1.decode())
Mw
>>>

After investigating it seems like its related to python 3.
My workaround was quite simple:

'uid': user.pk,

I receive it as uidb64 on my activate function:

user = User.objects.get(pk=uidb64)

and voila:

Content-Transfer-Encoding: 7bit
Subject: Activate Your MySite Account
From: <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="087f6d6a65697b7c6d7a4864676b696460677b7c">[email protected]</a>
To: <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="cfbbaabcbbbabcaabd8fb6aea7a0a0e1aca0a2">[email protected]</a>
Date: Fri, 20 Apr 2018 20:44:46 -0000
Message-ID: <<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="78494d4a4c4a4d4f48404e4c4e5649494a4a4056494b4f4b404c4e4d4e4e4a4f4d41494948414c4e383c190b10552d">[email protected]</a>>


Hi testuser,

Please click on the link below to confirm your registration:

http://127.0.0.1:8000/activate/45/4vi-3895fbb6b74016ad1882/

now it works fine.

Method 6

Assuming you don’t want to immediately decode it again like others are suggesting here, you can parse it to a string and then just strip the leading 'b and trailing '.

x = "Hi there 😄" 
x = "Hi there 😄".encode("utf-8") 
x # b"Hi there xefxbfxbd"
str(x)[2:-1]
# "Hi there \xef\xbf\xbd"

Method 7

I got it done by only encoding the output using utf-8.
Here is the code example

new_tweets = api.GetUserTimeline(screen_name = user,count=200)
result = new_tweets[0]
try: text = result.text
except: text = ''

with open(file_name, 'a', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerows(text)

i.e: do not encode when collecting data from api, encode the output (print or write) only.

Method 8

Alongside with @hiro protagonist answer, you can convert bytes to string by providing characters set into str:

b = b'1234'
str(b,'utf-8') # '1234'

Method 9

Although the question is very old, I think it may be helpful to who is facing the same problem. Here the texts is a string like below:

text= "b'I posted a new photo to Facebook'"

Thus you can not remove b by encoding it because it’s not a byte. I did the following to remove it.

cleaned_text = text.split("b'")[1]

which will give "I posted a new photo to Facebook"


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x