I am extracting emails from Gmail using the following:
def getMsgs():
try:
conn = imaplib.IMAP4_SSL("imap.gmail.com", 993)
except:
print 'Failed to connect'
print 'Is your internet connection working?'
sys.exit()
try:
conn.login(username, password)
except:
print 'Failed to login'
print 'Is the username and password correct?'
sys.exit()
conn.select('Inbox')
# typ, data = conn.search(None, '(UNSEEN SUBJECT "%s")' % subject)
typ, data = conn.search(None, '(SUBJECT "%s")' % subject)
for num in data[0].split():
typ, data = conn.fetch(num, '(RFC822)')
msg = email.message_from_string(data[0][1])
yield walkMsg(msg)
def walkMsg(msg):
for part in msg.walk():
if part.get_content_type() != "text/plain":
continue
return part.get_payload()
However, some emails I get are nigh impossible for me to extract dates (using regex) from as encoding-related chars such as ‘=’, randomly land in the middle of various text fields. Here’s an example where it occurs in a date range I want to extract:
Name: KIRSTI Email:
[email protected] Phone #: + 999
99995192 Total in party: 4 total, 0
children Arrival/Departure: Oct 9=
,
2010 – Oct 13, 2010 – Oct 13, 2010
Is there a way to remove these encoding characters?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You could/should use the email.parser module to decode mail messages, for example (quick and dirty example!):
from email.parser import FeedParser
f = FeedParser()
f.feed("<insert mail message here, including all headers>")
rootMessage = f.close()
# Now you can access the message and its submessages (if it's multipart)
print rootMessage.is_multipart()
# Or check for errors
print rootMessage.defects
# If it's a multipart message, you can get the first submessage and then its payload
# (i.e. content) like so:
rootMessage.get_payload(0).get_payload(decode=True)
Using the “decode” parameter of Message.get_payload, the module automatically decodes the content, depending on its encoding (e.g. quoted printables as in your question).
Method 2
If you are using Python3.6 or later, you can use the email.message.Message.get_content() method to decode the text automatically. This method supersedes get_payload(), though get_payload() is still available.
Say you have a string s containing this email message (based on the examples in the docs):
Subject: Ayons asperges pour le =?utf-8?q?d=C3=A9jeuner?=
From: =?utf-8?q?Pep=C3=A9?= Le Pew <<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="6e1e0b1e0b2e0b160f031e020b400d0103">[email protected]</a>>
To: Penelope Pussycat <<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="7a0a1f141f16150a1f3a1f021b170a161f54191517">[email protected]</a>>,
Fabrette Pussycat <<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="aacccbc8d8cfdedecfeacfd2cbc7dac6cf84c9c5c7">[email protected]</a>>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Salut!
Cela ressemble =C3=A0 un excellent recipie[1] d=C3=A9jeuner.
[1] http://www.yummly.com/recipe/Roasted-Asparagus-Epicurious-203718
--Pep=C3=A9
=20
Non-ascii characters in the string have been encoded with the quoted-printable encoding, as specified in the Content-Transfer-Encoding header.
Create an email object:
import email from email import policy msg = email.message_from_string(s, policy=policy.default)
Setting the policy is required here; otherwise policy.compat32 is used, which returns a legacy Message instance that doesn’t have the get_content method. policy.default will eventually become the default policy, but as of Python3.7 it’s still policy.compat32.
The get_content() method handles decoding automatically:
print(msg.get_content()) Salut! Cela ressemble à un excellent recipie[1] déjeuner. [1] http://www.yummly.com/recipe/Roasted-Asparagus-Epicurious-203718 --Pepé
If you have a multipart message, get_content() needs to be called on the individual parts, like this:
for part in message.iter_parts():
print(part.get_content())
Method 3
That’s known as quoted-printable encoding. You probably want to use something like quopri.decodestring – http://docs.python.org/library/quopri.html
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0