How can I get an email message's text content using Python?

Given an RFC822 message in Python 2.6, how can I get the right text/plain content part? Basically, the algorithm I want is this:

message = email.message_from_string(raw_message)
if has_mime_part(message, "text/plain"):
    mime_part = get_mime_part(message, "text/plain")
    text_content = decode_mime_part(mime_part)
elif has_mime_part(message, "text/html"):
    mime_part = get_mime_part(message, "text/html")
    html = decode_mime_part(mime_part)
    text_content = render_html_to_plaintext(html)
else:
    # fallback
    text_content = str(message)
return text_content

Of these things, I have get_mime_part and has_mime_part down pat, but I’m not quite sure how to get the decoded text from the MIME part. I can get the encoded text using get_payload(), but if I try to use the decode parameter of the get_payload() method (see the doc) I get an error when I call it on the text/plain part:

File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/
email/message.py", line 189, in get_payload
    raise TypeError('Expected list, got %s' % type(self._payload))
TypeError: Expected list, got <type 'str'>

In addition, I don’t know how to take HTML and render it to text as closely as possible.

Contents hide

Answers:

Method 1

Method 2

Method 3

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

In a multipart e-mail, email.message.Message.get_payload() returns a list with one item for each part. The easiest way is to walk the message and get the payload on each part:

import email
msg = email.message_from_string(raw_message)
for part in msg.walk():
    # each part is a either non-multipart, or another multipart message
    # that contains further parts... Message is organized like a tree
    if part.get_content_type() == 'text/plain':
        print part.get_payload() # prints the raw text

For a non-multipart message, no need to do all the walking. You can go straight to get_payload(), regardless of content_type.

msg = email.message_from_string(raw_message)
msg.get_payload()

If the content is encoded, you need to pass None as the first parameter to get_payload(), followed by True (the decode flag is the second parameter). For example, suppose that my e-mail contains an MS Word document attachment:

msg = email.message_from_string(raw_message)
for part in msg.walk():
    if part.get_content_type() == 'application/msword':
        name = part.get_param('name') or 'MyDoc.doc'
        f = open(name, 'wb')
        f.write(part.get_payload(None, True)) # You need None as the first param
                                              # because part.is_multipart() 
                                              # is False
        f.close()

As for getting a reasonable plain-text approximation of an HTML part, I’ve found that html2text works pretty darn well.

Method 2

Flat is better than nested 😉

from email.mime.multipart import MIMEMultipart
assert isinstance(msg, MIMEMultipart)

for _ in [k.get_payload() for k in msg.walk() if k.get_content_type() == 'text/plain']:
    print _

Method 3

To add on @Jarret Hardie’s excellent answer:

I personally like to transform that kind of data structures to a dictionary that I can reuse later, so something like this where the content_type is the key and the payload is the value:

import email

[...]

email_message = {
    part.get_content_type(): part.get_payload()
    for part in email.message_from_bytes(raw_email).walk()
}

print(email_message["text/plain"])

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating