Extract email sub-strings from large document

I have a very large .txt file with hundreds of thousands of email addresses scattered throughout. They all take the format:

...<<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="bfd1ded2daffdbd0d2ded6d191dcd0d2">[email protected]</a>>...

What is the best way to have Python to cycle through the entire .txt file looking for a all instances of a certain @domain string, and then grab the entirety of the address within the <…>’s, and add it to a list? The trouble I have is with the variable length of different addresses.

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Method 5

Method 6

Method 7

Method 8

Method 9

Method 10

Method 11

Method 12

Method 13

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

This code extracts the email addresses in a string. Use it while reading line by line

>>> import re
>>> line = "should we use regex more often? let me know at  <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="c6aca2b5ad86a4a9a4e8a5a9abe8aaa9aa">[email protected]</a>"
>>> match = re.search(r'[w.+-]<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="b09bf0">[email protected]</a>[w-]+.[w.-]+', line)
>>> match.group(0)
'<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="553f31263e15373a377b363a387b393a39">[email protected]</a>'

If you have several email addresses use findall:

>>> line = "should we use regex more often? let me know at  <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="1b717f68705b7974793578747635777477">[email protected]</a> or <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ed9d829d829dad8e828e82c38e8280">[email protected]</a>"
>>> match = re.findall(r'[w.+-]<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="b299f2">[email protected]</a>[w-]+.[w.-]+', line)
>>> match
['<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="29434d5a42694b464b074a464407454645">[email protected]</a>', '<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="493926392639092a262a26672a2624">[email protected]</a>']

The regex above probably finds the most common non-fake email address. If you want to be completely aligned with the RFC 5322 you should check which email addresses follow the specification. Check this out to avoid any bugs in finding email addresses correctly.

Edit: as suggested in a comment by @kostek:
In the string Contact us at [email protected] my regex returns [email protected] (with dot at the end). To avoid this, use [w.,][email protected][w.,]+.w+)

Edit II: another wonderful improvement was mentioned in the comments: [w.-][email protected][w.-]+.w+which will capture [email protected] as well.

Edit III: Added further improvements as discussed in the comments: “In addition to allowing + in the beginning of the address, this also ensures that there is at least one period in the domain. It allows multiple segments of domain like abc.co.uk as well, and does NOT match [email protected] :). Finally, you don’t actually need to escape periods within a character class, so it doesn’t do that.”

Method 2

You can also use the following to find all the email addresses in a text and print them in an array or each email on a separate line.

import re
line = "why people don't know what regex are? let me know <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="58392b3c3e39346a1839342b763b3735">[email protected]</a>, <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="6732140215145627000a060e0b490302">[email protected]</a> " 
       "<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="c480a5b6adb1b7ac84a0a5b7a0e9a5b7a5b7a0b7a5eaa7aba9eaa8ab">[email protected]</a>,<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ebaf8a99829e9883c5878a989fa58a868eab9884868eaf84868a8285c5888486">[email protected]</a>"
match = re.findall(r'[w.-]<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="705b30">[email protected]</a>[w.-]+', line)
for i in match:
    print(i)

If you want to add it to a list just print the “match”

# this will print the list
    print(match)

Method 3

import re
rgx = r'(?:.?)([w-_+#~!$&'.]+(?<!.)(@|[ ]?(?[ ]?(at|AT)[ ]?)?[ ]?)(?<!.)[w]+[w-.]*.[a-zA-Z-]{2,3})(?:[^w])'
matches = re.findall(rgx, text)
get_first_group = lambda y: list(map(lambda x: x[0], y))
emails = get_first_group(matches)

Please don’t hate me for having a go at this infamous regex. The regex works for a decent portion of email addresses shown below. I mostly used this as my basis for the valid chars in an email address.

Feel free to play around with it here

I also made a variation where the regex captures emails like name at example.com

(?:.?)([w-_+#~!$&'.]+(?<!.)(@|[ ](?[ ]?(at|AT)[ ]?)?[ ])(?<!.)[w]+[w-.]*.[a-zA-Z-]{2,3})(?:[^w])

Method 4

If you’re looking for a specific domain:

>>> import re
>>> text = "this is an email <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="97fbf6d7e3f2e4e3b9f4f8fa">[email protected]</a>, it will be matched, <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="067e467f2865696b">[email protected]</a> will not, and <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="c7b3a2b4b387b3a2b4b3e9a4a8aa">[email protected]</a> will"
>>> match = re.findall(r'[w-._+%]<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="4a610a3e2f393e">[email protected]</a>.com',text) # replace test.com with the domain you're looking for, adding a backslash before periods
>>> match
['<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="2a464b6a5e4f595e04494547">[email protected]</a>', '<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="7f0b1a0c0b3f0b1a0c0b511c1012">[email protected]</a>']

Method 5

import re

reg_pat = r'<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="20730b60">[email protected]</a>S+.S+'

test_text = '<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="720a0b085c100b11321114155f18185c111d1f">[email protected]</a>    <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="9bf2e9c4fee9dbf8eeb5f8f4b5f0f7">[email protected]</a>   uiufubvcbuw bvkw  <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="2c47436c4f4341">[email protected]</a>    <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="2a476a5f5843494f">[email protected]</a>'   

emails = re.findall(reg_pat ,test_text,re.IGNORECASE)
print(emails)

Output:

['<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="5a222320743823391a393c3d77303074393537">[email protected]</a>', '<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="f39a81ac9681b39086dd909cdd989f">[email protected]</a>']

Method 6

import re
mess = '''<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="da90bbadbbbebbb2b7bfbe9abdb7bbb3b6f4b9b5b7">[email protected]</a> <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="b0f1d8ddd5d4f0d7ddd1d9dc9ed3dfdd">[email protected]</a>
            <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="0e6f6c6d4e69636f6762">[email protected]</a>'''
email = re.compile(r'([w.-]<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="c7ec87a0aaa6aeabe9a4a8aa">[email protected]</a>)')
result= email.findall(mess)

if(result != None):
    print(result)

The above code will help to you and bring the Gmail, email only after calling it.

Method 7

You can use b at the end to get the correct email to define ending of the email.

The regex

[w.-]<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="624922">[email protected]</a>[w-.]+b

Method 8

Example : string if mail id has (a-z all lower and _ or any no.0-9), then below will be regex:

>>> str1 = "<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="3756555453525168060504030277505a565e5b1954585a">[email protected]</a>"
>>> regex1 = "^[a-z0-9]+[._]?[a-z0-9]+[@]w+[.]w{2,3}$"
>>> re_com = re.compile(regex1)
>>> re_match = re_com.search(str1)
>>> re_match
<_sre.SRE_Match object at 0x1063c9ac0>
>>> re_match.group(0)
'<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="107172737475764f212223242550777d71797c3e737f7d">[email protected]</a>'

Method 9

content = ' abcdabcd <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="157f767a657079747b557b6c6d3b76663b71603b707160">[email protected]</a>  afgh <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="1c2a2924242e5c7175716f653269717832797869">[email protected]</a>  qwertyuiop <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="c5a8a4aba2aaa085a6b6ebb0a8a1">[email protected]</a>'

match_objects = re.findall(r'<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="20570b60">[email protected]</a>w+[.w+]+', content)

Method 10

#    b[w|.]+   ---> means begins with any english and number character or dot.

import re

marks = '''

!()[]{};?#$%:'",/^&é*

'''

text = 'Hello from <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="4636342f3f27282d3006212b272f2a6825292b">[email protected]</a> to <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="1c6c65687473725c7b717d7570327f7371">[email protected]</a>, <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="791d180d180a1a101c171a1c39">[email protected]</a>@gmail.com and <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="adc0cccec5c4c3c8c1c8ccdfc3c4c3caed">[email protected]</a>@yahoo..com wrong email address: <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="e08681929a8184a0878f8f878c85ce838f8d8d8d8d">[email protected]</a>'
# list of sequences of characters:
text_pieces = text.split()
pattern = r'b[a-zA-Z]{1}[w|.]*@[w|.]+.[a-zA-Z]{2,3}$'
for p in text_pieces:
  for x in marks:
    p = p.replace(x, "") 
  if len(re.findall(pattern, p)) > 0:
    print(re.findall(pattern, p))

Method 11

Here’s another approach for this specific problem, with a regex from emailregex.com:

text = "blabla <<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="640c0108080b24130b1608004a070b09">[email protected]</a>>><<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="1d2c2f2e5d2c2f2e337c69">[email protected]</a>> <<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="771f021f023711161c12">[email protected]</a>> bla bla <<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="6a0713040b070f2a1905070f470e05070b0304441a1e">[email protected]</a>>"

# 1. find all potential email addresses (note: < inside <> is a problem)
matches = re.findall('<S+?>', text)  # ['<<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="e48c8188888ba4938b968880ca878b89">[email protected]</a>>', '<<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="c2f3f0f182f3f0f1eca3b6">[email protected]</a>>', '<<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="95fde0fde0d5f3f4fef0">[email protected]</a>>', '<<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="630e1a0d020e0623100c0e06070c0e020a0d4d060716">[email protected]</a>>']

# 2. apply email regex pattern to string inside <>
emails = [ x[1:-1] for x in matches if re.match(r"(^[a-zA-Z0-9_.+-]<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="e2c9a2">[email protected]</a>[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+$)", x[1:-1]) ]
print emails   # ['<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="076f626b6b68477068756b632964686a">[email protected]</a>', '<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ffcecdccbfcecdccd19e8b">[email protected]</a>', '<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="0f6276616e626a4f7c60626a226b60626e6661217f7b">[email protected]</a>']

Method 12

import re 
txt = 'hello from <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="315053425271565c50585d1f525e5c">[email protected]</a> to <a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="5929382b68192038313636773a3634">[email protected]</a> about the meeting @2PM'
email  =re.findall('<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="02512942">[email protected]</a>S+',s)
print(email)

Printed output:

['<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="0b6a6978684b6c666a626725686466">[email protected]</a>', '<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="0979687b38497068616666276a6664">[email protected]</a>']

Method 13

import re
with open("file_name",'r') as f:
    s = f.read()
    result = re.findall(r'<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="10433b50">[email protected]</a>S+',s)
    for r in result:
        print(r)

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating