How to use Beautiful Soup to extract string in tag?

In a given .html page, I have a script tag like so:

     <script>jQuery(window).load(function () {
  setTimeout(function(){
    jQuery("input[name=Email]").val("<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="97f9f6faf2d7f2faf6fefbb9f4f8fa">[email protected]</a>");
  }, 1000);
});</script>

How can I use Beautiful Soup to extract the email address?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

To add a bit more to the @Bob’s answer and assuming you need to also locate the script tag in the HTML which may have other script tags.

The idea is to define a regular expression that would be used for both locating the element with BeautifulSoup and extracting the email value:

import re

from bs4 import BeautifulSoup


data = """
<body>
    <script>jQuery(window).load(function () {
      setTimeout(function(){
        jQuery("input[name=Email]").val("<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="9af4fbf7ffdafff7fbf3f6b4f9f5f7">[email protected]</a>");
      }, 1000);
    });</script>
</body>
"""
pattern = re.compile(r'.val("([^@]<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="153e55">[email protected]</a>[^@]+.[^@]+)");', re.MULTILINE | re.DOTALL)
soup = BeautifulSoup(data, "html.parser")

script = soup.find("script", text=pattern)
if script:
    match = pattern.search(script.text)
    if match:
        email = match.group(1)
        print(email)

Prints: [email protected].

Here we are using a simple regular expression for the email address, but we can go further and be more strict about it but I doubt that would be practically necessary for this problem.

Method 2

I ran into a similar problem and the issue seems to be that calling script_tag.text returns an empty string. Instead, you have to call script_tag.string. Maybe this changed in some version of BeautifulSoup?

Anyway, @alecxe’s answer didn’t work for me, so I modified their solution:

import re

from bs4 import BeautifulSoup

data = """
<body>
    <script>jQuery(window).load(function () {
      setTimeout(function(){
        jQuery("input[name=Email]").val("[email protected]");
      }, 1000);
    });</script>
</body>
"""
soup = BeautifulSoup(data, "html.parser")

script_tag = soup.find("script")
if script_tag:
  # contains all of the script tag, e.g. "jQuery(window)..."
  script_tag_contents = script_tag.string

  # from there you can search the string using a regex, etc.
  email = re.search(r'.+val("(.+)");', script_tag_contents).group(1)
  print(email)

This prints [email protected].

Method 3

not possible using only BeautifulSoup, but you can do it for example with BS + regular expressions

import re
from bs4 import BeautifulSoup as BS

html = """<script> ... </script>"""

bs = BS(html)

txt = bs.script.get_text()

email = re.match(r'.+val("(.+?)");', txt).group(1)

or like this:

...

email = txt.split('.val("')[1].split('");')[0]

Method 4

In order to get the string inside the <script> tag, you can use .contents or .string.

data = """
   <body>
<script>jQuery(window).load(function () {
  setTimeout(function(){
    jQuery("input[name=Email]").val("<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="83ede2eee6c3e6eee2eaefade0ecee">[email protected]</a>");
  }, 1000);
});</script>
 </body>
    """
soup = BeautifulSoup(data, "html.parser")

script = soup.find("script")
inner_text_with_string = script.string
inner_text_with_content = script.contents[0]

print('inner_text_with_string', inner_text_with_string)
print('inner_text_with_content', inner_text_with_content)

Method 5

You could solve this with just a couple of lines of gazpacho and .split, no regex required!

from gazpacho import Soup

html = """
<script>jQuery(window).load(function () {
  setTimeout(function(){
    jQuery("input[name=Email]").val("[email protected]");
  }, 1000);
});</script>
"""

soup = Soup(html)
string = soup.find("script").text
string.split(".val("")[-1].split("");")[0]

Which would output:

'[email protected]'


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x