I want to extract only the text from the top-most element of my soup; however soup.text gives the text of all the child elements as well:
I have
import BeautifulSoup
soup=BeautifulSoup.BeautifulSoup('<html>yes<b>no</b></html>')
print soup.text
The output to this is yesno. I want simply ‘yes’.
What’s the best way of achieving this?
Edit: I also want yes to be output when parsing ‘<html><b>no</b>yes</html>‘.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
what about .find(text=True)?
>>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').find(text=True)
u'yes'
>>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').find(text=True)
u'no'
EDIT:
I think that I’ve understood what you want now. Try this:
>>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').html.find(text=True, recursive=False)
u'yes'
>>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').html.find(text=True, recursive=False)
u'yes'
Method 2
You could use contents
>>> print soup.html.contents[0] yes
or to get all the texts under html, use findAll(text=True, recursive=False)
>>> soup = BeautifulSoup.BeautifulSOAP('<html>x<b>no</b>yes</html>')
>>> soup.html.findAll(text=True, recursive=False)
[u'x', u'yes']
above joined to form a single string
>>> ''.join(soup.html.findAll(text=True, recursive=False)) u'xyes'
Method 3
This works for me in bs4:
import bs4
node = bs4.BeautifulSoup('<html><div>A<span>B</span>C</div></html>').find('div')
print "".join([t for t in node.contents if type(t)==bs4.element.NavigableString])
output:
AC
Method 4
You might want to look into lxml’s soupparser module, which has support for XPath:
>>> from lxml.html.soupparser import fromstring
>>> s1 = '<html>yes<b>no</b></html>'
>>> s2 = '<html><b>no</b>yes</html>'
>>> soup1 = fromstring(s1)
>>> soup2 = fromstring(s2)
>>> soup1.xpath("text()")
['yes']
>>> soup2.xpath("text()")
['yes']
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0