How to find text I am looking for in the following HTML (line breaks marked with n)?
...
<tr>
<td class="pos">n
"Some text:"n
<br>n
<strong>some value</strong>n
</td>
</tr>
<tr>
<td class="pos">n
"Fixed text:"n
<br>n
<strong>text I am looking for</strong>n
</td>
</tr>
<tr>
<td class="pos">n
"Some other text:"n
<br>n
<strong>some other value</strong>n
</td>
</tr>
...
The code below returns first found value, so I need to filter by "Fixed text:" somehow.
result = soup.find('td', {'class' :'pos'}).find('strong').text
UPDATE: If I use the following code:
title = soup.find('td', text = re.compile(ur'Fixed text:(.*)', re.DOTALL), attrs = {'class': 'pos'})
self.response.out.write(str(title.string).decode('utf8'))
then it returns just Fixed text:, not the <strong>-highlighted text in that same element.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You can pass a regular expression to the text parameter of findAll, like so:
import BeautifulSoup
import re
columns = soup.findAll('td', text = re.compile('your regex here'), attrs = {'class' : 'pos'})
Method 2
This post got me to my answer even though the answer is missing from this post. I felt I should give back.
The challenge here is in the inconsistent behavior of BeautifulSoup.find when searching with and without text.
Note:
If you have BeautifulSoup, you can test this locally via:
curl https://gist.githubusercontent.com/RichardBronosky/4060082/raw/test.py | python
Code: https://gist.github.com/4060082
# Taken from https://gist.github.com/4060082
from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
from pprint import pprint
import re
soup = BeautifulSoup(urlopen('https://gist.githubusercontent.com/RichardBronosky/4060082/raw/test.html').read())
# I'm going to assume that Peter knew that re.compile is meant to cache a computation result for a performance benefit. However, I'm going to do that explicitly here to be very clear.
pattern = re.compile('Fixed text')
# Peter's suggestion here returns a list of what appear to be strings
columns = soup.findAll('td', text=pattern, attrs={'class' : 'pos'})
# ...but it is actually a BeautifulSoup.NavigableString
print type(columns[0])
#>> <class 'BeautifulSoup.NavigableString'>
# you can reach the tag using one of the convenience attributes seen here
pprint(columns[0].__dict__)
#>> {'next': <br />,
#>> 'nextSibling': <br />,
#>> 'parent': <td class="pos">n
#>> "Fixed text:"n
#>> <br />n
#>> <strong>text I am looking for</strong>n
#>> </td>,
#>> 'previous': <td class="pos">n
#>> "Fixed text:"n
#>> <br />n
#>> <strong>text I am looking for</strong>n
#>> </td>,
#>> 'previousSibling': None}
# I feel that 'parent' is safer to use than 'previous' based on http://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names
# So, if you want to find the 'text' in the 'strong' element...
pprint([t.parent.find('strong').text for t in soup.findAll('td', text=pattern, attrs={'class' : 'pos'})])
#>> [u'text I am looking for']
# Here is what we have learned:
print soup.find('strong')
#>> <strong>some value</strong>
print soup.find('strong', text='some value')
#>> u'some value'
print soup.find('strong', text='some value').parent
#>> <strong>some value</strong>
print soup.find('strong', text='some value') == soup.find('strong')
#>> False
print soup.find('strong', text='some value') == soup.find('strong').text
#>> True
print soup.find('strong', text='some value').parent == soup.find('strong')
#>> True
Though it is most certainly too late to help the OP, I hope they will make this as the answer since it does satisfy all quandaries around finding by text.
Method 3
With bs4 4.7.1+ you can use :contains pseudo class to specify the td containing your (filter) search string. You can then use a descendant child combinator, in this case, to move to the strong containing target text:
from bs4 import BeautifulSoup as bs
html = '''
<tr>
<td class="pos">n
"Some text:"n
<br>n
<strong>some value</strong>n
</td>
</tr>
<tr>
<td class="pos">n
"Fixed text:"n
<br>n
<strong>text I am looking for</strong>n
</td>
</tr>
<tr>
<td class="pos">n
"Some other text:"n
<br>n
<strong>some other value</strong>n
</td>
</tr>'''
soup = bs(html, 'lxml')
print(soup.select_one('td:contains("Fixed text:") strong').text)
NEW: In order to avoid conflicts with future CSS specification
changes, non-standard pseudo classes will now start with the :-soup-
prefix. As a consequence, :contains() will now be known as
:-soup-contains(), though for a time the deprecated form of
:contains() will still be allowed with a warning that users should
migrate over to :-soup-contains().NEW: Added new non-standard pseudo class :-soup-contains-own() which
operates similar to :-soup-contains() except that it only looks at
text nodes directly associated with the currently scoped element and
not its descendants.
Quote from @facelessuser github page.
Method 4
Since Beautiful Soup 4.4.0. a parameter called string does the work that text used to do in the previous versions.
string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for the string. This code finds the tags whose .string is “Elsie”:
soup.find_all("td", string="Elsie")
For more information about string have a look this section https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-string-argument
Method 5
A solution for finding a anchor tag if having a particular keyword would be the following:
from bs4 import BeautifulSoup
from urllib.request import urlopen,Request
from urllib.parse import urljoin,urlparse
rawLinks=soup.findAll('a',href=True)
for link in rawLinks:
innercontent=link.text
if keyword.lower() in innercontent.lower():
print(link)
Method 6
result = soup.find('strong', text='text I am looking for').text
Method 7
You could solve this with some simple gazpacho parsing:
from gazpacho import Soup
soup = Soup(html)
tds = soup.find("td", {"class": "pos"})
tds[1].find("strong").text
Which will output:
text I am looking for
Method 8
You can use Beautiful Soup’s CSS selector method.
from bs4 import BeautifulSoup
from bs4.element import Tag
from typing import List
# This will work as of BeautifulSoup 4.9.1.
result: List[Tag] = BeautifulSoup(html_string, 'lxml').select(
'tr td strong:contains("text I am looking for")'
)
print(result)
[<strong>text I am looking for</strong>]
🤠
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0