My webpage looks like this:
<p>
<strong class="offender">YOB:</strong> 1987<br/>
<strong class="offender">RACE:</strong> WHITE<br/>
<strong class="offender">GENDER:</strong> FEMALE<br/>
<strong class="offender">HEIGHT:</strong> 5'05''<br/>
<strong class="offender">WEIGHT:</strong> 118<br/>
<strong class="offender">EYE COLOR:</strong> GREEN<br/>
<strong class="offender">HAIR COLOR:</strong> BROWN<br/>
</p>
I want to extract the info for each individual and get YOB:1987, RACE:WHITE, etc…
What I tried is:
subc = soup.find_all('p')
subc1 = subc[1]
subc2 = subc1.find_all('strong')
But this gives me only the values of YOB:, RACE:, etc…
Is there a way that I can get the data in YOB:1987, RACE:WHITE format?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Just loop through all the <strong> tags and use next_sibling to get what you want. Like this:
for strong_tag in soup.find_all('strong'):
print(strong_tag.text, strong_tag.next_sibling)
Demo:
from bs4 import BeautifulSoup
html = '''
<p>
<strong class="offender">YOB:</strong> 1987<br />
<strong class="offender">RACE:</strong> WHITE<br />
<strong class="offender">GENDER:</strong> FEMALE<br />
<strong class="offender">HEIGHT:</strong> 5'05''<br />
<strong class="offender">WEIGHT:</strong> 118<br />
<strong class="offender">EYE COLOR:</strong> GREEN<br />
<strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
'''
soup = BeautifulSoup(html)
for strong_tag in soup.find_all('strong'):
print(strong_tag.text, strong_tag.next_sibling)
This gives you:
YOB: 1987
RACE: WHITE
GENDER: FEMALE
HEIGHT: 5'05''
WEIGHT: 118
EYE COLOR: GREEN
HAIR COLOR: BROWN
Method 2
I think you can get it using subc1.text.
>>> html = """
<p>
<strong class="offender">YOB:</strong> 1987<br />
<strong class="offender">RACE:</strong> WHITE<br />
<strong class="offender">GENDER:</strong> FEMALE<br />
<strong class="offender">HEIGHT:</strong> 5'05''<br />
<strong class="offender">WEIGHT:</strong> 118<br />
<strong class="offender">EYE COLOR:</strong> GREEN<br />
<strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
"""
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> print soup.text
YOB: 1987
RACE: WHITE
GENDER: FEMALE
HEIGHT: 5'05''
WEIGHT: 118
EYE COLOR: GREEN
HAIR COLOR: BROWN
Or if you want to explore it, you can use .contents :
>>> p = soup.find('p')
>>> from pprint import pprint
>>> pprint(p.contents)
[u'n',
<strong class="offender">YOB:</strong>,
u' 1987',
<br/>,
u'n',
<strong class="offender">RACE:</strong>,
u' WHITE',
<br/>,
u'n',
<strong class="offender">GENDER:</strong>,
u' FEMALE',
<br/>,
u'n',
<strong class="offender">HEIGHT:</strong>,
u" 5'05''",
<br/>,
u'n',
<strong class="offender">WEIGHT:</strong>,
u' 118',
<br/>,
u'n',
<strong class="offender">EYE COLOR:</strong>,
u' GREEN',
<br/>,
u'n',
<strong class="offender">HAIR COLOR:</strong>,
u' BROWN',
<br/>,
u'n']
and filter out the necessary items from the list:
>>> data = dict(zip([x.text for x in p.contents[1::4]], [x.strip() for x in p.contents[2::4]]))
>>> pprint(data)
{u'EYE COLOR:': u'GREEN',
u'GENDER:': u'FEMALE',
u'HAIR COLOR:': u'BROWN',
u'HEIGHT:': u"5'05''",
u'RACE:': u'WHITE',
u'WEIGHT:': u'118',
u'YOB:': u'1987'}
Method 3
you can try this indside findall for loop:
item_price = item.find('span', attrs={'class':'s-item__price'}).text
it extracts only text and assigs it to “item_pice”
Method 4
I think you could solve this with .strip() in gazpacho:
Input:
html = """
<p>
<strong class="offender">YOB:</strong> 1987<br />
<strong class="offender">RACE:</strong> WHITE<br />
<strong class="offender">GENDER:</strong> FEMALE<br />
<strong class="offender">HEIGHT:</strong> 5'05''<br />
<strong class="offender">WEIGHT:</strong> 118<br />
<strong class="offender">EYE COLOR:</strong> GREEN<br />
<strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
"""
Code:
soup = Soup(html)
text = soup.find("p").strip(whitespace=False) # to keep n characters intact
lines = [
line.strip()
for line in text.split("n")
if line != ""
]
data = dict([line.split(": ") for line in lines])
Output:
print(data)
# {'YOB': '1987',
# 'RACE': 'WHITE',
# 'GENDER': 'FEMALE',
# 'HEIGHT': "5'05''",
# 'WEIGHT': '118',
# 'EYE COLOR': 'GREEN',
# 'HAIR COLOR': 'BROWN'}
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0