Using BeautifulSoup to extract text without tags

My webpage looks like this:

<p>
  <strong class="offender">YOB:</strong> 1987<br/>
  <strong class="offender">RACE:</strong> WHITE<br/>
  <strong class="offender">GENDER:</strong> FEMALE<br/>
  <strong class="offender">HEIGHT:</strong> 5'05''<br/>
  <strong class="offender">WEIGHT:</strong> 118<br/>
  <strong class="offender">EYE COLOR:</strong> GREEN<br/>
  <strong class="offender">HAIR COLOR:</strong> BROWN<br/>
</p>

I want to extract the info for each individual and get YOB:1987, RACE:WHITE, etc…

What I tried is:

subc = soup.find_all('p')
subc1 = subc[1]
subc2 = subc1.find_all('strong')

But this gives me only the values of YOB:, RACE:, etc…

Is there a way that I can get the data in YOB:1987, RACE:WHITE format?

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Just loop through all the <strong> tags and use next_sibling to get what you want. Like this:

for strong_tag in soup.find_all('strong'):
    print(strong_tag.text, strong_tag.next_sibling)

Demo:

from bs4 import BeautifulSoup

html = '''
<p>
  <strong class="offender">YOB:</strong> 1987<br />
  <strong class="offender">RACE:</strong> WHITE<br />
  <strong class="offender">GENDER:</strong> FEMALE<br />
  <strong class="offender">HEIGHT:</strong> 5'05''<br />
  <strong class="offender">WEIGHT:</strong> 118<br />
  <strong class="offender">EYE COLOR:</strong> GREEN<br />
  <strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
'''

soup = BeautifulSoup(html)

for strong_tag in soup.find_all('strong'):
    print(strong_tag.text, strong_tag.next_sibling)

This gives you:

YOB:  1987
RACE:  WHITE
GENDER:  FEMALE
HEIGHT:  5'05''
WEIGHT:  118
EYE COLOR:  GREEN
HAIR COLOR:  BROWN

Method 2

I think you can get it using subc1.text.

>>> html = """
<p>
    <strong class="offender">YOB:</strong> 1987<br />
    <strong class="offender">RACE:</strong> WHITE<br />
    <strong class="offender">GENDER:</strong> FEMALE<br />
    <strong class="offender">HEIGHT:</strong> 5'05''<br />
    <strong class="offender">WEIGHT:</strong> 118<br />
    <strong class="offender">EYE COLOR:</strong> GREEN<br />
    <strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
"""
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> print soup.text


YOB: 1987
RACE: WHITE
GENDER: FEMALE
HEIGHT: 5'05''
WEIGHT: 118
EYE COLOR: GREEN
HAIR COLOR: BROWN

Or if you want to explore it, you can use .contents :

>>> p = soup.find('p')
>>> from pprint import pprint
>>> pprint(p.contents)
[u'n',
 <strong class="offender">YOB:</strong>,
 u' 1987',
 <br/>,
 u'n',
 <strong class="offender">RACE:</strong>,
 u' WHITE',
 <br/>,
 u'n',
 <strong class="offender">GENDER:</strong>,
 u' FEMALE',
 <br/>,
 u'n',
 <strong class="offender">HEIGHT:</strong>,
 u" 5'05''",
 <br/>,
 u'n',
 <strong class="offender">WEIGHT:</strong>,
 u' 118',
 <br/>,
 u'n',
 <strong class="offender">EYE COLOR:</strong>,
 u' GREEN',
 <br/>,
 u'n',
 <strong class="offender">HAIR COLOR:</strong>,
 u' BROWN',
 <br/>,
 u'n']

and filter out the necessary items from the list:

>>> data = dict(zip([x.text for x in p.contents[1::4]], [x.strip() for x in p.contents[2::4]]))
>>> pprint(data)
{u'EYE COLOR:': u'GREEN',
 u'GENDER:': u'FEMALE',
 u'HAIR COLOR:': u'BROWN',
 u'HEIGHT:': u"5'05''",
 u'RACE:': u'WHITE',
 u'WEIGHT:': u'118',
 u'YOB:': u'1987'}

Method 3

you can try this indside findall for loop:

item_price = item.find('span', attrs={'class':'s-item__price'}).text

it extracts only text and assigs it to “item_pice”

Method 4

I think you could solve this with .strip() in gazpacho:

Input:

html = """
<p>
  <strong class="offender">YOB:</strong> 1987<br />
  <strong class="offender">RACE:</strong> WHITE<br />
  <strong class="offender">GENDER:</strong> FEMALE<br />
  <strong class="offender">HEIGHT:</strong> 5'05''<br />
  <strong class="offender">WEIGHT:</strong> 118<br />
  <strong class="offender">EYE COLOR:</strong> GREEN<br />
  <strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
"""

Code:

soup = Soup(html)
text = soup.find("p").strip(whitespace=False) # to keep n characters intact
lines = [
    line.strip()
    for line in text.split("n")
    if line != ""
]
data = dict([line.split(": ") for line in lines])

Output:

print(data)
# {'YOB': '1987',
#  'RACE': 'WHITE',
#  'GENDER': 'FEMALE',
#  'HEIGHT': "5'05''",
#  'WEIGHT': '118',
#  'EYE COLOR': 'GREEN',
#  'HAIR COLOR': 'BROWN'}

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating