I am scraping a website using beautifulsoup & python, which has more than 100 span tags. I want to remove 2 consecutive span tag, where the first span tag has text element “READ MORE:” and the second span tag is some string.
<span>Two cars collided at low speed in Lurnea on February 25, 2019.</span>, <span>The accident killed an 11-month-old boy who was in the BMW sedan being driven by Peter Watfa.</span>, <span>READ MORE: </span>, <span>Long queues form at airports as one million Aussies set to fly this Easter</span>, <span>Watfa has repeatedly refused to admit the 11-month-old was sitting on his lap and is adamant the baby was restrained in the backseat when the crash occurred.</span>, <span>The baby boy suffered fatal injuries when the driver's airbag deployed.</span>, <span>A judge today slammed Watfa's actions, with the court hearing the vulnerable child was "entirely dependent upon Watfa, who owed him a duty of care".</span>, <span>READ MORE: </span>, <span>Four female backpackers killed in horror highway crash</span>, <span>The court also heard he had earned the title of a serial traffic offender.</span>, <span>In the months after the crash, Watfa was involved in a police pursuit and caught driving under the influence of drugs.</span>, <span>Watfa will serve at least two years and three months for manslaughter.</span>, <span>He will be eligible for parole in early 2024.</span>
For example: I want to remove below 4 tag
<span>READ MORE: </span>, <span>Long queues form at airports as one million Aussies set to fly this Easter</span> <span>READ MORE: </span>, <span>Four female backpackers killed in horror highway crash</span>
The output should be :
<span>Two cars collided at low speed in Lurnea on February 25, 2019.</span>, <span>The accident killed an 11-month-old boy who was in the BMW sedan being driven by Peter Watfa.</span>, <span>Watfa has repeatedly refused to admit the 11-month-old was sitting on his lap and is adamant the baby was restrained in the backseat when the crash occurred.</span>, <span>The baby boy suffered fatal injuries when the driver's airbag deployed.</span>, <span>A judge today slammed Watfa's actions, with the court hearing the vulnerable child was "entirely dependent upon Watfa, who owed him a duty of care".</span>, <span>The court also heard he had earned the title of a serial traffic offender.</span>, <span>In the months after the crash, Watfa was involved in a police pursuit and caught driving under the influence of drugs.</span>, <span>Watfa will serve at least two years and three months for manslaughter.</span>, <span>He will be eligible for parole in early 2024.</span>
I would be grateful if someone can help me with the logic in python.cheers
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Assuming you scrape the text of each article of a news site and you should change your strategy.
Clean the tree while .decompose()
the elements you do not wanna scrape:
for e in soup.select('span:-soup-contains("READ MORE")'): e.find_next('span').decompose() e.decompose()
than select body of the article and extract the text:
soup.select_one('.article__body-croppable').get_text(' ', strip=True)
This results in:
A driver has been jailed over the death of a baby boy who was sitting on his lap during a crash in Sydney’s south-west . Two cars collided at low speed in Lurnea on February 25, 2019. The accident killed an 11-month-old boy who was in the BMW sedan being driven by Peter Watfa. Peter Watfa has been jailed for at least two years and three months. (9News) Watfa has repeatedly refused to admit the 11-month-old was sitting on his lap and is adamant the baby was restrained in the backseat when the crash occurred. The baby boy suffered fatal injuries when the driver’s airbag deployed. A judge today slammed Watfa’s actions, with the court hearing the vulnerable child was “entirely dependent upon Watfa, who owed him a duty of care”. An 11-month-old boy died in the crash. (9News) The court also heard he had earned the title of a serial traffic offender. In the months after the crash, Watfa was involved in a police pursuit and caught driving under the influence of drugs. Watfa will serve at least two years and three months for manslaughter. He will be eligible for parole in early 2024.
Indeed you also could iterate your ResultSet
and create a new list
with all valid <span>
but I think that is not the best option:
[x for i, x in enumerate(results) if 'READ MORE' not in x.text and 'READ MORE' not in results[i-1].text]
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0