Beautiful Soup 4 find_all don't find links that Beautiful Soup 3 finds

I noticed a really annoying bug: BeautifulSoup4 (package: bs4) often finds less tags than the previous version (package: BeautifulSoup).

Here’s a reproductible instance of that issue:

import requests
import bs4
import BeautifulSoup

r = requests.get('http://wordpress.org/download/release-archive/')
s4 = bs4.BeautifulSoup(r.text)
s3 = BeautifulSoup.BeautifulSoup(r.text)

print 'With BeautifulSoup 4 : {}'.format(len(s4.findAll('a')))
print 'With BeautifulSoup 3 : {}'.format(len(s3.findAll('a')))

Output:

With BeautifulSoup 4 : 557
With BeautifulSoup 3 : 1701

The difference is not minor as you can see.

Here are the exact versions of the modules in case someone is wondering:

In [20]: bs4.__version__
Out[20]: '4.2.1'

In [21]: BeautifulSoup.__version__
Out[21]: '3.2.1'

Contents hide

Answers:

Method 1

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

You have lxml installed, which means that BeautifulSoup 4 will use that parser over the standard-library html.parser option.

You can upgrade lxml to 3.2.1 (which for me returns 1701 results for your test page); lxml itself uses libxml2 and libxslt which may be to blame too here. You may have to upgrade those instead / as well. See the lxml requirements page; currently libxml2 2.7.8 or newer is recommended.

Or explicitly specify the other parser when parsing the soup:

s4 = bs4.BeautifulSoup(r.text, 'html.parser')

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating