BeautifulSoup returns empty list when searching by compound class names

BeautifulSoup returns empty list when searching by compound class names using regex.

Example:

import re
from bs4 import BeautifulSoup

bs = 
    """
    <a class="name-single name692" href="www.example.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener"">Example Text</a>
    """

bsObj = BeautifulSoup(bs)

# this returns the class
found_elements = bsObj.find_all("a", class_= re.compile("^(name-single.*)$"))

# this returns an empty list
found_elements = bsObj.find_all("a", class_= re.compile("^(name-single named*)$"))

I need the class selection to be very precise. Any ideas?

Contents hide

Answers:

Method 1

Method 2

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Unfortunately, when you try to make a regular expression match on a class attribute value that contains multiple classes, BeautifulSoup would apply the regular expression to every single class separately. Here are the relevant topics about the problem:

This is all because class is a very special multi-valued attribute and every time you parse HTML, one of the BeautifulSoup‘s tree builders (depending on the parser choice) internally splits a class string value into a list of classes (quote from the HTMLTreeBuilder‘s docstring):

# The HTML standard defines these attributes as containing a
# space-separated list of values, not a single value. That is,
# class="foo bar" means that the 'class' attribute has two values,
# 'foo' and 'bar', not the single value 'foo bar'.  When we
# encounter one of these attributes, we will parse its value into
# a list of values if possible. Upon output, the list will be
# converted back into a string.

There are multiple workarounds, but here is a hack-ish one – we are going to ask BeautifulSoup not to handle class as a multi-valued attribute by making our simple custom tree builder:

import re

from bs4 import BeautifulSoup
from bs4.builder._htmlparser import HTMLParserTreeBuilder


class MyBuilder(HTMLParserTreeBuilder):
    def __init__(self):
        super(MyBuilder, self).__init__()

        # BeautifulSoup, please don't treat "class" specially
        self.cdata_list_attributes["*"].remove("class")


bs = """<a class="name-single name692" href="www.example.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener"">Example Text</a>"""
bsObj = BeautifulSoup(bs, "html.parser", builder=MyBuilder())
found_elements = bsObj.find_all("a", class_=re.compile(r"^name-single named+$"))

print(found_elements)

In this case the regular expression would be applied to a class attribute value as a whole.

Alternatively, you can just parse the HTML with xml features enabled (if this is applicable):

soup = BeautifulSoup(data, "xml")

You can also use CSS selectors and match all elements with name-single class and a class staring with “name”:

soup.select("a.name-single,a[class^=name]")

You can then apply the regular expression manually if needed:

pattern = re.compile(r"^name-single named+$")
for elm in bsObj.select("a.name-single,a[class^=name]"):
    match = pattern.match(" ".join(elm["class"]))
    if match:
        print(elm)

Method 2

For this use case I would simply use a custom filter, like so:

import re

from bs4 import BeautifulSoup
from bs4.builder._htmlparser import HTMLParserTreeBuilder

def myclassfilter(tag):
    return re.compile(r"^name-single named+$").search(' '.join(tag['class']))

bs = """<a class="name-single name692" href="www.example.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener"">Example Text</a>"""
bsObj = BeautifulSoup(bs, "html.parser")
found_elements = bsObj.find_all(myclassfilter)

print(found_elements)

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating