When you use BeautifulSoup to scrape a certain part of a website, you can use
soup.find()andsoup.findAll()orsoup.select().
Is there a difference between the .find() and the .select() methods?
(e.g. In performance or flexibility, etc.) Or are they the same?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
To summarise the comments:
- select finds multiple instances and returns a list, find finds the first, so they don’t do the same thing. select_one would be the equivalent to find.
- I almost always use css selectors when chaining tags or using tag.classname, if looking for a single element without a class I use find. Essentially it comes down to the use case and personal preference.
- As far as flexibility goes I think you know the answer,
soup.select("div[id=foo] > div > div > div[class=fee] > span > span > a")would look pretty ugly using multiple chained find/find_all calls. - The only issue with the css selectors in bs4 is the very limited support, nth-of-type is the only pseudo class implemented and chaining attributes like a[href][src] is also not supported as are many other parts of css selectors. But things like a[href=..]* , a[href^=], a[href$=] etc.. are I think much nicer than
find("a", href=re.compile(....))but again that is personal preference.
For performance we can run some tests, I modified the code from an answer here running on 800+ html files taken from here, is is not exhaustive but should give a clue to the readability of some of the options and the performance:
The modified functions are:
from bs4 import BeautifulSoup
from glob import iglob
def parse_find(soup):
author = soup.find("h4", class_="h12 talk-link__speaker").text
title = soup.find("h4", class_="h9 m5").text
date = soup.find("span", class_="meta__val").text.strip()
soup.find("footer",class_="footer").find_previous("data", {
"class": "talk-transcript__para__time"}).text.split(":")
soup.find_all("span",class_="talk-transcript__fragment")
def parse_select(soup):
author = soup.select_one("h4.h12.talk-link__speaker").text
title = soup.select_one("h4.h9.m5").text
date = soup.select_one("span.meta__val").text.strip()
soup.select_one("footer.footer").find_previous("data", {
"class": "talk-transcript__para__time"}).text
soup.select("span.talk-transcript__fragment")
def test(patt, func):
for html in iglob(patt):
with open(html) as f:
func(BeautifulSoup(f, "lxml")
Now for the timings:
In [7]: from testing import test, parse_find, parse_select
In [8]: timeit test("./talks/*.html",parse_find)
1 loops, best of 3: 51.9 s per loop
In [9]: timeit test("./talks/*.html",parse_select)
1 loops, best of 3: 32.7 s per loop
Like I said not exhaustive but I think we can safely say the css selectors are definitely more efficient.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0