I would like to scrape the fund price and date of the following url: https://www.blackrock.com/sg/en/products/241427/blackrock-cont-european-flexible-d4rf-gbp-fund
and put these values in a table:
Date Price 21-Oct-2021 36.68
However, in the html source there are many <span class with the same title:
<span class="header-nav-label navAmount"> NAV as of 21-Oct-2021 </span> <span class="header-nav-data"> GBP 36.68 </span> <span class="header-nav-data"> 0.10 (0.27%) </span>
But I only want to pick the first class with the daily price in it.
I’ve tried the following code:
from bs4 import BeautifulSoup import requests #Create url list urls = ['https://www.blackrock.com/sg/en/products/241427/blackrock-cont-european-flexible-d4rf-gbp-fund'] headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'} # Build the scrapping loop for url in urls: # Extract HTML element (daily price and date) from url response = requests.get(url, headers=headers) soup = BeautifulSoup(response.content, "html.parser") spans = soup.findAll('span', {'class':'header-nav-data'}) for span in spans: print (span.text) spans1 = soup.findAll('span', {'class':'header-nav-label navAmount'}) print (spans1)
which returns:
GBP 36.8 0.1 (0.27%) [<span class="header-nav-label navAmount"> NAV as of 21-Oct-2021 </span>]
Do you know what I need to do to only select the first <span class as I’m only interested about the price? I’m new to Python so would greatly appreciate the help. Thanks!
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You could also go the path of pulling out the json within the html:
import requests import re import json import pandas as pd url = 'https://www.blackrock.com/sg/en/products/241427/blackrock-cont-european-flexible-d4rf-gbp-fund' response = requests.get(url) regex = r"(var navData = )([.*)(;)" jsonStr = re.search(regex, response.text).groups()[1] jsonStr = re.sub(r"((x:)(Date.UTC(d{4},d{1,2},d{1,2})),y:Number({1,2})([d.]*)([).sw(]*)", r"24", jsonStr) jsonStr = jsonStr.replace('x:','"y":') jsonStr = jsonStr.replace('formattedX:','"Date":') jsonData = json.loads(jsonStr) df = pd.DataFrame(jsonData) df = df[['Date','y']]
Output:
to the most recent, just do print(df.tail(1))
print(df) Date y 0 Thu, 13 Sep 2012 9.81 1 Fri, 14 Sep 2012 10.07 2 Mon, 17 Sep 2012 10.02 3 Tue, 18 Sep 2012 9.94 4 Wed, 19 Sep 2012 9.96 ... ... 2275 Fri, 15 Oct 2021 36.30 2276 Mon, 18 Oct 2021 36.43 2277 Tue, 19 Oct 2021 36.48 2278 Wed, 20 Oct 2021 36.58 2279 Thu, 21 Oct 2021 36.68 [2280 rows x 2 columns]
Method 2
You can use limit=1 doc
from bs4 import BeautifulSoup import requests #Create url list urls = ['https://www.blackrock.com/sg/en/products/241427/blackrock-cont-european-flexible-d4rf-gbp-fund'] headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'} # Build the scrapping loop for url in urls: # Extract HTML element (daily price and date) from url response = requests.get(url, headers=headers) soup = BeautifulSoup(response.content, "html.parser") spans = soup.findAll('span', {'class':'header-nav-data'}) print(spans) print('----------------------------') spans = soup.findAll('span', {'class':'header-nav-data'}, limit=1) print(spans) print('---------------------') print(spans[0].text) # or for span in spans: print (span.text) spans1 = soup.findAll('span', {'class':'header-nav-label navAmount'}) print (spans1)
Method 3
According to your question, Here is the working solution using css selectors.
Code:
from bs4 import BeautifulSoup import requests # Create url list urls = ['https://www.blackrock.com/sg/en/products/241427/blackrock-cont-european-flexible-d4rf-gbp-fund'] headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'} # Build the scrapping loop for url in urls: # Extract HTML element (daily price and date) from url response = requests.get(url, headers=headers) soup = BeautifulSoup(response.content, "html.parser") spans1 = soup.select_one('ul.values-list li span:nth-child(1)').get_text(strip=True).replace('NAV as of', ' ') spans2 = soup.select_one('ul.values-list li span:nth-child(2)').get_text(strip=True).replace('GBP', ' ') print('Date:'+spans1) print('Price:' +spans2)
Output:
Date: 21-Oct-2021 Price: 36.68
Method 4
Try using this:
soup.find_all('span', class='header-nav-label navAmount')
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0