I am trying to extract 2 sets of data from:
“https://www.kucoin.com/news/categories/listing”
using a python script and drop it into a list or dictionary. I’ve tried Selenium and BeautifulSoup as well as request.
All of them return an empty: [] or None. I’ve been at this all day with no success. I have tried to use the full xpath as well to try to index the location of the text, which had the same result. Any help at this point would be much appreciated.
########################################################## from bs4 import BeautifulSoup import requests url = requests.get('https://www.kucoin.com/news/categories/listing') soup = BeautifulSoup(url.text, features="lxml") listing = soup.find(class_='mainTitle___mbpq1') print(listing) ########################################################### import requests from lxml import html def main(): url = "https://www.kucoin.com/news/categories/listing" page = requests.get(url) tree = html.fromstring(page.content) text_val = tree.xpath('//div[@class="item___2ffLg"]') print(text_val) ###########################################################
1st text between ‘(‘ ‘)’, 2nd text is Date/Time after ‘Trade: ‘
(The only way i was even able to get the page in a text format that actually contains the part of the page i’m looking for, is by saving it as an *.mhtml format manually.)
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
As already explained, the data is loaded by an API. You can use the same to extract the details using requests
.
Have only tried for page 1
.
import requests response = requests.get("https://www.kucoin.com/_api/cms/articles?page=1&pageSize=10&category=listing&lang=en_US") jsoncode = response.json() options = jsoncode['items'] for i in range(len(options)): title = options[i]['title'] date = options[i]['summary'] print(f"{title} : {date}")
Cryowar (CWAR) Gets Listed on KuCoin! World Premiere! : Trading: 14:00 on November 12, 2021 (UTC) Deeper Network (DPR) Gets Listed on KuCoin! : Trading: 06:00 on November 12, 2021 (UTC) Vectorspace AI (VXV) Gets Listed on KuCoin! : Trading: 8:00 on November 12, 2021 (UTC) ...
Method 2
Go to Chrome Developer Mode and Refresh your site and now go to Network Tab Left side you will get search option just paste first Crypto War…. line in that
Now you will get URL which is used to reflect data in webpage you can click on headers to get URL and copy that and call it using requests
module which returns json
response
res=requests.get("https://www.kucoin.com/_api/cms/articles?page=1&pageSize=10&category=listing&lang=en_US") res.json()
Output:
{'success': True, 'code': 200, 'msg': 'success', 'timestamp': 1636695390265, 'totalNum': 461, 'items': [{'id': 10358, 'title': 'Cryowar (CWAR) Gets Listed on KuCoin! World Premiere!', 'summary': 'Trading: 14:00 on November 12, 2021 (UTC)', ...
Image:
Method 3
I checked the response from the request.get method and saw that the initial source code is plain javascript. You have to wait for its execution to finish to parse the final rendered html. If you are fine with using selenium as you have tried with that, here is a solution I came up with to get the first element. Adjust the timeout based on your internet connection speed
from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Firefox() driver.get("https://www.kucoin.com/news/categories/listing") try: elem = WebDriverWait(driver, 5).until( EC.presence_of_element_located((By.CLASS_NAME, "info___1vA3W")) ) title = elem.find_element_by_tag_name("a") date_desc = elem.find_element_by_tag_name("p") title_text = title.text date_text = date_desc.text print(title_text, date_text) finally: driver.quit()
Generic approach: Wait for all elements to be visible and cycle through them to print the elements. You can refer to this for running headless chrome so it doesn’t open browser window. Added regex search for convenience
import re from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.firefox.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC options = Options() options.headless = True driver = webdriver.Firefox(options = options) driver.get("https://www.kucoin.com/news/categories/listing") try: elements = WebDriverWait(driver, 5).until( EC.presence_of_all_elements_located((By.XPATH, "//div[contains(@class, 'info___')]")) ) for el in elements: title = el.find_element_by_tag_name("a") date_desc = el.find_element_by_tag_name("p") title_text = re.search(r'(?<=()[A-Z]+', title.text, re.I)[0] date_text = re.search(r'(?<=Trading: ).+', date_desc.text)[0] print(f'Title: {title_text}; Date: {date_text}') finally: driver.quit()
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0