I would like to scrape the title and authors of journal articles from all staff-members’ official web-pages. e.g.
https://eps.leeds.ac.uk/civil-engineering/staff/581/samuel-adu-amankwah
The specific part in question that I’m trying to access is this:
I’m following this guide: https://www.datacamp.com/community/tutorials/r-web-scraping-rvest
but it refers to HTML tags which this site doesn’t have. Can any point me in the right direction please?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
The page loads these citations dynamically using an XHR call that returns a json
object. In this case, we can replicate the query and parse the json ourselves to get the publication list:
library(httr)
library(rvest)
library(jsonlite)
url <- paste0("https://eps.leeds.ac.uk/site/custom_scripts/symplectic_ajax.php?",
"uniqueid=00970757",
"&tries=0",
"&hash=f6a214dc99686895d6bf3de25507356f",
"&citationStyle=1")
GET(url) %>%
content("text") %>%
fromJSON() %>%
`[[`("publications") %>%
`[[`("journal_article") %>%
lapply(function(x) paste(x$authors, x$title, x$journal, sep = " ; ")) %>%
unlist() %>%
as.character()
#> [1] "Adu-Amankwah S, Zajac M, Skocek J, Nemecek J, Haha MB, Black L ; Combined influence of carbonation and leaching on freeze-thaw resistance of limestone ternary cement concrete ; Construction and Building Materials"
#> [2] "Wang H, Hou P, Li Q, Adu-Amankwah S, Chen H, Xie N, Zhao P, Huang Y, Wang S, Cheng X ; Synergistic effects of supplementary cementitious materials in limestone and calcined clay-replaced slag cement ; Construction and Building Materials"
#> [3] "Shamaki M, Adu-Amankwah S, Black L ; Reuse of UK alum water treatment sludge in cement-based materials ; Construction and Building Materials"
#> [4] "Adu-Amankwah S, Bernal Lopez S, Black L ; Influence of component fineness on hydration and strength development in ternary slag-limestone cements ; RILEM Technical Letters"
#> [5] "Adu-Amankwah S, Zajac M, Skocek J, Ben Haha M, Black L ; Relationship between cement composition and the freeze-thaw resistance of concretes ; Advances in Cement Research"
#> [6] "Zajac M, Skocek J, Adu-Amankwah S, Black L, Ben Haha M ; Impact of microstructure on the performance of composite cements: Why higher total porosity can result in higher strength ; Cement and Concrete Composites"
#> [7] "Adu-Amankwah S, Black L, Skocek J, Ben Haha M, Zajac M ; Effect of sulfate additions on hydration and performance of ternary slag-limestone composite cements ; Construction and Building Materials"
#> [8] "Adu-Amankwah S, Zajac M, Stabler C, Lothenbach B, Black L ; Influence of limestone on the hydration of ternary slag cement ; Cement and Concrete Research"
#> [9] "Adu-Amankwah S, Khatib JM, Searle DE, Black L ; Effect of synthesis parameters on the performance of alkali-activated non-conformant EN 450 pulverised fuel ash ; Construction and Building Materials"
Update
It is possible to get the json url from the html of the faculty member’s homepage with a bit of text parsing:
get_json_url <- function(url)
{
carveout <- function(string, start, end)
{
string %>% strsplit(start) %>% `[[`(1) %>% `[`(2) %>%
strsplit(end) %>% `[[`(1) %>% `[`(1)
}
params <- GET(url) %>%
content("text") %>%
carveout("var dataGetQuery = ", ";")
id <- carveout(params, "uniqueid: '", "'")
tries <- carveout(params, "tries: ", ",")
hash <- carveout(params, "hash: '", "'")
citationStyle <- carveout(params, "citationStyle: ", "n")
paste0("https://eps.leeds.ac.uk/site/custom_scripts/symplectic_ajax.php?",
"uniqueid=", id,
"&tries=", tries,
"&hash=", hash,
"&citationStyle=", citationStyle)
}
Which allows:
url <- "https://eps.leeds.ac.uk/civil-engineering/staff/581/samuel-adu-amankwah"
get_json_request(url)
#> [1] "https://eps.leeds.ac.uk/site/custom_scripts/symplectic_ajax.php?uniqueid=00970757&tries=0&hash=f7266eb42b24715cfdf2851f24b229c6&citationStyle=1"
And, if you want to be able to just lapply
a vector of homepage urls to get the final publication list:
publications_from_homepage <- function(url)
{
get_json_request(url) %>%
GET() %>%
content("text") %>%
fromJSON() %>%
`[[`("publications") %>%
`[[`("journal_article") %>%
lapply(function(x) paste(x$authors, x$title, x$journal, sep = " ; ")) %>%
unlist() %>%
as.character()
}
So you have:
publications_from_homepage(url)
#> [1] "Adu-Amankwah S, Zajac M, Skocek J, Nemecek J, Haha MB, Black L ; Combined influence of carbonation and leaching on freeze-thaw resistance of limestone ternary cement concrete ; Construction and Building Materials"
#> [2] "Wang H, Hou P, Li Q, Adu-Amankwah S, Chen H, Xie N, Zhao P, Huang Y, Wang S, Cheng X ; Synergistic effects of supplementary cementitious materials in limestone and calcined clay-replaced slag cement ; Construction and Building Materials"
#> [3] "Shamaki M, Adu-Amankwah S, Black L ; Reuse of UK alum water treatment sludge in cement-based materials ; Construction and Building Materials"
#> [4] "Adu-Amankwah S, Bernal Lopez S, Black L ; Influence of component fineness on hydration and strength development in ternary slag-limestone cements ; RILEM Technical Letters"
#> [5] "Adu-Amankwah S, Zajac M, Skocek J, Ben Haha M, Black L ; Relationship between cement composition and the freeze-thaw resistance of concretes ; Advances in Cement Research"
#> [6] "Zajac M, Skocek J, Adu-Amankwah S, Black L, Ben Haha M ; Impact of microstructure on the performance of composite cements: Why higher total porosity can result in higher strength ; Cement and Concrete Composites"
#> [7] "Adu-Amankwah S, Black L, Skocek J, Ben Haha M, Zajac M ; Effect of sulfate additions on hydration and performance of ternary slag-limestone composite cements ; Construction and Building Materials"
#> [8] "Adu-Amankwah S, Zajac M, Stabler C, Lothenbach B, Black L ; Influence of limestone on the hydration of ternary slag cement ; Cement and Concrete Research"
#> [9] "Adu-Amankwah S, Khatib JM, Searle DE, Black L ; Effect of synthesis parameters on the performance of alkali-activated non-conformant EN 450 pulverised fuel ash ; Construction and Building Materials"
Created on 2021-11-04 by the reprex package (v2.0.0)
Method 2
here is a RSelenium approach
library(RSelenium) library(rvest) library(xml2) #setup driver, client and server driver <- rsDriver( browser = "firefox", port = 4545L, verbose = FALSE ) server <- driver$server browser <- driver$client #goto url in browser browser$navigate("https://eps.leeds.ac.uk/civil-engineering/staff/581/samuel-adu-amankwah") #get all relevant titles doc <- xml2::read_html(browser$getPageSource()[[1]]) df <- data.frame( title = xml2::xml_find_all(doc, '//span[@class="title-with-parent"]') %>% xml2::xml_text() ) #close everything down properly browser$close() server$stop() # needed, else the port 4545 stays occupied by the java process system("taskkill /im java.exe /f", intern = FALSE, ignore.stdout = FALSE)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0