Scrape journal article title from staff web-page

I would like to scrape the title and authors of journal articles from all staff-members’ official web-pages. e.g.

https://eps.leeds.ac.uk/civil-engineering/staff/581/samuel-adu-amankwah

The specific part in question that I’m trying to access is this:

Scrape journal article title from staff web-page

I’m following this guide: https://www.datacamp.com/community/tutorials/r-web-scraping-rvest but it refers to HTML tags which this site doesn’t have. Can any point me in the right direction please?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

The page loads these citations dynamically using an XHR call that returns a json object. In this case, we can replicate the query and parse the json ourselves to get the publication list:

library(httr)
library(rvest)
library(jsonlite)

url <- paste0("https://eps.leeds.ac.uk/site/custom_scripts/symplectic_ajax.php?",
       "uniqueid=00970757",
       "&tries=0", 
       "&hash=f6a214dc99686895d6bf3de25507356f", 
       "&citationStyle=1")

GET(url) %>% 
  content("text") %>%
  fromJSON() %>%
  `[[`("publications") %>%
  `[[`("journal_article") %>%
  lapply(function(x) paste(x$authors, x$title, x$journal, sep = " ; ")) %>%
  unlist() %>%
  as.character()
#> [1] "Adu-Amankwah S, Zajac M, Skocek J, Nemecek J, Haha MB, Black L ; Combined influence of carbonation and leaching on freeze-thaw resistance of limestone ternary cement concrete ; Construction and Building Materials"                        
#> [2] "Wang H, Hou P, Li Q, Adu-Amankwah S, Chen H, Xie N, Zhao P, Huang Y, Wang S, Cheng X ; Synergistic effects of supplementary cementitious materials in limestone and calcined clay-replaced slag cement ; Construction and Building Materials"
#> [3] "Shamaki M, Adu-Amankwah S, Black L ; Reuse of UK alum water treatment sludge in cement-based materials ; Construction and Building Materials"                                                                                                
#> [4] "Adu-Amankwah S, Bernal Lopez S, Black L ; Influence of component fineness on hydration and strength development in ternary slag-limestone cements ; RILEM Technical Letters"                                                                 
#> [5] "Adu-Amankwah S, Zajac M, Skocek J, Ben Haha M, Black L ; Relationship between cement composition and the freeze-thaw resistance of concretes ; Advances in Cement Research"                                                                  
#> [6] "Zajac M, Skocek J, Adu-Amankwah S, Black L, Ben Haha M ; Impact of microstructure on the performance of composite cements: Why higher total porosity can result in higher strength ; Cement and Concrete Composites"                         
#> [7] "Adu-Amankwah S, Black L, Skocek J, Ben Haha M, Zajac M ; Effect of sulfate additions on hydration and performance of ternary slag-limestone composite cements ; Construction and Building Materials"                                         
#> [8] "Adu-Amankwah S, Zajac M, Stabler C, Lothenbach B, Black L ; Influence of limestone on the hydration of ternary slag cement ; Cement and Concrete Research"                                                                                   
#> [9] "Adu-Amankwah S, Khatib JM, Searle DE, Black L ; Effect of synthesis parameters on the performance of alkali-activated non-conformant EN 450 pulverised fuel ash ; Construction and Building Materials"

Update

It is possible to get the json url from the html of the faculty member’s homepage with a bit of text parsing:

get_json_url <- function(url)
{
   carveout <- function(string, start, end)
   {
      string %>% strsplit(start) %>% `[[`(1) %>% `[`(2) %>%
                 strsplit(end)   %>% `[[`(1) %>% `[`(1)
   }
   
   params <- GET(url) %>% 
      content("text") %>% 
      carveout("var dataGetQuery = ", ";")
   
   id <- carveout(params, "uniqueid: '", "'")
   tries <- carveout(params, "tries: ", ",")
   hash <- carveout(params, "hash: '", "'")
   citationStyle <- carveout(params, "citationStyle: ", "n")

   paste0("https://eps.leeds.ac.uk/site/custom_scripts/symplectic_ajax.php?",
          "uniqueid=", id,
          "&tries=", tries, 
          "&hash=", hash,
          "&citationStyle=", citationStyle)
}

Which allows:

url <- "https://eps.leeds.ac.uk/civil-engineering/staff/581/samuel-adu-amankwah"

get_json_request(url)
#> [1] "https://eps.leeds.ac.uk/site/custom_scripts/symplectic_ajax.php?uniqueid=00970757&tries=0&hash=f7266eb42b24715cfdf2851f24b229c6&citationStyle=1"

And, if you want to be able to just lapply a vector of homepage urls to get the final publication list:

publications_from_homepage <- function(url)
{
   get_json_request(url) %>%
   GET() %>% 
     content("text") %>%
     fromJSON() %>%
     `[[`("publications") %>%
     `[[`("journal_article") %>%
     lapply(function(x) paste(x$authors, x$title, x$journal, sep = " ; ")) %>%
     unlist() %>%
     as.character()
}

So you have:

publications_from_homepage(url)
#> [1] "Adu-Amankwah S, Zajac M, Skocek J, Nemecek J, Haha MB, Black L ; Combined influence of carbonation and leaching on freeze-thaw resistance of limestone ternary cement concrete ; Construction and Building Materials"                        
#> [2] "Wang H, Hou P, Li Q, Adu-Amankwah S, Chen H, Xie N, Zhao P, Huang Y, Wang S, Cheng X ; Synergistic effects of supplementary cementitious materials in limestone and calcined clay-replaced slag cement ; Construction and Building Materials"
#> [3] "Shamaki M, Adu-Amankwah S, Black L ; Reuse of UK alum water treatment sludge in cement-based materials ; Construction and Building Materials"                                                                                                
#> [4] "Adu-Amankwah S, Bernal Lopez S, Black L ; Influence of component fineness on hydration and strength development in ternary slag-limestone cements ; RILEM Technical Letters"                                                                 
#> [5] "Adu-Amankwah S, Zajac M, Skocek J, Ben Haha M, Black L ; Relationship between cement composition and the freeze-thaw resistance of concretes ; Advances in Cement Research"                                                                  
#> [6] "Zajac M, Skocek J, Adu-Amankwah S, Black L, Ben Haha M ; Impact of microstructure on the performance of composite cements: Why higher total porosity can result in higher strength ; Cement and Concrete Composites"                         
#> [7] "Adu-Amankwah S, Black L, Skocek J, Ben Haha M, Zajac M ; Effect of sulfate additions on hydration and performance of ternary slag-limestone composite cements ; Construction and Building Materials"                                         
#> [8] "Adu-Amankwah S, Zajac M, Stabler C, Lothenbach B, Black L ; Influence of limestone on the hydration of ternary slag cement ; Cement and Concrete Research"                                                                                   
#> [9] "Adu-Amankwah S, Khatib JM, Searle DE, Black L ; Effect of synthesis parameters on the performance of alkali-activated non-conformant EN 450 pulverised fuel ash ; Construction and Building Materials"

Created on 2021-11-04 by the reprex package (v2.0.0)

Method 2

here is a RSelenium approach

library(RSelenium)
library(rvest)
library(xml2)

#setup driver, client and server
driver <- rsDriver( browser = "firefox", port = 4545L, verbose = FALSE ) 
server <- driver$server
browser <- driver$client

#goto url in browser
browser$navigate("https://eps.leeds.ac.uk/civil-engineering/staff/581/samuel-adu-amankwah")

#get all relevant titles
doc <- xml2::read_html(browser$getPageSource()[[1]])
df <- data.frame( title = 
                    xml2::xml_find_all(doc, '//span[@class="title-with-parent"]') %>%
                    xml2::xml_text() )

#close everything down properly
browser$close()
server$stop()
# needed, else the port 4545 stays occupied by the java process
system("taskkill /im java.exe /f", intern = FALSE, ignore.stdout = FALSE)

Scrape journal article title from staff web-page


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x