I’m trying to automate the download of the Panel Study of Income Dynamics files available on this web page using R. Clicking on any of those files takes the user through to this login/authentication page. After authentication, it’s easy to download the files with your web browser. Unfortunately, the httr code below does not appear to be maintaining the authentication. I have tried inspecting the Headers in Chrome for the Login.aspx page (as described here), but it doesn’t appear to maintain the authentication even when I believe I’m passing in all the correct values. I don’t care if it’s done with httr or RCurl or something else, I’d just like something that works inside R so I don’t need to have users of this script have to download the files manually or with some completely separate program. One of my attempts at this is below, but it doesn’t work. Any help would be appreciated. Thanks!! 😀
require(httr)
values <-
list(
"ctl00$ContentPlaceHolder3$Login1$UserName" = "<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="bec7d1cbfedbd3dfd7d290ddd1d3">[email protected]</a>" ,
"ctl00$ContentPlaceHolder3$Login1$Password" = "somepassword" ,
"ctl00$ContentPlaceHolder3$Login1$LoginButton" = "Log In" ,
"_LASTFOCUS" = "" ,
"_EVENTTARGET" = "" ,
"_EVENTARGUMENT" = ""
)
POST( "http://simba.isr.umich.edu/u/Login.aspx?redir=http%3a%2f%2fsimba.isr.umich.edu%2fZips%2fZipMain.aspx" , body = values )
resp <- GET( "http://simba.isr.umich.edu/Zips/GetFile.aspx" , query = list( file = "1053" ) )
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Beside storing the cookie after authentication (see my above comment) there was another problematic point in your solution: the ASP.net site sets a VIEWSTATE key-value pair in the cookie which is to be reserved in your queries – if you check, you could not even login in your example (the result of the POST command holds info about how to login, just check it out).
An outline of a possible solution:
-
Load
RCurlpackage:> library(RCurl)
-
Set some handy
curloptions:> curl = getCurlHandle() > curlSetOpt(cookiejar = 'cookies.txt', followlocation = TRUE, autoreferer = TRUE, curl = curl)
-
Load the page for the first time to capture
VIEWSTATE:> html <- getURL('https://simba.isr.umich.edu/u/Login.aspx', curl = curl) -
Extract
VIEWSTATEwith a regular expression or any other tool:> viewstate <- as.character(sub('.*id="__VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\1', html)) -
Set the parameters as your username, password and the
VIEWSTATE:> params <- list( 'ctl00$ContentPlaceHolder3$Login1$UserName' = '<USERNAME>', 'ctl00$ContentPlaceHolder3$Login1$Password' = '<PASSWORD>', 'ctl00$ContentPlaceHolder3$Login1$LoginButton' = 'Log In', '__VIEWSTATE' = viewstate ) -
Log in at last:
> html = postForm('http://simba.isr.umich.edu/u/Login.aspx', .params = params, curl = curl)Congrats, now you are logged in and
curlholds the cookie verifying that! -
Verify if you are logged in:
> grepl('Logout', html) [1] TRUE -
So you can go ahead and download any file – just be sure to pass
curl = curlin all your queries.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0