We are considering developing a new version of our website (where our customers can purchase and manage certain types of content) on the force.com platform. Looking into the impact of governor limits we analyzed some visitor and page request data of our current website. One of the conclusions was that there is a high likelihood that some of our customers are web scraping our website for information.
If we would build our website on the force.com platform using sites, we would have the sites limitations (40gb bandwith per rolling 24h, 60h processing page requests per rolling 24h and 1m of unauthorized page requests). Even though these limits are fairly high, robots scraping our pages will definitely increase the rate in which we reach limits (and some governor limits like callouts).
How can we analyze our webtraffic to know what customers are doing this (authenticated users over customer portal) and how can we block it ?
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Use a custom controller for each of the Visualforce pages and have it track usage either directly against the contacts logged in, or in a new custom object.
I’d suggest TrackingController extends StandardController,
And then for each page specific controller extend the TrackingController.
Look in particular at the user agent HTTP header:
String userAgent = System.currentPageReference().getHeaders().get('User-Agent');
Some scrapers might set this to emulate a real browser, but you may get some insight here.
Another way to think about the underlying issue:
Is the concern just limits, or do you have another reason to prevent you customers from scraping your data? If it’s just limits, could you provide a the data customers want in another fashion? For example, a csv (or even XML) file with just the data a customer wants would represent far less bandwidth than a full html page render, and potentially fewer page requests if the customers must currently scrape multiple pages to get all the data needed. Is the information is at all cachable? Could you precompute a download file, hourly or daily, and host it on AWS or Heroku for example?
All but the most persistent and well-coded bots should be put off by this approach.
- scrapers tend to only follow links, create an image take that points to a visualforce page with an ID in the parameter. track hits to your visualforce page (custom object), if the corresponding image tag is not found repeatedly you’ve either got a person using lynx or a scraper.(this doubles your bandwidth)
rate checking, scrapers can be built to act like browsers but tend to follow standard execution patterns, ie: polling at standard intervals. by tracking every time your visualforce page is read you can determine patters and associate probable scraping to an IP.
String ip = ApexPages.currentPage().getHeaders().get('X-Salesforce-SIP');
- Rotate your output template. Scrapes tend to be setup using anchor points and xpaths to find the data. By restructuring your data every now and again you break their scripts. This could be done automatically by generating html in Apex and using .
- Another method is to render content as images
- rate limit per ip
If these users are authenticated like you say (customer portal users) then most of those limits shouldn’t apply.
Doing this in Apex isn’t going to address the unauthorized request / month limit. It also will incur processing costs, although if you do it right they could be minimal.
If your robot traffic is so high that you’re in danger of crossing 1m hits/month, you should look at commercial CDN’s (unless your app absolutely must be dynamic, but presumably not if you’re getting routinely scraped). They reduce the load on your site, and they also offer services that could explicitly block your scrapers.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0