How can I prevent my asp.net 3.5 website from being screen scraped by my competitor?
Ideally, I want to ensure that no webbots or screenscrapers can extract data from my website.
Is there a way to detect that there is a webbot or screen scraper running ?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
It is possible to try to detect screen scrapers:
Use cookies and timing, this will make it harder for those out of the box screen scrapers. Also check for javascript support, most scrapers do not have it. Check Meta browser data to verify it is really a web browser.
You can also check for requests in a minute, a user driving a browser can only make a small number of requests per minute, so logic on the server that detects too many requests per minute could presume that screen scraping is taking place and prevent access from the offending IP address for some period of time. If this starts to affect crawlers, log the users ip that is blocked, and start allowing their IPs as needed.
You can use http://www.copyscape.com/ to proect your content also, this will at least tell you who is reusing your data.
See this question also:
Protection from screen scraping
Also take a look at
Nice doc about screen scraping:
How to prevent screen scraping:
http://mvark.blogspot.com/2007/02/how-to-prevent-screen-scraping.html
Method 2
Unplug the network cable to the server.
paraphrase: if public can see it, it can be scraped.
update: upon second look it appears that I am not answering the question. Sorry. Vecdid has offered a good answer.
But any half decent coded could defeat the measures listed. In that context, my answer could be considered valid.
Method 3
I don’t think it is possible without authenticating users to your site.
Method 4
You could use a CAPTCHA.
Also, you can mitigate it instead by throttling their connection. It won’t completely prevent them from screen scraping but it will probably prevent them from getting enough data to be useful.
First, for cookied users, throttle connections so you can see at most one page view per second, but once your one-second timer is up you experience no throttling whatsoever. No impact on normal users, lots of impact on screen scrapers (at least if you have a lot of pages they’re targeting).
Next, require cookies to see the data-sensitive pages.
They’ll be able to get in, but as long as you don’t accept bogus cookies, they won’t be able to screen scrape much with any real speed.
Method 5
Ultimately you can’t stop this.
You can make it harder for people to do, by setting up the robots.txt file etc. But you’ve got to get information onto legitimate users screens so it has to be served somehow, and if it is then your competitors can get to it.
If you force users to log in you can stop the robots all the time, but there’s nothing to stop a competitor registering for your site anyway. This may also drive potential customers away if they can’t access some information for “free”.
Method 6
If your competitor is in same country as you, have an acceptable use policy and terms of service clearly posted on your site. Mention the fact that you do not allow any sort of robots/screen scraping etc. If that continues, get an attorney to send them a friendly cease and desist letter.
Method 7
I don’t think that’s possible. But whatever you’ll come up with, it’ll be as bad for search engine optimization as it will be for the competition. Is that really desirable?
Method 8
How about serve up every bit of text as an image? Once that is done, either your competitors will be forced to invest OCR technologies, or you will find that you have no users – so the question will be moot.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0