Fighting scripted polls/screen scrapers

Posted by Jens Dahlin on 08-Apr-2011 06:50

We have a web application based on WebSpeed. It is basically formed like this

Input Form -> Price Listing -> Selection -> Confirmation

There are of course a number of parameters passed between these pages and for the price listing we have an url that looks something like (a bit simplified):

departure=AAA&destination=BBB&departureDate=CCC&arrivalDate=DDD

The problem is that we have a number of external sources polling the Price Listing a couple of times per minute and changing the input parameters, then probably electronically scraping the resulting page. Its quite obvious what they are doing since I logg all accesses and the pattern is clear:

Date (YYYY-MM-DD);Time;IP;Departure;Destination;DepartureDate;ArrivalDate

2011-04-07;14:02:59;x.x.x.x;ARN;PVZ;2011-06-15;2011-07-06
2011-04-07;14:03:25;x.x.x.x;ARN;PVZ;2011-06-16;2011-07-07
2011-04-07;14:03:49;x.x.x.x;ARN;PVZ;2011-06-17;2011-07-08
2011-04-07;14:04:11;x.x.x.x;ARN;PVZ;2011-06-18;2011-07-09
2011-04-07;14:04:26;x.x.x.x;ARN;PVZ;2011-06-19;2011-07-10
2011-04-07;14:04:49;x.x.x.x;ARN;PVZ;2011-06-20;2011-07-11
2011-04-07;14:05:19;x.x.x.x;ARN;PVZ;2011-06-21;2011-07-12
2011-04-07;14:05:48;x.x.x.x;ARN;PVZ;2011-06-22;2011-07-13
2011-04-07;14:06:18;x.x.x.x;ARN;PVZ;2011-06-23;2011-07-14
2011-04-07;14:06:47;x.x.x.x;ARN;PVZ;2011-06-24;2011-07-15
2011-04-07;14:07:21;x.x.x.x;ARN;PVZ;2011-06-25;2011-07-16
2011-04-07;14:07:38;x.x.x.x;ARN;PVZ;2011-06-26;2011-07-17
2011-04-07;14:08:11;x.x.x.x;ARN;PVZ;2011-06-27;2011-07-18
2011-04-07;14:08:21;x.x.x.x;ARN;PVZ;2011-06-28;2011-07-19
2011-04-07;14:08:48;x.x.x.x;ARN;PVZ;2011-06-29;2011-07-20
2011-04-07;14:09:28;x.x.x.x;ARN;PVZ;2011-06-30;2011-07-21

The issue we have with this is:

a) They take up resources in webspeed thus making us having to have more licenses than really needed

b) They take up resources in our server forcing us to invest in faster than needed servers

c) These calls also invoke an external web service call that actually costs money

Has anybody handled a problem like this? We are considering blocking IP's when reach a specific number of searches but then we have to keep track of a list of IPs that are free to search as much as they want etc.

All Replies

Posted by rbf on 08-Apr-2011 11:31

Are you legitimate users people of flesh and blood?

In that case a very simple solution is to add a CAPTCHA on your form. Check out

http://en.wikipedia.org/wiki/CAPTCHA

-peter

This thread is closed