More information about the Underscore mailing list

[_] YAPCW (Yet another price comparison website)

Paul Lomax paul at
Tue Sep 20 09:49:11 BST 2011

The problem with scraping is there is nothing stopping the other side from
changing their markup, breaking your scrapers. Depending how many sites you
are trying to scrape, it could end up being a full time job fixing scrapers
just to keep up.

You may also find if you hammer other peoples sites scraping them on a daily
basis, they will just block the scrapers IP.

Point being, *just* scraping isn't really a reliable method of generating

On Tue, Sep 20, 2011 at 9:27 AM, Mark Chitty <
mark.chitty at> wrote:

> good morning all,
> I thought I'd put this (sort of) moral dilemma to the floor...
> I've a client who wants to create a price comparison website for a very
> niche market, so their idea is to gather links to all the pages which list
> said niche products and make them 'searchable'. I've recommended against
> screen scraping, as it's unlikely to make friends but at this stage it's
> unlikely that we will be able to gain agreements from each niche website,
> at
> least until we start pushing significant traffic (and even then there are
> no
> guarantees).
> One option I've considered is using something like Yahoo Pipes to generate
> the list of products links, but then we'd still need to scrape some of the
> data off each page so we're back at an immoral square one.
> AFAIK, most price comparison websites work by ingesting published data, as
> does sites like Rightmove etc, however the niche sites are in no way set up
> for this sort of service.
> What I'm struggling/frustrated with is that effectively, the idea is no
> more
> than what Google does, gathering data, making it easier to search and then
> profiting from the added value. I guess it's a case of 'walk softly but
> carry a big stick'... and we don't have a big enough stick ...
> thoughts?
> m
> --
> Mark Chitty
> -
> web:
> email: mark.chitty at
> mobile: 0777 3392821
> skype: markchitty
> -
> --
> underscore_ list info/archive ->