[_] HA....y
Steve Roome
steve at pepcross.com
Tue Apr 1 12:22:37 BST 2008
On Mon, Mar 31, 2008 at 09:31:35PM +0100, Ric wrote:
> How big is your idea of a 'big' box?
> We've got 8 way 32gb linux servers, and similar Suns and HPs, but
> nothing that doesn't have fully redundant everything counts IMHO,
> ideally where you can change everything without taking the box down.
> An E10k or similar would do it, which is what Orange used to run all
> their Internet presence on, perhaps they still do.
Yeah, that sort of monster, F15 or F25K probably now though, or you
could go buy something with an equally silly name and price tag from
HP or IBM.
Still, they're pretty pricey! ;) Anything that has to be bigger than
8way, usually seems to come with a ridiculous pricetag. I've specced
up a couple of E10k's and some V classes over the years, but I've
never worked anywhere that already had one AND had really needed it
rather than better SQL. Which for kit which can cost millions is a
pretty sad state of affairs. Efficient software design could save a
bit perhaps ? (not necessary in your case)
> Even a couple of everything (server wise) can be done pretty simply
But it's not always done pretty "simply". Quite the opposite sometimes
don't you think? And then the third one goes in ....
> Going outwards a long way requires it to be kept pretty simple. I
> believe that google have lots of thin servers at the front end.
Google seem very secretive and it's hard to analyze how clever they
are when they have so much money from stock weirdness that it's a bit
ridiculous to compare them to any normal company. IMHO!
> It's in the middle that's tough, when you have enough to add to
> complexity, but not enough to *force* simplification of deployment
> and that allows admins to configure unreliability.
Oh, I totally agree with this, it's at that middle stage that I think
the "let's make this bigger and more reliable" tends to have gone a
little wrong. In my experience anyway.
> > Every situation I've sent with 3+ staff the probability of the staff
> > accidentally cocking it up during regular maintenance has become
> > higher than the probability of some software or hardware failure.
>
> It's a combination in my experience. Someone not configuring something
> how they should have, which then causes some failure to become an
> outage that wouldn't have. Happened the only time the 'resilient
> network went out:
> Network reboot -> causes rooting change -> internal DNS loss ->
> significant outage...
We're just people, we are thousands if not millions of times more
prone to failure than almost any decent equipment.
More of these high failure "units" seems bad for scalability. ;)
Steve