Self Denial of Service (was SETI-BOINC running but lacking connections)

In article <3n3bh1pclpaeonike8ln9ictfjb4m5uvua@4ax.com>, f/fgeorge  <*> wrote:

On Wed, 31 Aug 2005 08:46:24 +0100, John
<fredclark@consltec.demon.co.uk> wrote:

SETI-BOINC has been running several hours (started about midnight, as
far as I can see), and peak bandwidth demand seems to be over (bandwidth
settling at about 45-50 Mbps from about 90 Mbps)). 

Why is it still so difficult to connect and upload results and download
work units to crunch?

Different PCs on my small farm seem to have fared differently. In all
cases none of the completed WU results seem to have been sent yet. This
is despite BOINC Manager showing the connection attempts, but the
Systray connection icon shows no broadband activity. 

However, in a couple of cases PCs have succeeded in downloading work
(and are crunching). In other cases no downloads have been achieved, so
no crunching. 

This is despite being booted up and running online (ADSL) for several
hours.

Because there are over 100,000 of us with over 400,000 computers ALL
trying to connect AT THE SAME TIME!
Over the next couple of days things should settle down.


That's because Berkeley seem to have programmed a "Denial of Service"
attack against themselves.  It's clear their "backing off" stategy is
seriously flawed.  I hope they will rethink it.

Let me describe, as illustration, how my system behaved.  After the
switch on it was a couple of hours before I established a "scheduler"
connection, and was allocated 6 new work units.  I was lucky in that one
downloaded fairly soon, and I started processing.

However the others, and 6 uploads, started attempting to make a
connection.  Each download failed, backed off a minute, and reattempted.
There seemed to be an interlock so that multiple simultaneous attempts
didn't happen.  So the result was that for *hours* there was *always* a
download attempt in progress.  It was about 06:00 UTC before the backing
off has expanded enough to give the occasional few minutes gap, but that
were still only occasional, still with long runs of continual attempts.
It was several hours more before all the attempts were separated.  And
that was wanting *five* workunits - there are systems wanting hundreds,
they're probably *still* sending runs of requests.

It was 18:00 (UTC) before the fetch happened - and still no uploads.
(By then my new processing was finished, and I had been idle again.)

I finally had an upload at 04:00, next day - with "Throughput 117
bytes/sec".  Hours later, I still have 7 trying to upload.

So, what's wrong?  The main problem seems to be each request backs off
*independently*.  I would have thought that the uploads and downloads
should each be backed off as a *group*.  If that causes programming
problems then there could be congestion-processing built-in - and its
detection must be there because of the interlock I mentioned earlier -
with something like twice the congestion delay (or more) added to
whatever back-off the current algorithm provides.  (Where by "congestion
delay" I mean the time between when the access becomes due, and when the
interlock permits it.)

The second problem I see is that uploads and downloads seem to be aiming
at the same FQDN (setiboincdata...).  Surely there should be two FQDNs
for that.  That still would permit one sytem to be used if necessary -
they could resolve to the same IP address or to two address on the smae
computer if wanted - but would give more flexibility for tuning, to stop
uploads and downloads blocking one another.

In fact I would carry that further by using several FQDNs for each on a
round-robin basis.  That would have no effect normally, but would enable
access to be rationed during overloads by null-routing some of the access
attempts.

I hope that some BOINC developer will see this and give it - or
something similar - serious thought.

-- 
John F Hall