Re: Comms Failure

In article <5rOdnVCIYvcEMATeRVn-iw@giganews.com>, The Goblin
<gbb301@jaguar1.usouthal.edu> writes

so if this is the case why do i have 28 units waiting to upload? I haven't
been able to get any up or down all week.


This "sort of outrage" is caused by the large number of Classic users
switching to BOINC, at the same time the BOINC servers suffered database
over sizing and memory leaks. The nett result is the up load server
seems to be main one to suffer. You can see this in discussions at a
number of sites, like -

        http://setiweb.ssl.berkeley.edu/
- where there is a note (quoting ...

"December 9, 2005 
We will be extending the deadline for returning results so that the
troubles with the result upload handler will not result in lost credit. 

December 8, 2005 
We are experiencing heavy traffic on our data server. This is currently
preventing some result uploads, but is getting better over time. More in
Technical News." 

And similarly in the Technical News at -

"December 8, 2005 - 03:00 UTC 
For the past few days our upload/download server has been dropping
connections, making for a frustrating experience for everybody involved.
We also had our hands full trying to complete the first stage of the
master science database merge. 
Currently everybody who is requesting work can get it, thanks to
splitting the uploads and downloads onto two separate servers. This
isn't reflected yet in the server status page, and it may just be a
temporary solution until we somehow obtain a machine capable of doing
both. As well, there may be more server shuffling as Classic ramps down. 

Meanwhile, we are still dropping connections on the upload server. But
the good news is that we are successfully handling about 4 result
uploads for every work unit download, which means the upload server is
indeed catching up. We're getting about 35 results a second and sending
out about 8 work units a second at the time of writing. 

We hit several snags with the master science database merge and were too
far in to revert back. Since we were running low on work we went with a
backup plan - creating a third database. Since all new work units and
results are being inserted into this third database, we can leisurely
migrate the data between the other two databases without any time
pressure. This complicates our overall merge plan a bit, but reduces a
lot of the stress in the meantime. 

December 6, 2005 - 04:30 UTC 
With the influx of new users, bottlenecks were bound to happen. A couple
nights ago we started dropping connections on the upload/download server
(Kryten). This server was also serving the new BOINC core client
downloads. We immediately moved the client downloads onto the campus
network which was ugly, as this added about 20 Mbit/sec of traffic onto
the regular campus network. 

On Monday morning we fixed this by making the secondary web server
(penguin) the BOINC client download server. In its former life penguin
was the BOINC upload/download server so it already had the plumbing and
hardware to be on the Cogent network. So without much ado, we were able
to move the core client downloads off the campus net. But what about the
secondary web server? Well, another Sun D220R (kosh) wasn't doing very
much at the time, so we plopped apache/php on that and made it the
backup web server. Some people might be getting failed connections to
our home page as DNS maps need a while to propagate throughout the
internet. 

Meanwhile, we were still dropping connections on Kryten. At first we
thought this was due to the upload directories (physically attached to
Kryten) getting too large, as the assimilators were backing up (and they
only read files in the upload dirs). Upon checking half the files in
upload were "antiques," still leftover from server issues way back in
August. We will delete these files in good time. We increased the ufs
directory cache parameters on Kryten but this didn't help at all. So our
current woes must lie in the download directories (kept on a separate
server) or some other bottleneck further down the pike we haven't
discovered yet. 

And while all this was being diagnosed and treated we actually started
the master science database merge. This is why most of the back end
services are disabled, and will remain off until the first half of the
merge is done (about 2 days from now). We hope the results-to-send queue
lasts us through this first part. Having these back-end services off is
actually helping Kryten catch up on its backlog of work to
upload/results to download. 

More to come as we discover more about current server issues and
progress further with the database merge..." 

Further data, especially on connection errors (/BOINC Manager/Messages
tab/), including I/O errors, error 500 and -106s can be found in the
Community discussions groups at ...

        http://setiathome.berkeley.edu/forum_index.php

- under the heading /number crunching/


-- 
Hugh Janus      Constipation is the thief of time, but diaorrehia waits
                for no man!!