Subject: Re: Connections to ssl.berkeley not propagated yet?
From: jfh@avondale.demon.co.uk (John F Hall)
Date: 28/07/2005, 13:51
Newsgroups: alt.sci.seti

In article <51c9b$42e89002$82a1d3bf$3947@news1.tudelft.nl>,
Patrick Vervoorn  <patrick.vervoorn@NOSPAM.perihelion.demon.nl> wrote:
In article <dc9jvv$7b8$1@green.home>,
John F Hall <jfh@avondale.demon.co.uk> wrote:
In article <a2a91$42e7835f$82a1d3bf$17190@news1.tudelft.nl>,
Patrick Vervoorn  <patrick.vervoorn@NOSPAM.perihelion.demon.nl> wrote:

- Why did SetiBOINC not process the WU's it had already downloaded? This 
  is a P3-733MHz machine, so those 3 WU's should have lasted it for at 
  least a day or two. Instead it decided to just sit there idling, and 
  periodically checking the Berkeley site for 'something'?

Have you checked the log?  It's quite likely that they had already been
processed, their results uploaded, and your system was waiting to tell
the scheduler, after which they would be deleted.

Which log do you mean? I didn't see anything resembling a log file in the 
local BOINC/ directory, so I assume you mean the 'log' displayed under the 
'Account' tab on SetiBOINC's website; there it showed the units had been 
sent out, but no results had been returned, so also the 'Pending validaton 
...' text was not there (which would've been an indication the WU's had 
been processed, the results returned, and I only had to wait for these 
results to be validated...

Ah, on looking deeper I see it's the standard output of boinc.  My
startup command, buried deep within /etc.rc.d/rc.boinc, captures that:

  BOINCUSER=boinc

  ## Locate the executable with highest version
  BOINCEXE=`/bin/ls -1 $BINDIR/boinc_*_$BUILD_ARCH 2>/dev/null | tail -1 `
  if [ ! -x "$BOINCEXE" ]; then
    echo "Cannot find/run boinc executable $BOINCEXE "
    exit 2
  fi
 
  BOINCOPTS="-return_results_immediately $*"

  LOGFILE=boinc.log
  ERRORLOG=error.log

  su $BOINCUSER -c "$BOINCEXE $BOINCOPTS >>$LOGFILE 2>>$ERRORLOG &"

I usually have a "tail -f boinc.log" running on one of my virtual
terminals to keep an eye on what's happening :-).  (I've upped the
inittab commands to give myself 10 such, rather than the default 6.)

There's also relevant info in the client_state.xml file, though I
haven't fully analysed that.

As far as I could determine, these WU's had been downloaded by me, but 
nothing had been returned and/or processed.

I'd better start by saying I'm only another user, so everything I say is
found by observation and by my struggles to get it running as I want,
not by special knowledge.

The processing sequence seems to be:

  Contact scheduler and ask for work.  Scheduler allocates workunits and
  marks them "in progress" and sets deadline for return.

  Contact download server and collect workunits.

  Process workunits.

  Contact upload server and return result as each completes.

  Contact scheduler and register completion.  Scheduler marks workunit
  as "done" and queues it for validation.  Scheduler instructs your
  computer to delete all files associated with that workinit.  The "date
  returned" is the date the scheduler is contacted.

Yes.  Don't interfere, but let "Deferring communication with project for
..." run its course.  :-)

I'll do that next time, but this is more or less against my principles: 
I'm donating some CPU time (I have more computers running BOINC), but when 
one of them for some fuzzy reason decided to go idle, I try to fix this, 
especially if after some investigation, it turns out there is no real 
reason to idle...

But also make sure you're running boinc with the
"-return_results_immediately" flag if you're not already doing so, and
make sure your cache size is set large enough (mine is set to 4 days).

Does that flag make a lot of difference? If I look at the stdout output 
>from BOINC, it seems it's returning results rather quickly already:

Before I included that flag I found that the final step of contacting
the scheduler didn't seem to start until the cache was drained a bit (to
half-full?).  The documentation (boinc_public/doc/client_unix.php) says:

  "-return_results_immediately",
      "Contact scheduler as soon as any result done."

2005-07-28 09:32:41 [SETI@home] Computation for result 
13mr04aa.15943.30386.29822.119_1 finished
[....]
2005-07-28 09:32:42 [SETI@home] Sending scheduler request to 
http://setiboinc.ssl.berkeley.edu/sah_cgi/cgi
2005-07-28 09:32:42 [SETI@home] Started upload of 
13mr04aa.15943.30386.29822.119_1_0
2005-07-28 09:32:43 [SETI@home] Scheduler request to 
http://setiboinc.ssl.berkeley.edu/sah_cgi/cgi succeeded
2005-07-28 09:32:44 [SETI@home] Finished upload of 
13mr04aa.15943.30386.29822.119_1_0

So that's about a 1 second delay...

Presumably it depends on how big a cache you're running.

And, how do I set up the cache to cache up to 4 days of work? Is this done 
by some fiddling with the 'contact SetiBOINC every ... hours/days'? If so, 
what value do I put there?

Under the "general preferences" on the "your account" web page there is
a "Connect to network about every:" question.  I have set it to 4 days,
which seems for me to give a reasonable compromise between having
workunits in reserve and not being the last in returning them.  I used
to run classic with a 20 workunit cache, about a week, but after
experimenting I settled on 4 days for boinc.  The effect seems to be
that the cache is topped up every time the cache is half empty - about 5
or 6 units every two days.  If a breakdown happened at the wrong time I
would just have 2 days work and could run out, but that hasn't happened
yet - if it did I suppose I would change to 5 days :-).

When you change that setting it won't take effect until you next contact
the scheduler.  There are ways of forcing a connection, but if you've
used the return_results_immediately flag that will be next time a
workunit completes which should be soon enough (as long as seti is the
master project).  If you're running more than one project, it's the one
defined as the "master_url" in the client_state, and as "master.html",
that matters - that should be the project you started first, as long as
you didn't stop it.

I've tried to see where I got my boinc.rc from.  I seem to have got an
"init.d-boinc" that is Red Hat specific from somewhere, and hacked it to
form a "boinc.rc" for my Slackware system.  If it would help I can email
you copies (my email address above is valid).

I hope this all helps :-).

-- John F Hall