Subject: Re: SETI@home (Classic) Phenomenology
From: Martin 53N 1W
Date: 05/08/2004, 20:37
Newsgroups: alt.sci.seti

Randall Schulz wrote:
On Thu, 05 Aug 2004 13:49:07 +0000, Martin 53N 1W wrote:
[...]

It sounds like you have hardware problems causing the CPU/software to
fail intermittently.

I doubt that, really. As I said, this symptom has occurred on two
different systems. The stall always happens at the same point in a work
unit. This is too patterend to be sporadic hardware failure.

So far, my best guess is that what I'm seeing is some kind of interaction
between the SETI@home client and Ksetiwatch.

As a test, try the same conditions but without KSetiWatch.

Also, do you still have the problem for a different WU?


[...]
If you've ever disturbed the CPU heatsink, it is very worthwhile
cleaning and reseating it with new thermally conductive grease.

It's a new system. The stock, boxed-CPU cooler assembly was installed and
has not been removed since (since three days ago, that is).

(I found it interesting that the heat-sink compound they use is apparently
designed to be solid at room temperature but to melt at CPU operating
temperatures and hence to conform fully to the surfaces (CPU and heat
sink) between which it's situated regardless of its form and distribution
when the heat sink is installed.)

I've seen on brand new systems where they have been assembled with the protective plastic cover still on the thermally conductive pad! I've also seen heatsinks mounted on CPUs with no grease and no pad. I've also seen broken clips with the heatsink starting to slide off...

They will work for a surprisingly long time under light loading!!

The motherboard sensors may well not accurately indicate CPU temperature.


[...]
Memtest86 and prime95 (torture mode) reveal no problems, so far. I'll
allow prime95 to run its torture test for the rest of the day to get a
better idea of the system's reliability.

See what you get. Other things to check are the HDD and any other peripherals that you have connected that might cause some 'glitch'.


[...]
Computers should be _completely_ _reliable_, and consistent and
repeatable for their results.

Yes. Yes they should. In principle, they implement the essence of pure
mechanism (a.k.a. algorithm). But, they are not. There are just too many
moving parts (electrons and holes, that is...)

It's all a game of probabilities at that level. It's just that the probabilities are so heavily weighted that if you are working within spec, the world will suffer a heat death and the machine be long obsolete before suffering 'an unexpected' result. However, cosmic ray hits might flip a bit in your machine once every few years or sooner. ECC is a good idea for multi-GByte RAMs!


Good luck for tracing the problem,

Regards,
Martin


-- 
----------   OS? What's that?!
- Martin -   To most people, "Operating System" is unknown & strange.
- 53N 1W -   Mandrake 10.0.1 GNU Linux
----------   http://www.mandrakelinux.com/en-gb/concept.php3