Re: aggrivating

f/fgeorge <ffgeorge@yourplace.com> wrote:

Berkeley had another power outage and is trying to figure out why.


About twenty-some years ago I was working in a data center
that was terrorized for months by a really strange and
persistent but intermittent problem. Everything would be
fine for days, hours, or weeks, then we would suddenly start
getting "HOT I/O" from literally scores of devices at once
(Hot I/O is an IBM mainframe term for I/O errors which occur
while the system is still recovering from previous I/O
errors). Suddenly, dozens of devices would be "Boxed"
(rendered inert and inaccessible, probably until the system
is restarted). 

What was particularly strange was that the CPU's (a pair of
370/168's without an MCU) didn't crash through all this, but
hours of production was lost and had to be restarted,
databases restored, etc. Needless to say, this was really
great training for the Ops staff, and good for lots of
overtime.

The managers were really tearing out their hair over this
for months as every on of these outages cost the facility
thousands of dollars in overtime, lost production, and
futile service calls. During the downtime, the staff pulled
up hundreds of raised-floor sections looking for anything
suspicious. Management thought it was water leaking from one
of the 50 ton chillers or from the water-cooled CPU's;  I
was betting on rodents, which had recently attacked the
wiring in my car.

We found neither one drop of water nor any rodents or their
trademarks, and no one could explain why the CPU's were
immune to the problem. They did everything but call in an
exorcist. Eventually, after calling in a team of engineers
for about a week, someone traced the problem to
deterioration of a twelve foot piece of big fat 4/0 aluminum
wire that connected the main panel to a large sub-panel
which fed everything in the machine room but the lights and
the chillers. 

The reason the CPU's were not affected was because they were
protected by Liebert motor-generators with big inertial
flywheels which provided the 400 Hz power the 168's required
and did an all too good job of stabilizing it. Furthermore,
if the lights had been on the circuit we would have noticed
them flicker and known what was up. What really suffered
from this problem were the hundreds of 3350 disk drives and
the "channels" (big refrigerator-sized boxes with no clearly
observable function) and various "controllers" (smaller
boxes with no clearly observable function). 

This is just one more example of how a really high-tech
problem can have a really low-tech cause (and if you want to
really screw things up you need a computer). 

(I'll be "dry" in two hours.)