Computers Built Not to Fail

Ya’akov

Joined Jan 27, 2019
10,226
When I saw the title of your post, Tandem immediately came to mind. I remember clients who talked about wanting "non-stop" hardware in the 80s/90s and I would ask them why. They told me they "couldn't afford downtime".

When I asked how much downtime cost them, they had no answer. So we did some math and downtime did look very expensive—until I showed them what non-stop would cost. Then, we worked out a rapid recovery configuration instead. They could pay for more downtime than they could reasonably expect and still be far ahead of non-stop—and it left a budget for disaster recovery which non-stop hardware doesn't address.

If they were a telecom, or a financials company, sure—buy the Tandem or something else. But then, they wouldn't have been talking to me about what they needed, that wasn't my domain.
 

Futurist

Joined Apr 8, 2025
720
Tandem were overshadowed by Stratus in the 1980s. Unlike Tandem's complex software based mechanism (which required developers to periodically embed checkpointing calls) Stratus was all hardware. It was hot pluggable, boards could be removed/inserted while the system was running.

Users software was oblivious to the fault tolerance, no checkpointing or other such burdens.

A logical CPU consisted of two identical physical boards, each board had duplicate 68020 CPUs and all ancillary logic. These boards contained comparator logic that detected the slightest difference between the two CPUs and would disable the board whenever any such difference emerged.

So the logical processor was four processors running in lockstep 24/7. If a board took itself out of service, the partner board simply continued merrily.

This same architecture was used for memory boards and IO controller boards too.

It was (still is) very impressive hardware engineering, I worked on their systems for over a decade in stock exchanges and trading firms.
 

Thread Starter

nsaspook

Joined Aug 27, 2009
16,249
Times have changed, now hardware is so reliable it's software that causes most outages. We had checkpoint in the Harris mini computers I installed on ships
 

Futurist

Joined Apr 8, 2025
720
Times have changed, now hardware is so reliable it's software that causes most outages. We had checkpoint in the Harris mini computers I installed on ships
This is true.

I recall being in meetings where management were saying stuff like "So why was our feed handler down, when we just spent a ton on these ultra reliable computers?". It wasn't a fun conversation.
 

atferrari

Joined Jan 6, 2004
5,001
A logical CPU consisted of two identical physical boards, each board had duplicate 68020 CPUs and all ancillary logic. These boards contained comparator logic that detected the slightest difference between the two CPUs and would disable the board whenever any such difference emerged.

So the logical processor was four processors running in lockstep 24/7. If a board took itself out of service, the partner board simply continued merrily.
What's the logic supporting the decision of one of the "comparators" to disable the other/s?

How a board knows it is not the wrong one?
Majority concept in action?
 

SamR

Joined Mar 19, 2019
5,470
Redundancy! One of my worst days was when I went to an operations area to back up a control system computer. It was an 80286 CPU machine with a Xenix OS (Unix variant) running Foxboro DCS control software. It only had a single floppy drive and it wasn't working. I'd already tried cleaning it (as it was a bit of a dusty area) to no avail. So, I opened it up and put a fresh floppy drive in and it would not reboot. Operations were losing ~25kUSD/hour. I spent the next 18 hours nonstop feeding in 360k floppy disks and compiling them to rebuild the base control system software and then I could load the backup control configurations. Never again. I inherited that system, but every DCS control system I specified and built always had redundant operator stations so one could be used for engineering configuration, backup, and upgrades while the area operator had full access to the other operating station for alarm handling and control. Plus, I always specified UPS power for controlled shutdowns in the event of a blackout. I would have be hard to convince that, no matter how good, a single computer station would suffice. But then, what the computer is "controlling" has to have redundant input capabilities as well.
 

Futurist

Joined Apr 8, 2025
720
What's the logic supporting the decision of one of the "comparators" to disable the other/s?

How a board knows it is not the wrong one?
Majority concept in action?
One logical processor consists of two identical, separate pluggable boards. Each board has two processors and two sets of support chips plus comparator hardware. When running, each of the four processor chips is executing identical instructions in lockstep.

If a board's comparator detects any difference between the two processors signals on the board, it disables the board and interrupts the OS, the board is then designated as "out of service", the other board carries on, no impact on user code, the remaining board is then designated as "running alone".

Finally the event is surfaced and a call sent out to order a replacement board which usually arrives in few hours by courier. I demo'd this several times when I ran software teams in London.

The entire system was superbly engineered, the OS though proprietary was very high quality and designed by a team led by Bob Freiburghouse (I learned last year he tried to hire Dave Cutler to build the OS but could not as Microsoft had just snapped him up).
 
Last edited:

Futurist

Joined Apr 8, 2025
720
I should add too, that in practice the system would run tests on an out of service board. This was a standard part of the protocol. The board would self-test and report if it was a permanent error or transient. I saw evidence of these events in the OS logs. It was uncommon, but a few times a year we'd see in the log that a board went out of service then was put back into service and resynchronized with its partner a few mins later.

All these events were processed invisibly, zero impact on customer applications, all this data too was sent to the customer assistance center (CAC) over dialup (back in the early 80s anyway) so Stratus had a database that knew the history of every hardware fault/glitch of every customer's machines anywhere in the world.

There was a great guy at their London headquarters too, named Bob Croft. He taught OS internals courses for Stratus and I attended quite a few courses taught by him. He was in fact a senior trainer for Tandem and Stratus somehow poached him.
 
Last edited:
Top