Computers Built Not to Fail

nsaspook · Jan 13, 2026

Ya’akov · Jan 13, 2026

When I saw the title of your post, Tandem immediately came to mind. I remember clients who talked about wanting "non-stop" hardware in the 80s/90s and I would ask them why. They told me they "couldn't afford downtime".

When I asked how much downtime cost them, they had no answer. So we did some math and downtime did look very expensive—until I showed them what non-stop would cost. Then, we worked out a rapid recovery configuration instead. They could pay for more downtime than they could reasonably expect and still be far ahead of non-stop—and it left a budget for disaster recovery which non-stop hardware doesn't address.

If they were a telecom, or a financials company, sure—buy the Tandem or something else. But then, they wouldn't have been talking to me about what they needed, that wasn't my domain.

Futurist · Jan 13, 2026

Tandem were overshadowed by Stratus in the 1980s. Unlike Tandem's complex software based mechanism (which required developers to periodically embed checkpointing calls) Stratus was all hardware. It was hot pluggable, boards could be removed/inserted while the system was running.

Users software was oblivious to the fault tolerance, no checkpointing or other such burdens.

A logical CPU consisted of two identical physical boards, each board had duplicate 68020 CPUs and all ancillary logic. These boards contained comparator logic that detected the slightest difference between the two CPUs and would disable the board whenever any such difference emerged.

So the logical processor was four processors running in lockstep 24/7. If a board took itself out of service, the partner board simply continued merrily.

This same architecture was used for memory boards and IO controller boards too.

It was (still is) very impressive hardware engineering, I worked on their systems for over a decade in stock exchanges and trading firms.

Futurist · Jan 13, 2026

Futurist · Jan 13, 2026

Here's an interesting post by the founder of Stratus, discusses how it got started.

https://www.teamfoster.com/stratus-computer

nsaspook · Jan 13, 2026

Times have changed, now hardware is so reliable it's software that causes most outages. We had checkpoint in the Harris mini computers I installed on ships

Futurist · Jan 14, 2026

nsaspook said:
Times have changed, now hardware is so reliable it's software that causes most outages. We had checkpoint in the Harris mini computers I installed on ships

This is true.

I recall being in meetings where management were saying stuff like "So why was our feed handler down, when we just spent a ton on these ultra reliable computers?". It wasn't a fun conversation.

cmartinez · Jan 17, 2026

Life support ... I've seen quite a few software license documents where they draw the line at life support systems.

atferrari · Jan 17, 2026

Futurist said:
A logical CPU consisted of two identical physical boards, each board had duplicate 68020 CPUs and all ancillary logic. These boards contained comparator logic that detected the slightest difference between the two CPUs and would disable the board whenever any such difference emerged.

So the logical processor was four processors running in lockstep 24/7. If a board took itself out of service, the partner board simply continued merrily.

What's the logic supporting the decision of one of the "comparators" to disable the other/s?

How a board knows it is not the wrong one?
Majority concept in action?

SamR · Jan 17, 2026

Redundancy! One of my worst days was when I went to an operations area to back up a control system computer. It was an 80286 CPU machine with a Xenix OS (Unix variant) running Foxboro DCS control software. It only had a single floppy drive and it wasn't working. I'd already tried cleaning it (as it was a bit of a dusty area) to no avail. So, I opened it up and put a fresh floppy drive in and it would not reboot. Operations were losing ~25kUSD/hour. I spent the next 18 hours nonstop feeding in 360k floppy disks and compiling them to rebuild the base control system software and then I could load the backup control configurations. Never again. I inherited that system, but every DCS control system I specified and built always had redundant operator stations so one could be used for engineering configuration, backup, and upgrades while the area operator had full access to the other operating station for alarm handling and control. Plus, I always specified UPS power for controlled shutdowns in the event of a blackout. I would have be hard to convince that, no matter how good, a single computer station would suffice. But then, what the computer is "controlling" has to have redundant input capabilities as well.

Futurist · Jan 17, 2026

atferrari said:
What's the logic supporting the decision of one of the "comparators" to disable the other/s?

How a board knows it is not the wrong one?
Majority concept in action?

One logical processor consists of two identical, separate pluggable boards. Each board has two processors and two sets of support chips plus comparator hardware. When running, each of the four processor chips is executing identical instructions in lockstep.

If a board's comparator detects any difference between the two processors signals on the board, it disables the board and interrupts the OS, the board is then designated as "out of service", the other board carries on, no impact on user code, the remaining board is then designated as "running alone".

Finally the event is surfaced and a call sent out to order a replacement board which usually arrives in few hours by courier. I demo'd this several times when I ran software teams in London.

The entire system was superbly engineered, the OS though proprietary was very high quality and designed by a team led by Bob Freiburghouse (I learned last year he tried to hire Dave Cutler to build the OS but could not as Microsoft had just snapped him up).

Futurist · Jan 17, 2026

I should add too, that in practice the system would run tests on an out of service board. This was a standard part of the protocol. The board would self-test and report if it was a permanent error or transient. I saw evidence of these events in the OS logs. It was uncommon, but a few times a year we'd see in the log that a board went out of service then was put back into service and resynchronized with its partner a few mins later.

All these events were processed invisibly, zero impact on customer applications, all this data too was sent to the customer assistance center (CAC) over dialup (back in the early 80s anyway) so Stratus had a database that knew the history of every hardware fault/glitch of every customer's machines anywhere in the world.

There was a great guy at their London headquarters too, named Bob Croft. He taught OS internals courses for Stratus and I attended quite a few courses taught by him. He was in fact a senior trainer for Tandem and Stratus somehow poached him.

Thread starter	Similar threads	Forum	Replies	Date
F	ACSI device for the Atari ST (Retro computers)	PCB Layout , EDA & Simulations	4	Feb 9, 2026
S	Analog computers and transfer function simulators	Automation, Robotics & Control	5	Jan 9, 2024
	Self learning about computers	Homework Help	33	May 19, 2023
T	What is meant by accuracy of analog computer?Analog computers have accuracy of 0.01% to 2%?	Homework Help	11	Jul 13, 2022
	Are there any mechanical PID computers?	General Science, Physics & Math	29	Apr 9, 2022

Computers Built Not to Fail

Join our Engineering Community! Sign-in with:

Computers Built Not to Fail

nsaspook

Ya’akov

Futurist

Futurist

Futurist

nsaspook

Futurist

cmartinez

atferrari

SamR

Futurist

Futurist

You May Also Like

Rambus Introduces PCIe 7.0 Switch With Time-Division Multiplexing

SoC Evaluation Boards Evolve to Meet New Design Complexities

Infineon Reveals End-to-End Upgrades of Its Power and Sensing Portfolio

Rohm Shoots Out 17 New Op Amps to Its High-Performance Portfolio