Fault tolerant HW and SW systems

danadak · Aug 23, 2018

I am curious, so want others too post, why use DMA ?

So far we have

1) Speed
2) Fault Tolerance

So community, what other compelling reasons do you use DMA for ?

Regards, Dana.

MrChips · Aug 23, 2018

I have never considered fault tolerance is a reason for using DMA. If I had to implement a fault tolerant system I would be looking at redundancy.
At the simplest level, embedded systems have a watch-dog timer for crash recovery.

danadak · Aug 23, 2018

I have never considered fault tolerance is a reason for using DMA.

Probably NASA the best resource on this discussion. Tons of work on both SW and HW
fault tolerance. I am reading an interesting paper now, as I am weak in this area and
have been curious about how those systems achieve their fault tolerance.

As you point out one resorts to HW, watchdog, as a crude method for fault detection.
More advanced medical products, like injection pumps, use dual processors to achieve
a little higher level of robustness.

I think in earth based system, non radioactive environment, that HW is more fault
tolerant than SW, as SW faults, systems with memory managers, many threaded
applications using stack processes, I think more problematic. One can argue what
about environmental stuff, like noise, affecting HW. That we can usually design for.
But complex SW applications we never test for all possible states in a system,
largely rely on belief logical SW modules provide a firewall against fault propagation.
Then we funnel all that thru stack operations and memory managers and hope for
the best. Mostly works. To wit how many times has ones PC hung, frozen, tanked in
a career. I would posit thats mostly SW inadequacies in multi million line code
systems.

So I think any process I can do in HW, partially or fully, more robust. So for me DMA
Is one part of the solution, when possible.

I think we agree to disagree. But I am interested in what other designers think about
DMA and reasons for use. Should be instructive, at least for me.

Regards, Dana.

MrChips · Aug 23, 2018

I certainly disagree. I have not known of any PC to be fault tolerant.

You can make HW robust only so far when you consider the abuse it has to put up with:

power supply variations
brownout
power line glitch
ground bounce
temperature fluctuations
component degradation
over-temperature abuse
EMI
alpha radiation
manufacturing defects
creeping solder
lead-free solder whiskers
skewing of clock and data transitions
transmission line reflections

At the lowest level (commercial PC speaking) , the only mechanisms I am aware of that were used as an attempt to incorporate some degree of robustness are memory parity checks, CRC in data storage, parity and error detection and correction in data transmission, brown-out detection in the power supply system and watch-dog timers in the processor.

Fault-tolerant systems rely on redundancy, plain and simple, for example, having a copy of the data and OS status (disk drive directory etc) and majority vote systems.

I will stick my neck out and bluntly claim that ALL SW faults are human induced. That is, it is the fault of the programmer, OS, or platform design.
For example, memory leaks are programmer errors. You can program some level of protection into the OS, for example, memory protection. OS have become way too complex to protect from a rogue program. As an example, how come we cannot secure even the world's most secure system from hackers?

I can create a system that cannot be hacked but that is another story.

My point is, you can make SW fault tolerant but it is more difficult to do so with single piece of HW. Fault tolerant systems rely on hardware redundancy.

Papabravo · Aug 23, 2018

You ask a surprising question for unfathomable reasons. The reasons to use DMA, have little or nothing to do with fault tolerance IMHO. They have everything to do with overlapping I/O operations with processor activity. In particular it eliminates the need for the processor to engage in busy waiting, typical of character I/O devices.

danadak · Aug 24, 2018

I certainly disagree. I have not known of any PC to be fault tolerant.

Not sure why you think a claim was made PCs are fault tolerant, quite the contrary, I am
in partial agreement, for the reasons you point out (some fault tolerance measures
that are done) -

At the lowest level (commercial PC speaking) , the only mechanisms I am aware of that
were used as an attempt to incorporate some degree of robustness are memory parity checks,
CRC in data storage, parity and error detection and correction in data transmission, brown-out
detection in the power supply system and watch-dog timers in the processor.

My point is, you can make SW fault tolerant but it is more difficult to do so with single piece
of HW.

Totally agree, a single piece of HW can only contribute part of the solution. No claim was made
a single piece of HW is the ultimate solution. That would be crazy talk.Total and complete
fault tolerance is a goal, not a reality.

power supply variations

brownout

power line glitch

ground bounce

temperature fluctuations

component degradation

over-temperature abuse

EMI

alpha radiation

manufacturing defects

creeping solder

lead-free solder whiskers

skewing of clock and data transitions

transmission line reflections

These we mostly can design for.

I have seen several NASA papers discussing the importance of message delivery from
CPU to I/O, Fault Tolerant design considerations, all in agreement HW primary focus
to fix this, SW redundancy considered secondary approach for message verification.

Regards, Dana.

ArakelTheDragon · Aug 24, 2018

danadak said:
Probably NASA the best resource on this discussion. Tons of work on both SW and HW
fault tolerance. I am reading an interesting paper now, as I am weak in this area and
have been curious about how those systems achieve their fault tolerance.

As you point out one resorts to HW, watchdog, as a crude method for fault detection.
More advanced medical products, like injection pumps, use dual processors to achieve
a little higher level of robustness.

I think in earth based system, non radioactive environment, that HW is more fault
tolerant than SW, as SW faults, systems with memory managers, many threaded
applications using stack processes, I think more problematic. One can argue what
about environmental stuff, like noise, affecting HW. That we can usually design for.
But complex SW applications we never test for all possible states in a system,
largely rely on belief logical SW modules provide a firewall against fault propagation.
Then we funnel all that thru stack operations and memory managers and hope for
the best. Mostly works. To wit how many times has ones PC hung, frozen, tanked in
a career. I would posit thats mostly SW inadequacies in multi million line code
systems.

So I think any process I can do in HW, partially or fully, more robust. So for me DMA
Is one part of the solution, when possible.

I think we agree to disagree. But I am interested in what other designers think about
DMA and reasons for use. Should be instructive, at least for me.

Regards, Dana.

Do not read what NASA tells you. They only try to justify their budjet (that is why they say we have 8 planets today).

Alec_t · Aug 24, 2018

It's my understanding that (some) aircraft have triple-redundant IT systems. Three different processors run three different independently-written programmes and a majority vote is taken of their outputs. Not sure what happens if the voting system develops a fault

.

MrChips · Aug 24, 2018

I certainly disagree. I have not known of any PC to be fault tolerant.

Sorry, I should not have put those two statements on the same line.

I certainly disagree.
I disagree that HW is more fault tolerant than SW. Fault tolerant systems employ redundancy.
You never hear a programmer doing a conditional test twice or performing the same calculation twice. Redundancy is created by replicating the hardware or performing the same operations on separate systems. Faults in SW are programming errors which can be avoided by systematic programming and diligent testing procedures.

HW faults are much harder to mitigate because there are too many environmental parameters beyond ones control.

I have not known of any PC to be fault tolerant.

This was in response to your comment:
To wit how many times has ones PC hung, frozen, tanked in
a career. I would posit thats mostly SW inadequacies in multi million line code
systems.

And I would posit that SW failures are a result of poor design, poor due diligence and systems that have become too complex and non-deterministic. PC systems are the classic examples.

danadak · Aug 24, 2018

First order principles -

Google "nasa embedded fault tolerance", several papers, handbook.

Regards, Dana.

MrChips · Aug 24, 2018

Here is some food for thought.

How many times have we made this error or seen someone make this error:

if ( A = B )
{
}

This is a programmer's error not rejected by the compiler or software development system.
Should the compiler detect this as an error or flag it as a warning?

Or another one:
using two or more variable identifiers that are similar but using the wrong one:
Example:
int ThisHour, thisHour, thishour;

Btw, the Y2K bug was human introduced error as a result of poor design/engineering practice and methodology.

And we still continue to invite catastrophes when we write our dates as 08/07/09.

ArakelTheDragon · Aug 24, 2018

I do not believe that is true

MrChips said:
Here is some food for thought.

How many times have we made this error or seen someone make this error:

if ( A = B )
{
}

This is a programmer's error not rejected by the compiler or software development system.
Should the compiler detect this as an error or flag it as a warning?

Or another one:
using two or more variable identifiers that are similar but using the wrong one:
Example:
int ThisHour, thisHour, thishour;

Btw, the Y2K bug was human introduced error as a result of poor design/engineering practice and methodology.

And we still continue to invite catastrophes when we write our dates as 08/07/09.

The poor engineering practices become even more, they are included in the government requirements and when I tell someone you have to write it day/month/year so we have an order, he says that is not how we do it here with hatred and contempt.

danadak · Aug 24, 2018

How many times have we made this error or seen someone make this error:

if ( A = B )
{
}

I can personally attest this has been very effective in my self induced personal hair loss program.

Regards, Dana.

Thread starter	Similar threads	Forum	Replies	Date
	Exercise on 3-phase fault(L-L-G short circuit)	Power Electronics	3	Dec 3, 2025
A	Help finding the value of the SMD capacitor/resistor	Technical Repair	2	Sep 11, 2025
	Alliance memory NAND Flash Stuck in OIP	General Electronics Chat	8	Aug 22, 2025
L	NordicTrack 1750 (and others) - Inherent fault?	Technical Repair	4	Aug 19, 2025
L	Dynacord Powermate fault - Help needed	Technical Repair	3	Aug 8, 2025

Fault tolerant HW and SW systems

Join our Engineering Community! Sign-in with:

Fault tolerant HW and SW systems

danadak

MrChips

danadak

MrChips

Papabravo

danadak

ArakelTheDragon

Alec_t

MrChips

danadak

MrChips

ArakelTheDragon

danadak

You May Also Like

5 V MCUs and 5 V Tolerant MCUs—What’s the Difference and Why It Matters

The Kilo Lamp: An Interactive Lamp Controller

Nuvoton Launches Easy-to-Use Tool to Build and Deploy AI on Its MCUs

3 Pocket-Sized Photonics Innovations Power MedTech to Atomic Clocks