Fault tolerant HW and SW systems

Thread Starter

danadak

Joined Mar 10, 2018
4,057
I am curious, so want others too post, why use DMA ?

So far we have

1) Speed
2) Fault Tolerance

So community, what other compelling reasons do you use DMA for ?

Regards, Dana.
 
Last edited:

MrChips

Joined Oct 2, 2009
30,706
I have never considered fault tolerance is a reason for using DMA. If I had to implement a fault tolerant system I would be looking at redundancy.
At the simplest level, embedded systems have a watch-dog timer for crash recovery.
 

Thread Starter

danadak

Joined Mar 10, 2018
4,057
I have never considered fault tolerance is a reason for using DMA.

Probably NASA the best resource on this discussion. Tons of work on both SW and HW
fault tolerance. I am reading an interesting paper now, as I am weak in this area and
have been curious about how those systems achieve their fault tolerance.

As you point out one resorts to HW, watchdog, as a crude method for fault detection.
More advanced medical products, like injection pumps, use dual processors to achieve
a little higher level of robustness.

I think in earth based system, non radioactive environment, that HW is more fault
tolerant than SW, as SW faults, systems with memory managers, many threaded
applications using stack processes, I think more problematic. One can argue what
about environmental stuff, like noise, affecting HW. That we can usually design for.
But complex SW applications we never test for all possible states in a system,
largely rely on belief logical SW modules provide a firewall against fault propagation.
Then we funnel all that thru stack operations and memory managers and hope for
the best. Mostly works. To wit how many times has ones PC hung, frozen, tanked in
a career. I would posit thats mostly SW inadequacies in multi million line code
systems.

So I think any process I can do in HW, partially or fully, more robust. So for me DMA
Is one part of the solution, when possible.

I think we agree to disagree. But I am interested in what other designers think about
DMA and reasons for use. Should be instructive, at least for me.

Regards, Dana.
 
Last edited:

MrChips

Joined Oct 2, 2009
30,706
I certainly disagree. I have not known of any PC to be fault tolerant.

You can make HW robust only so far when you consider the abuse it has to put up with:
  • power supply variations
  • brownout
  • power line glitch
  • ground bounce
  • temperature fluctuations
  • component degradation
  • over-temperature abuse
  • EMI
  • alpha radiation
  • manufacturing defects
  • creeping solder
  • lead-free solder whiskers
  • skewing of clock and data transitions
  • transmission line reflections

At the lowest level (commercial PC speaking) , the only mechanisms I am aware of that were used as an attempt to incorporate some degree of robustness are memory parity checks, CRC in data storage, parity and error detection and correction in data transmission, brown-out detection in the power supply system and watch-dog timers in the processor.

Fault-tolerant systems rely on redundancy, plain and simple, for example, having a copy of the data and OS status (disk drive directory etc) and majority vote systems.

I will stick my neck out and bluntly claim that ALL SW faults are human induced. That is, it is the fault of the programmer, OS, or platform design.
For example, memory leaks are programmer errors. You can program some level of protection into the OS, for example, memory protection. OS have become way too complex to protect from a rogue program. As an example, how come we cannot secure even the world's most secure system from hackers?

I can create a system that cannot be hacked but that is another story.

My point is, you can make SW fault tolerant but it is more difficult to do so with single piece of HW. Fault tolerant systems rely on hardware redundancy.
 

Papabravo

Joined Feb 24, 2006
21,158
You ask a surprising question for unfathomable reasons. The reasons to use DMA, have little or nothing to do with fault tolerance IMHO. They have everything to do with overlapping I/O operations with processor activity. In particular it eliminates the need for the processor to engage in busy waiting, typical of character I/O devices.
 

Thread Starter

danadak

Joined Mar 10, 2018
4,057
I certainly disagree. I have not known of any PC to be fault tolerant.
Not sure why you think a claim was made PCs are fault tolerant, quite the contrary, I am
in partial agreement, for the reasons you point out (some fault tolerance measures
that are done) -

At the lowest level (commercial PC speaking) , the only mechanisms I am aware of that
were used as an attempt to incorporate some degree of robustness are memory parity checks,
CRC in data storage, parity and error detection and correction in data transmission, brown-out
detection in the power supply system and watch-dog timers in the processor.
My point is, you can make SW fault tolerant but it is more difficult to do so with single piece
of HW.
Totally agree, a single piece of HW can only contribute part of the solution. No claim was made
a single piece of HW is the ultimate solution. That would be crazy talk.Total and complete
fault tolerance is a goal, not a reality.

  • power supply variations
  • brownout
  • power line glitch
  • ground bounce
  • temperature fluctuations
  • component degradation
  • over-temperature abuse
  • EMI
  • alpha radiation
  • manufacturing defects
  • creeping solder
  • lead-free solder whiskers
  • skewing of clock and data transitions
  • transmission line reflections
These we mostly can design for.

I have seen several NASA papers discussing the importance of message delivery from
CPU to I/O, Fault Tolerant design considerations, all in agreement HW primary focus
to fix this, SW redundancy considered secondary approach for message verification.

Regards, Dana.
 
Last edited:

ArakelTheDragon

Joined Nov 18, 2016
1,362
Probably NASA the best resource on this discussion. Tons of work on both SW and HW
fault tolerance. I am reading an interesting paper now, as I am weak in this area and
have been curious about how those systems achieve their fault tolerance.

As you point out one resorts to HW, watchdog, as a crude method for fault detection.
More advanced medical products, like injection pumps, use dual processors to achieve
a little higher level of robustness.

I think in earth based system, non radioactive environment, that HW is more fault
tolerant than SW, as SW faults, systems with memory managers, many threaded
applications using stack processes, I think more problematic. One can argue what
about environmental stuff, like noise, affecting HW. That we can usually design for.
But complex SW applications we never test for all possible states in a system,
largely rely on belief logical SW modules provide a firewall against fault propagation.
Then we funnel all that thru stack operations and memory managers and hope for
the best. Mostly works. To wit how many times has ones PC hung, frozen, tanked in
a career. I would posit thats mostly SW inadequacies in multi million line code
systems.

So I think any process I can do in HW, partially or fully, more robust. So for me DMA
Is one part of the solution, when possible.

I think we agree to disagree. But I am interested in what other designers think about
DMA and reasons for use. Should be instructive, at least for me.

Regards, Dana.
Do not read what NASA tells you. They only try to justify their budjet (that is why they say we have 8 planets today).
 

Alec_t

Joined Sep 17, 2013
14,280
It's my understanding that (some) aircraft have triple-redundant IT systems. Three different processors run three different independently-written programmes and a majority vote is taken of their outputs. Not sure what happens if the voting system develops a fault :).
 

MrChips

Joined Oct 2, 2009
30,706
I certainly disagree. I have not known of any PC to be fault tolerant.

Sorry, I should not have put those two statements on the same line.

I certainly disagree.
I disagree that HW is more fault tolerant than SW. Fault tolerant systems employ redundancy.
You never hear a programmer doing a conditional test twice or performing the same calculation twice. Redundancy is created by replicating the hardware or performing the same operations on separate systems. Faults in SW are programming errors which can be avoided by systematic programming and diligent testing procedures.

HW faults are much harder to mitigate because there are too many environmental parameters beyond ones control.

I have not known of any PC to be fault tolerant.

This was in response to your comment:
To wit how many times has ones PC hung, frozen, tanked in
a career. I would posit thats mostly SW inadequacies in multi million line code
systems.


And I would posit that SW failures are a result of poor design, poor due diligence and systems that have become too complex and non-deterministic. PC systems are the classic examples.
 

Thread Starter

danadak

Joined Mar 10, 2018
4,057
First order principles -

Google "nasa embedded fault tolerance", several papers, handbook.

Regards, Dana.
 

MrChips

Joined Oct 2, 2009
30,706
Here is some food for thought.

How many times have we made this error or seen someone make this error:

if ( A = B )
{
}

This is a programmer's error not rejected by the compiler or software development system.
Should the compiler detect this as an error or flag it as a warning?

Or another one:
using two or more variable identifiers that are similar but using the wrong one:
Example:
int ThisHour, thisHour, thishour;

Btw, the Y2K bug was human introduced error as a result of poor design/engineering practice and methodology.

And we still continue to invite catastrophes when we write our dates as 08/07/09.
 

ArakelTheDragon

Joined Nov 18, 2016
1,362
I do not believe that is true
Here is some food for thought.

How many times have we made this error or seen someone make this error:

if ( A = B )
{
}

This is a programmer's error not rejected by the compiler or software development system.
Should the compiler detect this as an error or flag it as a warning?

Or another one:
using two or more variable identifiers that are similar but using the wrong one:
Example:
int ThisHour, thisHour, thishour;

Btw, the Y2K bug was human introduced error as a result of poor design/engineering practice and methodology.

And we still continue to invite catastrophes when we write our dates as 08/07/09.
The poor engineering practices become even more, they are included in the government requirements and when I tell someone you have to write it day/month/year so we have an order, he says that is not how we do it here with hatred and contempt.
 

Thread Starter

danadak

Joined Mar 10, 2018
4,057
How many times have we made this error or seen someone make this error:

if ( A = B )
{
}
I can personally attest this has been very effective in my self induced personal hair loss program. :)

Regards, Dana.
 
Top