Selftest techniques

Thread Starter

bug13

Joined Feb 13, 2012
1,809
Hi team

In the context or a MCU (no external ram), can we write some code to do some basic selftest, eg RAM and IOs? I am not sure if RAM selftest is necessary. But how do we selftest IOs?
 

jpanhalt

Joined Jan 18, 2008
8,776
Hi team

In the context or a MCU (no external ram), can we write some code to do some basic selftest, eg RAM and IOs? I am not sure if RAM selftest is necessary. But how do we selftest IOs?
I am not sure what you are driving at. For example, I usually write a testing program to test an algorithm or process. As one example, the first time I used a FIFO and linear RAM in a PIC, I self tested by loading and unloading it. I wouldn't include that in any final design as I don't do stuff that is life critical.

Are you actually asking about checking for BADRAM (i.e., bad hardware)?
 

Thread Starter

bug13

Joined Feb 13, 2012
1,809
I am not sure what you are driving at. For example, I usually write a testing program to test an algorithm or process. As one example, the first time I used a FIFO and linear RAM in a PIC, I self tested by loading and unloading it. I wouldn't include that in any final design as I don't do stuff that is life critical.

Are you actually asking about checking for BADRAM (i.e., bad hardware)?
Yes I was referring to bad hardware, eg dry joint or something. Not software bug test
 

PaulNewf

Joined Mar 24, 2020
9
We did FMEA selftests for MCUs in appliances, given that a misbehaving appliance could burn down a house.

FMEA = Failure Mode Effects Analysis.
First FMEA was done on the paper design to evaluate what would happen if two adjacent pins were shorted together, and whether that would be detectable in a bootup routine.
Second was a check of what would happen if any part was missing/opencircuit (can happen over time from cracked component or cracked/fried trace).
If the result of such a simple evaluations was catostrophic the MCU pinout was altered by reassigning IO pin positions.
For MCU the bootup test would check for any obvious shorts to adjacent pins (procedure bvelow). If such faults couldn't be detected at bootup then if possible the MCU pinout was altered by reassigning IO pin positions to try and improve the bootuip test coverage.

Basically this was the procedure on powerup:
a) For pins that were outputs you would first set them with a weak internal pullup or pulldown resistor (if MCU has internal pullups/pulldowns).
- Read all the pins to see if they gave appropriate highs/lows. If not then possibly there is a short to an adjacent pin.
b) Stepping though each output pin drive it high/low and each time check for changes in state of adjacent pins (shorts), Where possible check for correct feedback from controlled circuits (opens, wrong components).
c) Repeat for all pins that you can drive without damaging the unit.

This isn't exhaustive explanation by any means, but it can really help in Production Quality and in field diagnostics (The unit can display or log an error message or code on bootup if something incorrect, log to a file if USB or SD card type device in unit like an aircraft's black box).

Scan the internet for more detailed FMEA methods and design guidelines.

Paul
 

Thread Starter

bug13

Joined Feb 13, 2012
1,809
We did FMEA selftests for MCUs in appliances, given that a misbehaving appliance could burn down a house.

FMEA = Failure Mode Effects Analysis.
First FMEA was done on the paper design to evaluate what would happen if two adjacent pins were shorted together, and whether that would be detectable in a bootup routine.
Second was a check of what would happen if any part was missing/opencircuit (can happen over time from cracked component or cracked/fried trace).
If the result of such a simple evaluations was catostrophic the MCU pinout was altered by reassigning IO pin positions.
For MCU the bootup test would check for any obvious shorts to adjacent pins (procedure bvelow). If such faults couldn't be detected at bootup then if possible the MCU pinout was altered by reassigning IO pin positions to try and improve the bootuip test coverage.

Basically this was the procedure on powerup:
a) For pins that were outputs you would first set them with a weak internal pullup or pulldown resistor (if MCU has internal pullups/pulldowns).
- Read all the pins to see if they gave appropriate highs/lows. If not then possibly there is a short to an adjacent pin.
b) Stepping though each output pin drive it high/low and each time check for changes in state of adjacent pins (shorts), Where possible check for correct feedback from controlled circuits (opens, wrong components).
c) Repeat for all pins that you can drive without damaging the unit.

This isn't exhaustive explanation by any means, but it can really help in Production Quality and in field diagnostics (The unit can display or log an error message or code on bootup if something incorrect, log to a file if USB or SD card type device in unit like an aircraft's black box).

Scan the internet for more detailed FMEA methods and design guidelines.

Paul
Thanks Paul, that's a very good reference for me to start dig deeper!
 

MrChips

Joined Oct 2, 2009
20,329
When I used to service minicomputers we had tests for CPU functionality and extensive memory tests.

Testing CPU was very thorough and extensive. The first test was the HALT instruction. Following that, every conceivable ALU and CPU operation was tested at the bit level. All possible branch conditions were tested.

Memory tests were also very extensive, all zeros, all ones, alternative zeros and ones, moving zeros and ones, random pattern, X/Y addressing cross patterns, read/write, sense amp cross effects, etc. A complete memory test suit on 16K words of RAM took hours to complete.

For modern MCU, you can do a simple checksum of firmware in flash memory, followed by read/write tests on RAM.
Testing HW modules and I/O will require much more effort.
 

atferrari

Joined Jan 6, 2004
3,793
When I used to service minicomputers we had tests for CPU functionality and extensive memory tests.

Testing CPU was very thorough and extensive. The first test was the HALT instruction. Following that, every conceivable ALU and CPU operation was tested at the bit level. All possible branch conditions were tested.

Memory tests were also very extensive, all zeros, all ones, alternative zeros and ones, moving zeros and ones, random pattern, X/Y addressing cross patterns, read/write, sense amp cross effects, etc. A complete memory test suit on 16K words of RAM took hours to complete.

For modern MCU, you can do a simple checksum of firmware in flash memory, followed by read/write tests on RAM.
Testing HW modules and I/O will require much more effort.
Could you swear that you tested everyhting?
 

MrChips

Joined Oct 2, 2009
20,329
Testing CPU was very interesting.

Firstly, the reason for performing a test was either (1) there was a hard fail somewhere in the CPU or (2) programs which were already validated were failing at random for unknown reasons. Similarly, memory had to be tested to make certain that memory was not the reason for failure.

So you had a situation where you had to get a diagnostic suit into a non-functional or flaky system in the first place. The solution was simple. You had to have a working system to begin with. We would load magnetic core memory boards with diagnostic programs read from paper tape and then swap memory boards with the bad computer. Magnetic core memory was/is non-volatile. You did not need power to preserve memory contents.

This is what paper tape looks like.

1585108027540.png

This is a Data General Nova 1200 minicomputer front panel. We had Nova 2 computers. You would enter binary code directly in computer memory via the toggle switches.
1585108232122.png


Here is an amazing find. I just found this board while looking for some Nova 2 photos.
Magnetic core memory were constantly failing and needed frequent repairs.
This is a semiconductor replacement memory board I made for DG Nova 2 computers and it is now sitting at RICM (Rhode Island Computer Museum). It provided code in UV-EPROM and battery backed SRAM. If anyone at RICM wants info on this I would be happy to provide it. The code that is sitting in those two EPROMS would be BASIC and an assembler/disassembler that I wrote.

1585108376784.png

Here is what the magnetic core boards looked like.
1585108824716.png

Any how, I was going to say until I got distracted.

The CPU test program was interesting. The purpose of a CPU test suit was to pinpoint any hard fault as well as to exercise all aspects of the CPU in order to find intermittent faults. So how do you test a CPU that is bad or flaky? You need a working program and a working CPU, right?

You need to test the most primitive instructions first. Every subsequent test relies on all prior operations to be functional.

The first test was to execute a HALT instruction. If the program halted, then that instruction is used in the next test.
The second test would test an ALU flag. If the test failed the program would HALT. If the test passed, the program continued to the next test. This sequence would continue until a sufficient number of machine instructions were validated. From here on more sophisticated test would be performed. There were no outputs to any screen or printout. If a fault was encountered the CPU would HALT. You had to look at the program address on the front panel and then consult the ASM listing in order to identify the fault.

Those were interesting times!

Incidentally, when the IBM PC came along, I diagnosed MOBO system faults in a similar fashion. I made an SRAM pod that I would plug in place of the BIOS chip. The code in the SRAM was supplied by a second working PC. In this manner I was able to exercise any part of the CPU and MOBO.
 

MrChips

Joined Oct 2, 2009
20,329
In an exhaustive CPU diagnostic suit, the purpose is to not only test for reliability but to also pinpoint HW failures within the HW architecture. Remember that this was in the era when the CPU was composed of discrete ICs and it was possible to repair such systems at the component level.

Hence every test can only use instructions that have already been validated. The designer of the sequence of tests had to have an intimate knowledge of the architecture of the CPU logic, gates, decoders, shift registers, adders, status flags, instruction cycles and phases, etc.

Thus, before one can test a simple arithmetic operation such as ADD the contents of two registers, one would validate all shift, rotate, status flags, conditional branch operations. Even a simple memory read/write operation would be way down the test list.

Hence you can imagine that in trying to write code for a simple test you would be severely limited to what instructions were available at your disposal.
 

nsaspook

Joined Aug 27, 2009
6,951
We repaired the UYK-20 on many systems in the day.
The built-in microcoded diagnostic routine tests the basic micro instructions, control memory, I/O, lower 8K of memory, I/O instructions and the emulate instruction. The program loaded diagnostic routines are more comprehensive and can be loaded from external memory into computer memory as needed.
http://bitsavers.org/pdf/univac/military/an_uyk-20/PX10431C_AN_UYK-20_Technical_Description_Nov76.pdf

Very reliable machine but most of the time the computers would pass all tests but still halt during normal program operation after a while. The operational spaces were hot so thermal sensors were tripping during heavy computing. This was normally cured by engaging the:

BATTLE SHORT ON/OFF switch (two-position) BATTLE SHORT indicator light
ON position disables computer over-temperature shutdown function.
 

nsaspook

Joined Aug 27, 2009
6,951
Another thing about self-test is what happens when a system fails. Is there redundancy to keep the system safely running and/or is there a builtin backup for things like on-board power failures.

 

Thread Starter

bug13

Joined Feb 13, 2012
1,809
Another thing about self-test is what happens when a system fails. Is there redundancy to keep the system safely running and/or is there a builtin backup for things like on-board power failures.

Just curious, why did you use a through hole part?
 
Top