What do you do when faced with an intermittent problem? (PIC SPI)

Thread Starter

camerart

Joined Feb 25, 2013
2,506
Hi,
This is a general question, not a specific one.

EDIT: Later answers have made it more specific to CODE and SPI, so I've changed the title.


I'm using SPI on PICs with peripherals, and have asked questions on this and other forums, and received lots of help, but my problem persists.

Sometimes my PCBs work fine, but for 'no' reason they error, so what do you do in these cases? I'm sure it all depends on what area the error(s) are in, so maybe this is an impossible question, but there it is.

Any ideas, please?
Camerart.
 
Last edited:

MaxHeadRoom

Joined Jul 18, 2013
23,099
I don't think there is a cut and dried answer, it is just a case of trial and error analyzing and trying to come up with, or zero in on the source and do some definitive testing from thereon in.
Also it could be a code issue!
Max..
 

jpanhalt

Joined Jan 18, 2008
11,088
You try to make it happen more predictably. Failing that, how do you know when it is solved?

In my experience such problems are often dependent on specific sequences that may vary from time to time. That applies not only to code I write today, but to work I did developing tests in a clinical laboratory many years ago. As a wise doctor once told a patient, "If the pain only occurs when you do that, don't do that."
 

Thread Starter

camerart

Joined Feb 25, 2013
2,506
Hi M and J2,
I can get it to error, and either when switching on it shows correct, so each time this is repeated, or it shows error, and the errors are repeated, so each rubbish is the same. Does this tell you anything?
I'm suspecting timing, so skipping a clock beat, when I force the error
Once running it stay one way or another, and only?? changes when it is restarted.

J2, have you seen Tommy Cooper? He does a DR sketch similar:)
C
 

jpanhalt

Joined Jan 18, 2008
11,088
Hi M and J2,
I can get it to error, and either when switching on it shows correct, so each time this is repeated, or it shows error, and the errors are repeated, so each rubbish is the same. Does this tell you anything?
I'm suspecting timing, so skipping a clock beat, when I force the error
Once running it stay one way or another, and only?? changes when it is restarted.

J2, have you seen Tommy Cooper? He does a DR sketch similar:)
C
That type of error may happen when you don't initialize (clear) registers. Although, that memory is not permanent, the contents of those reisters is unknown. It may be worth clearing all user RAM, or at least RAM that is used in calculations, when starting. Of course, RAM that is written to with a movwf type of instruction doesn need to be cleared before writing to it.
 

Thread Starter

camerart

Joined Feb 25, 2013
2,506
That type of error may happen when you don't initialize (clear) registers. Although, that memory is not permanent, the contents of those reisters is unknown. It may be worth clearing all user RAM, or at least RAM that is used in calculations, when starting. Of course, RAM that is written to with a movwf type of instruction doesn need to be cleared before writing to it.
Hi J2,
I have noticed that when working ok, the first digit of the MISO is quite often wrong, but often repeatedly the same wrong digit, then on the next loop it is correct, which could be an indication of what you said?

I'm using BASIC, and will try to find out how to clear the RAM.
C.
 

jjw

Joined Dec 24, 2013
633
Hi J2,
I have noticed that when working ok, the first digit of the MISO is quite often wrong, but often repeatedly the same wrong digit, then on the next loop it is correct, which could be an indication of what you said?

I'm using BASIC, and will try to find out how to clear the RAM.
C.
Which SPI device gives the wrong first digit?
Do you know, that the first byte from the slave ( MISO ) is what happened to be in the slaves SSPBUF, when the master sent the first byte.
 

jpanhalt

Joined Jan 18, 2008
11,088
Hi J2,
I have noticed that when working ok, the first digit of the MISO is quite often wrong, but often repeatedly the same wrong digit, then on the next loop it is correct, which could be an indication of what you said?

I'm using BASIC, and will try to find out how to clear the RAM.
C.
Not clear whether your are referring to bits or bytes?

In either case, since SPI is always duplex, one partner can exchange rubbish and there is no effect, depending on the purpose of the exchange. For example, if the master sends an address to the slave from or to which it intends to read or write, respectively, the slave can return garbage. Maybe what you are seeing (i.e., an exchanged bit/byte) is just showing that your SPI is working. When you don't get that garbage, then maybe that indicates SPI is not working and your system fails?

If that sounds probable, I would step through the SPI code and see where it is hanging.
 

Thread Starter

camerart

Joined Feb 25, 2013
2,506
Which SPI device gives the wrong first digit?
Do you know, that the first byte from the slave ( MISO ) is what happened to be in the slaves SSPBUF, when the master sent the first byte.
I was interested in the general question, which has been answered. 'Try to zero in on the cause by repeatedly getting it to fail, or perhaps in the CODE'
Thanks.

Hi J and J2,
As this is specific, do you mind if we go here:https://forum.allaboutcircuits.com/threads/master-and-slave-pics-using-hwspi-in-oshonsoft.156175/
 

BobaMosfet

Joined Jul 1, 2009
1,773
Hi,
This is a general question, not a specific one.
I'm using SPI on PICs with peripherals, and have asked questions on this and other forums, and received lots of help, but my problem persists.

Sometimes my PCBs work fine, but for 'no' reason they error, so what do you do in these cases? I'm sure it all depends on what area the error(s) are in, so maybe this is an impossible question, but there it is.

Any ideas, please?
Camerart.
It's a process. Divide & Conquer. The simple answer- get more information. Until you have enough to understand specifically what is failing, then you figure out why/how.

Determine if the problem is software or hardware, and the proceed accordingly. Use tools like oscilloscopes to test signals against expectation for the hardware side, and timing issues. If you can rule out the hardware, then you know it's software and can then debug the software using the same divide & conquer process.

One of the things I do in software, is I designed a debugging tool that I can turn on/off by changing a single define, and it will cause my code to basically dump it's stackchain (all values, functions names, etc) to serial while it turns. I can then take the data dump (which is just hex) and parse it through an application I wrote that will take all the hex and turn it into a list describing every single step the software took, where it was, what it was doing. I can then compare operation to flow-charts and confirm logic operates accordingly or whatever any error is.

For example, the serial port might get HEX:

Code:
4145AA00AA034BAA00AA0342AA00AA02AA0EAA0E
And I then parse it with my tool, and get this:

Code:
--------------------------------------------------------------------------------------------------------------
       0.   Key Down Interrupt Called                                   [A] INT2-0
       1.   - Interrupt Accepted. Determine Row/Col & Key               [E] INT2-1
                <Row (Y): 0 / Col (X): 3>
       6.   Y,X Keypress to Byte                                        [K] GETCHARFROMROWCOL
                <Val: 'A'>
      11.   - Queue An Event                                            [B] CREATEEVENT
                <Event 'what': 0010>
                '14 / 0x000E / 00001110'                                    <Val: 14/'*'>
                '14 / 0x000E / 00001110'                                    <Val: 14/'*'>
.... You can see how useful that can be
 

andrewmm

Joined Feb 25, 2011
1,467
To answer question direct;
a) shout
b) swear
c) cry to myself

d) try to find some thing that makes the effect happen more / less, that is a clue as to what the cause might be
 

djsfantasi

Joined Apr 11, 2010
7,689
Try to make it happen more often. Stress parts of the system nearby the problem area. Do this in a structured manner. I even program microprocessors to simulate other components but this requires advanced coding skills...
 

Thread Starter

camerart

Joined Feb 25, 2013
2,506
Not clear whether your are referring to bits or bytes?

In either case, since SPI is always duplex, one partner can exchange rubbish and there is no effect, depending on the purpose of the exchange. For example, if the master sends an address to the slave from or to which it intends to read or write, respectively, the slave can return garbage. Maybe what you are seeing (i.e., an exchanged bit/byte) is just showing that your SPI is working. When you don't get that garbage, then maybe that indicates SPI is not working and your system fails?

If that sounds probable, I would step through the SPI code and see where it is hanging.
It's a process. Divide & Conquer. The simple answer- get more information. Until you have enough to understand specifically what is failing, then you figure out why/how.

Determine if the problem is software or hardware, and the proceed accordingly. Use tools like oscilloscopes to test signals against expectation for the hardware side, and timing issues. If you can rule out the hardware, then you know it's software and can then debug the software using the same divide & conquer process.

One of the things I do in software, is I designed a debugging tool that I can turn on/off by changing a single define, and it will cause my code to basically dump it's stackchain (all values, functions names, etc) to serial while it turns. I can then take the data dump (which is just hex) and parse it through an application I wrote that will take all the hex and turn it into a list describing every single step the software took, where it was, what it was doing. I can then compare operation to flow-charts and confirm logic operates accordingly or whatever any error is.

For example, the serial port might get HEX:

Code:
4145AA00AA034BAA00AA0342AA00AA02AA0EAA0E
And I then parse it with my tool, and get this:

Code:
--------------------------------------------------------------------------------------------------------------
       0.   Key Down Interrupt Called                                   [A] INT2-0
       1.   - Interrupt Accepted. Determine Row/Col & Key               [E] INT2-1
                <Row (Y): 0 / Col (X): 3>
       6.   Y,X Keypress to Byte                                        [K] GETCHARFROMROWCOL
                <Val: 'A'>
      11.   - Queue An Event                                            [B] CREATEEVENT
                <Event 'what': 0010>
                '14 / 0x000E / 00001110'                                    <Val: 14/'*'>
                '14 / 0x000E / 00001110'                                    <Val: 14/'*'>
.... You can see how useful that can be
Hi B,
I'm not so expert, but I can mostly see your logic. I'm working through what I get the 'feel' about and try to corner it this way. Not as sophisticated as your debug tool, which is a bit too skilled for me.
Thanks, C.
 

Thread Starter

camerart

Joined Feb 25, 2013
2,506
Try to make it happen more often. Stress parts of the system nearby the problem area. Do this in a structured manner. I even program microprocessors to simulate other components but this requires advanced coding skills...
Hi D,
Again, a bit too advanced for my coding skills.
Cheers, C.
 

Thread Starter

camerart

Joined Feb 25, 2013
2,506
To answer question direct;
a) shout
b) swear
c) cry to myself

d) try to find some thing that makes the effect happen more / less, that is a clue as to what the cause might be
Hi A,
At the moment, I think it could be some sort of timing issue, perhaps code related, or synch.
C.
 

Thread Starter

camerart

Joined Feb 25, 2013
2,506
Not clear whether your are referring to bits or bytes?

In either case, since SPI is always duplex, one partner can exchange rubbish and there is no effect, depending on the purpose of the exchange. For example, if the master sends an address to the slave from or to which it intends to read or write, respectively, the slave can return garbage. Maybe what you are seeing (i.e., an exchanged bit/byte) is just showing that your SPI is working. When you don't get that garbage, then maybe that indicates SPI is not working and your system fails?

If that sounds probable, I would step through the SPI code and see where it is hanging.
Hi J2,
If you follow the link in #11, I have put a readout of the terminal readout.
C.
 

JohnInTX

Joined Jun 26, 2012
4,443
Good suggestions above.

As you might guess from our earlier work I like to split up things into functional modules and in this case an SPI transceiver would be one of those. The issue could be hardware or more likely software so if it were me I'd test along these lines:
Split off whatever you have going on and write a test suite that exercises the SPI and flags any errors. From the master write 0-255 and have the slave echo +1 whatever it gets back to the master.
The master writes 00h to get things started and ignores the return value.
The slave receives 00h, increments it and puts that value in SSPBUF to send as the return value for the next byte Tx/Rx.
The master writes 01h to SPI and receives the 01h from the slave.
Master compares the values and if OK, continues with 02h etc. Wrap around at FFh-00H and keep going forever.
If there ever is a mismatch flag the error. I like to have the scope set up on the data lines and post-triggered in the one-shot mode. That way, I can scroll back through the scope's memory to see what the signals were that caused the problem.

When both PICs are paying full-time attention to the SPI, any issues will likely be in hardware. If you can run the sequence for a few hours without fail, you probably can assume your hardware is OK but the software isn't keeping up.

When you have this little test code working, archive it. Then whenever you encounter future problems you can reload the test code, run it and re-validate the hardware link.

Once the hardware is solid and testable (and repeatable) then you know whatever your current problem is is likely the code.... Be sure to check the SPI mode error flags.

Do remember that ANY communication between devices means that the firmware of the receiver has to be responsive enough to catch any character sent as fast as it is sent If the receiver is off doing painfully slow OSH string manipulations when multiple characters come in, you're in trouble. In master-slave SPI you should be able to send a character and wait for the character back as a single operation but that will take time away from other things you are doing..

Just my .03
 

djsfantasi

Joined Apr 11, 2010
7,689
Good suggestions above.

As you might guess from our earlier work I like to split up things into functional modules and in this case an SPI transceiver would be one of those. The issue could be hardware or more likely software so if it were me I'd test along these lines:
Split off whatever you have going on and write a test suite that exercises the SPI and flags any errors. From the master write 0-255 and have the slave echo +1 whatever it gets back to the master.
The master writes 00h to get things started and ignores the return value.
The slave receives 00h, increments it and puts that value in SSPBUF to send as the return value for the next byte Tx/Rx.
The master writes 01h to SPI and receives the 01h from the slave.
Master compares the values and if OK, continues with 02h etc. Wrap around at FFh-00H and keep going forever.
If there ever is a mismatch flag the error. I like to have the scope set up on the data lines and post-triggered in the one-shot mode. That way, I can scroll back through the scope's memory to see what the signals were that caused the problem.

When both PICs are paying full-time attention to the SPI, any issues will likely be in hardware. If you can run the sequence for a few hours without fail, you probably can assume your hardware is OK but the software isn't keeping up.

When you have this little test code working, archive it. Then whenever you encounter future problems you can reload the test code, run it and re-validate the hardware link.

Once the hardware is solid and testable (and repeatable) then you know whatever your current problem is is likely the code.... Be sure to check the SPI mode error flags.

Do remember that ANY communication between devices means that the firmware of the receiver has to be responsive enough to catch any character sent as fast as it is sent If the receiver is off doing painfully slow OSH string manipulations when multiple characters come in, you're in trouble. In master-slave SPI you should be able to send a character and wait for the character back as a single operation but that will take time away from other things you are doing..

Just my .03
i like your .03! It is an eloquent explanation of my earlier point. The important takeaway here is that coding a system doesn’t stop at the system. A good test suite is also necessary, particularly in an application as complex as this one.

@camerart If this is beyond your current capabilities, I propose that your system may also be beyond your capabilities. You are learning that coding is more than cut and paste of someone else’s code. Hopefully, you can continue to learn and have success in the next step of creating a complex system by being able to code a test suite for all of its discrete functions.

I’ve watched several of your posts. And I don’t believe success is beyond your reach. With help from @John P , you have made incredible progress. You just have a little more to learn and a little more to do.
 
Top