Concurrency and race conditions
There is one problem we keep seeing bending people's minds over and over again and that is race conditions under concurrency.
Race conditions are hard to find because small changes in the code which subtly modifies the timing can make the bug disappear entirely. Adding any debugging code such as a printf can cause the bug to mysteriously disappear. We spoke about this kind of "Heisenbug" before in the lore blog.
In order for there to be a concurrency bug you have to either have multiple threads of execution (as you would have when using an RTOS) or, like we see more often in embedded systems, you are using interrupts. In this context an interrupt behaves the same as a parallel execution thread and we we have to take special care whenever both contexts of execution will be accesing the same variable.
Usually this cannot be avoided, we e.g. receive data in the ISR and want to process it in the main loop. Whenever we have this kind of interaction where we need to access shared data which will be accessed by multiple contexts we need to serialize access to the data to ensure that the two contexts will safely navigate this. You will see that a good design will minimise the amount of data that will be shared between the two contexts and carefully manage this interaction.
Simple race condition example
The idea here is simply that we receive bytes in the ISR and process them in the main loop. When we receive a byte we increase a counter and when we process one we reduce the counter. When the counter is at 0 this means that we have no bytes left to process.
uint8_t bytesInBuffer = 0; void interrupt serialPortInterrupt(void) { enqueueByte(RCREG); // Add received byte to buffer bytesInBuffer++; } void main(void) { while(1) { if ( bytesInBuffer > 0 ) { uint8_t newByte = dequeueByte(); bytesInBuffer--; processByte(newByte); } } }
The problem is that both the interrupt and the mainline code access the same memory here and these interactions last for multiple instruction cycles. When an operation completes in a single cycle we call it "atomic" which means that it is not interruptable. There are a numbe of ways that instructions which seem to be atomic in C can take multiple machine cycles to complete.
Some examples:
- Mathematical operations - these often use an accumulator or work register
- Assignment. If I have e.g. a 32bit variable on an 8-bit processor it can take 8 cycles or more to do x = y.
- Most pointer operations (indirect access). [This one can be particularly nasty btw.]
- Arrays [yes if you do i[12] that actually involves a multiply!]
In fact this happens at two places here, the counter as well as the queue containing they bytes, but we will focus only on the first case for now, the counter "bytesInBuffer".
Consider what would happen if the code "bytesInBuffer++" compiled to something like this:
MOVFW bytesInBuffer ADDLW 1 MOVWF bytesInBuffer
Similarly the code to reduce the variable could look like this:
MOVFW bytesInBuffer SUBLW 1 MOVWF bytesInBuffer
The race happens once the main execution thread is trying the decrement the variable. Once the value has been copied to the work register the race is on. If the mainline code can complete the calculation before an interrupt happens everything will work fine, but this will take 2 instructions. If an interrupt happens after the first instruction or after the 2nd instruction the data will be corrupted.
Lets look at the case where there is 1 byte in the buffer and the main code starts processing this byte, but before the processing is complete another byte arrives. Logically we would expect the counter to be incremented by one in the interrupt and decremented by 1 in the mainline code, so the end result should be bytesInBuffer = 1, and the newly received byte should be ready for processing.
The execution will proceed something like this - we will simplify to ignore clearing and checking of interrupt flags etc. (W is the work register, I_W is the interrupt context W register):
// State before : bytesInBuffer = 1. Mainline is reducing, next byte arrives during operation Mainline Interrupt // state ... [bytesInBuffer = 1] MOVFW bytesInBuffer [bytesInBuffer = 1, W=1] SUBLW 1 [bytesInBuffer = 1, W=0] MOVFW bytesInBuffer [bytesInBuffer = 1, W=0,I_W=1] ADDLW 1 [bytesInBuffer = 1, W=0,I_W=2] MOVWF bytesInBuffer [bytesInBuffer = 2, W=0,I_W=2] MOVWF bytesInBuffer [bytesInBuffer = 0, W=0,I_W=2] ... [bytesInBuffer = 0]
As you can see instead of ending up with bytesInBuffer = 0 instead of 1 and the newly received byte is never processed.
This typically leads to a bug report saying that the serial port is either losing bytes randomly or double-receiving bytes e.g.
Deadlock Empire
When I get a new junior programmer in my team I always make sure they understand this concept really well by laying down a challenge. They have to defeat "The Deadlock Empire" and bring me proof that they have won the BossFight at the end.
Deadlock Empire is a fun website which uses a series of examples to show just how hard proper serialization can be, and how even when you try to serialze access using mutexes and/or semaphores you can still end up with problems.
You can try it out for youerself - the link is https://deadlockempire.github.io
Even if you are a master of concurrency I bet you will still learn something new in the process!
Serialization
When we have to share data between 2 or more contexts it is critical that we properly serialize access to it. By this we mean that the one context should get to complete it's operation on the data before the 2nd context can access it. We want to ensure that access to the variables happen in series and not in parallel to avoid all of these problems.
The simplest way to do this (and also the least desireably) is to simply disable interrupts while accessing the variable in the mainline code. That means that the mainline code will never be interrupted in the middle of an operation on the shared date and so we will be safe.
Something like this:
// State before : bytesInBuffer = 1. Mainline is reducing, next byte arrives during operation Mainline Interrupt // state ... [bytesInBuffer = 1] BCF GIE /* disable interrupts */ [bytesInBuffer = 1] MOVFW bytesInBuffer [bytesInBuffer = 1, W=1] SUBLW 1 [bytesInBuffer = 1, W=0] MOVWF bytesInBuffer [bytesInBuffer = 0, W=0] BSF GIE /* enable interrupts */ [bytesInBuffer = 0, W=0] MOVFW bytesInBuffer [bytesInBuffer = 0, W=0, I_W=0] ADDLW 1 [bytesInBuffer = 0, W=0, I_W=1] MOVWF bytesInBuffer [bytesInBuffer = 1, W=0, I_W=1] ... [bytesInBuffer = 1]
And the newly received byte is ready to process and our counter is correct.
I am not going to go into fancy serialization mechanisms such as mutexes and semaphores here. If you are using a RTOS or are working on a fancy advanced processor then please do read up on the abilities of the processor and the mecahnisms provided by the OS to make operations atomic, critical sections, semaphores, mutexes and concurrent access in general.
Concurrency Checklist
We always want to explicitly go over a checklist when we are programming in any concurrent environment to make sure that we are guarding all concurrent access of shared data. My personal flow goes something like this:
-
Do you ever have more than one context of execution?
- Interrupts
- Threads
- Peripherals changing data (We problems with reading/writing 16-bit timers on 8-bit PIC's ALL OF THE TIME!)
-
List out all data that is accessed in more than one context.
- Look deep, sometimes it is subtle e.g. accessing an array index may be calling multiply internally.
- Don't miss registers, especially 16-bit values like Timers or ADC results
- Is it possible to reduce the amount of data that is shared?
-
Take a look at the ASM generated by the compiler to make 100% sure you understand what is happening with your data.
- Operations that look atomic in C are often not in machine instructions.
-
Explicitly design the mechanism for serializing between the two contexts and make sure it is safe under all conditions.
- Serialization almost always causes one thread to be blocked, this will slow down either processing speed by blocking the other thread or increase latency and jitter in the case of interrupts
We only looked into the counter variable above. Of course the queue holding the data is also shared, and in my experience I see much more issues with queues being corrupted due to concurrent access than I see with simple counters, so do be careful with all shared data.
One last example
I mentioned 16-bit timers a couple of times, I have to show my favourite example of this, which happens when people write to the timer register without stopping the timer.
// Innocent enough looking code to update Timer1 register void updateTimer1(uint16_t value) { TMR1 = value; } // This is more common, same problem void updateTimer1_v2(uint16_t value) { TMR1L = value >> 8; TMR1H = value & 0xFF; }
With the code above the compiler is generally smart enough not to do actual right shifts or masking, realizing that you are working with the lower and upper byte and this code compiles in both cases to something looking like this:
MOVFW value+1 // High byte MOVWF TMR1H MOVFW value // Low byte MOVWF TMR1L
And people always forget that when the timer is still running this can have disastrous results as follows when the timer increments in the middle of this all like this:
// We called updateTimer1(0xFFF0) - expecting a period of 16 cycles to the next overflow // We show the register value to the right for just one typical error case // The timer ticks on each instruction ... [TMR1 = 0x00FD] MOVFW value+1 [TMR1 = 0x00FE] MOVWF TMR1H [TMR1 = 0xFFFF] MOVFW value [TMR1 = 0x0000] MOVWF TMR1L [TMR1 = 0x00F0] ... [TMR1 = 0x00F1] ... [TMR1 = 0x00F2]
This case is always hard to debug because you can only catch it failing when the update is overlapping with the low byte overflowing into the high byte, which happens only 1 in 256 timer cycles, and since the update takes 2 cycles we have only a 0.7% chance of catching it in the act.
As you can see, depending on whether you clear the interrupt flag after this operation or not you can end up with either getting an immediate interrupt due to the overflow in the middle of the operation, or a period which vastly exceeds what you are expecting (by 65280 cycles!).
Of course if the low byte did not overflow in the middle of this the high byte is unaffected and we get exactly the expected behavior.
When this happens we always see issues reported on the forums that sound something like "My timer is not working correctly. Every once in a while I get a really short or really long period"
When I see these issues posted in future I will try to remember to come update this page with links just for fun 🙂
0 Comments
Recommended Comments
There are no comments to display.