Jump to content

Comparing raw pin toggling speed AVR ATmega4808 vs PIC 16F15376




Comparing raw pin toggling speed AVR ATmega4808 vs PIC 16F15376

Toggling a pin is such a basic thing. After all, we all start with that Blinky program when we bring up a new board.

It is actually a pretty effective way to compare raw processing speed between any two microcontrollers. In this, our first Toe-to-toe showdown, we will be comparing how fast these cores can toggle a pin using just a while loop and a classic XOR toggle.

First, let's take a look at the 2 boards we used to compare these cores. These were selected solely because I had them both lying on my desk at the time. Since we are not doing anything more than toggling a pin we just needed an 8-bit AVR core and an 8-bit PIC16F1 core on any device to compare. I do like these two development boards though, so here are the details if you want to repeat this experiment.

In the blue corner, we have the AVR, represented by the ATmega4808, 
sporting an AtMega core (AVRxt in the instruction manual)

Clocking at a maximum of 20MHz.

We used the AVR-IOT WG Development Board, part number AC164160.
This board can be obtained for $29 here:

Compiler: XC8 v2.05 (Free)

In the red corner, we have the PIC, represented by the 16F15376,
sporting a PIC16F1 Enhanced Midrange core.

Clocking at a maximum of 32MHz.

We used the MPLAB® Xpress PIC16F15376 Evaluation Board, part number DM164143.
This board can be obtained at $12 here:

Compiler: XC8 v2.05 (Free)

image.png image.png


This is what we measured. All the details around the methodology we used and an analysis of the code follows below and attached you will find all the source code we used if you want to try this at home. The numbers in the graph are pin toggling frequency in kHz after it has been normalized to a 1MHz CPU clock speed. 


How we did it (and some details about the cores)

Doing objective comparisons between 2 very different cores is always hard. 

We wanted to make sure that we do an objective comparison between the cores which you can use to make informed decisions on your project. In order to do this, we had to deal with the fact that the maximum clock speed of these devices is not the same and also that the fundamental architecture of these two cores is very different.

In principle, the AVR is a Load-store Architecture machine with a 1 stage pipeline. This basically means that all ALU operations have to be performed between CPU registers and the RAM is used to load from and store results to. The PIC, on the other hand, uses a Register Memory Architecture, which means in short that some ALU operations can be performed on RAM locations directly and that the machine has a much smaller set of registers.

On the PIC all instructions are 1 word in length, which is 14-bits wide, while the data bus is 8-bits in size and all results will be a maximum of 8-bits in size. On the AVR instructions can be 16-bit or 32-bit wide which results in different execution times depending on the instruction.

Both processors have a 1 stage pipeline, which means that the next instruction is fetched while the current one is being executed. This means branching causes an incorrect fetch and results in a penalty of one instruction cycle. One major difference is that the AVR, due to its Load-store Architecture, is capable of completing the instruction within as little as just one clock cycle. When instructions need to use the data bus they can take up to 5 clock cycles to execute. Since the PIC has to transfer data over the bus it takes multiple cycles to execute an instruction. In keeping with the RISC paradigm of highly regular instruction pipeline flow, all instructions on the PIC take 4 clock cycles to execute.

All of this just makes it tricky and technical to compare performance between these processors. What we decided to do is rather take typical tasks we need the CPU to perform which occurs regularly in real programs and simply measure how fast each CPU can perform these tasks. This should allow you to work backwards from what your application will be doing during maximum throughput pressure on the CPU and figure out which CPU will perform the best for your specific problem.

Round 1: Basic test

For the first test, we used virtually the same code on both processors. Since both of these are supported by MCC it was really easy to get going.

  1. We created a blank project for the target CPU
  2. Fired up MCC
  3. Adjusted the clock speed to the maximum possible
  4. Clicked in the pin manager to make a single pin on PORTC an output
  5. Hit generate code.

After this all we added was the following simple while loop:

while (1)
  LATC ^= 0xFF;
while (1)
  PORTC.OUT ^= 0xFF;

The resulting code produced by the free compilers (XC8 v2.05 in both cases) was as follows, interestingly enough both loops had the same number of instructions (6 in total) including the loop jump. This is especially interesting as it will show how the execution of a same-length loop takes on each of these processors. You will notice that without optimization there is some room for improvement, but since this is how people will evaluate the cores at first glance we wanted to go with this. 

PIC          AVR
Address Hex Instruction
07B3 30FF MOVLW 0xFF
07B4 00F0 MOVWF __pcstackCOMMON
07B5 0870 MOVF __pcstackCOMMON, W
07B6 0140 MOVLB 0x0
07B8 2FB3 GOTO 0x7B3
Address     Hex         Instruction
017D 9180 LDS R24, 0x00
017E 0444  
017F 9580 COM R24
0180 9380 STS 0x00, R24
0181 0444  
0182 CFFA RJMP 0x17D

We used my Saleae logic analyzer to capture the signal and measure the timing on both devices. Since the Saleae is thresholding the digital signal and the rise and fall times are not always identical you will notice a little skew in the measurements. We did run everything 512x slower to confirm that this was entirely measurement error, so it is correct to round all times to multiples of the CPU clock in all cases here.

PIC_basic.png  AVR_basic.png 


For the PIC

The clock speed was 32MHz. We know that the PIC takes 4 clock cycles to execute one instruction, which gives us an expected instruction rate of one instruction every 125ns. Rounding for measurement errors we see that the PIC has equal low and high times of 875ns. That is 7 instruction cycles for each loop iteration.

To verify if this makes sense we can look at the ASM. We see 6 instructions, the last of which is a GOTO, which we know will take 2 instruction cycles to execute. Using that fact we can verify that the loop repeats every 7 instruction cycles as expected (7 x 125ns = 875ns.)

For the AVR

The clock speed was 20MHz. We know that the AVR takes 1 clock cycle per instruction, which gives us an expected instruction rate of one instruction every 50ns.

Rounding for measurement errors we see that the AVR has equal low and high times of 400ns. That is 8 instruction cycles for each loop iteration.

To verify if this makes sense we again look at the ASM. We see 4 instructions, the last of which is an RJMP, which we know will take 2 instruction cycles to execute. We also see one LDS which takes 3 cycles because it is accessing sram, and one STS instruction which will each take 2 cycles and a Complement instruction which takes 1 more. Using those facts we can verify that the loop should repeat every 8 instruction cycles as expected (8 x 50ns = 400ns.)


Since the 2 processors are not running at the same clock speed we need to do some math to get a fair comparison. We think 2 particular approaches would be reasonable.

  1. Compare the raw fastest speed the CPU can do, this gives a fair benchmark where CPU's with higher clock speeds get an advantage.
  2. Normalize the results to a common clock speed, this gives us a fair comparison of capability at the same clock speed.

In the numbers below we used both methods for comparison.

The numbers

  AVR PIC Notes
Clock Speed 20MHz 32MHz  
Loop Speed 400ns 875ns  
Maximum Speed 2.5Mhz 1.142MHz Loop speed as a toggle frequency
Normalized Speed 125kHz 35.7kHz Loop frequency normalized to a 1MHz CPU clock
ASM Instructions 4 6  
Loop Code Size 12 bytes
12 bytes
4 instructions
6 words
10.5 bytes
6 instructions
Due to the nuances here we compared this 3 ways
Total Code Size 786 bytes
101 words
176.75 bytes


Round 2: Expert Optimized test

For the second round, we tried to hand-optimize the code to squeeze out the best possible performance from each processor. After all, we do not want to just compare how well the compilers are optimizing, we want to see what is the absolute best the raw CPU's can achieve. You will notice that although optimization doubled our performance, it made little difference to the relative performance between the two processors.


If anyone can improve on my hand optimizations here please report it in the comments and I will update the post accordingly

For the PIC we wrote to LATC to ensure we are in the right bank, and pre-set the W register, this means the loop reduces to just a XORF and a GOTO. For the AVR we changed the code to use the Toggle register instead doing an XOR of the OUT register for the port.

The optimized code looked as follows.

LATC = 0xFF;
asm ("MOVLW  0xFF");
while (1)
   asm ("XORWF  LATC, F");
asm ("LDI R30,0x40");
asm ("LDI R31,0x04");
asm ("SER R24");
while (1){
   asm ("STD Z+7,R24");

The resulting ASM code after these changes now looked as follows. Note we did not include the instructions outside of the loop here as we are really just looking at the loop execution.

PIC          AVR
Address Hex Instruction
07C2 00F0 GOTO 0x7C1
Address     Hex         Instruction
0180 8387 STD Z+7,R24
0181 CFFE RJMP 0x180

Here are the actual measurements:

Pic_optimized.png  image.png 


For the PIC we do not see how we could improve on this as the loop has to be a GOTO which takes 2 cycles and 1 instruction is the least amount of work we could possibly do in the loop so we are pretty confident that this is the best we can do, and when measuring we see 3 instruction cycles which we think is the limit here. 

Note: N9WXU did suggest that we could fill all the memory with XOR instructions and let it loop around forever and in doing so save the GOTO, but we would still have to set W to FF every second instruction to have consistent timing, so this would still be 2 instructions per "loop" although it would use all the FLASH and execute in 250ns. Novel as this idea was, since that means you can do nothing else we dismissed that idea as not representative.

For the AVR we think we are also at the limits here. The toggle register lets us toggle the pin in 1 clock cycle which cannot be beaten, and the RJMP unavoidably adds 2 more. We measure 3 cycles for this.

  AVR PIC Notes
Clock Speed 20MHz 32MHz  
Loop Speed 150ns 375ns  
Maximum Speed 6.667Mhz 2.667MHz Loop speed as a toggle frequency
Normalized Speed 333.3kHz 83.3kHz Loop frequency normalized to a 1MHz CPU clock
ASM Instructions 2 2  
Loop Code Size 4 bytes
4 bytes
2 words
3.5 bytes

At this point, we can do a raw comparison of absolute toggle frequency performance after the hand optimization. Comparing this way gives the PIC the advantage of running at 32MHz while the AVR is limited to 20MHz. Interestingly the PIC gains a little as expected, but the overall picture does not change much.


The source code can be downloaded here:

PIC MPLAB-X Project file  MicroforumToggleTestPic16f1.zip

AVR MPLAB-X Project file  MicroforumToggleTest4808.zip

What next?

For our next installment, we have a number of options. We could add more cores/processors to this test of ours, or we can take a different task and cycle through the candidates on that. We could also vary the tools by using different compilers and see how they stack up against each other and across the architectures. Since our benchmarks will all be based on real-world tasks it should not matter HOW the CPU is performing the task or HOW we created the code, the comparison will simply be how well the job gets done.

Please do post any ideas or requests in the comments and we will see if we can either improve this one or oblige with another Toe-to-toe comparison.



This post was updated to use the 1 cycle STD instruction instead of the 2 cycle STS instruction for the hand-optimized AVR version in round 2


  • Like 2


Recommended Comments

Your update didn't include changing the "loop size" for AVR....
You could also used the OUT instruction, which would cut down on the setup....


Link to comment
  • Member
8 hours ago, westfw said:

Your update didn't include changing the "loop size" for AVR....
You could also used the OUT instruction, which would cut down on the setup....


Thanks for pointing that out! I looked over the code 3 times before I realized you were referring to the Table 🙂

And yes I agree there are a lot of ways to optimize what happens before the loop, we decided not to even bother with making that smaller as it did not form part of our focus here. For example I also noticed that the Start/MCC generated code for the AVR was actually quite large, more than 4x what the size was for the PIC code - in fact it took more than 800 instructions to just make a loop that toggles a pin. In ASM I can do all that in almost 100x less, but if this was going to be a reasonably fast read we had to be laser focussed.

The idea in the longer run is to add a lot of small posts that focus on one specific aspect every time, and we will be sure to cover the advantages of OUT in the future sometime based on your recommendation.


Link to comment
Add a comment...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...