Comparing raw pin toggling speed between Arduino platforms.

Entry posted by N9WXU · December 17, 2019

3,982 views

When comparing CPU's and architectures it is also a good idea to compare the frameworks and learn how the framework will affect your system. In this article I will be comparing a number of popular Arduino compatible systems to see how different "flavors" of Arduino stack up in the pin toggling test. When I started this effort, I thought it would be a straight forward demonstration of CPU efficiency, clock speed and compiler performance on the one side against the Arduino framework implementation on the other. As is often the case, if you poke deeply into even the most trivial of systems you will always find something to learn.

As I look around my board stash I see that there are the following Arduino compatible development kits:

Arduino Nano Every (ATMega 4809 @ 20MHz AVR Mega)
Mini Nano V3.0 (ATMega 328P @ 16MHz AVR)
RobotDyn SAMD21 M0-Mini (ATSAMD21G18A @ 48MHz Cortex M0+)
ESP-12E NodeMCU (ESP8266 @ 80MHz Tenselica)
Teensy 3.2 (MK20DX256VLH7 @ 96MHz Cortex M4)
ESP32-WROOM-32 (ESP32 @ 240MHz Tenselica)

And each of these kits has an available Arduino framework. Say what you will about the Arduino framework, there are some serious advantages to using it and a few surprises. For the purpose of this testing I will be running one program on every board. I will use vanilla "Arduino" code and make zero changes for each CPU. The Arduino framework is very useful for normalizing the API to the hardware in a very consistent and portable manner. This is mostly true at the low levels like timers, PWM and digital I/O, but it is very true as you move to higher layers like the String library or WiFi. Strangely, there are no promises of performance. For instance, every Arduino program has a setup() function where you put your initialization and a loop() function that is called very often. With this in mind it is easy to imagine the following implementation:

extern void setup(void);
extern void loop(void);

void main(void)
{
  setup();
  while(1)
  {
    loop();
  }
}

And in fact when you dig into the AVR framework you find the following code in main.cpp

int main(void)
{
   init();

   initVariant();

#if defined(USBCON)
   USBDevice.attach();
#endif
	
   setup();
    
   for (;;) {
      loop();
      if (serialEventRun) serialEventRun();
   }     
   return 0;
}

There are a few "surprises" that really should not be surprises. First, the Arduino environment needs to be initialized (init()), then the HW variant (initVariant()), then we might be using a usb device so get USB started (USBDevice.attach()) and finally, the user setup() function. Once we start our infinite loop. Between calls to the loop function the code maintains the serial connection which could be USB. I suppose that other frameworks could implement this environment a little bit differently and there could be significant consequences to these choices.

The Test

For this test I am simply going to initialize 1 pin and then set it high and low. Here is the code.

void setup()
{
  pinMode(2,OUTPUT);
}

void loop()
{
  digitalWrite(2,HIGH);
  digitalWrite(2,LOW);
}

I am expecting this to make a short high pulse and a slightly longer low pulse. The longer low pulse is to account for the extra overhead of looping back. This is not likely to be as fast as the pin toggles Orunmila did in the previous article but I do expect it to be about half as fast.

Here are the results. The 2 red lines at the bottom are the best case optimized raw speed from Orunmila's comparison.

That is a pretty interesting chart and if we simply compare the data from the ATMEGA 4809 both with ASM and Arduino code, you see a 6x difference in performance. Let us look at the details and we will summarize at the end.

Nano 328P

So here is the first victim. The venerable AVR AT328P running 16MHz. The high pulse is 3.186uS while the low pulse is 3.544uS making a pulse frequency of 148.2kHz.

Clearly the high and low pulses are nearly the same so the extra check to handle the serial ports is not very expensive but the digitalWrite abstraction is much more expensive that I was anticipating.

Nano Every

The Nano Every uses the much newer ATMega 4809 at 20Mhz. The 4809 is a different variant of the AVR CPU with some additional optimizations like set and clear registers for the ports. This should be much faster.

The high pulse is 1.192uS and the low pulse is 1.504uS. Again the pulses are almost the same size so the additional overhead outside of the loop function must be fairly small. Perhaps it is the same serial port test. Interestingly, one of the limiting factors of popular Arduino 3d printer controller projects such as GRBL is the pin toggle rate for driving the stepper motor pulses. A 4809 based controller could be 2x faster for the same stepper code.

Sam D21 Mini M0

Now we are stepping up to an ARM Cortex M0 at 48Mhz. I actually expect this to be nearly 2x performance as the 4809 simply because the instructions required to set pins high and low should be essentially the same.

Wow! I was definitely NOT expecting the timing to get worse than the 4809. The high pulse width is 1.478uS and the low pulse width is 1.916uS making the frequency 294.6kHz. Obviously toggling pins is not a great measurement of CPU performance but if you need fast pin toggling in the Arduino world, perhaps the SAMD21 is not your best choice.

Teensy 3.2

This is a NXP Cortex M4 CPU at 96 MHz. This CPU is double the clock speed as the D21 and it is a M4 CPU which has lots of great features, though those features may not help toggle pins quickly.

Interesting. Clearly this device is very fast as shown by the short high period of only 0.352uS. But, this framework must be doing quite a lot of work behind the scenes to justify the 2.274uS of loop delay.

Looking a little more closely I see a number of board options for this hardware. First, I see that I can disable the USB. Surely the USB is supported between calls to the loop function. I also see a number of compiler optimization options. If I turn off the USB and select the "fastest" optimizations, what is the result?

Teensy 3.2, No USB and Fastest optimizations

Making these two changes and re-running the same C code produces this result:

That is much better. It is interesting to see the compiler change is about 3x faster for this test (measured on the high pulse) and the lack of USB saves about 1uS in the loop rate. This is not a definitive test of the optimizations and probably the code grew a bit, but it is a stark reminder that optimization choices can make a big difference.

ESP8266

The ESP8266 is a 32-bit Tenselica CPU. This is still a load/store architecture so its performance will largely match ARM though undoubtedly there are cases where it will be a bit different. The 8266 runs at 80Mhz so I do expect the performance to be similar to the Teensy 3.2. The wildcard is the 8266 framework is intended to support WiFI so it is running FreeRTOS and the Arduino loop is just one thread in the system. I have no idea what that will do to our pin toggle so it is time to measure.

Interesting. It is actually quite slow and clearly there is quite a bit of system house-keeping happening in the main loop. The high pulse is only 0.948uS so that is very similar to Nano Every at 1/4th the clock speed. The low pulse is simply slow. This does seem to be a good device for IoT but not for pin toggling.

ESP32

The ESP32 is a dual core very fast machine, but it does run the code out of a cache. This is because the code is stored in a serial memory. Of course our test is quite short so perhaps we do not need to fear the cache miss.

Like the ESP8266, the Arduino framework is built upon a FreeRTOS task. But this has a second CPU and lots more clock speed so lets look at the results:

Interesting, the toggle rate is about 2x the Teensy while the clock speed is about 3x. I do like how the pulses are nearly symmetrical. A quick peek at the source code for the framework shows the Arduino running as a thread but the thread updates the watchdog timer and the serial drivers on each pass through the loop.

Conclusions

It is very educational to make measurements instead of assumptions when evaluating an MCU for your next project. A specific CPU may have fantastic specifications and even demonstrations but it is critical to include the complete development system and code framework in your evaluation. It is a big surprise to find the 16MHz AVR328P can actually toggle a pin faster than the ESP8266 when used in a basic Arduino project.

The summary graph at the top of the article is duplicated here:

In this graph, the Pin Toggling Speed is actually only 1/(the high period). This was done on purpose so only the pin toggle efficiency is being compared. In the test program, the low period is where the loop() function ends and other housekeeping work can take place. If we want to compare the CPU/CODE efficiency, we should really normalize the pin toggling frequency to a common clock speed. We can always compensate for inefficiency with more clock speed.

This graph is produced by dividing the frequency by the clock speed and now we can compare the relative efficiencies. That Cortex M4 and its framework in the Teensy 3.2 is quite impressive now. Clearly the ESP-32 is pretty good but using its clock speed for the win. The Mega 4809 has a reasonable framework just not enough clock speed. All that aside, the ASM versions (or even a faster framework) could seriously improve all of these numbers. The poor ESP8266 is pretty dismal.

So what is happening in the digitalWrite() function that is making this performance so slow? Put another way, what am I getting in return for the low performance? There are really 3 reasons for the performance.

Portability. Each device has work to adapt to the pin interface so the price of portability is runtime efficiency
Framework Support. There are many functions in the framework that could be affected by the writing to the pins so the digitalWrite function must modify other functions.
Application Ignorance. The framework (and this function) cannot know how the system is constructed so they must plan for the worst.

Let us look at the digitalWrite for the the AVR

void digitalWrite(uint8_t pin, uint8_t val)
{
	uint8_t timer = digitalPinToTimer(pin);
	uint8_t bit = digitalPinToBitMask(pin);
	uint8_t port = digitalPinToPort(pin);
	volatile uint8_t *out;

	if (port == NOT_A_PIN) return;

	// If the pin that support PWM output, we need to turn it off
	// before doing a digital write.
	if (timer != NOT_ON_TIMER) turnOffPWM(timer);

	out = portOutputRegister(port);

	uint8_t oldSREG = SREG;
	cli();

	if (val == LOW) {
		*out &= ~bit;
	} else {
		*out |= bit;
	}

	SREG = oldSREG;
}

Note the first thing is a few lookup functions to determine the timer, port and bit described by the pin number. These lookups can be quite fast but they do cost a few cycles. Next we ensure we have a valid pin and turn off any PWM that may be active on that pin. This is just safe programming and framework support. Next we figure out the output register for the update, turn off the interrupts (saving the interrupt state) set or clear the pin and restore interrupts. If we knew we were not using PWM (like this application) we could omit the turnOffPWM function. If we knew all of our pins were valid we could remove the NOT_A_PIN test. Unfortunately all of these optimizations require knowledge of the application which the framework cannot know. Clearly we need new tools to describe embedded applications.

This has been a fun bit of testing. I look forward to your comments and suggestions for future toe-to-toe challenges.

Good Luck and go make some measurements.

PS: I realize that this pin toggling example is simplistic at best. There are some fine Arduino libraries and peripherals that could easily toggle pins much faster than the results shown here. However, this is a simple Apples to Apples test of identical code in "identical" frameworks on different CPU's so the comparisons are valid and useful. That said, if you have any suggestions feel free to enlighten us in the comments.

2 Comments

Recommended Comments

Member

N9WXU 56

Posted January 20, 2020

More Data!

I just got a Teensy 4 and it is pretty fast.

Compiling it in "fastest" and 600Mhz provides the following results.

Strangely compiling it in "faster" provides the slightly better results. (6ns)

This is pretty fast but I was expecting a bit more performance since it is 6x faster than the Teensy 3.2 tested before.

There is undoubtedly a good reason for this performance, and I expect pin toggling to be limited by wait states in writing to the GPIO peripherals. In any case this is still a fast result.