Jump to content
 

Blogs

Featured Entries

  • Orunmila

    How to struct - lessons on Structures in C

    By Orunmila

    Structures in the C Programming Language Structures in C is one of the most misunderstood concepts. We see a lot of questions about the use of structs, often simply about the syntax and portability. I want to explore both of these and look at some best practice use of structures in this post as well as some lesser known facts. Covering it all will be pretty long so I will start off with the basics, the syntax and some examples, then I will move on to some more advanced stuff. If you are an
    • 10 comments
    • 14,781 views
  • Orunmila

    The Ballmer Peak

    By Orunmila

    If you are going to be writing any code you can probably use all the help you can get, and in that line you better be aware of the "Ballmer Peak". Legend has it that drinking alcohol impairs your ability to write code, BUT there is a curious peak somewhere in the vicinity of a 0.14 BAC where programmers attain almost super-human programming skill. The XKCD below tries to explain the finer nuances. But seriously many studies have shown that there is some truth to this in the sense that
    • 0 comments
    • 718 views
  • Orunmila

    How long is a nanosecond?

    By Orunmila

    Exactly how long is a nanosecond? This Lore blog is all about standing on the shoulders of giants. Back in February 1944 IBM shipped the Harvard Mark 1 to Harvard University. It looked like this: The Mark I was a remarkable machine at the time, it could perform addition in 1 cycle (which took roughly 0.3 seconds) and multiplication in 20 cycles or 6 seconds. Calculating sin(x)  would run up to 60 seconds (1 minute). The team that ran this Electromechanical computer had o
    • 0 comments
    • 1,388 views
  • N9WXU

    Initializing Strings in C

    By N9WXU

    Embedded applications are hard for a large number of reasons, but one of the main issues is memory.  Today I want to talk about how our C variables get initialized and a few assumptions we make as we use C to write embedded software. Let us take a few simple declarations such as we might make every day. char *string1 = "string1"; const char *string2 = "string2"; char const *string3 = "string3"; char * const string4 = "string4"; char const * const string5 = "string5";   In C99 th
    • 4 comments
    • 1,421 views
  • Orunmila

    Epigrams on Programming

    By Orunmila

    Epigrams on Programming Alan J. Perlis Yale University This text has been published in SIGPLAN Notices Vol. 17, No. 9, September 1982, pages 7 - 13.  The phenomena surrounding computers are diverse and yield a surprisingly rich base for launching metaphors at individual and group activities. Conversely, classical human endeavors provide an inexhaustible source of metaphor for those of us who are in labor within computation. Such relationships between society and device are no
    • 0 comments
    • 871 views

Our community blogs

  1. In the comments of our blog on structures in c a member asked me a specific question about what they observed. As this code is a beautiful example of a number of problems we often see we thought it a good idea to make an entry just to discuss this as there really is a lot going on here.

    We will cover the following:

    1. What allocation and freeing of memory on the stack means and the lifetime of objects
    2. In which direction the stack usually grows (note - the C standard does not contain the word "stack" so this is compiler-specific)
    3. Another look at deep vs. shallow copies of c strings inside structures

    In order to keep this all generic, I am going to be using the LLVM compiler on my MAC to do all my examples. The examples are all standard C and you can play with the code on your favorite compiler, but since the details of memory allocation are not mandated by the C standard your results may not look exactly like mine. I will e.g. show how the results I get changes when I modify the optimization levels.

    The Question

    The OP @zakasterwas asking this:

    Quote

    if i call function `eg_test_copy2`, i noticed that the memory allocated for `new_name` never gets cleared, 

    the output will be

    input new name for person:alex
    address of new_name = 000000000061FD90
    p name = alex
    address of dummy = 000000000061FDE0
    p name = alex

    and the address for `dummy` actually starts after the address of new_name + 4 bytes x 20, and i can still access the old name, is this weird.

    for example if I input another name,

    input new name for person:hello
    address of new_name = 000000000061FD90
    p name = hello
    address of dummy = 000000000061FDE0
    p name = hello

    Here is the code snippet they provided:

    struct person {
        char* name;
    };
    
    void get_person(struct person* p) {
        char new_name[20];  // on stack, gets freed when function returned
        printf("input new name for person:");
        scanf("%s", &new_name);
        p->name = new_name;
        printf("address of new_name = %p\n", &new_name[0]);
    }
    
    void eg_test_copy2(void) {
        struct person p = {"alex"};
        get_person(&p);
        printf("p name = %s\n", p.name);
    
        char dummy[20] = { 0 };
        printf("address of dummy = %p\n", &dummy[0]);
        printf("p name = %s\n", p.name);
    }

    Variable Allocation

    When you declare a variable the compiler will only reserve a memory location to be used by the variable. This process will not actually clear the memory unless the variable has static linkage, the standard states that only variables with static linkage (in simple terms this means global variables) shall be initialized to 0. If you want a variable to be initialized you have to supply an initializer.

    What actually happens before your main function starts running is that something generally referred to as "c-init" will run. This is a bit of code that will do the work needed by the C standard before your code runs, and one of the things it will do is to clear, usually using a loop, the block of memory which will contain statically linked variables. Other things that may be in here are setting up interrupt vectors and other machine registers and of course copying the initial values of global variables that do have initializers over the locations reserved for these variables.

    When a variable goes "out of scope" the memory is no longer reserved. This simply means that it is free for others to use, it does not mean that the memory is cleared when it is no longer reserved. This is very important to note. This phenomenon often leads to developers testing their code after having a pointer that points to memory which is no longer reserved, and the code seems to work fine until the new owner of that part of memory modifies it, then the code inexplicably breaks! No, it was actually broken all along and you just got lucky that the memory was not used at the time you were accessing this unreserved piece of memory!

    The classic way this manifests can be seen in our first test (test1) below.

    #include <stdio.h>
    
    char* get_name() {
        char new_name[20];  // on stack, gets freed when function returned
        printf("Enter Name:");
        scanf("%s", new_name);
        return new_name;
    }
    
    int main(void)
    {
        char* theName;
        theName = get_name();
        printf("\r\nThe name was : %s\r\n", theName);
    
        return 0;
    }

    I compile and run this and get :

    > test1
    Enter Name:Orunmila

    The name was : Orunmila
     

    Note: Let me mention here that I was using "gcc test1.c -O3" to compile that, when I use the default optimization or -O1 it prints junk instead. When you do something which is undefined in the C standard the behavior will not be guaranteed to be the same on all machines.

    So I can easily be fooled into thinking this is working just fine, but it is actually very broken! On LLVM I actually get a compiler warning when I compile that as follows:

    test1.c:7:12: warning: address of stack memory associated with local variable 'new_name' returned
          [-Wreturn-stack-address]
        return new_name;
               ^~~~~~~~
    1 warning generated.

    Did I mention that I do love LLVM?!

    We can quickly see how this breaks down if we call the function more than once in a row like this (test2):

    #include <stdio.h>
    
    char* get_name() {
        char new_name[20];  // on stack, gets freed when function returned
        printf("Enter Name:");
        scanf("%s", new_name);
        return new_name;
    }
    
    int main(void)
    {
        char* theName;
        char* theSecondName;
      
        theName = get_name();
        theSecondName = get_name();
      
        printf("\r\nThe first name was  : %s\r\n", theName);
        printf("The second name was : %s\r\n", theSecondName);
    
        return 0;
    }

    Now we get the following obviously wrong behavior

    Enter Name:N9WXU
    Enter Name:Orunmila
     
    The first name was : Orunmila
    The second name was : Orunmila

    This happens because the declarations of theName and theSecondName in the code only reserve enough memory to store a pointer to a memory location. When the function returns it does not actually return the string containing the name, it only returns the address of the string, the name of the memory location which used to contain the string inside of the function get_name(). 

    At the time when I print the name, the memory is no longer reserved, but as nobody else has used it since I called the function (I did perform any other operation which makes use of the stack in other words). The code is still printing the name, but both name pointers are pointing the same location in memory (which is actually just a coincidence, the compiler would have been within its rights to place the two in different locations).

    If you call a function that has a local variable between fetching the names and printing them the names will be overwritten by these variables and it will print something which looks like gibberish instead of the names I was typing. We will leave it to the reader to play with this and see how/why this breaks. I would encourage you to also add this to the end, these print statements will clearly show you where the variables are located and why they print the same thing - you will notice that the values of both pointers are the same!

    printf("Location of theName       : %p\r\n", &theName);       // This prints the location of the first pointer
    printf("Location of theSecondName : %p\r\n", &theSecondName); // This prints the location of the second pointer
    
    printf("Value of theName       : %p\r\n", theName);       // This prints the value of the first pointer
    printf("Value of theSecondName : %p\r\n", theSecondName); // This prints the value of the second pointer

    This all should answer the question asked, which was "I can still access the old name, is this weird?". The answer is no, this is nor weird at all, but it is undefined and if you called some other functions in between you would see the memory which used to hold the old name being overwritten in weird and wonderful ways as expected.

    How does the stack grow?

    Now that we have printed out some pointers this brings us to the next question. Our OP noticed that "the address for `dummy` actually starts after the address of new_name + 4 bytes x 20".

    We need to be careful here, the C standard requires pointers to be byte-addressable, which means that the address being 20x4 away makes no sense by itself, and in this case it is a pure coincidence. A couple of things should be noted here:

    1. The stack usually grows downwards in memory
    2. The size of a char[20] buffer will always be 20 and never 4x20 (specified in section 6.5.3.4 of the C99 standard)
    3. In the example question the address of new_name was at 0x61FD90, which is actually smaller than 0x61FDE0, and in other words it was placed on the stack AFTER dummy.

    Here is a diagram which shows a typical layout that a C compiler may choose to use.

    image.png

    The reason there was a gap of 80 between the pointers was simply due to the way the compiler decided to place the variables on the stack. It was probably creating some extra space on the stack for passing parameters around and this just happened to be exactly 60 bytes, which resulted in a gap of 80.

    The C standard only defines the scope of the variables, it does not mandate how the compiler must place them in memory. This can even vary for the same compiler when you add more code as the linker may move things around and will probably change when you change the optimization settings for the compiler.

    I did some tests with LLVM and if I look at the addresses in the example they will differ significantly when I am using optimization O1, but when I set it to O3 the difference between the two pointers is exactly 20 bytes for the example code. 

    Getting back to Structures and Strings

    Looking at the intent of the OP's code we can now get back to how structures and strings work in C.

    With our interface like this

    struct person {
        char* name;
    };
    
    void get_person(struct person* p);

    What we have is a struct which very importantly does NOT contain a string, it only contains the address of a string. That person struct will reserve (typically) the 4 bytes of RAM required to store a 32-bit address which will be the location where a string exists in memory. If you use it like this you will most often find that the address of "name" will be exactly the same as the address of the person struct you are passing in, so if our OP tested the following this would have been clear:

    struct person p = {"alex"};
        
    printf("Address of p      = %p\n", &p);
    printf("Address of p.name = %p\n", &p.name);

    These two addresses must be the same because the struct has only one member!

    When we want to work with a structure that contains the name of a person we have 2 choices and they both have pro's and con's. 

    1. Let the struct contain a pointer and use malloc to allocate memory for the string on the heap. (not recommended for embedded projects!)
    2. Let the struct contain an array of chars that can contain the name.

    For option 1 the declaration is fine, but the getName function would have to look as follows:

    void get_person(struct person* p) {
        char* new_name = malloc(20);     // on heap, so remember to check if it returns NULL !
        printf("input new name for person:");
        scanf("%s", new_name);
        p->name = new_name;
        printf("address of new_name = %p\n", new_name);
    }

    Of course, now you have to check and handle the case where we run out of memory and malloc returns NULL, we also have to be cognisant of heap fragmentation and most importantly we now have to be very careful to ensure that the memory gets freed or we will have a memory leak!

    For option 2 the structure and the function has to change to something like the following:

    struct person {
        char  name[20]; 
    }
    
    void get_person(struct person* p) {
        printf("input new name for person:");
        scanf("%s", p->name);
        printf("address of p->name = %p\n", p->name);
    }

    Of course now we use 20 bytes of memory regardless how long the name is, but on the upside we do not have to worry about freeing the memory, when the instance goes out of scope the compiler will take care of that for us. Also now we can assign one person struct to another which will actually copy the entire string and we still have the option of passing it by reference by using the address of the object!

    Conclusion

    Be careful when using C strings in structures, there are a lot of ways these can get you into trouble. Memory leaks and shallow copies, where you make a copy of the pointer but not the string, are very likely to catch you sooner rather than later.

     

     

     

     

     

  2. When comparing CPU's and architectures it is also a good idea to compare the frameworks and learn how the framework will affect your system.  In this article I will be comparing a number of popular Arduino compatible systems to see how different "flavors" of Arduino stack up in the pin toggling test.  When I started this effort, I thought it would be a straight forward demonstration of CPU efficiency, clock speed and compiler performance on the one side against the Arduino framework implementation on the other.  As is often the case, if you poke deeply into even the most trivial of systems you will always find something to learn.

    As I look around my board stash I see that there are the following Arduino compatible development kits:

    1. Arduino Nano Every (ATMega 4809 @ 20MHz AVR Mega)
    2. Mini Nano V3.0 (ATMega 328P @ 16MHz AVR)
    3. RobotDyn SAMD21 M0-Mini (ATSAMD21G18A @ 48MHz Cortex M0+)
    4. ESP-12E NodeMCU (ESP8266 @ 80MHz Tenselica)
    5. Teensy 3.2 (MK20DX256VLH7 @ 96MHz Cortex M4)
    6. ESP32-WROOM-32 (ESP32 @ 240MHz Tenselica)

    And each of these kits has an available Arduino framework.  Say what you will about the Arduino framework, there are some serious advantages to using it and a few surprises.  For the purpose of this testing I will be running one program on every board.  I will use vanilla "Arduino" code and make zero changes for each CPU.  The Arduino framework is very useful for normalizing the API to the hardware in a very consistent and portable manner.  This is mostly true at the low levels like timers, PWM and digital I/O, but it is very true as you move to higher layers like the String library or WiFi.  Strangely, there are no promises of performance.  For instance, every Arduino program has a setup() function where you put your initialization and a loop() function that is called very often.  With this in mind it is easy to imagine the following implementation:

    extern void setup(void);
    extern void loop(void);
    
    void main(void)
    {
      setup();
      while(1)
      {
        loop();
      }
    }

    And in fact when you dig into the AVR framework you find the following code in main.cpp

    int main(void)
    {
       init();
    
       initVariant();
    
    #if defined(USBCON)
       USBDevice.attach();
    #endif
    	
       setup();
        
       for (;;) {
          loop();
          if (serialEventRun) serialEventRun();
       }     
       return 0;
    }

    There are a few "surprises" that really should not be surprises.  First, the Arduino environment needs to be initialized (init()), then the HW variant (initVariant()), then we might be using a usb device so get USB started (USBDevice.attach()) and finally, the user setup() function.  Once we start our infinite loop.  Between calls to the loop function the code maintains the serial connection which could be USB.  I suppose that other frameworks could implement this environment a little bit differently and there could be significant consequences to these choices.

    The Test

    For this test I am simply going to initialize 1 pin and then set it high and low.  Here is the code.

    void setup()
    {
      pinMode(2,OUTPUT);
    }
    
    void loop()
    {
      digitalWrite(2,HIGH);
      digitalWrite(2,LOW);
    }

    I am expecting this to make a short high pulse and a slightly longer low pulse.  The longer low pulse is to account for the extra overhead of looping back.  This is not likely to be as fast as the pin toggles Orunmila did in the previous article but I do expect it to be about half as fast.  

    Here are the results. The 2 red lines at the bottom are the best case optimized raw speed from Orunmila's comparison.

    image.png

    That is a pretty interesting chart and if we simply compare the data from the ATMEGA 4809 both with ASM and Arduino code, you see a 6x difference in performance.  Let us look at the details and we will summarize at the end.

    Nano 328P

    So here is the first victim.  The venerable AVR AT328P running 16MHz.  The high pulse is 3.186uS while the low pulse is 3.544uS making a pulse frequency of 148.2kHz.

    Clearly the high and low pulses are nearly the same so the extra check to handle the serial ports is not very expensive but the digitalWrite abstraction is much more expensive that I was anticipating.

     

    image.png

    Nano Every

    The Nano Every uses the much newer ATMega 4809 at 20Mhz.  The 4809 is a different variant of the AVR CPU with some additional optimizations like set and clear registers for the ports.  This should be much faster.

    image.png

    The high pulse is 1.192uS and the low pulse is 1.504uS.  Again the pulses are almost the same size so the additional overhead outside of the loop function must be fairly small.  Perhaps it is the same serial port test.  Interestingly, one of the limiting factors of popular Arduino 3d printer controller projects such as GRBL is the pin toggle rate for driving the stepper motor pulses.  A 4809 based controller could be 2x faster for the same stepper code.

    Sam D21 Mini M0

    Now we are stepping up to an ARM Cortex M0 at 48Mhz.  I actually expect this to be nearly 2x performance as the 4809 simply because the instructions required to set pins high and low should be essentially the same.

    image.png

    Wow!  I was definitely NOT expecting the timing to get worse than the 4809.  The high pulse width is 1.478uS and the low pulse width is 1.916uS making the frequency 294.6kHz.  Obviously toggling pins is not a great measurement of CPU performance but if you need fast pin toggling in the Arduino world, perhaps the SAMD21 is not your best choice.

    Teensy 3.2

    This is a NXP Cortex M4 CPU at 96 MHz.  This CPU is double the clock speed as the D21 and it is a M4 CPU which has lots of great features, though those features may not help toggle pins quickly.

    image.png

    Interesting.  Clearly this device is very fast as shown by the short high period of only 0.352uS.  But, this framework must be doing quite a lot of work behind the scenes to justify the 2.274uS of loop delay.

    Looking a little more closely I see a number of board options for this hardware.  First, I see that I can disable the USB.  Surely the USB is supported between calls to the loop function.  I also see a number of compiler optimization options.  If I turn off the USB and select the "fastest" optimizations, what is the result?

    Teensy 3.2, No USB and Fastest optimizations

    Making these two changes and re-running the same C code produces this result:

    image.png

    That is much better.  It is interesting to see the compiler change is about 3x faster for this test (measured on the high pulse) and the lack of USB saves about 1uS in the loop rate.  This is not a definitive test of the optimizations and probably the code grew a bit, but it is a stark reminder that optimization choices can make a big difference.

    ESP8266

    The ESP8266 is a 32-bit Tenselica CPU.  This is still a load/store architecture so its performance will largely match ARM though undoubtedly there are cases where it will be a bit different.  The 8266 runs at 80Mhz so I do expect the performance to be similar to the Teensy 3.2.  The wildcard is the 8266 framework is intended to support WiFI so it is running FreeRTOS and the Arduino loop is just one thread in the system.  I have no idea what that will do to our pin toggle so it is time to measure.

    image.png

    Interesting.  It is actually quite slow and clearly there is quite a bit of system house-keeping happening in the main loop.  The high pulse is only 0.948uS so that is very similar to Nano Every at 1/4th the clock speed.  The low pulse is simply slow.  This does seem to be a good device for IoT but not for pin toggling.

    ESP32

    The ESP32 is a dual core very fast machine, but it does run the code out of a cache.  This is because the code is stored in a serial memory.  Of course our test is quite short so perhaps we do not need to fear the cache miss.

    Like the ESP8266, the Arduino framework is built upon a FreeRTOS task.  But this has a second CPU and lots more clock speed so lets look at the results:

    image.png

    Interesting, the toggle rate is about 2x the Teensy while the clock speed is about 3x.  I do like how the pulses are nearly symmetrical.  A quick peek at the source code for the framework shows the Arduino running as a thread but the thread updates the watchdog timer and the serial drivers on each pass through the loop.

    Conclusions

    It is very educational to make measurements instead of assumptions when evaluating an MCU for your next project.  A specific CPU may have fantastic specifications and even demonstrations but it is critical to include the complete development system and  code framework in your evaluation.  It is a big surprise to find the 16MHz AVR328P can actually toggle a pin faster than the ESP8266 when used in a basic Arduino project.

    The summary graph at the top of the article is duplicated here:

    image.png

    In this graph, the Pin Toggling Speed is actually only 1/(the high period).  This was done on purpose so only the pin toggle efficiency is being compared.  In the test program, the low period is where the loop() function ends and other housekeeping work can take place.  If we want to compare the CPU/CODE efficiency, we should really normalize the pin toggling frequency to a common clock speed.  We can always compensate for inefficiency with more clock speed.

    image.png

    This graph is produced by dividing the frequency by the clock speed and now we can compare the relative efficiencies.  That Cortex M4 and its framework in the Teensy 3.2 is quite impressive now.  Clearly the ESP-32 is pretty good but using its clock speed for the win.  The Mega 4809 has a reasonable framework just not enough clock speed.  All that aside, the ASM versions (or even a faster framework) could seriously improve all of these numbers.  The poor ESP8266 is pretty dismal.

    So what is happening in the digitalWrite() function that is making this performance so slow?  Put another way, what am I getting in return for the low performance?  There are really 3 reasons for the performance.

    1. Portability.  Each device has work to adapt to the pin interface so the price of portability is runtime efficiency
    2. Framework Support.  There are many functions in the framework that could be affected by the writing to the pins so the digitalWrite function must modify other functions.
    3. Application Ignorance.  The framework (and this function) cannot know how the system is constructed so they must plan for the worst.

    Let us look at the digitalWrite for the the AVR

    void digitalWrite(uint8_t pin, uint8_t val)
    {
    	uint8_t timer = digitalPinToTimer(pin);
    	uint8_t bit = digitalPinToBitMask(pin);
    	uint8_t port = digitalPinToPort(pin);
    	volatile uint8_t *out;
    
    	if (port == NOT_A_PIN) return;
    
    	// If the pin that support PWM output, we need to turn it off
    	// before doing a digital write.
    	if (timer != NOT_ON_TIMER) turnOffPWM(timer);
    
    	out = portOutputRegister(port);
    
    	uint8_t oldSREG = SREG;
    	cli();
    
    	if (val == LOW) {
    		*out &= ~bit;
    	} else {
    		*out |= bit;
    	}
    
    	SREG = oldSREG;
    }

    Note the first thing is a few lookup functions to determine the timer, port and bit described by the pin number.  These lookups can be quite fast but they do cost a few cycles.  Next we ensure we have a valid pin and turn off any PWM that may be active on that pin.  This is just safe programming and framework support.  Next we figure out the output register for the update, turn off the interrupts (saving the interrupt state) set or clear the pin and restore interrupts.  If we knew we were not using PWM (like this application) we could omit the turnOffPWM function.  If we knew all of our pins were valid we could remove the NOT_A_PIN test.  Unfortunately all of these optimizations require knowledge of the application which the framework cannot know.  Clearly we need new tools to describe embedded applications.

    This has been a fun bit of testing.  I look forward to your comments and suggestions for future toe-to-toe challenges.

    Good Luck and go make some measurements.

    PS:  I realize that this pin toggling example is simplistic at best.  There are some fine Arduino libraries and peripherals that could easily toggle pins much faster than the results shown here.  However, this is a simple Apples to Apples test of identical code in "identical" frameworks on different CPU's so the comparisons are valid and useful.  That said, if you have any suggestions feel free to enlighten us in the comments.

  3. Programming Lore

    Orunmila
    Latest Entry

    By Orunmila,

    Melvin Conway quipped the phrase back in 1967 that "organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations."

    Over the decades this old adage has proven to be quite accurate and this has become known as "Conway's Law". Researchers from MIT and Harvard have since shown that there is strong evidence for this correllation, they called it the "The Mirroring Hypothesis". 

    When you read "The mythical man month" by Fred Brooks we see that we already knew back in the seventies that there is no silver bullet when it comes to Software Engineering, and that the reason for this is essentially the complexity of software and how we deal with it. It turns out that adding more people to a software project increases the number of people we need to communicate with and the number of people who need to understand it. When we just make one big team where everyone has to communicate with everyone the code tends to reflect this structure. As we can see the more people we add into a team the more the structure quickly starts to resemble something we all know all too well!

    image.pngimage.pngclipart226770.png

    When we follow the age old technique of divide and concquer, making small Agile teams that each work on a part of the code which is their single responsibility, it turns out that we end up getting encapsulation and modularity with dependencies managed between the modules.

    No wonder the world is embracing agile everywhere nowadays!

    You can of course do your own research on this, here are some org carts of some well known companies out there you can use to check the hypothesis for yourself!

    image.png

     

×
×
  • Create New...