Jump to content
 

All Activity

This stream auto-updates     

  1. Earlier
  2. Hey, I just noticed that there are some over-clock options. Here is the result when clocked at 960MHz. I could not get it to run at 1GHz. They did warn that cooling was required.
  3. More Data! I just got a Teensy 4 and it is pretty fast. Compiling it in "fastest" and 600Mhz provides the following results. Strangely compiling it in "faster" provides the slightly better results. (6ns) This is pretty fast but I was expecting a bit more performance since it is 6x faster than the Teensy 3.2 tested before. There is undoubtedly a good reason for this performance, and I expect pin toggling to be limited by wait states in writing to the GPIO peripherals. In any case this is still a fast result.
  4. Looks like it gave the segfault upon running? And yes that is what I would expect because on a PC your code is trying to write to memory which should not permit writes. On an embedded system the behavior will depend a lot on the underlying system. Some systems will actually crash in some way, others, like XC8 based PIC microcontrollers, actually copy the read-only section into RAM so the code will actually work. This is why this is so dangerous, the behavior depends on the target and the toolchain and when this is one day tried on another system it could be a real challenge to figure out the mistake because it is so easily masked.
  5. Thanks for pointing out the mistake, we have updated the text accordingly.
  6. string1[3] = 'a'; This gave a segmentation fault on compilation
  7. It is also permissible to do this: string1[3] = 'a'; Which will change the original string into "striag1". I believe the original string will change to "strang1"
  8. @zakaster - I have posted a more comprehensive answer to the blog here https://www.microforum.cc/blogs/entry/49-a-question-about-structures-strings-and-pointers/
  9. In the comments of our blog on structures in c a member asked me a specific question about what they observed. As this code is a beautiful example of a number of problems we often see we thought it a good idea to make an entry just to discuss this as there really is a lot going on here. We will cover the following: What allocation and freeing of memory on the stack means and the lifetime of objects In which direction the stack usually grows (note - the C standard does not contain the word "stack" so this is compiler-specific) Another look at deep vs. shallow copies of c strings inside structures In order to keep this all generic, I am going to be using the LLVM compiler on my MAC to do all my examples. The examples are all standard C and you can play with the code on your favorite compiler, but since the details of memory allocation are not mandated by the C standard your results may not look exactly like mine. I will e.g. show how the results I get changes when I modify the optimization levels. The Question The OP @zakasterwas asking this: Here is the code snippet they provided: struct person { char* name; }; void get_person(struct person* p) { char new_name[20]; // on stack, gets freed when function returned printf("input new name for person:"); scanf("%s", &new_name); p->name = new_name; printf("address of new_name = %p\n", &new_name[0]); } void eg_test_copy2(void) { struct person p = {"alex"}; get_person(&p); printf("p name = %s\n", p.name); char dummy[20] = { 0 }; printf("address of dummy = %p\n", &dummy[0]); printf("p name = %s\n", p.name); } Variable Allocation When you declare a variable the compiler will only reserve a memory location to be used by the variable. This process will not actually clear the memory unless the variable has static linkage, the standard states that only variables with static linkage (in simple terms this means global variables) shall be initialized to 0. If you want a variable to be initialized you have to supply an initializer. What actually happens before your main function starts running is that something generally referred to as "c-init" will run. This is a bit of code that will do the work needed by the C standard before your code runs, and one of the things it will do is to clear, usually using a loop, the block of memory which will contain statically linked variables. Other things that may be in here are setting up interrupt vectors and other machine registers and of course copying the initial values of global variables that do have initializers over the locations reserved for these variables. When a variable goes "out of scope" the memory is no longer reserved. This simply means that it is free for others to use, it does not mean that the memory is cleared when it is no longer reserved. This is very important to note. This phenomenon often leads to developers testing their code after having a pointer that points to memory which is no longer reserved, and the code seems to work fine until the new owner of that part of memory modifies it, then the code inexplicably breaks! No, it was actually broken all along and you just got lucky that the memory was not used at the time you were accessing this unreserved piece of memory! The classic way this manifests can be seen in our first test (test1) below. #include <stdio.h> char* get_name() { char new_name[20]; // on stack, gets freed when function returned printf("Enter Name:"); scanf("%s", new_name); return new_name; } int main(void) { char* theName; theName = get_name(); printf("\r\nThe name was : %s\r\n", theName); return 0; } I compile and run this and get : > test1 Enter Name:Orunmila The name was : Orunmila Note: Let me mention here that I was using "gcc test1.c -O3" to compile that, when I use the default optimization or -O1 it prints junk instead. When you do something which is undefined in the C standard the behavior will not be guaranteed to be the same on all machines. So I can easily be fooled into thinking this is working just fine, but it is actually very broken! On LLVM I actually get a compiler warning when I compile that as follows: test1.c:7:12: warning: address of stack memory associated with local variable 'new_name' returned [-Wreturn-stack-address] return new_name; ^~~~~~~~ 1 warning generated. Did I mention that I do love LLVM?! We can quickly see how this breaks down if we call the function more than once in a row like this (test2): #include <stdio.h> char* get_name() { char new_name[20]; // on stack, gets freed when function returned printf("Enter Name:"); scanf("%s", new_name); return new_name; } int main(void) { char* theName; char* theSecondName; theName = get_name(); theSecondName = get_name(); printf("\r\nThe first name was : %s\r\n", theName); printf("The second name was : %s\r\n", theSecondName); return 0; } Now we get the following obviously wrong behavior Enter Name:N9WXU Enter Name:Orunmila The first name was : Orunmila The second name was : Orunmila This happens because the declarations of theName and theSecondName in the code only reserve enough memory to store a pointer to a memory location. When the function returns it does not actually return the string containing the name, it only returns the address of the string, the name of the memory location which used to contain the string inside of the function get_name(). At the time when I print the name, the memory is no longer reserved, but as nobody else has used it since I called the function (I did perform any other operation which makes use of the stack in other words). The code is still printing the name, but both name pointers are pointing the same location in memory (which is actually just a coincidence, the compiler would have been within its rights to place the two in different locations). If you call a function that has a local variable between fetching the names and printing them the names will be overwritten by these variables and it will print something which looks like gibberish instead of the names I was typing. We will leave it to the reader to play with this and see how/why this breaks. I would encourage you to also add this to the end, these print statements will clearly show you where the variables are located and why they print the same thing - you will notice that the values of both pointers are the same! printf("Location of theName : %p\r\n", &theName); // This prints the location of the first pointer printf("Location of theSecondName : %p\r\n", &theSecondName); // This prints the location of the second pointer printf("Value of theName : %p\r\n", theName); // This prints the value of the first pointer printf("Value of theSecondName : %p\r\n", theSecondName); // This prints the value of the second pointer This all should answer the question asked, which was "I can still access the old name, is this weird?". The answer is no, this is nor weird at all, but it is undefined and if you called some other functions in between you would see the memory which used to hold the old name being overwritten in weird and wonderful ways as expected. How does the stack grow? Now that we have printed out some pointers this brings us to the next question. Our OP noticed that "the address for `dummy` actually starts after the address of new_name + 4 bytes x 20". We need to be careful here, the C standard requires pointers to be byte-addressable, which means that the address being 20x4 away makes no sense by itself, and in this case it is a pure coincidence. A couple of things should be noted here: The stack usually grows downwards in memory The size of a char[20] buffer will always be 20 and never 4x20 (specified in section 6.5.3.4 of the C99 standard) In the example question the address of new_name was at 0x61FD90, which is actually smaller than 0x61FDE0, and in other words it was placed on the stack AFTER dummy. Here is a diagram which shows a typical layout that a C compiler may choose to use. The reason there was a gap of 80 between the pointers was simply due to the way the compiler decided to place the variables on the stack. It was probably creating some extra space on the stack for passing parameters around and this just happened to be exactly 60 bytes, which resulted in a gap of 80. The C standard only defines the scope of the variables, it does not mandate how the compiler must place them in memory. This can even vary for the same compiler when you add more code as the linker may move things around and will probably change when you change the optimization settings for the compiler. I did some tests with LLVM and if I look at the addresses in the example they will differ significantly when I am using optimization O1, but when I set it to O3 the difference between the two pointers is exactly 20 bytes for the example code. Getting back to Structures and Strings Looking at the intent of the OP's code we can now get back to how structures and strings work in C. With our interface like this struct person { char* name; }; void get_person(struct person* p); What we have is a struct which very importantly does NOT contain a string, it only contains the address of a string. That person struct will reserve (typically) the 4 bytes of RAM required to store a 32-bit address which will be the location where a string exists in memory. If you use it like this you will most often find that the address of "name" will be exactly the same as the address of the person struct you are passing in, so if our OP tested the following this would have been clear: struct person p = {"alex"}; printf("Address of p = %p\n", &p); printf("Address of p.name = %p\n", &p.name); These two addresses must be the same because the struct has only one member! When we want to work with a structure that contains the name of a person we have 2 choices and they both have pro's and con's. Let the struct contain a pointer and use malloc to allocate memory for the string on the heap. (not recommended for embedded projects!) Let the struct contain an array of chars that can contain the name. For option 1 the declaration is fine, but the getName function would have to look as follows: void get_person(struct person* p) { char* new_name = malloc(20); // on heap, so remember to check if it returns NULL ! printf("input new name for person:"); scanf("%s", new_name); p->name = new_name; printf("address of new_name = %p\n", new_name); } Of course, now you have to check and handle the case where we run out of memory and malloc returns NULL, we also have to be cognisant of heap fragmentation and most importantly we now have to be very careful to ensure that the memory gets freed or we will have a memory leak! For option 2 the structure and the function has to change to something like the following: struct person { char name[20]; } void get_person(struct person* p) { printf("input new name for person:"); scanf("%s", p->name); printf("address of p->name = %p\n", p->name); } Of course now we use 20 bytes of memory regardless how long the name is, but on the upside we do not have to worry about freeing the memory, when the instance goes out of scope the compiler will take care of that for us. Also now we can assign one person struct to another which will actually copy the entire string and we still have the option of passing it by reference by using the address of the object! Conclusion Be careful when using C strings in structures, there are a lot of ways these can get you into trouble. Memory leaks and shallow copies, where you make a copy of the pointer but not the string, are very likely to catch you sooner rather than later.
  10. Your example is actually quite complex and it contains a number of really typical and very dangerous mistakes. I think it will take up way too much space here in the comments to answer this properly so I will write another blog just about your question here and post that over the weekend when I get some time to put it together. A hint to what is wrong with the code: The buffer you are using in get_person() is allocated on the stack, but the space is made available for use after the function returns. After this point you have a struct which contains a pointer to the buffer on the stack, but the memory no longer belongs to you and when you call any other function it will get overwritten. There is no rule that the compiler must allocate variables on the stack consecutively, and actually XC8 uses a "compiled stack" which means that variables are placed permanently where they will not overlap. You can probably get behavior closer to what you expect if you change the stack model to use a software stack. The last time this happened to someone in my team they were calling get_person for the first name, and then for the second name, and after calling it twice they tried to print out both names and both the structs had the same string, if you call it many times all the people will have the last name you entered. Try this with your code: struct person p1 = {"nobody"}; struct person p2 = {"nobody"}; get_person(&p1); get_person(&p2); printf("p1 name = %s\n", p1.name); printf("p2 name = %s\n", p2.name); You will enter a new name in get_person for each one, and after that the printing should not print nobody but the 2 different names you have entered, and both names should be different. Let me know if that behaves as you expected? Read my upcoming blog on stack usage and pointers to see why 🙂
  11. sorry i used a duplicated name, for example if i input another name, input new name for person:hello address of new_name = 000000000061FD90 p name = hello address of dummy = 000000000061FDE0 p name = hello
  12. hello sir, really good topic. i have a question regarding the example on deep and shallow copy though, struct person { char* name; }; void get_person(struct person* p) { char new_name[20]; // on stack, gets freed when function returned printf("input new name for person:"); scanf("%s", &new_name); p->name = new_name; printf("address of new_name = %p\n", &new_name[0]); } void eg_test_copy2(void) { struct person p = {"alex"}; get_person(&p); printf("p name = %s\n", p.name); char dummy[20] = { 0 }; printf("address of dummy = %p\n", &dummy[0]); printf("p name = %s\n", p.name); } if i call function `eg_test_copy2`, i noticed that the memory allocated for `new_name` never gets cleared, the output will be input new name for person:alex address of new_name = 000000000061FD90 p name = alex address of dummy = 000000000061FDE0 p name = alex and the address for `dummy` actually starts after the address of new_name + 4 bytes x 20, and i can still access the old name, is this weired
  13. Thanks for pointing that out! I looked over the code 3 times before I realized you were referring to the Table 🙂 And yes I agree there are a lot of ways to optimize what happens before the loop, we decided not to even bother with making that smaller as it did not form part of our focus here. For example I also noticed that the Start/MCC generated code for the AVR was actually quite large, more than 4x what the size was for the PIC code - in fact it took more than 800 instructions to just make a loop that toggles a pin. In ASM I can do all that in almost 100x less, but if this was going to be a reasonably fast read we had to be laser focussed. The idea in the longer run is to add a lot of small posts that focus on one specific aspect every time, and we will be sure to cover the advantages of OUT in the future sometime based on your recommendation.
  14. Your update didn't include changing the "loop size" for AVR.... You could also used the OUT instruction, which would cut down on the setup....
  15. When comparing CPU's and architectures it is also a good idea to compare the frameworks and learn how the framework will affect your system. In this article I will be comparing a number of popular Arduino compatible systems to see how different "flavors" of Arduino stack up in the pin toggling test. When I started this effort, I thought it would be a straight forward demonstration of CPU efficiency, clock speed and compiler performance on the one side against the Arduino framework implementation on the other. As is often the case, if you poke deeply into even the most trivial of systems you will always find something to learn. As I look around my board stash I see that there are the following Arduino compatible development kits: Arduino Nano Every (ATMega 4809 @ 20MHz AVR Mega) Mini Nano V3.0 (ATMega 328P @ 16MHz AVR) RobotDyn SAMD21 M0-Mini (ATSAMD21G18A @ 48MHz Cortex M0+) ESP-12E NodeMCU (ESP8266 @ 80MHz Tenselica) Teensy 3.2 (MK20DX256VLH7 @ 96MHz Cortex M4) ESP32-WROOM-32 (ESP32 @ 240MHz Tenselica) And each of these kits has an available Arduino framework. Say what you will about the Arduino framework, there are some serious advantages to using it and a few surprises. For the purpose of this testing I will be running one program on every board. I will use vanilla "Arduino" code and make zero changes for each CPU. The Arduino framework is very useful for normalizing the API to the hardware in a very consistent and portable manner. This is mostly true at the low levels like timers, PWM and digital I/O, but it is very true as you move to higher layers like the String library or WiFi. Strangely, there are no promises of performance. For instance, every Arduino program has a setup() function where you put your initialization and a loop() function that is called very often. With this in mind it is easy to imagine the following implementation: extern void setup(void); extern void loop(void); void main(void) { setup(); while(1) { loop(); } } And in fact when you dig into the AVR framework you find the following code in main.cpp int main(void) { init(); initVariant(); #if defined(USBCON) USBDevice.attach(); #endif setup(); for (;;) { loop(); if (serialEventRun) serialEventRun(); } return 0; } There are a few "surprises" that really should not be surprises. First, the Arduino environment needs to be initialized (init()), then the HW variant (initVariant()), then we might be using a usb device so get USB started (USBDevice.attach()) and finally, the user setup() function. Once we start our infinite loop. Between calls to the loop function the code maintains the serial connection which could be USB. I suppose that other frameworks could implement this environment a little bit differently and there could be significant consequences to these choices. The Test For this test I am simply going to initialize 1 pin and then set it high and low. Here is the code. void setup() { pinMode(2,OUTPUT); } void loop() { digitalWrite(2,HIGH); digitalWrite(2,LOW); } I am expecting this to make a short high pulse and a slightly longer low pulse. The longer low pulse is to account for the extra overhead of looping back. This is not likely to be as fast as the pin toggles Orunmila did in the previous article but I do expect it to be about half as fast. Here are the results. The 2 red lines at the bottom are the best case optimized raw speed from Orunmila's comparison. That is a pretty interesting chart and if we simply compare the data from the ATMEGA 4809 both with ASM and Arduino code, you see a 6x difference in performance. Let us look at the details and we will summarize at the end. Nano 328P So here is the first victim. The venerable AVR AT328P running 16MHz. The high pulse is 3.186uS while the low pulse is 3.544uS making a pulse frequency of 148.2kHz. Clearly the high and low pulses are nearly the same so the extra check to handle the serial ports is not very expensive but the digitalWrite abstraction is much more expensive that I was anticipating. Nano Every The Nano Every uses the much newer ATMega 4809 at 20Mhz. The 4809 is a different variant of the AVR CPU with some additional optimizations like set and clear registers for the ports. This should be much faster. The high pulse is 1.192uS and the low pulse is 1.504uS. Again the pulses are almost the same size so the additional overhead outside of the loop function must be fairly small. Perhaps it is the same serial port test. Interestingly, one of the limiting factors of popular Arduino 3d printer controller projects such as GRBL is the pin toggle rate for driving the stepper motor pulses. A 4809 based controller could be 2x faster for the same stepper code. Sam D21 Mini M0 Now we are stepping up to an ARM Cortex M0 at 48Mhz. I actually expect this to be nearly 2x performance as the 4809 simply because the instructions required to set pins high and low should be essentially the same. Wow! I was definitely NOT expecting the timing to get worse than the 4809. The high pulse width is 1.478uS and the low pulse width is 1.916uS making the frequency 294.6kHz. Obviously toggling pins is not a great measurement of CPU performance but if you need fast pin toggling in the Arduino world, perhaps the SAMD21 is not your best choice. Teensy 3.2 This is a NXP Cortex M4 CPU at 96 MHz. This CPU is double the clock speed as the D21 and it is a M4 CPU which has lots of great features, though those features may not help toggle pins quickly. Interesting. Clearly this device is very fast as shown by the short high period of only 0.352uS. But, this framework must be doing quite a lot of work behind the scenes to justify the 2.274uS of loop delay. Looking a little more closely I see a number of board options for this hardware. First, I see that I can disable the USB. Surely the USB is supported between calls to the loop function. I also see a number of compiler optimization options. If I turn off the USB and select the "fastest" optimizations, what is the result? Teensy 3.2, No USB and Fastest optimizations Making these two changes and re-running the same C code produces this result: That is much better. It is interesting to see the compiler change is about 3x faster for this test (measured on the high pulse) and the lack of USB saves about 1uS in the loop rate. This is not a definitive test of the optimizations and probably the code grew a bit, but it is a stark reminder that optimization choices can make a big difference. ESP8266 The ESP8266 is a 32-bit Tenselica CPU. This is still a load/store architecture so its performance will largely match ARM though undoubtedly there are cases where it will be a bit different. The 8266 runs at 80Mhz so I do expect the performance to be similar to the Teensy 3.2. The wildcard is the 8266 framework is intended to support WiFI so it is running FreeRTOS and the Arduino loop is just one thread in the system. I have no idea what that will do to our pin toggle so it is time to measure. Interesting. It is actually quite slow and clearly there is quite a bit of system house-keeping happening in the main loop. The high pulse is only 0.948uS so that is very similar to Nano Every at 1/4th the clock speed. The low pulse is simply slow. This does seem to be a good device for IoT but not for pin toggling. ESP32 The ESP32 is a dual core very fast machine, but it does run the code out of a cache. This is because the code is stored in a serial memory. Of course our test is quite short so perhaps we do not need to fear the cache miss. Like the ESP8266, the Arduino framework is built upon a FreeRTOS task. But this has a second CPU and lots more clock speed so lets look at the results: Interesting, the toggle rate is about 2x the Teensy while the clock speed is about 3x. I do like how the pulses are nearly symmetrical. A quick peek at the source code for the framework shows the Arduino running as a thread but the thread updates the watchdog timer and the serial drivers on each pass through the loop. Conclusions It is very educational to make measurements instead of assumptions when evaluating an MCU for your next project. A specific CPU may have fantastic specifications and even demonstrations but it is critical to include the complete development system and code framework in your evaluation. It is a big surprise to find the 16MHz AVR328P can actually toggle a pin faster than the ESP8266 when used in a basic Arduino project. The summary graph at the top of the article is duplicated here: In this graph, the Pin Toggling Speed is actually only 1/(the high period). This was done on purpose so only the pin toggle efficiency is being compared. In the test program, the low period is where the loop() function ends and other housekeeping work can take place. If we want to compare the CPU/CODE efficiency, we should really normalize the pin toggling frequency to a common clock speed. We can always compensate for inefficiency with more clock speed. This graph is produced by dividing the frequency by the clock speed and now we can compare the relative efficiencies. That Cortex M4 and its framework in the Teensy 3.2 is quite impressive now. Clearly the ESP-32 is pretty good but using its clock speed for the win. The Mega 4809 has a reasonable framework just not enough clock speed. All that aside, the ASM versions (or even a faster framework) could seriously improve all of these numbers. The poor ESP8266 is pretty dismal. So what is happening in the digitalWrite() function that is making this performance so slow? Put another way, what am I getting in return for the low performance? There are really 3 reasons for the performance. Portability. Each device has work to adapt to the pin interface so the price of portability is runtime efficiency Framework Support. There are many functions in the framework that could be affected by the writing to the pins so the digitalWrite function must modify other functions. Application Ignorance. The framework (and this function) cannot know how the system is constructed so they must plan for the worst. Let us look at the digitalWrite for the the AVR void digitalWrite(uint8_t pin, uint8_t val) { uint8_t timer = digitalPinToTimer(pin); uint8_t bit = digitalPinToBitMask(pin); uint8_t port = digitalPinToPort(pin); volatile uint8_t *out; if (port == NOT_A_PIN) return; // If the pin that support PWM output, we need to turn it off // before doing a digital write. if (timer != NOT_ON_TIMER) turnOffPWM(timer); out = portOutputRegister(port); uint8_t oldSREG = SREG; cli(); if (val == LOW) { *out &= ~bit; } else { *out |= bit; } SREG = oldSREG; } Note the first thing is a few lookup functions to determine the timer, port and bit described by the pin number. These lookups can be quite fast but they do cost a few cycles. Next we ensure we have a valid pin and turn off any PWM that may be active on that pin. This is just safe programming and framework support. Next we figure out the output register for the update, turn off the interrupts (saving the interrupt state) set or clear the pin and restore interrupts. If we knew we were not using PWM (like this application) we could omit the turnOffPWM function. If we knew all of our pins were valid we could remove the NOT_A_PIN test. Unfortunately all of these optimizations require knowledge of the application which the framework cannot know. Clearly we need new tools to describe embedded applications. This has been a fun bit of testing. I look forward to your comments and suggestions for future toe-to-toe challenges. Good Luck and go make some measurements. PS: I realize that this pin toggling example is simplistic at best. There are some fine Arduino libraries and peripherals that could easily toggle pins much faster than the results shown here. However, this is a simple Apples to Apples test of identical code in "identical" frameworks on different CPU's so the comparisons are valid and useful. That said, if you have any suggestions feel free to enlighten us in the comments.
  16. We just started a new blog which aims to create some benchmarks for comparing performance of small microcontrollers. This is something that we always needed to do ourselves as part of processor selection, so we decided to share our experiments and results here for everybody to use.
  17. Computer science practitioners must be very good at modeling and analyzing problems. This includes the optimization of solutions and testing of their effectiveness. Precision, creativity, and reasoning are the major bases of problem-solving. Computer science is spread between various disciplines such as medicine, business, and engineering. If a student decides to negotiate for online assistance, it should be considered before choosing Do My Homework with a computer science assignment.

    When students do not get help even by going to different places to get help, it can cause serious problems. This will definitely make you feel the need to worry about using our services. You will always have peace of mind that the works are absolutely original and true to the entire team of experts.

  18. Comparing raw pin toggling speed AVR ATmega4808 vs PIC 16F15376 Toggling a pin is such a basic thing. After all, we all start with that Blinky program when we bring up a new board. It is actually a pretty effective way to compare raw processing speed between any two microcontrollers. In this, our first Toe-to-toe showdown, we will be comparing how fast these cores can toggle a pin using just a while loop and a classic XOR toggle. First, let's take a look at the 2 boards we used to compare these cores. These were selected solely because I had them both lying on my desk at the time. Since we are not doing anything more than toggling a pin we just needed an 8-bit AVR core and an 8-bit PIC16F1 core on any device to compare. I do like these two development boards though, so here are the details if you want to repeat this experiment. In the blue corner, we have the AVR, represented by the ATmega4808, sporting an AtMega core (AVRxt in the instruction manual) Clocking at a maximum of 20MHz. We used the AVR-IOT WG Development Board, part number AC164160. This board can be obtained for $29 here: https://www.microchip.com/Developmenttools/ProductDetails/AC164160 Compiler: XC8 v2.05 (Free) In the red corner, we have the PIC, represented by the 16F15376, sporting a PIC16F1 Enhanced Midrange core. Clocking at a maximum of 32MHz. We used the MPLAB® Xpress PIC16F15376 Evaluation Board, part number DM164143. This board can be obtained at $12 here: https://www.microchip.com/developmenttools/ProductDetails/DM164143 Compiler: XC8 v2.05 (Free) Results This is what we measured. All the details around the methodology we used and an analysis of the code follows below and attached you will find all the source code we used if you want to try this at home. The numbers in the graph are pin toggling frequency in kHz after it has been normalized to a 1MHz CPU clock speed. How we did it (and some details about the cores) Doing objective comparisons between 2 very different cores is always hard. We wanted to make sure that we do an objective comparison between the cores which you can use to make informed decisions on your project. In order to do this, we had to deal with the fact that the maximum clock speed of these devices is not the same and also that the fundamental architecture of these two cores is very different. In principle, the AVR is a Load-store Architecture machine with a 1 stage pipeline. This basically means that all ALU operations have to be performed between CPU registers and the RAM is used to load from and store results to. The PIC, on the other hand, uses a Register Memory Architecture, which means in short that some ALU operations can be performed on RAM locations directly and that the machine has a much smaller set of registers. On the PIC all instructions are 1 word in length, which is 14-bits wide, while the data bus is 8-bits in size and all results will be a maximum of 8-bits in size. On the AVR instructions can be 16-bit or 32-bit wide which results in different execution times depending on the instruction. Both processors have a 1 stage pipeline, which means that the next instruction is fetched while the current one is being executed. This means branching causes an incorrect fetch and results in a penalty of one instruction cycle. One major difference is that the AVR, due to its Load-store Architecture, is capable of completing the instruction within as little as just one clock cycle. When instructions need to use the data bus they can take up to 5 clock cycles to execute. Since the PIC has to transfer data over the bus it takes multiple cycles to execute an instruction. In keeping with the RISC paradigm of highly regular instruction pipeline flow, all instructions on the PIC take 4 clock cycles to execute. All of this just makes it tricky and technical to compare performance between these processors. What we decided to do is rather take typical tasks we need the CPU to perform which occurs regularly in real programs and simply measure how fast each CPU can perform these tasks. This should allow you to work backwards from what your application will be doing during maximum throughput pressure on the CPU and figure out which CPU will perform the best for your specific problem. Round 1: Basic test For the first test, we used virtually the same code on both processors. Since both of these are supported by MCC it was really easy to get going. We created a blank project for the target CPU Fired up MCC Adjusted the clock speed to the maximum possible Clicked in the pin manager to make a single pin on PORTC an output Hit generate code. After this all we added was the following simple while loop: PIC AVR while (1) { LATC ^= 0xFF; } while (1) { PORTC.OUT ^= 0xFF; } The resulting code produced by the free compilers (XC8 v2.05 in both cases) was as follows, interestingly enough both loops had the same number of instructions (6 in total) including the loop jump. This is especially interesting as it will show how the execution of a same-length loop takes on each of these processors. You will notice that without optimization there is some room for improvement, but since this is how people will evaluate the cores at first glance we wanted to go with this. PIC AVR .tg {border-collapse:collapse;border-spacing:0;} .tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;} .tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;} .tg .tg-0lax{text-align:left;vertical-align:top} Address Hex Instruction 07B3 30FF MOVLW 0xFF 07B4 00F0 MOVWF __pcstackCOMMON 07B5 0870 MOVF __pcstackCOMMON, W 07B6 0140 MOVLB 0x0 07B7 069A XORWF LATC, F 07B8 2FB3 GOTO 0x7B3 Address Hex Instruction 017D 9180 LDS R24, 0x00 017E 0444 017F 9580 COM R24 0180 9380 STS 0x00, R24 0181 0444 0182 CFFA RJMP 0x17D We used my Saleae logic analyzer to capture the signal and measure the timing on both devices. Since the Saleae is thresholding the digital signal and the rise and fall times are not always identical you will notice a little skew in the measurements. We did run everything 512x slower to confirm that this was entirely measurement error, so it is correct to round all times to multiples of the CPU clock in all cases here. PIC AVR Analysis For the PIC The clock speed was 32MHz. We know that the PIC takes 4 clock cycles to execute one instruction, which gives us an expected instruction rate of one instruction every 125ns. Rounding for measurement errors we see that the PIC has equal low and high times of 875ns. That is 7 instruction cycles for each loop iteration. To verify if this makes sense we can look at the ASM. We see 6 instructions, the last of which is a GOTO, which we know will take 2 instruction cycles to execute. Using that fact we can verify that the loop repeats every 7 instruction cycles as expected (7 x 125ns = 875ns.) For the AVR The clock speed was 20MHz. We know that the AVR takes 1 clock cycle per instruction, which gives us an expected instruction rate of one instruction every 50ns. Rounding for measurement errors we see that the AVR has equal low and high times of 400ns. That is 8 instruction cycles for each loop iteration. To verify if this makes sense we again look at the ASM. We see 4 instructions, the last of which is an RJMP, which we know will take 2 instruction cycles to execute. We also see one LDS which takes 3 cycles because it is accessing sram, and one STS instruction which will each take 2 cycles and a Complement instruction which takes 1 more. Using those facts we can verify that the loop should repeat every 8 instruction cycles as expected (8 x 50ns = 400ns.) Comparison Since the 2 processors are not running at the same clock speed we need to do some math to get a fair comparison. We think 2 particular approaches would be reasonable. Compare the raw fastest speed the CPU can do, this gives a fair benchmark where CPU's with higher clock speeds get an advantage. Normalize the results to a common clock speed, this gives us a fair comparison of capability at the same clock speed. In the numbers below we used both methods for comparison. The numbers AVR PIC Notes Clock Speed 20MHz 32MHz Loop Speed 400ns 875ns Maximum Speed 2.5Mhz 1.142MHz Loop speed as a toggle frequency Normalized Speed 125kHz 35.7kHz Loop frequency normalized to a 1MHz CPU clock ASM Instructions 4 6 Loop Code Size 12 bytes 12 bytes 4 instructions 6 words 10.5 bytes 6 instructions Due to the nuances here we compared this 3 ways Total Code Size 786 bytes 101 words 176.75 bytes Round 2: Expert Optimized test For the second round, we tried to hand-optimize the code to squeeze out the best possible performance from each processor. After all, we do not want to just compare how well the compilers are optimizing, we want to see what is the absolute best the raw CPU's can achieve. You will notice that although optimization doubled our performance, it made little difference to the relative performance between the two processors. For the PIC we wrote to LATC to ensure we are in the right bank, and pre-set the W register, this means the loop reduces to just a XORF and a GOTO. For the AVR we changed the code to use the Toggle register instead doing an XOR of the OUT register for the port. The optimized code looked as follows. PIC AVR LATC = 0xFF; asm ("MOVLW 0xFF"); while (1) { asm ("XORWF LATC, F"); } asm ("LDI R30,0x40"); asm ("LDI R31,0x04"); asm ("SER R24"); while (1){ asm ("STD Z+7,R24"); } The resulting ASM code after these changes now looked as follows. Note we did not include the instructions outside of the loop here as we are really just looking at the loop execution. PIC AVR .tg {border-collapse:collapse;border-spacing:0;} .tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;} .tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;} .tg .tg-0lax{text-align:left;vertical-align:top} Address Hex Instruction 07C1 069A XORWF LATC, F 07C2 00F0 GOTO 0x7C1 Address Hex Instruction 0180 8387 STD Z+7,R24 0181 CFFE RJMP 0x180 Here are the actual measurements: PIC AVR Analysis For the PIC we do not see how we could improve on this as the loop has to be a GOTO which takes 2 cycles and 1 instruction is the least amount of work we could possibly do in the loop so we are pretty confident that this is the best we can do, and when measuring we see 3 instruction cycles which we think is the limit here. Note: N9WXU did suggest that we could fill all the memory with XOR instructions and let it loop around forever and in doing so save the GOTO, but we would still have to set W to FF every second instruction to have consistent timing, so this would still be 2 instructions per "loop" although it would use all the FLASH and execute in 250ns. Novel as this idea was, since that means you can do nothing else we dismissed that idea as not representative. For the AVR we think we are also at the limits here. The toggle register lets us toggle the pin in 1 clock cycle which cannot be beaten, and the RJMP unavoidably adds 2 more. We measure 3 cycles for this. AVR PIC Notes Clock Speed 20MHz 32MHz Loop Speed 150ns 375ns Maximum Speed 6.667Mhz 2.667MHz Loop speed as a toggle frequency Normalized Speed 333.3kHz 83.3kHz Loop frequency normalized to a 1MHz CPU clock ASM Instructions 2 2 Loop Code Size 4 bytes 4 bytes 2 words 3.5 bytes At this point, we can do a raw comparison of absolute toggle frequency performance after the hand optimization. Comparing this way gives the PIC the advantage of running at 32MHz while the AVR is limited to 20MHz. Interestingly the PIC gains a little as expected, but the overall picture does not change much. The source code can be downloaded here: PIC MPLAB-X Project file MicroforumToggleTestPic16f1.zip AVR MPLAB-X Project file MicroforumToggleTest4808.zip What next? For our next installment, we have a number of options. We could add more cores/processors to this test of ours, or we can take a different task and cycle through the candidates on that. We could also vary the tools by using different compilers and see how they stack up against each other and across the architectures. Since our benchmarks will all be based on real-world tasks it should not matter HOW the CPU is performing the task or HOW we created the code, the comparison will simply be how well the job gets done. Please do post any ideas or requests in the comments and we will see if we can either improve this one or oblige with another Toe-to-toe comparison. Updates: This post was updated to use the 1 cycle STD instruction instead of the 2 cycle STS instruction for the hand-optimized AVR version in round 2
  19. I think with these drivers it always depends on what you need and if this is a good match for your requirements. If you say that this seems very complicated it sounds like this driver probrably is doing a lot more than you need, so my first question would be "what do you expect instead"? I know N9WXU has talked to much of this already and also said it depends on whether you require a ring buffer. What I can offer you is what the thinking was for this implementation. For this driver the requirements were: We have to be able to receive data on interrupt, this ensures we never miss any bytes due to data overrun (next byte received before prior byte is read out). The application does NOT want to process the data inside of the ISR, so we need to store the data temporarily We need to be able to store multiple bytes, which means we may have 7 or 8 interrupts before we process the data. This means the application can safely take much longer to process the data before we lose data. If you serial port is running 115200 it means one character arrives every 87us. If we have a buffer of 16 bytes it means we only need to service the serial port at least once every 1.4ms to be sure we never lose any data, this extra time can be very important to us, and we can make the buffer bigger to get even more time. If this matches your situation you need to do the following. In the ISR you need to: 1. Store the data in a temp buffer // use this default receive interrupt handler code eusart2RxBuffer[eusart2RxHead++] = RCREG2; 2. Handle wrapping of the buffer when you reach the end of the array if(sizeof(eusart2RxBuffer) <= eusart2RxHead) { eusart2RxHead = 0; } 3. And keep track of how many bytes we have ready for processing eusart2RxCount++; When the app processes the data you need to: 1. If you try to read 1 byte before the data is available you need to wait until there is something to read uint8_t readValue = 0; while(0 == eusart2RxCount) { } 2. Remove one byte from the ring buffer and again handle the wrapping at the end of the array readValue = eusart2RxBuffer[eusart2RxTail++]; if(sizeof(eusart2RxBuffer) <= eusart2RxTail) { eusart2RxTail = 0; } 3. Reduce the number of bytes in the buffer by one (we need to disable the ISR or a collision can happen here), and then return the retrieved byte PIE3bits.RC2IE = 0; eusart2RxCount--; PIE3bits.RC2IE = 1; return readValue; I think this is kind of as simple as you can ever do this without reducing the functionality. Of course if you just wanted to without interrupts read the single byte all you need is to return RCREG, which BTW is what you will get if you uncheck the interrupt checkbox. Also if you do not like all of the ring buffer stuff you can enable interrupts and replace the ISR with the same thing you get when you do not need a buffer, then it is simpler but more likely to lose data. PS. I did not describe the eusart2RxLastError line as I cannot see all the code for that here and I cannot remember the details about that one line. What it is doing is updating a global (eusart2RxLastError) to indicate if there was any error. From the looks of this code snippet that part of the code may have some bugs in it as the last error is not updated after the byte is read out, but I may just be missing the start of the ISR ...
  20. The two functions you see are both halves of a ring buffer driver. The first function unloads the UART receive buffer and puts the bytes into the array eusart2RXBuffer. This array is indexed by eusart2RXHead. The head is always incremented and it rolls over when it reaches the maximum value. This receiving function creates a basic ring buffer insert that sacrifices error handling for speed. There are four possible errors that can occur. UART framing error. If a bad UART signal arrives the UART will abort reception with a framing error. It can be important to know if framing errors have occurred, and it is critical that the framing error bit be cleared if it gets set. The UART receiver is overrun. This happens if a third byte begins before any bytes are removed from the UART. With an ISR unloading the receiver this is generally not a real threat but if the baudrate is very high, and/or interrupts are disabled for too long, it can be a problem. The ring buffer head overwrites the tail. The oldest bytes will be lost but worse, the tail is not "pushed" ahead so the next read will return the newest data and then the oldest data. That can be a strange bug to sort out. It is better to add a check for head == tail and then increment the tail in that instance. This error is perhaps an extension of #3. The eusart2RxCount variable keeps track of the bytes in the buffer. This makes the while loop at the beginning of the read function much more efficient (probably 2 instructions on a PIC16). However if there is a head-tail collision, the the count variable will be too high which will later cause a undetected underrun in the read function. The second function is to be called from your application to retrieve the data captured by the interrupt service routine. This function will block until data is available. If you do not want to block, there are other functions that indicate the number of bytes available. The read function does have a number of lines of code, but it is a very efficient ring buffer implementation which extends the UART buffer size and helps keep UART receive performance high. That said, not all UART applications require a ring buffer. If you turn off the UART interrupts, you should get simple polling code that blocks for a character but does not add any buffers. The application interface should be identical (read) there will simply be no interrupt or buffers supporting the read function.
  21. and in the eusart.c file there is another function "EUSART2_RxDataHandler" can you explain how this works? void EUSART2_RxDataHandler(void){ // use this default receive interrupt handler code eusart2RxBuffer[eusart2RxHead++] = RCREG2; if(sizeof(eusart2RxBuffer) <= eusart2RxHead) { eusart2RxHead = 0; } eusart2RxCount++; } and "EUSART2_Read" function seems very complicated uint8_t EUSART2_Read(void) { uint8_t readValue = 0; while(0 == eusart2RxCount) { } eusart2RxLastError = eusart2RxStatusBuffer[eusart2RxTail]; readValue = eusart2RxBuffer[eusart2RxTail++]; if(sizeof(eusart2RxBuffer) <= eusart2RxTail) { eusart2RxTail = 0; } PIE3bits.RC2IE = 0; eusart2RxCount--; PIE3bits.RC2IE = 1; return readValue; }
  22. I agree with you complaints. You can mitigate the probability of making errors, but you can't get it to zero. 1. True, you have to continue typing to at least SERCOM_[register]_ to get more meaningful suggestions. The good thing is, there is a pattern in all header files. In doubt, follow to the definition in the header files, they are relatively small. 2. It will compile, as any magical number will compile. But you will at least have a chance to find a suspicious looking symbol by carefully reading through the statement. 3. This is done much better with the new DFPs. Have a close look in the XC32 compiler directory. You'll find an /include directory AND an /include_mcc! The first directory is the legacy code. By default, the new definitions will be used. 4. Similar to 2. 5. That is a great point, it is indeed not MISRA compliant. I'd guess that an exception is made then, but it would be interesting how this is handled in larger companies. Don't mix up Atmel START (ASF libraries) with DFPs. I personally don't like to use START, too much layer-over-layer. If you use it, you have to fully rely on the code underneath all the layers. And often enough, there is a bug hidden in some hri function. At this point, ASF doesn't save time anymore.
  23. Excellent points and I think we are almost completely in agreement. I have only four complaints with these helper macros and two of them are relatively minor. Consider the following snapshot from ATMEL Studio in a SAMD21 project the <CTRL><SPACE> pattern works great when you know the keyword. If you simply start with SERCOM, you get a number of matches that are not UART related. ALL the choices will compile. See the BAUD value placed in the CTRLA register... Some of these choices are used like macros...And others are not (CMODE) Placing an invalid value in these macros is completely reasonable and will compile. MISRA 19.7 disallows function-like macros. (though this is not really an issue because it does not apply in this situation.) So, these constructs are handy because they help prevent a certain sort of bookkeeping error related to sorting out all of these offsets. But the constructs allow range errors and semantic errors. Since we are talking about the SERCOM and I am using the SAMD21 in my example. Here is what START produces. hri_sercomusart_write_CTRLA_reg( SERCOM0, 1 << SERCOM_USART_CTRLA_DORD_Pos /* Data Order: enabled */ | 0 << SERCOM_USART_CTRLA_CMODE_Pos /* Communication Mode: disabled */ | 0 << SERCOM_USART_CTRLA_FORM_Pos /* Frame Format: 0 */ | 0 << SERCOM_USART_CTRLA_SAMPA_Pos /* Sample Adjustment: 0 */ | 0 << SERCOM_USART_CTRLA_SAMPR_Pos /* Sample Rate: 0 */ | 0 << SERCOM_USART_CTRLA_IBON_Pos /* Immediate Buffer Overflow Notification: disabled */ | 0 << SERCOM_USART_CTRLA_RUNSTDBY_Pos /* Run In Standby: disabled */ | 1 << SERCOM_USART_CTRLA_MODE_Pos); /* Operating Mode: enabled */ This is completely different than the other examples provided. This style is defined in hri_sercom_d21.h and a word to the wise, DO NOT BROWSE THIS INSIDE OF ATMEL START, This file must be huge as the page is completely frozen as it populates the source viewer. So I do like the construct, but often it does not help me very much because I must still go through the entire register and decide my values. When I string these values together any mistakes made will be sorted at once. All that aside, I don't really care much about the HW initialization. I want it to be short, to the point, and perfectly clear about what is placed in each register. In a typical project, only 1% of the code is some sort of HW work as you develop the application specific HW interfaces and bring up your board. Once these are tested, you are not likely to spend any more time on them. In a 9 month project, I expect to spend 2 weeks in the HW functions so if a magic value is clear and to the point, use it. If you want to construct your values with logical operations. Go for it.
  24. Well, yes. My code example didn't show the WHY, because the point was to discuss about the avoidance of magic numbers as a mean to be an end in itself. This is non-exclusive: void UART_init(void) { // Initialize SERCOM link // for details, see datasheet (DS60001507E) p. 944, 34.8.1 // use DORD to change bit order, choose pins with RXPO/TXPO SERCOM4_REGS->USART_INT.SERCOM_CTRLA = SERCOM_USART_INT_CTRLA_MODE(1) | SERCOM_USART_INT_CTRLA_RXPO(3) | SERCOM_USART_INT_CTRLA_TXPO(1) | SERCOM_USART_INT_CTRLA_DORD(1); } Since this is mostly precompiler stuff, no speed or code size is sacrificed. DFPs help me to write human-readable register initializations without making a calculation error in my mind. If someone else wants to change this later (switch RXPO for example), he doesn't have to decipher the actual value of RXPO from my magic value (first source of error) and doesn't have to recipher it into a new magic value (second source). This "someone else" could also be me after having an argument with my project leader. And finally I don't need to maintain clarifying comments. I could read the statement aloud and would be still comprehensible what I'm trying to accomplish. The IDE also helps a lot with getting the right symbols. The XC32 team has done a nice job having a logic in the names, so if I start typing "SERCOM_UART_" and hit <CTRL> - <SPACE>, I'll get a selection which fits, so I don't need to remember the exact name. If I'm unsure if this is the right value, I just hold <CTRL> and left click on the symbol. MPLABX then jumps right into the include header, letting me verify my decision: #define SERCOM_USART_INT_CTRLA_RXPO_Pos _U_(20) /**< (SERCOM_USART_INT_CTRLA) Receive Data Pinout Position */ #define SERCOM_USART_INT_CTRLA_RXPO_Msk (_U_(0x3) << SERCOM_USART_INT_CTRLA_RXPO_Pos) /**< (SERCOM_USART_INT_CTRLA) Receive Data Pinout Mask */ #define SERCOM_USART_INT_CTRLA_RXPO(value) (SERCOM_USART_INT_CTRLA_RXPO_Msk & ((value) << SERCOM_USART_INT_CTRLA_RXPO_Pos))
  25. But in all of your examples you are not telling me why you are doing that bit of work. I cannot possibly determine if there is a bug if I don't know why you are configuring the SERCOM with that particular value. How about simply saying: void configureSerialForMyDataLink(void) { // datalink specifications found in specification 4.3.2 // using SERCOM0 as follows: // - Alternate Pin x,y,z // - 9600 baud // - half duplex // - SamD21 datasheet page 26 for specifics SERCOM0 = <blah blah blah>; } Now you know why. You have a function that has a clear purpose. And if the link is invalid, you can see the intent. The specifics of the bits are in the datasheet and clearly referenced. No magic here. As for the special access mode for performance... inline void SERCOM0_WRITE(uint32_t ControllOffset, uint32_t Value) { // Accessing the SERCOM via DFP offsets for high performance (* (uint32_t*) (0x42000400 + ControlOfset)) = Value; } Now a future engineer has a handy helper and the details are nicely removed. And an interested engineer can debug it because the intent is clear. Obviously you need to be a DFP expert (or have the datasheet) to understand/edit it. But no magic. But the application should NEVER use this helper. It should be buried in the HAL. The first function is much more clear for the HAL because it conveys application level intent. i.e. the Application will rarely care about the SERCOM and will always care about its DataLink. If I port the code to something without a SERCOM, the application will still need a DataLink so this function will simply be refilled with something suitable for the other CPU. The application remains unchanged.
  26. I wouldn't duplicate the datasheet either. But isn't describing the calculation already the same? To clarify, both suggestions in an example: Describe the calculation: // comparator configuration // bit 6 = C1OUT, bit 4 = C1INV => 0101 // bit 3 = CIS, switch to RA0/AN0 // CM<2:0> = 101, one independent comparator CMCON0 = 0x5d; Describe the register: // comparator configuration // bit 7 = C2OUT // bit 6 = C1OUT // bit 5 = C2INV // bit 4 = C1INV // bit 3 = CIS // bit <2:0> = comparator mode CMCON0 = 0x5d; The latter has the advantage of not getting out of sync, but is a tedious and error-prone work. And if you do the work, why not describe it with #define and put into a header file to re-use it for other projects? The datasheet is just some not-so-formal syntax to describe something we need to compile in our brains to C source code each time we like to do something with peripheral registers. On 8-bit machines, everything can be easily calculated manually. But by Murphy's law, you will make an error. The manufacturer as a single point of truth (and errata) should supply every developer with usable DFPs, which already contain tested and verified definitions to use. In fact, this is the formal version of the datasheet. 32-bit world now, what does this mean? Do I need to write comments (flavor A or B from above) for every register now? The SERCOM module has more than 10 registers, EVSYS summary table is several pages long.. SERCOM0->USART.CTRLA.reg = 0x40310006; Wait, I've already used the DFPs to access the register: // accessing SERCOM, base address 0x42000400, offset for CTRLA is 0x00 (* (uint32_t*) (0x42000400 + 0x00)) = 0x40310006; 😉 I don't see the point why I as a customer should bother this these numbers for the sake of clarity. In fact, the left side of the statement already is a symbol to access and nobody complains or starts to comment about register and offset calculations, so why would we for the right side?
  1. Load more activity
 


  • Popular Contributors

    Nobody has received reputation this week.

  • Who's Online (See full list)

    There are no registered users currently online

  • Forum Statistics

    • Total Topics
      69
    • Total Posts
      327
×
×
  • Create New...