Jump to content
 
  • entries
    31
  • comments
    47
  • views
    24,541

Contributors to this blog

A question about structures, strings and pointers


Orunmila

1,038 views

 Share

In the comments of our blog on structures in c a member asked me a specific question about what they observed. As this code is a beautiful example of a number of problems we often see we thought it a good idea to make an entry just to discuss this as there really is a lot going on here.

We will cover the following:

  1. What allocation and freeing of memory on the stack means and the lifetime of objects
  2. In which direction the stack usually grows (note - the C standard does not contain the word "stack" so this is compiler-specific)
  3. Another look at deep vs. shallow copies of c strings inside structures

In order to keep this all generic, I am going to be using the LLVM compiler on my MAC to do all my examples. The examples are all standard C and you can play with the code on your favorite compiler, but since the details of memory allocation are not mandated by the C standard your results may not look exactly like mine. I will e.g. show how the results I get changes when I modify the optimization levels.

The Question

The OP @zakasterwas asking this:

Quote

if i call function `eg_test_copy2`, i noticed that the memory allocated for `new_name` never gets cleared, 

the output will be

input new name for person:alex
address of new_name = 000000000061FD90
p name = alex
address of dummy = 000000000061FDE0
p name = alex

and the address for `dummy` actually starts after the address of new_name + 4 bytes x 20, and i can still access the old name, is this weird.

for example if I input another name,

input new name for person:hello
address of new_name = 000000000061FD90
p name = hello
address of dummy = 000000000061FDE0
p name = hello

Here is the code snippet they provided:

struct person {
    char* name;
};

void get_person(struct person* p) {
    char new_name[20];  // on stack, gets freed when function returned
    printf("input new name for person:");
    scanf("%s", &new_name);
    p->name = new_name;
    printf("address of new_name = %p\n", &new_name[0]);
}

void eg_test_copy2(void) {
    struct person p = {"alex"};
    get_person(&p);
    printf("p name = %s\n", p.name);

    char dummy[20] = { 0 };
    printf("address of dummy = %p\n", &dummy[0]);
    printf("p name = %s\n", p.name);
}

Variable Allocation

When you declare a variable the compiler will only reserve a memory location to be used by the variable. This process will not actually clear the memory unless the variable has static linkage, the standard states that only variables with static linkage (in simple terms this means global variables) shall be initialized to 0. If you want a variable to be initialized you have to supply an initializer.

What actually happens before your main function starts running is that something generally referred to as "c-init" will run. This is a bit of code that will do the work needed by the C standard before your code runs, and one of the things it will do is to clear, usually using a loop, the block of memory which will contain statically linked variables. Other things that may be in here are setting up interrupt vectors and other machine registers and of course copying the initial values of global variables that do have initializers over the locations reserved for these variables.

When a variable goes "out of scope" the memory is no longer reserved. This simply means that it is free for others to use, it does not mean that the memory is cleared when it is no longer reserved. This is very important to note. This phenomenon often leads to developers testing their code after having a pointer that points to memory which is no longer reserved, and the code seems to work fine until the new owner of that part of memory modifies it, then the code inexplicably breaks! No, it was actually broken all along and you just got lucky that the memory was not used at the time you were accessing this unreserved piece of memory!

The classic way this manifests can be seen in our first test (test1) below.

#include <stdio.h>

char* get_name() {
    char new_name[20];  // on stack, gets freed when function returned
    printf("Enter Name:");
    scanf("%s", new_name);
    return new_name;
}

int main(void)
{
    char* theName;
    theName = get_name();
    printf("\r\nThe name was : %s\r\n", theName);

    return 0;
}

I compile and run this and get :

> test1
Enter Name:Orunmila

The name was : Orunmila
 

Note: Let me mention here that I was using "gcc test1.c -O3" to compile that, when I use the default optimization or -O1 it prints junk instead. When you do something which is undefined in the C standard the behavior will not be guaranteed to be the same on all machines.

So I can easily be fooled into thinking this is working just fine, but it is actually very broken! On LLVM I actually get a compiler warning when I compile that as follows:

test1.c:7:12: warning: address of stack memory associated with local variable 'new_name' returned
      [-Wreturn-stack-address]
    return new_name;
           ^~~~~~~~
1 warning generated.

Did I mention that I do love LLVM?!

We can quickly see how this breaks down if we call the function more than once in a row like this (test2):

#include <stdio.h>

char* get_name() {
    char new_name[20];  // on stack, gets freed when function returned
    printf("Enter Name:");
    scanf("%s", new_name);
    return new_name;
}

int main(void)
{
    char* theName;
    char* theSecondName;
  
    theName = get_name();
    theSecondName = get_name();
  
    printf("\r\nThe first name was  : %s\r\n", theName);
    printf("The second name was : %s\r\n", theSecondName);

    return 0;
}

Now we get the following obviously wrong behavior

Enter Name:N9WXU
Enter Name:Orunmila
 
The first name was : Orunmila
The second name was : Orunmila

This happens because the declarations of theName and theSecondName in the code only reserve enough memory to store a pointer to a memory location. When the function returns it does not actually return the string containing the name, it only returns the address of the string, the name of the memory location which used to contain the string inside of the function get_name(). 

At the time when I print the name, the memory is no longer reserved, but as nobody else has used it since I called the function (I did perform any other operation which makes use of the stack in other words). The code is still printing the name, but both name pointers are pointing the same location in memory (which is actually just a coincidence, the compiler would have been within its rights to place the two in different locations).

If you call a function that has a local variable between fetching the names and printing them the names will be overwritten by these variables and it will print something which looks like gibberish instead of the names I was typing. We will leave it to the reader to play with this and see how/why this breaks. I would encourage you to also add this to the end, these print statements will clearly show you where the variables are located and why they print the same thing - you will notice that the values of both pointers are the same!

printf("Location of theName       : %p\r\n", &theName);       // This prints the location of the first pointer
printf("Location of theSecondName : %p\r\n", &theSecondName); // This prints the location of the second pointer

printf("Value of theName       : %p\r\n", theName);       // This prints the value of the first pointer
printf("Value of theSecondName : %p\r\n", theSecondName); // This prints the value of the second pointer

This all should answer the question asked, which was "I can still access the old name, is this weird?". The answer is no, this is nor weird at all, but it is undefined and if you called some other functions in between you would see the memory which used to hold the old name being overwritten in weird and wonderful ways as expected.

How does the stack grow?

Now that we have printed out some pointers this brings us to the next question. Our OP noticed that "the address for `dummy` actually starts after the address of new_name + 4 bytes x 20".

We need to be careful here, the C standard requires pointers to be byte-addressable, which means that the address being 20x4 away makes no sense by itself, and in this case it is a pure coincidence. A couple of things should be noted here:

  1. The stack usually grows downwards in memory
  2. The size of a char[20] buffer will always be 20 and never 4x20 (specified in section 6.5.3.4 of the C99 standard)
  3. In the example question the address of new_name was at 0x61FD90, which is actually smaller than 0x61FDE0, and in other words it was placed on the stack AFTER dummy.

Here is a diagram which shows a typical layout that a C compiler may choose to use.

image.png

The reason there was a gap of 80 between the pointers was simply due to the way the compiler decided to place the variables on the stack. It was probably creating some extra space on the stack for passing parameters around and this just happened to be exactly 60 bytes, which resulted in a gap of 80.

The C standard only defines the scope of the variables, it does not mandate how the compiler must place them in memory. This can even vary for the same compiler when you add more code as the linker may move things around and will probably change when you change the optimization settings for the compiler.

I did some tests with LLVM and if I look at the addresses in the example they will differ significantly when I am using optimization O1, but when I set it to O3 the difference between the two pointers is exactly 20 bytes for the example code. 

Getting back to Structures and Strings

Looking at the intent of the OP's code we can now get back to how structures and strings work in C.

With our interface like this

struct person {
    char* name;
};

void get_person(struct person* p);

What we have is a struct which very importantly does NOT contain a string, it only contains the address of a string. That person struct will reserve (typically) the 4 bytes of RAM required to store a 32-bit address which will be the location where a string exists in memory. If you use it like this you will most often find that the address of "name" will be exactly the same as the address of the person struct you are passing in, so if our OP tested the following this would have been clear:

struct person p = {"alex"};
    
printf("Address of p      = %p\n", &p);
printf("Address of p.name = %p\n", &p.name);

These two addresses must be the same because the struct has only one member!

When we want to work with a structure that contains the name of a person we have 2 choices and they both have pro's and con's. 

  1. Let the struct contain a pointer and use malloc to allocate memory for the string on the heap. (not recommended for embedded projects!)
  2. Let the struct contain an array of chars that can contain the name.

For option 1 the declaration is fine, but the getName function would have to look as follows:

void get_person(struct person* p) {
    char* new_name = malloc(20);     // on heap, so remember to check if it returns NULL !
    printf("input new name for person:");
    scanf("%s", new_name);
    p->name = new_name;
    printf("address of new_name = %p\n", new_name);
}

Of course, now you have to check and handle the case where we run out of memory and malloc returns NULL, we also have to be cognisant of heap fragmentation and most importantly we now have to be very careful to ensure that the memory gets freed or we will have a memory leak!

For option 2 the structure and the function has to change to something like the following:

struct person {
    char  name[20]; 
}

void get_person(struct person* p) {
    printf("input new name for person:");
    scanf("%s", p->name);
    printf("address of p->name = %p\n", p->name);
}

Of course now we use 20 bytes of memory regardless how long the name is, but on the upside we do not have to worry about freeing the memory, when the instance goes out of scope the compiler will take care of that for us. Also now we can assign one person struct to another which will actually copy the entire string and we still have the option of passing it by reference by using the address of the object!

Conclusion

Be careful when using C strings in structures, there are a lot of ways these can get you into trouble. Memory leaks and shallow copies, where you make a copy of the pointer but not the string, are very likely to catch you sooner rather than later.

 

 

 

 

 

 Share

0 Comments


Recommended Comments

There are no comments to display.

Guest
Add a comment...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...