Jump to content
 

About Translation Units and how C code is compiled


Orunmila

8,025 views

 Share

Beginners always have a hard time understanding when to use a header file and what goes in the .c and what goes in the .h.  I think the root cause of the confusion is really a lack of information about how the compilation process of a C program actually works.

C education tends to focus on the semantics of loops and pointers way too quickly and completely skips over linkage and how C code is compiled, so today I want to shine a light on this a bit.

C code is compiled as a number of "translation units" (I like to think of them as modules) which are in the end linked together to form a complete program. In casual usage a "translation unit" is often referred to as a "Compilation Unit".

The process of compilation is quite nicely described in the XC8 Users Guide in section 4.3 so we will look at some of the diagrams from that section shortly.

Compilation Steps

Before we get into a bit more detail I want to step back slightly and quite the C99 standard on the basic idea here.

Section 5.1.1.1 of the C99 standard refers to this process as follows:

Quote

A C program need not all be translated at the same time. The text of the program is kept in units called source files, (or preprocessing files) in this International Standard. A source file together with all the headers and source files included via the preprocessing directive #include is known as a preprocessing translation unit. After preprocessing, a preprocessing translation unit is called a translation unit. Previously translated translation units may be preserved individually or in libraries. The separate translation units of a program communicate by (for example) calls to functions whose identifiers have external linkage, manipulation of objects whose identifiers have external linkage, or manipulation of data files. Translation units may be separately translated and then later linked to produce an executable program.

image.pngA C (or C++ for that matter) compiler will process your code as shown to the right (pictures from XC8 User's Guide).

The C files are compiled completely independenty of each other into a set of object files. After this step is completed the object files are linked together, like a set of Lego blocks, into a final program during the second stage.

It is possible for 2 entirely different programs to share some of the same obj files this way.

Re-usable object files that perform common functions are often bundled together into an archive of object files referred to as a "Library" by the standard, and is often "zipped" together into a single file called .lib or .a

This sequence is pretty much standard for all C compilers as depicted on the right.

Looking at XC8 in particular there is a little more details that only applies to this compiler.  The PIC16 also poses some challenges for compilers as the architecture has banked memory. This means that moving code around from one location to another may not only change the addreses of objects (which is quite standard) but it may also require some extra instructions (bank switch instructions) to be added depending on where the code is ultimately placed in memory.

image.png

We will not get any deeper into the details here, but I want to point out the most important aspects.

Some useful tips:

  1. Most compilers will have an option to emit and keep the output from the pre-processor so that you can look at it. When debugging your #define's and MACROs are getting you under this is an excellent debugging tool. With the latest version of XC8 the option to keep these files is "-save-temps" which can be passed to the linker as an additional argument. (They will end up in a folder called "build" and the .pre files in the diagram may have an extention ".i" depending on the processor you are on).
  2. During the linking step all the objects (translation units) which will be combined to create the final program will be linked together to produce an executable. This process will decide where each variable and function will be placed, and all symbolic references are replaced with actual addresses. This process is sometimes referred to as allocation, re-allocation or fix-up. At this step it is possible to supply most linkers with a linker file or "linker script" which will guide the linker about which memory locations you want it to use.
  3. Although the C standard does not specify the format of the object files, some common formats do exist. Most compilers used to use the COFF format (Common Object File Format) which typically produces files with the extension .o or .obj. Another popular format favored by many compilers today is the ELF (Executable and Linkable Format).

The most important thing to take away from all that is that your C files the H files they includ will be combined into a single translation unit which will be processed by the compiler. The compiler literally just pastes you include file into the translation unit at the place you include it. There is no magic here.

So why do I need an H file at all then?

As noted in the standard different translation units communicate with each other through either calling functions with external linkage, or manipulating objects with external linkage. When 2 translation units communicate in this way it is very important that they both have the exact same definition for the objects they are exchanging. If we just had the definitions in C files without any headers then the definitions of everything that is shared would have to be very carefully re-typed in each file, and all of these copies would later have to be maintained.

It really is that simple, the only reason we have header files is to save us from maintaining shared code in multiple places.

As you should have noticed by now the descriptions of translation units sound very similar to "libraries" or "modules" of your program, and this is precisely what they are. Each translation unit is an independent module which may or may not be dependent on one or more other modules, and you can use these concepts to split your programs into more managable and re-usable modules. This is the divide and conquer strategy.

In this scheme the sole purpose of header files is to be used by more than one translation unit. They represent a definition of the interface that 2 modules in your program can use to communicate with each other and saves you from typing it all multiple times. 

Let's look at a simple example of a how a header file named module.h may be processed when it is included into your C code.

// This is  module.h - the header file
// This file defines the interface specification for this module.
// It contains all definitions of functions and/or variables with external linkage
// The purpose of this file is to provide other translation units with the names of the objects that this translation unit
//    provides, so that they can be used to communicate with this translation unit.

// This declaration promises the compiler that somewhere there will be a function called "myFunction" which the linker will be able to resolve
void myFunction(void);

// This declaration promises the compiler that somewhere there will be a variable called "i" which the linker will be able to resolve
extern int i;

And it's corresponding C source file module.c

// This is module.c - the C file for my module
// This file contains the implemenation of my module

// It is typical for a module to include it's own interface, this makes it easier to implement by ensuring the interface and implemenation are identical
#include "module.h"

// Declaring a variable like this will allocate storage for it.
// In C Variables with global scope has external linkage by default (This is NOT true for C++ where this would have internal linkage)
int i = 42;  

// Functions have external linkage by default, it is not necessary to say extern void myFunction as the "extern" is implied
void myFunction(void)
{
    ... // some code here
}

As described above the pre-processor will convert this into a file for the compiler to process which looks like this :

// This is module.c - the C file for my module
// This file contains the implemenation of my module

// It is typical for a module to include it's of interface, this makes it easier to implement in many ways
// This is  module.h - the header file
// This file defines the interface specification for this module.
// It contains all definitions of functions and/or variables with external linkage
// The purpose of this file is to provide other translation units with the names of the objects that this translation unit
//    provides, so that they can be used to communicate with this translation unit.

// This declaration promises the compiler that somewhere there will be a function called "myFunction" which the linker will be able to resolve
void myFunction(void);

// This declaration promises the compiler that somewhere there will be a variable called "i" which the linker will be able to resolve
extern int i;

// Defining a variable like this will allocate storage for it.
// In C Variables with file scope has external linkage by default (This is NOT true for C++ where this would have internal linkage)
int i = 42;  

// Functions have external linkage by default, it is not necessary to say extern void myFunction as the "extern" is implied
void myFunction(void)
{
    // some code here
}

void main(void)
{
    myFunction(void);
}

Now you will notice that with the inclusion of the header like this some things like the "int i" end up occuring in the file twice. This can be very confusing when we try and establish exactly which of these statements are just declarations of the names of variables and which ones actually allocate memory. If a symbol like "int i" is declared more than once in the file how do we ensure that memory is not allocated more than once, especially if "int i" occurs in the global file scope of more than one tranlsation unit!

In order to make more sense of this we can go over how the compiler will process the combined "tranlation unit" from top to bottom for our simple example.

When the compiler processes this file it first finds a declaration of a function without an implementation/definition. This tells the compiler to only declare this name in what is commonly referred to as it's "dictionary". Once the name is established it is possible for the implementation to safely refer to this name. Such a declaration of a function without an implementation is called a function prototype.

The next code line contains a declaration of an integer called "i" with external linkage (we will get to linkage in the next section). This is a declaration as opposed to a definition, as it does not have any initializer (an assignment with an initial value). This declaration places "i" in the dictionary, but does not allocate storage for the variable. It also marks the object as having external linkage. When 2 compilation units declare the same object with external linkage the compiler will know that they are linked (refer to the same thing), and it will only allocate space for it once so that both translation units end up manipulating the same variable!

Later on the compiler finds "int i = 42", this is a definition of the same symbol "i", this time it also supplies an initializer, which tells the compiler to set this variable to 42 before main is run. As this is a definition this is the statement that will cause memory to be allocated for the variable. If you try and have 2 definitions for the same object (even in 2 separate translation units) the compile will report an error which will alert you that the object was defined more than once. It will either say "duplicate symbol" or "multiple definition for object" or something along these lines (error messages are not specified by the standard so these messages are different on each compiler).

Next we encounter the implementation/definition of the function myFunction. Lastly we encounter the implementation/definition of main, which is traditionally the entry point of the application.

I encourage you to cut and paste that snippet above into an empty project and compile it to assure yourself that this works fine.

After that I want you to paste these examples so we can better understand the mechanics here, and prove that I am not smoking something here!

// Some test code to show why include files actually end up working
int j;
int j;
int j = 1;
 
void main(void)
{
   // Empty main 
}

You can compile this and note that there will be no error. (a project with this code is attached for your convenience).

This is because int j; is what is called a "tentative definition". What this means is that we are stating that there is a definition for this variable in this translation unit. If the end of the tranlation unit is reached and no definition has been provided (there was no definition with an initializer) then the compiler must behave as if there was a definition with an initializer of "0" at the end of the translation unit. You can have as many tentative definitions as you want for the same object, even if they are in the same compilation unit, as long as their types are all the same.

The third line is the only definition of "j" which also triggers the allocation of storage for the variable. If this line is removed storage will be allocated at link time as if there was a definition with initializer of 0 at the end of the file if no definitions can be found in any of the translation units being linked. 

Now change the code to look as follows:

// Some test code to show why include files actually end up working
int j;
int j = 1;
int j = 1;
 
void main(void)
{
   // Empty main 
}

This will result in the following error message, since more than one initializer is provided we have multiple definitions for the object in the same translation unit, which is not allowable. This is because a declaration with an initializer is a definition for a variable and a definition will allocate storage for the variable. We can only allocate storage for a variable once.

Quote

make[2]: *** [build/default/production/main.p1] Error 1
main.c:4:5: error: redefinition of 'j'
make[1]: *** [.build-conf] Error 2
int j = 1;

On CLang the error looks only slightly different:

Quote

 

duplicate symbol _j in:
    /var/folders/wx/0hk7_sfx0qd67jdn0ms8b40c0000gn/T/main-e9a917.o
    /var/folders/wx/0hk7_sfx0qd67jdn0ms8b40c0000gn/T/module-ffb7fd.o
ld: 1 duplicate symbol for architecture x86_64
clang: error:
linker command failed with exit code 1 (use -v to see invocation)

 

Now let's try something else. Change it to look like this. This time the variable is an auto variable, so it has internal linkage (this is also called a local variable). In this case we are not allowed to declare the same variable more than once because there is no good reason (like with header files) to do this and if this happens it would most likely be a mistake so the compiler will not allow it.

// Some test code to show why include files actually end up working
void main(void)
{
   int j;
   int j;
   int j = 1;
}

The error produced looks as follows on XC8, and I will get this even if I have only 2 j's with no initializer:

Quote

make[2]: *** [build/default/production/main.p1] Error 1
make[1]: *** [.build-conf] Error 2
make: *** [.build-impl] Error 2
main.c:7:5: error: redefinition of 'j'
int j;
    ^
main.c:6:8: note: previous definition is here
   int j;

An important note here, auto variables (local variables) will NOT be automatically initialized to 0 like variables at file scope with external linkage will be. This means that if you do not supply any initializer the variable can and will likely have any random value.

Linkage

We spoke about linkage quite a bit, so lets also make this clear. The C Standard states in section 6.2.2:

Quote
  1. An identifier declared in different scopes or in the same scope more than once can be made to refer to the same object or function by a process called linkage.21) There are three kinds of linkage: external, internal, and none.

  2. In the set of translation units and libraries that constitutes an entire program, each declaration of a particular identifier with external linkage denotes the same object or function. Within one translation unit, each declaration of an identifier with internal linkage denotes the same object or function. Each declaration of an identifier with no linkage denotes a unique entity.

For any identifier with file scope linkage is automatically external. External linkage means that all variables with this identical name in all translation units will be linked together to point to the same object.

Variables with file scope which has a "static" storage-class specifier have internal linkage. This means that all objects WITHIN THIS TRANSLATION UNIT with the same name will be linked to refer to the same object, but objects in other translation units with the same symbol name will NOT be linked to this one. 

Local variables (variables with block or function scope) automatically has no linkage, this means they will never be linked, which means having the same symbol twice will cause an error (as they cannot be linked). An example of this was shown in the last sample block of the previous section.

Note that adding "extern" in front of a local variable will give it external linkage, which means that it will be linked to any global variables elsewhere in the program. I have made a little example project to play with which demonstrates this behavior (perhaps to your surprize!)

If two tranlation units both contain definitions for the same symbol with external linkage the compiler will only define the object once and both definitions will be linked to the same definition. Since the definitions provide initial values this only works if both definitions are identical. 

There is a nice example, as always, in the C99 standard.

int i1 = 1;			// definition, external linkage
static int i2 = 2;		// definition, internal linkage
extern int i3 = 3;		// definition, external linkage
int i4;				// tentative definition, external linkage
static int i5;			// tentative definition, internal linkage
int i1;				// valid tentative definition, refers to previous
int i2;				// 6.2.2 renders undefined, linkage disagreement
int i3;				// valid tentative definition, refers to previous
int i4;				// valid tentative definition, refers to previous
int i5;				// 6.2.2 renders undefined, linkage disagreement
extern int i1;			// refers to previous, whose linkage is external
extern int i2;			// refers to previous, whose linkage is internal 
extern int i3;			// refers to previous, whose linkage is external
extern int i4;			// refers to previous, whose linkage is external
extern int i5;			// refers to previous, whose linkage is internal

// I had to add the missing one
extern int i6;			// Valid declaration only, whose linkage is external. No storage is allocated.

Note that if we have a declaration only such as i6 above in a compilation unit the unit will compile without allocating any storage to the object. At link-time the linker will attempt to locate the definition of the object that allocates it's storate, if none is found in any other compilation unit for the program you will get an error, something like "reference to undefined symbol i6"

image.png

Looking over those examples you will note that the storage-class specifier "extern" should not be confused with the linkage of the variable. It is very possible that a variable with external storage class can have internal linkage as indicated by the examples from the standard for "i2" and also for "i5".

To see if you understand "extern" take a look at this example. What happens when you have one file (e.g. main.c) which defines a local variable as extern like this? [ First try to predict and then test it and see for yourself]

#include <stdio.h>
void testFunction(void);

int i = 1;

void main(void)
{
    extern int i;
    testFunction();
    printf("%d\r\n", i);
}

And in a different file (e.g. module.c) place the following:

int i;

void testFunction(void)
{
    i = 5;
}

You should be able to tell if this will compile or not, and if not what error it would give, or will it compile just fine and print 5 ?

Also try the following:

  1. What happens when you remove the "extern" storage-class specifier?
  2. What happens when instead you just emove the entire line "extern int i;" from function main ? (no externs in either file). Is that what you expected?
  3. What happens when you move the initializer from file-scope (just leaving the int i), to the function scope definition inside of main (when you have "extern int i = 1;" inside of the main function)?
  4. What happens when you add "extern" to the file scope declaration (replace "int i = 1;" with "extern int i = 1;")

In Closing

When you are breaking your C code into independent "Translation Units" or "Compilation Units", keep in mind that the entire header file is being pasted into your C file whenever you use #include. Keeping this in mind can help you resolve all kinds of mysterious bugs.

Make sure you understand when variables have external storage and when they have external linkage. Remember that if 2 modules declare file scope variables with external linkage and the same name, they will end up being the same variable, so 2 libraries using "temp" is a bad idea, as these will end up overwriting each other and causing hard to locate bugs.

 

 

 

Answers Cheat Sheet:

The code listed compiles fine and prints "5".

Additional exercises:

  1. When you remove the extern from the first line in function main it prints some random value (on my machine 287088694). This is because the local variable does not have linkage to the other variable called i.
  2. When you instead remove the entire first line from the main function it compiles and prints "5" like before.
  3. Having "extern int i = 1"  inside of function main does not compile at all, complaining that "an extern variable cannot have an initializer" 
  4. Having "extern int i = 1" in file scope is allowed though! This compiles just fine although CLang will give a warning for this one as long as only one definition exists. If you now also add an initializer to the file scope int i in module.c it will not compile any more.

 

 Share

2 Comments


Recommended Comments

A very informative Article as always! Thank you for publishing this blog! but I noticed a little mistake in the next passage

Quote

Now let's try something else. Change it to look like this. This time the variable is an auto variable, so it has internal linkage (this is also called a local variable). In this case we are not allowed to declare the same variable more than once because there is no good reason (like with header files) to do this and if this happens it would most likely be a mistake so the compiler will not allow it.

The Issue here is that automatic variables don't have a linkage at all, since they are allocated on the Stack during the Execution of the Function, so it will not be visible to the Linker at all. This case also applies to the local variables declared as static, they have no linkage, even though they are allocated before the Execution of the Function, because they are not visible outside of their block.

Other than that, everything is perfectly explained!

 

Link to comment

Thanks for the nice topic and your successful effort to unfold it. A little mistake Just caught my eye:

Quote

Note that if we have a declaration only such as i6 above in a compilation unit the unit will compile without allocating any storage to the object. At link-time the linker will attempt to locate the definition of the object that allocates it's storate, if none is found in any other compilation unit for the program you will get an error, something like "reference to undefined symbol i6"

That's not right. If no definition for the variable with an external linkage has been found, then the compiler will assign "0" for the variable as explained earlier in this article too:

Quote

If the end of the tranlation unit is reached and no definition has been provided (there was no definition with an initializer) then the compiler must behave as if there was a definition with an initializer of "0" at the end of the translation unit.

 

Link to comment
Guest
Add a comment...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...