Now that i have a really minimal program with possibly of debug output it's time to expand the capabilities a bit. While this setup barely works for blinking an LED most C++ and even most C features are not working yet.

Section garbage collection

The first step actually doesn't add a real feature. It just reduces the code memory usage. By default g++ produces object files that have all code in one section so the linker can either add all of it to the final linked file or nothing. And the linker reasonably includes all code from every object file. That’s nice and predictable. It’s also wasteful, because code get’s included that is not actually used.

There are two main reasons for this unused code. The first one is the there might be functions in the code that are not actually called. That can of course happen easily if some files are more of a library nature or some calls are conditionally compiled (depending on debug options for example).

The second reason is that g++ often inlines code, but without beeing explicitly told that it will not be used also emits an non inline version. For example in the current code the functions for using the serial port are likely inlined. But they are not declared static so g++ must prepare an object file that also has duplicate code. One fix is to always add static for these functions. But i think it’s better to let the computer figure that out automatically. Even though marking file local functions with static is a good idea, often we don't know if there is an outside user. Fundamentally static is about keeping private functions private and the global namespace clean, not about encoding cross file usage of functions in source. In case of the simple serial code serial_writebyte_wait is currently only used by serial_writestr, but it’s reasonable to export. Tracking if there currently is an external user or not and adding a static wouldn't make for cleaner code. But just for busy work and unneeded coupeling.

So i’m going to allow the linker to remove unneeded code. But how would the linker know what code is needed in the first place? If the linker has some code (or rather section) that it needs to include in the linked output it will fullfill every symbol referenced in that section either from other object files included directly into the build or from included static libraries. But there needs to be an initial section to start this from. The current linker script doesn’t have anything that would act like an anchor for the linker to start from. Using KEEP in the linker script we can declare sections that always need to be included even if the linker doesn't see any references to it from other already included sections. In this case .vectors always needs to be included because it contains the pointers the cpu uses when starting up:

     .vectors : {
-        *(.vectors)
+        KEEP(*(.vectors))
     } > FLASH

This will make sure the linker has something as root for determining what need to be included. Of course we still need to tell the linker to only include needed sections.

LDFLAGS += -Wl,--gc-sections

This enables the section garbage collection. So far this would remove completly unused files (actually it considers code and data separately, but i don't have data sections yet at this point). To make this effective at function level g++ needs to generate each independent function into it's own section. And while we are at it for data too:

CFLAGS += -ffunction-sections -fdata-sections
CXXFLAGS += -ffunction-sections -fdata-sections

These optionas are not always good. There are some tricks with weak symbols and bigger sections to ensure that when a feature of a library is used the the needed infrastructure is only then brought in. But that‘s more what you need to worry about when writing the standard c library.

The result of these flags is for this sample program a reduction of code size from 312 bytes down to 172 bytes (about 45% of the previous size). Of course this is because g++ did a lot of inlining in this case and everything is just in one file. In real applications size reduction is not expected to be this large.

Explicitly placing read only data

The example code contains a string that is printed to the serial port. g++ emits this to a .rodata section that is not placed explicitly by the linker script. While this might be convinient, i like my memory layout to be easier to predict and thus decided to place everything explicitly. Thus after the .text output section in the linker script:

    .rodata : {
    } > FLASH

This just gathers all sections beginning with .rodata into flash at this point in the linked image.

Guarding against automatic placement of unknown elf sections

To take this a bit further i like to make my build fail if it produces sections i did not explicitly map in the linker script. Often special sections need some kind of special handling. Thus i prefer the “fail early, fail hard” approch. If the build can detect a potential problem, i want it to fail so i can avoid needing to debug it later in a running micro controller.

ld has a promising sounding option --orphan-handling=error for that. Orphans beeing unmapped sections in this context. Sadly there are quite many sections even in this small sample program that are not explicitly mapped. So let’s start mapping a few of them:

    /* debug information, intentionally only support for modern dwarf */

    .debug_info 0 : { *(.debug_info*) }
    .debug_abbrev 0 : { *(.debug_abbrev*) }
    .debug_loc 0 : { *(.debug_loc*) }
    .debug_aranges 0 : { *(.debug_aranges*) }
    .debug_ranges 0 : { *(.debug_ranges*) }
    .debug_macro 0 : { *(.debug_macro*) }
    .debug_line 0 : { *(.debug_line*) }
    .debug_str 0 : { *(.debug_str*) }
    .debug_frame 0 : { *(.debug_frame*) }

The makefile creates debug information. This just passes these though to the output file. The sections’ attributes in the object files make sure that they don’t end up in the actual flash image.

    /DISCARD/ : {

These section can just be discarded for various reasons.

The linker implicitly creates some section while processing the object files. These have to be empty for what is supported by this linker script and the system initialization. So the next step is to match them all into a section and bail out if that section contains anything:

    /* complain about contents in any sections we don't support or know about that linker or assembler generate */
    .unmatched : {
        KEEP("linker stubs"(*)) /* .glue_7 .glue_7t .vfp11_veneer .v4_bx */
    } > FLASH
    ASSERT(SIZEOF(.unmatched) == 0, "allocated sections not matched. Search in linker map for .unmatched and add non empty sections explicitly in this file")

These are mostly for interworking between thumb and non thumb machine code(which can’t happen on cortex m because it only supports thumb) and advanced linker tricks (STT_GNU_IFUNC).

Now that we have discarded or matched we can finally add -Wl,--orphan-handling=error to the makefile. Sadly this is quite a big part in the linker script. But i believe it's worth knowing if something is emitted unexpectedly.

Move stack definition into linker script

The minimal example hard coded the starting value for the stack pointer in the code. But going forward the layout of data in the sram should be managed by the linker. So let’s start by moving the placement of the stack into the linker script.

First we need to define the memory for the sram:

     FLASH : ORIGIN = 0x08000000, LENGTH = 64K
+    RAM :   ORIGIN = 0x20000000, LENGTH = 20K

This adds a new memory called RAM with the given address and length.

+__stack_size = 0x400;

For convinence define a constant for the wanted stack size. This is likely more than a simple project needs. But when experimenting it‘s nice to be on the safe side for now.

    .stack : {
        __stack_start = .;
        . = . + __stack_size;
        __stack_end = .;
    } > RAM

This adds a section called .stack and sets its size by incrementing the . current address variable. Also it defines symbols __stack_start and __stack_end with which refer to the start and end of the section. The end is to be used in the c++ code for initializing the initial stack pointer. The start is not needed currently, but for other sections both are needed so i added it for .stack too.

+extern char __stack_end;
 extern void (* const vectors[])() __attribute__ ((section(".vectors"))) = {
-                (void (*)())0x20000400,
+                (void (*)())&__stack_end,

This removes the hardcoded value for the stack start. Getting the address of a symbol created in the linker script is a bit convoluted. We need to define an external variable (type doesn’t matter really) and take it’s address.

Prepare for more test code
  • Add test-language-features.cpp with an empty run_tests function.
  • Add test-language-features.h
  • Add serial.h with declaration of serial_writebyte_wait and serial_writestr
  • call the run_tests function just before the loop in mainFn()
Adding support for variables without initialization

The goal is to make this code run and not output an '!' even when the cpu is only soft reset. C and C++ standards guarantee that not explicitly initialized global variables are initialized to what effectivly is all bits zeroed on most architectures (including arm)

char test_global;

void global_variable() {
    if (test_global != 0) {
    test_global = 'a';
    test_global = 'b';

So for this to compile we need a linker section to map .bss sections where gcc will place the variable into our resulting linked program. This will not actually create anything in the rom image. But the linker can now assign addresses to the variable and resolve the relocations for the variable in the code in the object file.

    .bss : {
        __bss_start = . ;
        __bss_end = . ;
    } > RAM

This again adds symbols for start and end. This time both are needed to fill the memory with the required zero bit pattern. This is done with a small c++ function:

void init_sram_sections() {

    extern uint32_t __bss_start, __bss_end;

    for (uint32_t* dst = &__bss_start; dst< &__bss_end; dst++) {
        *dst = 0;

Again to get pointers from the symbols there are extern variables. I used uint32_t here, because that gives pointers that are useful for the initialization loop. Additionally i added a call to this function at the start of mainFn.

Common sections

Gcc per default allows multiple definitions of variables in C mode(contrary to the C standard). To do that it uses common sections that are apart from build in weak semantics similar to .bss. To test and support that we need a C test file and a little addition to the linker script.

     .bss : {
         __bss_start = . ;
+        *(COMMON)
         __bss_end = . ;
     } > RAM

And in a new file test-language-features-c.c

#include "serial.h"

char test_common;

void run_tests_c(void) {
    if (test_common != 0) {

While with the linker script addition it works, i prefer to disable that extension in the makefile. Because we should be writing correct code anyways:

CFLAGS += -fno-common
Initialized globals

Next up is support for initialized globals:

+char test_init = 'c';
 void global_variable() {
     if (test_global != 0) {
     test_global = 'a';
     test_global = 'b';
+    serial_writebyte_wait(test_init);
+    test_init = 'd';
+    serial_writebyte_wait(test_init);

For this to work the memory location where test_init is located at runtime needs to be initialized before starting the main part of the program. The program when the cpu is reset is only an flash image. So we need the initialization values somewhere in flash. And they need to be copied to ram early by the reset vector.

For this reason the linker allows us to have two locations for a section. One for an initialization image in flash that is actually included in the flash image and one for use at run time that is located on a different address that is in ram.

    .data : {
        . = ALIGN(4);
    } > RAM AT>FLASH

    __data_start_flash = LOADADDR(.data);
    __data_start_ram = ADDR(.data);
    __data_size = SIZEOF(.data);

Here we have LOADADDR that returns the address of the initialization image and ADDR that return the run time location. Together with the size of the data section that’s enough to extend init_sram_sections to copy the initialization image into the correct ram location:

    extern uint32_t __data_start_flash, __data_start_ram, __data_size;

    uint32_t *src = &__data_start_flash;
    uint32_t *dst = &__data_start_ram;
    uint32_t *dend = dst + ((uint32_t)&__data_size);

    while (dst < dend) {
        *dst++ = *src++;

Ok that’s the basics. Using only the language features enabled at this we could build quite a bit without needing to resort to strange workarounds.


But there are a few more things that are easy to enable and useful. Constructors is certainly one. Having things run implicitly in the startup might be not be a good idea if we are not careful. But maybe it is if we are. One of the classic problems is that ordering is not well defined for C++ constructors. But that's just standard C++. The gnu toolchain supports constructors with explicit priorities.

Depending on your preferences this could be a nice feature to decouple various parts of code or a pure horror show. Let’s assume we don’t abuse it and go on to add support.

The syntax that is independent of classes (available in C and C++) is:

__attribute__((constructor (202)))
static void constructor2() {

static void constructor1() __attribute__((constructor (200)));
static void constructor1() {

Both synaxes work. The priority can be from 0 to 65535. The documentation claims that 0 to 100 are reserved. So let’s avoid those.

The way this works is simple. Pointers to the constructor functions are generated into .init_array.XXXXX sections, where XXXXX is the zero padded priority. We just need to arrange for them to be called on startup in the right order. Fortunatly the other special support for this in the toolchain is that the linker can sort these sections by numeric order using SORT_BY_INIT_PRIORITY so getting a list in the right order is easy in the linker script:

    .init_array : {
        __init_array_start = .;
        __init_array_end = .;
    } > FLASH

    /* .fini_array ommited because when would we run destructors? */

As usual there’s a start and end symbol to allow for iteration in the startup code. The same infrastructure is available for destructors instead of constructors. But as the comment in the linker script says there is usually no clear time to run those on embedded microcontrollers. So i left out support for this.

The code to run the constructors is similar to the code for memory initialization. It just iterates over function pointers and calls them this time:

void run_init_data() {
    typedef void (*init_fun)(void);
    extern init_fun __init_array_start, __init_array_end;

    init_fun *ptr = &__init_array_start;
    while (ptr < &__init_array_end) {

Another way to use this for classes is the init_order attribute.

Some_Class  B  __attribute__ ((init_priority (543)));

This works exactly like the examples above from the linker and runtime support perspective. But there’s one special case. Constructors of global variables without priority are generated into the section .init_array (without a dot and a priority). To support this case we need a tiny addition to the linker script:

     .init_array : {
         __init_array_start = .;
+        KEEP(*(.init_array*))
         __init_array_end = .;
     } > FLASH

The default linker script of the toolchain places constructors without explicit priority as after all with priority which seems sensible, so i kept that ordering.

Ignoring global destructors

Now that we have support for constructors in globals, what about destructors?

As i wrote earlier global destructors are not really useful in most embedded environments. But they are often when using the same class in automatic or dynamically allocated storage. So i decided that i want to ignore them in global context. While there is a .fini_array g++ doesn't use it for destructors. Instead it uses an arm abi definied variant of atexit that is called from the function in the .init_array* section that does the construction. Likely to be sure that the destructor ordering is always the exact reverse of the construtors. And because the ARM C++ ABI specifies this.

While the micro variante of newlib shipped with the toolchain i use has an implementation of __aeabi_atexit to just ignore the calls can be done easier with an empty function. And it uses less code space. For reasons that are only relevant in environments using shared libraries the __aeabi_atexit function has an additional argument that is passed the contents of the symbol __dso_handle which is used to identify the shared object from where the call was made. As the value if the symbol doesn’t matter we can just define that symbol to 0 in the linker script.

__dso_handle = 0;

and then implement an empty __aeabi_atexit:

// No atexit, no destructors

// actual signature is bool __aeabi_atexit(void* object, void (*destroyer)(void*), void* dso_handle)
extern "C" void __aeabi_atexit() {

I used the wrong signature here to save a few bytes because the return is ignored anyway and all parameters fit into registers. This of course is a questional gain for relying on such low level details. But i sometimes i just can’t help myself from doing stupid micro optimisations.

Statics in functions

In C++ we can have static variables in functions that are initialized by code run when the function is executed for the first time. Of course as we expect from a good general purpose toolchain this is done in a thread safe manner. Nice. Well that of course needs runtime support and is totally unneeded in a single threaded design with interrupts. Well if a function that uses such a static is used from an interrupt this would be needed, and it would need to disable the interrupt. If you want to use statics in interrupts you have to reimplement __cxa_guard_acquire and __cxa_guard_release anyway. For single threaded use without usage in interrupts there is a easier way:

CXXFLAGS += -fno-threadsafe-statics

That is just to disable thread safeness and being careful in using this feature. But when implementing code that needs to be run from an interrupt being careful is always needed anyway.

For full code see here

more in this series:
6. Heap, malloc and new
7. STM32 PLL and Crystal Oscillator
8. Interrupts
Makefile for embedded projects »
« Heap, malloc and new