on 2016-03-16 in stm32-from-scratch
Now that i have a really minimal program with possibly of debug output it's time to expand the capabilities a bit. While this setup barely works for blinking an LED most C++ and even most C features are not working yet.
The first step actually doesn't add a real feature. It just reduces the code memory usage. By default g++ produces object files that have all code in one section so the linker can either add all of it to the final linked file or nothing. And the linker reasonably includes all code from every object file. That’s nice and predictable. It’s also wasteful, because code get’s included that is not actually used.
There are two main reasons for this unused code. The first one is the there might be functions in the code that are not actually called. That can of course happen easily if some files are more of a library nature or some calls are conditionally compiled (depending on debug options for example).
The second reason is that g++ often inlines code, but without beeing explicitly told that it will not be used also emits an non inline version. For example in the current code the functions for using the serial port are likely inlined. But they are not declared static so g++ must prepare an object file that also has duplicate code. One fix is to always add static
for these functions. But i think it’s better to let the computer figure that out automatically. Even though marking file local functions with static is a good idea, often we don't know if there is an outside user. Fundamentally static
is about keeping private functions private and the global namespace clean, not about encoding cross file usage of functions in source. In case of the simple serial code serial_writebyte_wait
is currently only used by serial_writestr
, but it’s reasonable to export. Tracking if there currently is an external user or not and adding a static
wouldn't make for cleaner code. But just for busy work and unneeded coupeling.
So i’m going to allow the linker to remove unneeded code. But how would the linker know what code is needed in the first place? If the linker has some code (or rather section) that it needs to include in the linked output it will fullfill every symbol referenced in that section either from other object files included directly into the build or from included static libraries. But there needs to be an initial section to start this from. The current linker script doesn’t have anything that would act like an anchor for the linker to start from. Using KEEP
in the linker script we can declare sections that always need to be included even if the linker doesn't see any references to it from other already included sections. In this case .vectors
always needs to be included because it contains the pointers the cpu uses when starting up:
.vectors : {
- *(.vectors)
+ KEEP(*(.vectors))
} > FLASH
This will make sure the linker has something as root for determining what need to be included. Of course we still need to tell the linker to only include needed sections.
LDFLAGS += -Wl,--gc-sections
This enables the section garbage collection. So far this would remove completly unused files (actually it considers code and data separately, but i don't have data sections yet at this point). To make this effective at function level g++ needs to generate each independent function into it's own section. And while we are at it for data too:
CFLAGS += -ffunction-sections -fdata-sections
CXXFLAGS += -ffunction-sections -fdata-sections
These optionas are not always good. There are some tricks with weak symbols and bigger sections to ensure that when a feature of a library is used the the needed infrastructure is only then brought in. But that‘s more what you need to worry about when writing the standard c library.
The result of these flags is for this sample program a reduction of code size from 312 bytes down to 172 bytes (about 45% of the previous size). Of course this is because g++ did a lot of inlining in this case and everything is just in one file. In real applications size reduction is not expected to be this large.
The example code contains a string that is printed to the serial port. g++ emits this to a .rodata
section that is not placed explicitly by the linker script. While this might be convinient, i like my memory layout to be easier to predict and thus decided to place everything explicitly. Thus after the .text
output section in the linker script:
.rodata : {
*(.rodata*)
} > FLASH
This just gathers all sections beginning with .rodata
into flash at this point in the linked image.
To take this a bit further i like to make my build fail if it produces sections i did not explicitly map in the linker script. Often special sections need some kind of special handling. Thus i prefer the “fail early, fail hard” approch. If the build can detect a potential problem, i want it to fail so i can avoid needing to debug it later in a running micro controller.
ld has a promising sounding option --orphan-handling=error
for that. Orphans beeing unmapped sections in this context. Sadly there are quite many sections even in this small sample program that are not explicitly mapped. So let’s start mapping a few of them:
/* debug information, intentionally only support for modern dwarf */
.debug_info 0 : { *(.debug_info*) }
.debug_abbrev 0 : { *(.debug_abbrev*) }
.debug_loc 0 : { *(.debug_loc*) }
.debug_aranges 0 : { *(.debug_aranges*) }
.debug_ranges 0 : { *(.debug_ranges*) }
.debug_macro 0 : { *(.debug_macro*) }
.debug_line 0 : { *(.debug_line*) }
.debug_str 0 : { *(.debug_str*) }
.debug_frame 0 : { *(.debug_frame*) }
The makefile creates debug information. This just passes these though to the output file. The sections’ attributes in the object files make sure that they don’t end up in the actual flash image.
/DISCARD/ : {
*(.note.GNU-stack)
*(.gnu_debuglink)
*(.gnu.lto_*)
*(.comment)
*(.ARM.attributes)
}
These section can just be discarded for various reasons.
The linker implicitly creates some section while processing the object files. These have to be empty for what is supported by this linker script and the system initialization. So the next step is to match them all into a section and bail out if that section contains anything:
/* complain about contents in any sections we don't support or know about that linker or assembler generate */
.unmatched : {
KEEP("linker stubs"(*)) /* .glue_7 .glue_7t .vfp11_veneer .v4_bx */
KEEP(*(.iplt))
KEEP(*(.rel.iplt))
KEEP(*(.igot.plt))
} > FLASH
ASSERT(SIZEOF(.unmatched) == 0, "allocated sections not matched. Search in linker map for .unmatched and add non empty sections explicitly in this file")
These are mostly for interworking between thumb and non thumb machine code(which can’t happen on cortex m because it only supports thumb) and advanced linker tricks (STT_GNU_IFUNC).
Now that we have discarded or matched we can finally add -Wl,--orphan-handling=error
to the makefile. Sadly this is quite a big part in the linker script. But i believe it's worth knowing if something is emitted unexpectedly.
The minimal example hard coded the starting value for the stack pointer in the code. But going forward the layout of data in the sram should be managed by the linker. So let’s start by moving the placement of the stack into the linker script.
First we need to define the memory for the sram:
MEMORY {
FLASH : ORIGIN = 0x08000000, LENGTH = 64K
+ RAM : ORIGIN = 0x20000000, LENGTH = 20K
}
This adds a new memory called RAM
with the given address and length.
+__stack_size = 0x400;
For convinence define a constant for the wanted stack size. This is likely more than a simple project needs. But when experimenting it‘s nice to be on the safe side for now.
.stack : {
__stack_start = .;
. = . + __stack_size;
__stack_end = .;
} > RAM
This adds a section called .stack
and sets its size by incrementing the .
current address variable. Also it defines symbols __stack_start
and __stack_end
with which refer to the start and end of the section. The end is to be used in the c++ code for initializing the initial stack pointer. The start is not needed currently, but for other sections both are needed so i added it for .stack
too.
-
+extern char __stack_end;
extern void (* const vectors[])() __attribute__ ((section(".vectors"))) = {
- (void (*)())0x20000400,
+ (void (*)())&__stack_end,
mainFn,
};
This removes the hardcoded value for the stack start. Getting the address of a symbol created in the linker script is a bit convoluted. We need to define an external variable (type doesn’t matter really) and take it’s address.
run_tests
function.serial_writebyte_wait
and serial_writestr
run_tests
function just before the loop in mainFn()
The goal is to make this code run and not output an '!' even when the cpu is only soft reset. C and C++ standards guarantee that not explicitly initialized global variables are initialized to what effectivly is all bits zeroed on most architectures (including arm)
char test_global;
void global_variable() {
if (test_global != 0) {
serial_writebyte_wait('!');
}
test_global = 'a';
serial_writebyte_wait(test_global);
test_global = 'b';
serial_writebyte_wait(test_global);
}
So for this to compile we need a linker section to map .bss
sections where gcc will place the variable into our resulting linked program. This will not actually create anything in the rom image. But the linker can now assign addresses to the variable and resolve the relocations for the variable in the code in the object file.
.bss : {
__bss_start = . ;
*(.bss*)
__bss_end = . ;
} > RAM
This again adds symbols for start and end. This time both are needed to fill the memory with the required zero bit pattern. This is done with a small c++ function:
void init_sram_sections() {
extern uint32_t __bss_start, __bss_end;
for (uint32_t* dst = &__bss_start; dst< &__bss_end; dst++) {
*dst = 0;
}
}
Again to get pointers from the symbols there are extern variables. I used uint32_t here, because that gives pointers that are useful for the initialization loop. Additionally i added a call to this function at the start of mainFn
.
Gcc per default allows multiple definitions of variables in C mode(contrary to the C standard). To do that it uses common sections that are apart from build in weak semantics similar to .bss
. To test and support that we need a C test file and a little addition to the linker script.
.bss : {
__bss_start = . ;
*(.bss*)
+ *(COMMON)
__bss_end = . ;
} > RAM
And in a new file test-language-features-c.c
#include "serial.h"
char test_common;
void run_tests_c(void) {
if (test_common != 0) {
serial_writebyte_wait('!');
}
}
While with the linker script addition it works, i prefer to disable that extension in the makefile. Because we should be writing correct code anyways:
CFLAGS += -fno-common
Next up is support for initialized globals:
+char test_init = 'c';
+
void global_variable() {
if (test_global != 0) {
serial_writebyte_wait('!');
}
test_global = 'a';
serial_writebyte_wait(test_global);
test_global = 'b';
serial_writebyte_wait(test_global);
+
+ serial_writebyte_wait(test_init);
+ test_init = 'd';
+ serial_writebyte_wait(test_init);
}
For this to work the memory location where test_init
is located at runtime needs to be initialized before starting the main part of the program. The program when the cpu is reset is only an flash image. So we need the initialization values somewhere in flash. And they need to be copied to ram early by the reset vector.
For this reason the linker allows us to have two locations for a section. One for an initialization image in flash that is actually included in the flash image and one for use at run time that is located on a different address that is in ram.
.data : {
*(.data*)
. = ALIGN(4);
} > RAM AT>FLASH
__data_start_flash = LOADADDR(.data);
__data_start_ram = ADDR(.data);
__data_size = SIZEOF(.data);
Here we have LOADADDR
that returns the address of the initialization image and ADDR
that return the run time location. Together with the size of the data section that’s enough to extend init_sram_sections
to copy the initialization image into the correct ram location:
extern uint32_t __data_start_flash, __data_start_ram, __data_size;
uint32_t *src = &__data_start_flash;
uint32_t *dst = &__data_start_ram;
uint32_t *dend = dst + ((uint32_t)&__data_size);
while (dst < dend) {
*dst++ = *src++;
}
Ok that’s the basics. Using only the language features enabled at this we could build quite a bit without needing to resort to strange workarounds.
But there are a few more things that are easy to enable and useful. Constructors is certainly one. Having things run implicitly in the startup might be not be a good idea if we are not careful. But maybe it is if we are. One of the classic problems is that ordering is not well defined for C++ constructors. But that's just standard C++. The gnu toolchain supports constructors with explicit priorities.
Depending on your preferences this could be a nice feature to decouple various parts of code or a pure horror show. Let’s assume we don’t abuse it and go on to add support.
The syntax that is independent of classes (available in C and C++) is:
__attribute__((constructor (202)))
static void constructor2() {
serial_writebyte_wait('2');
}
static void constructor1() __attribute__((constructor (200)));
static void constructor1() {
serial_writebyte_wait('0');
}
Both synaxes work. The priority can be from 0 to 65535. The documentation claims that 0 to 100 are reserved. So let’s avoid those.
The way this works is simple. Pointers to the constructor functions are generated into .init_array.XXXXX
sections, where XXXXX is the zero padded priority. We just need to arrange for them to be called on startup in the right order. Fortunatly the other special support for this in the toolchain is that the linker can sort these sections by numeric order using SORT_BY_INIT_PRIORITY
so getting a list in the right order is easy in the linker script:
.init_array : {
__init_array_start = .;
KEEP(*(SORT_BY_INIT_PRIORITY(.init_array.*)))
__init_array_end = .;
} > FLASH
/* .fini_array ommited because when would we run destructors? */
As usual there’s a start and end symbol to allow for iteration in the startup code. The same infrastructure is available for destructors instead of constructors. But as the comment in the linker script says there is usually no clear time to run those on embedded microcontrollers. So i left out support for this.
The code to run the constructors is similar to the code for memory initialization. It just iterates over function pointers and calls them this time:
void run_init_data() {
typedef void (*init_fun)(void);
extern init_fun __init_array_start, __init_array_end;
init_fun *ptr = &__init_array_start;
while (ptr < &__init_array_end) {
(*ptr)();
++ptr;
}
}
Another way to use this for classes is the init_order
attribute.
Some_Class B __attribute__ ((init_priority (543)));
This works exactly like the examples above from the linker and runtime support perspective. But there’s one special case. Constructors of global variables without priority are generated into the section .init_array
(without a dot and a priority). To support this case we need a tiny addition to the linker script:
.init_array : {
__init_array_start = .;
KEEP(*(SORT_BY_INIT_PRIORITY(.init_array.*)))
+ KEEP(*(.init_array*))
__init_array_end = .;
} > FLASH
The default linker script of the toolchain places constructors without explicit priority as after all with priority which seems sensible, so i kept that ordering.
Now that we have support for constructors in globals, what about destructors?
As i wrote earlier global destructors are not really useful in most embedded environments. But they are often when using the same class in automatic or dynamically allocated storage. So i decided that i want to ignore them in global context. While there is a .fini_array
g++ doesn't use it for destructors. Instead it uses an arm abi definied variant of atexit
that is called from the function in the .init_array*
section that does the construction. Likely to be sure that the destructor ordering is always the exact reverse of the construtors. And because the ARM C++ ABI specifies this.
While the micro variante of newlib shipped with the toolchain i use has an implementation of __aeabi_atexit
to just ignore the calls can be done easier with an empty function. And it uses less code space. For reasons that are only relevant in environments using shared libraries the __aeabi_atexit
function has an additional argument that is passed the contents of the symbol __dso_handle
which is used to identify the shared object from where the call was made. As the value if the symbol doesn’t matter we can just define that symbol to 0 in the linker script.
__dso_handle = 0;
and then implement an empty __aeabi_atexit
:
// No atexit, no destructors
// actual signature is bool __aeabi_atexit(void* object, void (*destroyer)(void*), void* dso_handle)
extern "C" void __aeabi_atexit() {
}
I used the wrong signature here to save a few bytes because the return is ignored anyway and all parameters fit into registers. This of course is a questional gain for relying on such low level details. But i sometimes i just can’t help myself from doing stupid micro optimisations.
In C++ we can have static variables in functions that are initialized by code run when the function is executed for the first time. Of course as we expect from a good general purpose toolchain this is done in a thread safe manner. Nice. Well that of course needs runtime support and is totally unneeded in a single threaded design with interrupts. Well if a function that uses such a static is used from an interrupt this would be needed, and it would need to disable the interrupt. If you want to use statics in interrupts you have to reimplement __cxa_guard_acquire
and __cxa_guard_release
anyway. For single threaded use without usage in interrupts there is a easier way:
CXXFLAGS += -fno-threadsafe-statics
That is just to disable thread safeness and being careful in using this feature. But when implementing code that needs to be run from an interrupt being careful is always needed anyway.
For full code see here