Exploring Startup Implementations: Newlib (ARM)

Updated: 20190909

For most programmers, a C or C++ program’s life begins at the main function. They are blissfully unaware of the hidden steps that happen between invoking a program and executing main. Depending on the program and the compiler, there are all kinds of interesting functions that get run before main, automatically inserted by the compiler and linker and invisible to casual observers.

Unfortunately for programmers who are curious about the program startup process, the literature on what happens before main is quite sparse.

Embedded Artistry has been hard at working creating a C++ embedded framework. The final piece of the puzzle was implementing program startup code. To aid in the design of our framework’s boot process, I performed an exploratory survey of existing program startup implementations. My goal is to identify a general program startup model. I also want to provide a more comprehensive look into how our programs get to main.

In this six-part series, we will be investigating what it takes to get to main:

A General Overview of What Happens Before main()

Exploring Startup Implementations: Newlib (ARM)

Exploring Startup Implementations: OS X

Exploring Startup Implementations: Custom Embedded System with ThreadX

Abstracting a Generic Flow for Getting to main

Implementing our Generic Startup Flow

Now that we have a high-level understanding of how our programs get to main, we can explore real-world implementations of program startup code.

Today’s analysis focuses on Newlib. If you build embedded applications for ARM using the GNU arm-none-eabi toolchain, your program is linked with Newlib startup code by default. Newlib supports multiple architectures, but we will focus exclusively on the ARM startup path.

If you are interested in exploring Newlib startup routines on your own, you can download the Newlib source code or browse the source code online.

The boot flow is quite complicated, and it’s easy to get mentally lost. You can refer to the Visual Summary throughout the article for a visual representation of the startup procedure and call stack.

Table of Contents:

ARM Procedure Call Standard

System Configuration

Initial Exploration

Boot Path

_start Disassembly

nRF52 Initial Boot

Load from Flash to RAM

Optional: Clear .bss

SystemInit

Call start

IRQ Handlers

nRF52 System Initialization

Newlib ARM Startup

exit

_exit

__call_exitprocs

_kill

atexit

__cxa_atexit

__register_exitproc

Automatic Registration of Destructors

Stack Setup

Initialize .bss

Target-Specific Initialization

argc and argv Initialization

Call Global Constructors

crt0.s

__libc_init_array

__libc_fini_array

Heap Limit and malloc

atexit Family

exit Family

Visual Summary

Startup Activity Checklist

Further Reading

ARM Procedure Call Standard

Since we are going to look at ARM assembly, we will need to familiarize ourselves with the basics of the Procedure Call Standard for ARM Applications.

There are sixteen 32-bit registers and a status register (CPSR) in the ARM and Thumb instruction sets:

r0 (aka a1) is Argument register 1 and a result register

r1 (aka a2) is Argument register 2 and a result register

r2 (aka a3) is Argument register 3

r3 (aka a4) is Argument register 4

r4 (aka v1) is Variable register 1

r5 (aka v2) is Variable register 2

r6 (aka v3) is Variable register 3

r7 (aka v4) is Variable register 4

r8 (aka v5) is Variable register 5

r9 usage changes depending on the platform

r10 (aka v7) is Variable register 7

r11 (aka v7) is Variable register 8

r12 is the IP special purpose register (intra-procedure-call scratch register)

r2` is the SP special register (stack pointer)

r14 is the LR special register (link register)

r15 is the PC special register (program counter)

The standard says the following for the argument registers (r0–r3):

The first four registers r0-r3 (a1-a4) are used to pass argument values into a subroutine and to return a result value from a function. They may also be used to hold intermediate values within a routine (but, in general, only between subroutine calls).

We have multiple registers to hold the value of local variables:

Typically, the registers r4-r8, r10 and r11 (v1-v5, v7 and v8) are used to hold the values of a routine’s local variables. Of these, only v1-v4 can be used uniformly by the whole Thumb instruction set, but the AAPCS does not require that Thumb code only use those registers.

We must preserve specific registers when calling functions:

A subroutine must preserve the contents of the registers r4-r8, r10, r11 and SP (and r9 in PCS variants that designate r9 as v6)

ARM specifies that the stack pointer (SP) must always be aligned to a word boundary (i.e., sp % 4 == 0). For public interfaces, the stack must be aligned to a double-word boundary (i.e., sp % 8 == 0).

The least significant bit of a function address is an ARM/Thumb flag (1 == ARM, 0 == Thumb). This bit is set by the linker.

When we want to call a subroutine, we need to preserve the current function’s persistent registers on the stack, store the return address in the LR register (so we know how to get back from our function), and change the PC to the subroutine address. ARM provides branching instructions which handle this process for us (e.g., bl, blx,bx`), although the process may still be performed manually.

Now, there are many details that we did not cover, but this basic overview provides enough details to understand some of the assembly that we will be analyzing. Particularly important to keep in mind: values put into r0–r3 represent arguments to functions, and values put into r4–r11 represent variables used in our current subroutine.

System Configuration

For this exploration, I used a Nordic nRF52840 Development Kit. The development kit has several examples provided by Nordic; I used the blinky program. I compiled and linked the program with the GNU ARM toolchain (version 8–2018-q4-major). The Nordic blinky program links against the Newlib libraries provided by the GNU ARM toolchain.

Because this is a Cortex-M processor, the program is compiled entirely in Thumb mode. We will discuss some aspects of the boot process which apply to Cortex-A processors that use ARM instructions.

Initial Exploration

Before we start blindly looking through the Newlib code base, we should do some initial exploration with our debugger as described in the last article.

To begin the investigation, I compiled the blinky example for the nRF52840 Development Kit (PCA10056 in the SDK parlance) in the “blank” configuration using the armgcc Makefile. I flashed the binary to the board with the nRF Connect Programmer,

First, let’s start with a backtrace from main in an example program so we can see what code is run. Then we will look at the disassembly for the _start function that is provided by Newlib.

Boot Path

To investigate the path our program takes to get to main, we’ll use gdb. The nRF52 DK has USB connection with an on-board debugging chip. I fired up a Jlink gdb server and connected to my board usingarm-none-eabi-gdb`.

Once the board is connected, we load the symbols for our application:(gdb) file _build/nrf52840_xxaa.out A program is being debugged already. Are you sure you want to change the file? (y or n) y Reading symbols from _build/nrf52840_xxaa.out...

Set the breakpoint for main:(gdb) b main Breakpoint 1 at 0x380: file ../../../main.c, line 62.

Enable backtraces to extend past main:(gdb) set backtrace past-main on

Then restart and run the program:(gdb) mon reset Resetting target (gdb) c Continuing. Breakpoint 1, main () at ../../../main.c:62 62 bsp_board_init(BSP_INIT_LEDS);

Our initial backtrace shows a corrupt frame prior to _start:(gdb) bt #0 main () at ../../../main.c:62 #1 0x0000028e in _start () Backtrace stopped: previous frame inner to this frame (corrupt stack?)

This can happen when the _start routine is messing with stacks or frame pointers to set up the program according to the library and ABI requirements. We can confirm this by setting a breakpoint at _start and re-starting the program. This will allow us to look at the state of the program before stack modifications.(gdb) b _start Breakpoint 2 at 0x258 (gdb) mon reset Resetting target (gdb) c Continuing. Breakpoint 2, 0x00000258 in _start () (gdb) bt #0 0x00000258 in _start () #1 0x000002ce in Reset_Handler () at ../../../../../../modules/nrfx/mdk/gcc_startup_nrf52840.S:280

Our program receives control at the Reset_Handler function in our processor’s startup code. This is expected for an embedded platform, since the processor loads our program from memory and begins execution at the reset vector address.

Now we know that there are two areas to investigate for startup, and gdb helpfully provided the path to the gcc_startup_nrf52840.S file, which is where our investigation of the source code will begin.

_start Disassembly

Before we dive into the source code, let’s look at the disassembly for the _start function with gdb.(gdb) disass /m _start Dump of assembler code for function _start: 0x00001240 <+0>: ldr r3, [pc, #84] ; (0x1298 <_start+88>) 0x00001242 <+2>: cmp r3, #0 0x00001244 <+4>: it eq 0x00001246 <+6>: ldreq r3, [pc, #76] ; (0x1294 <_start+84>) 0x00001248 <+8>: mov sp, r3 0x0000124a <+10>: sub.w r10, r3, #65536 ; 0x10000 0x0000124e <+14>: movs r1, #0 0x00001250 <+16>: mov r11, r1 0x00001252 <+18>: mov r7, r1 0x00001254 <+20>: ldr r0, [pc, #76] ; (0x12a4 <_start+100>) 0x00001256 <+22>: ldr r2, [pc, #80] ; (0x12a8 <_start+104>) 0x00001258 <+24>: subs r2, r2, r0 0x0000125a <+26>: bl 0x330c 0x0000125e <+30>: ldr r3, [pc, #60] ; (0x129c <_start+92>) 0x00001260 <+32>: cmp r3, #0 0x00001262 <+34>: beq.n 0x1266 <_start+38> 0x00001264 <+36>: blx r3 0x00001266 <+38>: ldr r3, [pc, #56] ; (0x12a0 <_start+96>) 0x00001268 <+40>: cmp r3, #0 0x0000126a <+42>: beq.n 0x126e <_start+46> 0x0000126c <+44>: blx r3 0x0000126e <+46>: movs r0, #0 0x00001270 <+48>: movs r1, #0 0x00001272 <+50>: movs r4, r0 0x00001274 <+52>: movs r5, r1 0x00001276 <+54>: ldr r0, [pc, #52] ; (0x12ac <_start+108>) 0x00001278 <+56>: cmp r0, #0 0x0000127a <+58>: beq.n 0x1282 <_start+66> 0x0000127c <+60>: ldr r0, [pc, #48] ; (0x12b0 <_start+112>) 0x0000127e <+62>: nop.w 0x00001282 <+66>: bl 0x32b4 <__libc_init_array> 0x00001286 <+70>: movs r0, r4 0x00001288 <+72>: movs r1, r5 0x0000128a <+74>: bl 0x1554 0x0000128e <+78>: bl 0x3268 0x00001292 <+82>: nop 0x00001294 <+84>: movs r0, r0 0x00001296 <+86>: movs r0, r1 0x00001298 <+88>: movs r0, r0 0x0000129a <+90>: movs r0, #4 0x0000129c <+92>: movs r0, r0 0x0000129e <+94>: movs r0, r0 0x000012a0 <+96>: movs r0, r0 0x000012a2 <+98>: movs r0, r0 0x000012a4 <+100>: lsls r0, r4, #3 0x000012a6 <+102>: movs r0, #0 0x000012a8 <+104>: lsls r4, r4, #10 0x000012aa <+106>: movs r0, #0 0x000012ac <+108>: movs r0, r0 0x000012ae <+110>: movs r0, r0 0x000012b0 <+112>: movs r0, r0 0x000012b2 <+114>: movs r0, r0

Disassembly Highlights

We won’t reconstruct the entire process from disassembly, but we can quickly note some highlights.

First, the routine sets up the stack pointer using the r3 register:0x00001248 <+8>: mov sp, r3

The Newlib _start function handles initializing the .bss section contents (which holds uninitialized global and static data) to 0. Note the call to memset: r1 holds the value we are setting (‘0’); r0 holds the start address of the .bss section; r2 is loaded with the end address of the .bss section, and then the start address is subtracted from it, giving us the size of the section.0000124e <+14>: movs r1, #0 [...] 0x00001254 <+20>: ldr r0, [pc, #76] ; (0x12a4 <_start+100>) 0x00001256 <+22>: ldr r2, [pc, #80] ; (0x12a8 <_start+104>) 0x00001258 <+24>: subs r2, r2, r0 0x0000125a <+26>: bl 0x330c

From the disassembly, I don’t immediately understand what’s happening after memset, but I do notice some function calls (blx instructions). I’m also guessing that _start initializes argc and argv to 0, then preserves those in r4–r5. Looking at the commented and non-optimized source will clarify this part of the process.

I do recognize the next function call, which is conveniently named. This call will initialize the global constructors:0x00001282 <+66>: bl 0x32b4 <__libc_init_array>

After we’ve called the global constructors, we put the (presumed) argc and argv values into our argument registers, and then call main:0x00001286 <+70>: movs r0, r4 0x00001288 <+72>: movs r1, r5 0x0000128a <+74>: bl 0x1554

Since the r0 register holds the value that main returns, we can invoke exit without needing to modify the argument registers:0x0000128e <+78>: bl 0x3268

The assembly instructions following exit is a mystery to me from this view. Let’s see what the source investigation reveals.

nRF52840 Boot

Our backtrace showed us that our journey begins in the Reset_Handler function in gcc_startup_nrf52840.S (found in the nRF SDK).

The file begins by providing for stack storage: .section .stack #if defined(__STARTUP_CONFIG) .align __STARTUP_CONFIG_STACK_ALIGNEMENT .equ Stack_Size, __STARTUP_CONFIG_STACK_SIZE #elif defined(__STACK_SIZE) .align 3 .equ Stack_Size, __STACK_SIZE #else .align 3 .equ Stack_Size, 8192 #endif .globl __StackTop .globl __StackLimit __StackLimit: .space Stack_Size .size __StackLimit, . - __StackLimit __StackTop: .size __StackTop, . - __StackTop

There are also provisions for heap storage: .section .heap .align 3 #if defined(__STARTUP_CONFIG) .equ Heap_Size, __STARTUP_CONFIG_HEAP_SIZE #elif defined(__HEAP_SIZE) .equ Heap_Size, __HEAP_SIZE #else .equ Heap_Size, 8192 #endif .globl __HeapBase .globl __HeapLimit __HeapBase: .if Heap_Size .space Heap_Size .endif .size __HeapBase, . - __HeapBase __HeapLimit: .size __HeapLimit, . - __HeapLimit

This file also contains a declaration of all interrupt vectors and their associated handlers. A small sample is shown: .section .isr_vector .align 2 .globl __isr_vector __isr_vector: .long __StackTop /* Top of Stack */ .long Reset_Handler .long NMI_Handler .long HardFault_Handler .long MemoryManagement_Handler .long BusFault_Handler .long UsageFault_Handler /// ... .size __isr_vector, . - __isr_vector

We then find the declaration of Reset_Handler: .text .thumb .thumb_func .align 1 .globl Reset_Handler .type Reset_Handler, %function Reset_Handler:

Load from Flash to RAM

First, the reset handler copies data from flash to RAM.

The data is copied from the address of the __etext symbol, which represents the end of the .text section in flash storage. The data is copied to the address indicated by the __data_start__ symbol, and the number of bytes copied is calculated by subtracting the __data_start__ address from __bss_start__, which indicates the beginning of the next section. As the nRF startup code explains, __bss_start__ is used so users can insert their own initialized data section before the .bss section. Using this logic, it will be copied to RAM without any changes from the user. ldr r1, =__etext ldr r2, =__data_start__ ldr r3, =__bss_start__ subs r3, r3, r2 ble .L_loop1_done .L_loop1: subs r3, r3, #4 ldr r0, [r1,r3] str r0, [r2,r3] bgt .L_loop1

Optional: Clear .bss

Once the .data section contents are copied to RAM, there is an optional step for initializing the .bss section contents to 0. In our case, this code is not compiled. Newlib handles .bss initialization..L_loop1_done: /* This part of work usually is done in C library startup code. Otherwise, * define __STARTUP_CLEAR_BSS to enable it in this startup. This section * clears the RAM where BSS data is located. * * The BSS section is specified by following symbols * __bss_start__: start of the BSS section. * __bss_end__: end of the BSS section. * * All addresses must be aligned to 4 bytes boundary. */ #ifdef __STARTUP_CLEAR_BSS ldr r1, =__bss_start__ ldr r2, =__bss_end__ movs r0, 0 subs r2, r2, r1 ble .L_loop3_done .L_loop3: subs r2, r2, #4 str r0, [r1, r2] bgt .L_loop3 .L_loop3_done: #endif /* __STARTUP_CLEAR_BSS */

SystemInit

Before invoking the C runtime startup routine, a SystemInit function is called. This function, which we will look at next, is responsible for initializing the processor and applying behavioral fixes for relevant errata.bl SystemInit

Call _start

Once the processor is initialized, we call the _start function to initialize the C runtime. Note that the nRF startup code allows you to define a custom entry point with a compiler definition./* Call _start function provided by libraries. If those libraries * are not accessible, define __START as your entry point. */ #ifndef __START #define __START _start #endif bl __START

IRQ Handlers

The gcc_startup_nrf52840.S also contains dummy exception handler function definitions. For example: .weak NMI_Handler .type NMI_Handler, %function NMI_Handler: b . .size NMI_Handler, . - NMI_Handler .weak HardFault_Handler .type HardFault_Handler, %function HardFault_Handler: b . .size HardFault_Handler, . - HardFault_Handler

A default handler is declared, which performs an infinite loop: .globl Default_Handler .type Default_Handler, %function Default_Handler: b . .size Default_Handler, . - Default_Handler

All other IRQ handlers are mapped to this default handler. Users are able to overwrite these handlers with their own implementations as needed..macro IRQ handler .weak \handler .set \handler, Default_Handler .endm IRQ POWER_CLOCK_IRQHandler IRQ RADIO_IRQHandler IRQ UARTE0_UART0_IRQHandler /// ...

After the IRQ handlers are supplied, the file ends..end

nRF52 System Initialization

The SystemInit function is implemented in system_nrf52840.c (found in the nRF SDK). For a normal application, this file would be modified to suit the platform’s requirements. We’ll look at the default implementation for our processor.

First, SWO trace functionality is enabled in the processor. If ENABLE_SWO is not defined, the pin is left as normal GPIO.#if defined (ENABLE_SWO) CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk; NRF_CLOCK->TRACECONFIG |= CLOCK_TRACECONFIG_TRACEMUX_Serial << CLOCK_TRACECONFIG_TRACEMUX_Pos; NRF_P1->PIN_CNF[0] = (GPIO_PIN_CNF_DRIVE_H0H1 << GPIO_PIN_CNF_DRIVE_Pos) | (GPIO_PIN_CNF_INPUT_Connect << GPIO_PIN_CNF_INPUT_Pos) | (GPIO_PIN_CNF_DIR_Output << GPIO_PIN_CNF_DIR_Pos); #endif

Next, Trace functionality is enabled in the processor. If ENABLE_TRACE is not defined, the pins are left as normal GPIO.#if defined (ENABLE_TRACE) CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk; NRF_CLOCK->TRACECONFIG |= CLOCK_TRACECONFIG_TRACEMUX_Parallel << CLOCK_TRACECONFIG_TRACEMUX_Pos; NRF_P0->PIN_CNF[7] = (GPIO_PIN_CNF_DRIVE_H0H1 << GPIO_PIN_CNF_DRIVE_Pos) | (GPIO_PIN_CNF_INPUT_Connect << GPIO_PIN_CNF_INPUT_Pos) | (GPIO_PIN_CNF_DIR_Output << GPIO_PIN_CNF_DIR_Pos); // ... more pin configurations in the actual implementation #endif

Following debug configuration, the system checks for a variety of errata conditions and applies fixes as necessary. Here are a few examples:/* Workaround for Errata 98 "NFCT: Not able to communicate with the peer" */ if (errata_98()){ *(volatile uint32_t *)0x4000568Cul = 0x00038148ul; } /* Workaround for Errata 103 "CCM: Wrong reset value of CCM MAXPACKETSIZE" */ if (errata_103()){ NRF_CCM->MAXPACKETSIZE = 0xFBul; }

Following the errata section, the FPU is initialized if the program has been compiled with floating point support. The __FPU_USED macro is supplied by the compiler.#if (__FPU_USED == 1) SCB->CPACR |= (3UL << 20) | (3UL << 22); __DSB(); __ISB(); #endif

If NFC is not used for an nRF52 platform, the associated NFC pins are configured as normal GPIO.#if defined (CONFIG_NFCT_PINS_AS_GPIOS) if ((NRF_UICR->NFCPINS & UICR_NFCPINS_PROTECT_Msk) == (UICR_NFCPINS_PROTECT_NFC << UICR_NFCPINS_PROTECT_Pos)){ NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Wen << NVMC_CONFIG_WEN_Pos; while (NRF_NVMC->READY == NVMC_READY_READY_Busy){} NRF_UICR->NFCPINS &= ~UICR_NFCPINS_PROTECT_Msk; while (NRF_NVMC->READY == NVMC_READY_READY_Busy){} NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Ren << NVMC_CONFIG_WEN_Pos; while (NRF_NVMC->READY == NVMC_READY_READY_Busy){} NVIC_SystemReset(); } #endif

The nRF allows a GPIO to be configured as a reset pin. If CONFIG_GPIO_AS_PINRESET is defined, a dedicated GPIO will be configured to act as a reset pin.#if defined (CONFIG_GPIO_AS_PINRESET) if (((NRF_UICR->PSELRESET[0] & UICR_PSELRESET_CONNECT_Msk) != (UICR_PSELRESET_CONNECT_Connected << UICR_PSELRESET_CONNECT_Pos)) || ((NRF_UICR->PSELRESET[1] & UICR_PSELRESET_CONNECT_Msk) != (UICR_PSELRESET_CONNECT_Connected << UICR_PSELRESET_CONNECT_Pos))){ NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Wen << NVMC_CONFIG_WEN_Pos; while (NRF_NVMC->READY == NVMC_READY_READY_Busy){} NRF_UICR->PSELRESET[0] = 18; while (NRF_NVMC->READY == NVMC_READY_READY_Busy){} NRF_UICR->PSELRESET[1] = 18; while (NRF_NVMC->READY == NVMC_READY_READY_Busy){} NRF_NVMC->CONFIG = NVMC_CONFIG_WEN_Ren << NVMC_CONFIG_WEN_Pos; while (NRF_NVMC->READY == NVMC_READY_READY_Busy){} NVIC_SystemReset(); } #endif

Finally, the system clock is initialized:SystemCoreClockUpdate();

Newlib ARM Startup

After data has been relocated and the processor properly initialized, the reset handler calls the _start function. For our GCC ARM application, this function is supplied by Newlib.

The Newlib project is divided into two major parts: newlib and libgloss. The newlib portion is an implementation of libc and libm. The libgloss portion contains platform-specific code, such as startup files, board support packages, and I/O support for the C library.

When exploring the Newlib code base on your own, it is important to note the distinction between libgloss and newlib. The libgloss division happened after the inception of the Newlib project. Many of the same files are found in the newlib folder and the libgloss folder. For platform-specific code, you should prefer the libgloss implementations. These are newer, and the older implementations remain in the newlib folder for backwards compatibility with older targets.

crt0.s

The _start function for the ARM architecture is found in libgloss/arm/crt0.S

The _start function is quite lengthy, so I will be providing highlights of the full implementation. The startup code presented below has also simplified from the code found in crt0.S. The full implementation supports semi-hosting, where a debugger handles parts of the standard library functionality. I’ve removed the monitor-related code to simplify our current review.

Newlib implements a single runtime that supports both ARM and Thumb modes. This can be confusing, since not all operations apply to both modes. Because we are using a Cortex-M processor (the nRF52), the program is compiled entirely in Thumb mode. Some startup code only applies when ARM mode is enabled, and I will highlight this as best as I can.

The file opens with preprocessor definitions, logic for selecting the proper ARM/Thumb architecture, and a declaration of the _start`` function. The most important preprocessor entry for our current exploration isHAVE_INITFINI_ARRAY` selection logic.#ifdef HAVE_INITFINI_ARRAY #define _init __libc_init_array #define _fini __libc_fini_array #endif

When HAVE_INITFINI_ARRAY is defined, the _init and _fini function calls will be exchanged with __libc_init_array and __libc_fini_array respectively. This macro comes into play – our ARM program uses the .init_array and .fini_array sections.

We should also note an assembly macro which we will encounter in the startup code: indirect_call..macro indirect_call reg #ifdef HAVE_CALL_INDIRECT blx \reg #else mov lr, pc mov pc, \reg #endif .endm

The indirect_call is used to mimic blx behavior for architectures that do not support that instruction, as described in the summary of the ARM Procedure Call Standard.

We eventually reach the proper beginning of the _start function, which is aliased as _mainCRTStartup: FUNC_START _mainCRTStartup FUNC_START _start #if defined(__ELF__) && !defined(__USING_SJLJ_EXCEPTIONS__) /* Annotation for EABI unwinding tables. */ .fnstart #endif

Stack Setup

The first order of business is to set up the stacks for the various ARM processor modes.

The linker script may provide the stack address with the __stack symbol, which is then made accessible to the assembly via .Lstack:.Lstack: .word __stack

The stack address is loaded and checked to make sure it is a non-zero value:ldr r3, .Lstack cmp r3, #0

If the __stack symbol is not defined, the alternate value provided in the .LC0 variable is used instead:#ifdef __thumb2__ it eq #endif #ifdef THUMB1_ONLY bne .LC28 ldr r3, .LC0 .LC28: #else ldreq r3, .LC0 #endif

Once the stack address is loaded into r3, we work through the various user modes and set up stacks and stack limits. This operation only applies to programs compiled in ARM mode, bceause Thumb has no concept of user modes.

If the processor is already operating in user mode, or if Thumb mode is being used, this section is skipped. Our Cortex-M-based nRF52 only uses Thumb mode, so this section is skipped. /* Note: This 'mov' is essential when starting in User, and ensures we always get *some* sp value for the initial mode, even if we have somehow missed it below (in which case it gets the same value as FIQ - not ideal, but better than nothing.) */ mov sp, r3 #ifdef PREFER_THUMB /* XXX Fill in stack assignments for interrupt modes. */ #else mrs r2, CPSR tst r2, #0x0F /* Test mode bits - in User of all are 0 */ beq .LC23 /* "eq" means r2 AND #0x0F is 0 */ msr CPSR_c, #0xD1 /* FIRQ mode, interrupts disabled */ mov sp, r3 sub sl, sp, #0x1000 /* This mode also has its own sl (see below) */ mov r3, sl msr CPSR_c, #0xD7 /* Abort mode, interrupts disabled */ mov sp, r3 sub r3, r3, #0x1000 msr CPSR_c, #0xDB /* Undefined mode, interrupts disabled */ mov sp, r3 sub r3, r3, #0x1000 msr CPSR_c, #0xD2 /* IRQ mode, interrupts disabled */ mov sp, r3 sub r3, r3, #0x2000 msr CPSR_c, #0xD3 /* Supervisory mode, interrupts disabled */ mov sp, r3 sub r3, r3, #0x8000 /* Min size 32k */ bic r3, r3, #0x00FF /* Align with current 64k block */ bic r3, r3, #0xFF00 str r3, [r3, #-4] /* Move value into user mode sp without */ ldmdb r3, {sp}^ /* changing modes, via '^' form of ldm */ orr r2, r2, #0xC0 /* Back to original mode, presumably SVC, */ msr CPSR_c, r2 /* with FIQ/IRQ disable bits forced to 1 */ #endif

Note that setting up each mode is currently not performed for Thumb code. Only the user mode stack is initialized for thumb programs. That’s why we did not observe this setup code in our disassembly of _start.

The last portion of the stack setup process puts an arbitrary stack limit in place. Unlike the __stack definition which is provided by the linker, the stack limit is an arbitrarily decided value of 64kB. This may be problematic if we have a larger stack or if the stack runs into the heap..LC23: #ifdef THUMB1_ONLY movs r2, #64 lsls r2, r2, #10 subs r2, r3, r2 mov sl, r2 #else sub sl, r3, #64 << 10 /* Still assumes 256bytes below sl */ #endif

Initialize .bss

Once our stack is set up, the .bss sections I cleared. The .bss section start and end addresses are made available through the .LC1 and .LC2 variables:.LC1: .word __bss_start__ .LC2: .word __bss_end__

The arguments to memset are loaded into registers, and the size is calculated: /* Zero the memory in the .bss section. */ movs a2, #0 /* Second arg: fill value */ mov fp, a2 /* Null frame pointer */ mov r7, a2 /* Null frame pointer for Thumb */ ldr a1, .LC1 /* First arg: start of memory block */ ldr a3, .LC2 subs a3, a3, a1 /* Third arg: length of block */

Once the arguments are loaded, we call memset (and switch to Thumb mode if appropriate):#if __thumb__ && !defined(PREFER_THUMB) /* Enter Thumb mode.... */ add a4, pc, #1 /* Get the address of the Thumb block */ bx a4 /* Go there and start Thumb decoding */ .code 16 .global __change_mode .thumb_func __change_mode: #endif bl FUNCTION (memset)

Target-Specific Initialization

Once the .bss section is cleared, optional target-specific early initialization is performed.

The startup code supports two weakly-linked functions: .weak FUNCTION (hardware_init_hook) .weak FUNCTION (software_init_hook)

They are weakly-linked because they are optional. If a platform does not require this functionality the functions will not be defined and a value of 0 will be loaded for the variable. These functions are made available via the .Lhwinit and .Lswinit variables:.Lhwinit: .word FUNCTION (hardware_init_hook) .Lswinit: .word FUNCTION (software_init_hook)

The startup code checks whether these functions are defined, and calls them if they are. ldr r3, .Lhwinit cmp r3, #0 beq .LC24 indirect_call r3 .LC24: ldr r3, .Lswinit cmp r3, #0 beq .LC25 indirect_call r3

argc and argv Initialization

The Newlib ARM startup code has a simple solution for argc and argv: they are initialized to 0:.LC25: movs r0, #0 /* no arguments */ movs r1, #0 /* no argv either */

Call Global Constructors

Next, we call global constructors. The code is provisioned such that it will work If global constructors are not present. Constructors are enabled in our configuration.

First, we store the values of r0 and r1 to r4 and r5, since we will be calling other functions: movs r4, r0 movs r5, r1

First, we will register the _fini function (which is actually __libc_fini_array thanks to the preprocessor) with atexit. This ensures that global destructors will be run when exiting the program.

Newlib supports a “light exit” implementation, which is controlled by the _LITE_EXIT compiler definition. For embedded systems, this is a wonderful option. Our programs do not perform normal exit procedures; they simply run until power is removed. Cleaning up after the program is not a requirement, and exit functions can be discarded.

If _LITE_EXIT is enabled, atexit is weakly linked. If atexit is linked in our application, it will be called with __libc_fini_array as an argument. If it is not defined, the global destructors will not be registered. Our current configuration is using _LITE_EXIT without atexit.#ifdef _LITE_EXIT /* Make reference to atexit weak to avoid unconditionally pulling in support code. Refer to comments in __atexit.c for more details. */ .weak FUNCTION(atexit) ldr r0, .Latexit cmp r0, #0 beq .Lweak_atexit #endif ldr r0, .Lfini bl FUNCTION (atexit)

After the global destructors are registered, the _init function is invoked (which is actually __libc_init_array thanks to the preprocessor). This function calls the global constructors, and it is always run..Lweak_atexit: bl FUNCTION (_init)

Once we have called the global constructors, the values for argc and argv are moved into the function argument registers r0 and r1 so we can call main: movs r0, r4 movs r1, r5

Call main

With the argc and argv function arguments stored in r0 and r1, we can safely call main:bl FUNCTION (main)

Program Exit

After main returns, exit is called using its return code. We do not expect exit to return, but if it does then we trap the program in SWI_Exit. bl FUNCTION (exit) /* Should not return. */ #if __thumb__ && !defined(PREFER_THUMB) /* Come out of Thumb mode. This code should be redundant. */ mov a4, pc bx a4 .code 32 .global change_back change_back: /* Halt the execution. This code should never be executed. */ /* With no debug monitor, this probably aborts (eventually). With a Demon debug monitor, this halts cleanly. With an Angel debug monitor, this will report 'Unknown SWI'. */ swi SWI_Exit #endif

Now that we’ve looked over the _start function, let’s look at the various functions that _start called.

__libc_init_array

The __libc_init_array() function can be found in newlib/libc/misc/init.c.

Depending on the architecture, compiler, and linker, constructors are placed into the .init_array section or the .init section. The Newlib ARM startup code is flexible and can handle any combination of cases. If HAVE_INITFINI_ARRAY is not defined, _start calls _init directly instead of calling __libc_init_array. If HAVE_INITFINI_ARRAY is defined, __libc_init_array calls the constructors in the .preinit_array and .init_array sections. If .init is also present for an architecture, the constructors stored in that section will also be invoked.

ARM code typically uses the __init_array instead of _init. In our current case, HAVE_INITFINI_ARRAY is defined and HAVE_INIT_FINI is not./* Handle ELF .{pre_init,init,fini}_array sections. */ #include #ifdef HAVE_INITFINI_ARRAY /* These magic symbols are provided by the linker. */ extern void (*__preinit_array_start []) (void) __attribute__((weak)); extern void (*__preinit_array_end []) (void) __attribute__((weak)); extern void (*__init_array_start []) (void) __attribute__((weak)); extern void (*__init_array_end []) (void) __attribute__((weak)); #ifdef HAVE_INIT_FINI extern void _init (void); #endif /* Iterate over all the init routines. */ void __libc_init_array (void) { size_t count; size_t i; count = __preinit_array_end - __preinit_array_start; for (i = 0; i < count; i++) __preinit_array_start[i] (); #ifdef HAVE_INIT_FINI _init (); #endif count = __init_array_end - __init_array_start; for (i = 0; i < count; i++) __init_array_start[i] (); } #endif

__libc_fini_array

The __libc_fini_array() function can be found in newlib/libc/misc/fini.c.

Depending on the architecture, compiler, and linker, destructors are placed into the .fini_array section or the .fini section. If the program is configured with full exit support, these functions will be executed before the program exits. In a LITE_EXIT configuration, the destructors are ignored.

Like __libc_init_array, the functionality is decided by two macros. If HAVE_INITFINI_ARRAY is not defined, _start registers _fini with atexit instead of __libc_fini_array. If HAVE_INITFINI_ARRAY is defined, the __libc_fini_array function is registered. When __libc_fini_array is invoked by exit, it calls the destructors in the .fini_array section. If .fini is also present for an architecture, the constructors stored in that section will also be invoked.

ARM code typically uses the __fini_array instead of _fini. In our current case, HAVE_INITFINI_ARRAY is defined and HAVE_INIT_FINI is not./* Handle ELF .{pre_init,init,fini}_array sections. */ #include #ifdef HAVE_INITFINI_ARRAY extern void (*__fini_array_start []) (void) __attribute__((weak)); extern void (*__fini_array_end []) (void) __attribute__((weak)); #ifdef HAVE_INIT_FINI extern void _fini (void); #endif /* Run all the cleanup routines. */ void __libc_fini_array (void) { size_t count; size_t i; count = __fini_array_end - __fini_array_start; for (i = count; i > 0; i--) __fini_array_start[i-1] (); #ifdef HAVE_INIT_FINI _fini (); #endif } #endif

Heap Limit and malloc

The __heap_limit variable set during the _start routine is used by _sbrk, found in libgloss/arm/syscalls.c.

The _sbrk function is used to allocate memory for the platform. For more information heap allocation and sbrk, read this article about the glibc heap implementation.

While the _sbrk function is not directly used in the startup code, we can see that setting __heap_limit during _start is effectively configuring the program’s heap. If the _start routine does not update __heap_limit, the default value is recognized and there will be no detection for allocations reaching beyond the heap limit./* Heap limit returned from SYS_HEAPINFO Angel semihost call. */ uint __heap_limit = 0xcafedead; void * __attribute__((weak)) _sbrk (ptrdiff_t incr) { extern char end asm ("end"); /* Defined by the linker. */ static char * heap_end; char * prev_heap_end; if (heap_end == NULL) heap_end = & end; prev_heap_end = heap_end; if ((heap_end + incr > stack_ptr) /* Honour heap limit if it's valid. */ || (__heap_limit != 0xcafedead && heap_end + incr > (char *)__heap_limit)) { errno = ENOMEM; return (void *) -1; } heap_end += incr; return (void *) prev_heap_end; }

atexit Family

The atexit family of functions is responsible for registering functions to be called when the program exits, including the global destructors. We will explore the following functions:

atexit

__cxa_atexit

__register_exitproc

We don’t typically need exit functionality for our embedded platforms. Rarely is there a concept of a program “exit” which requires cleanup of resources. Instead, our programs run until they are terminated by a reset, off switch, or our of power.

Newlib provides for this behavior through the _LITE_EXIT compilation option. This option changes behavior related to the exit-time requirements and reduces our binary size. Our program is technically compiled under _LITE_EXIT, but we will still analyze the normal exit-related behavior for instructional purposes.

The Newlib code comments are helpful in explaining the differences between the two exit configurations. Under normal circumstances, we can expect the following exit call graphs ( an -> indicates “invokes”):Default (without lite exit) call graph is like: * _start -> atexit -> __register_exitproc * _start -> __libc_init_array -> __cxa_atexit -> __register_exitproc * on_exit -> __register_exitproc * _start -> exit -> __call_exitprocs

When lite exit is enabled, the call graph changes. The atexit, __register_exitproc, and __call_exitprocs functions are changed to weak symbols, which may not be linked by the final program. These function call stacks are modified:Lite exit makes some of above calls as weak reference, so that size expansive functions __register_exitproc and __call_exitprocs may not be linked. These calls are: * _start w-> atexit * __cxa_atexit w-> __register_exitproc * exit w-> __call_exitprocs

Let’s look at how these exit functions operate.

atexit

The atexit function is used to register calls that should be invoked when the program exits. Most notably, this call is used to register the function in .fini or .fini_array during the startup process. If the _LITE_EXIT configuration is used, this function step will be avoided.

The atexit function is implemented in newlib/libc/stdlib/atexit.c. This implementation forwards the input function argument to __register_exitproc while noting that the call originated from atexit (using the __et_atexit argument).#include #include "atexit.h" int atexit (void (*fn) (void)) { return __register_exitproc (__et_atexit, fn, NULL, NULL); }

__cxa_atexit

The __cxa_atexit call is used similarly to atexit, but often for handling functions to be called when a dynamic library is unloaded. In many implementations, such as this one, atexit and __cxa_atexit share implementations.

The __cxa_atexit function is implemented in newlib/libc/stdlib/cxa_atexit.c. This implementation forwards the input function and arguments to __register_exitproc while indicating that the call originated from __cxa_atexit (using the __et_cxa argument).

If the _LITE_EXIT configuration is used, then __register_exitproc may be weakly linked. In this case, __cxa_atexit will blindly return success (0).int __cxa_atexit (void (*fn) (void *), void *arg, void *d) { #ifdef _LITE_EXIT /* Refer to comments in __atexit.c for more details of lite exit. */ int __register_exitproc (int, void (*fn) (void), void *, void *) __attribute__ ((weak)); if (!__register_exitproc) return 0; else #endif return __register_exitproc (__et_cxa, (void (*)(void)) fn, arg, d); }

__register_exitproc

We’ve seen two uses of __register_exitproc, the common routine that handles all atexit-like functionality. __register_exitproc is called when the program exits or when a shared library is unloaded.

The __register_exitproc function is implemented in newlib/libc/stdlib/__atexit.c. This function must support a variety of configurations and behaviors: _LITE_EXIT vs standard exit, single-threaded vs multi-threaded, atexit vs __cxa_atexit. I’ve stripped out some of the #ifdef blocks to make the code more readable.

The function starts by acquiring a lock if threading is enabled:int __register_exitproc (int type, void (*fn) (void), void *arg, void *d) { struct _on_exit_args * args; register struct _atexit *p; #ifndef __SINGLE_THREAD__ __lock_acquire_recursive(__atexit_recursive_mutex); #endif

And we grab our _GLOBAL_ATEXIT list of functions. If this has not been initialized yet, we assign it to the initial list value.p = _GLOBAL_ATEXIT; if (p == NULL) { _GLOBAL_ATEXIT = p = _GLOBAL_ATEXIT0; }

By default, atexit requires the C runtime to support registering at least 32 functions (_ATEXIT_SIZE). Newlib handles this by allocating 32-chunk blocks of memory. Once the current block is full, a new block will be allocated and added to the list

If there is no malloc implementation for the system, or if dynamic allocations for atexit are not allowed, the function will fail and return an error code instead of allocating a new block. if (p->_ind >= _ATEXIT_SIZE) { #if !defined (_ATEXIT_DYNAMIC_ALLOC) || !defined (MALLOC_PROVIDED) #ifndef __SINGLE_THREAD__ __lock_release_recursive(__atexit_recursive_mutex); #endif return -1; #else p = (struct _atexit *) malloc (sizeof *p); if (p == NULL) { #ifndef __SINGLE_THREAD__ __lock_release_recursive(__atexit_recursive_mutex); #endif return -1; } p->_ind = 0; p->_next = _GLOBAL_ATEXIT; _GLOBAL_ATEXIT = p; p->_on_exit_args_ptr = NULL; #endif }

We observed two different type values for this call: __et_atexit and __et_cxa. If __cxa_atexit was called, additional arguments were provided and need to be stored for future retrieval. Arguments and function pointers are stored in the current index, and then it is incremented.if (type != __et_atexit) { args = &p->_on_exit_args; args->_fnargs[p->_ind] = arg; args->_fntypes |= (1 << p->_ind); args->_dso_handle[p->_ind] = d; if (type == __et_cxa) args->_is_cxa |= (1 << p->_ind); } p->_fns[p->_ind++] = fn;

Once we are done, we can unlock and exit the function:#ifndef __SINGLE_THREAD__ __lock_release_recursive(__atexit_recursive_mutex); #endif return 0; }

Automatic Registration of Destructors

One interesting note is that Newlib provides features for registering global destructors (in .fini or .fini_array) within the C library, rather than in startup code. This automatic registration code is provided in newlib/libc/stdlib/__call_atexit.c.

A __libc_fini symbol is weakly defined. You can define __libc_fini to _fini or _fini_array in your linker script, and the C library will handle the registration so that your startup code does not need to call atexit.extern char __libc_fini __attribute__((weak));

A registration function is defined and marked as a high-priority constructor, which places it into the .init or .init_array section. Since destructors are stored in LIFO order, and the .fini and .fini_array functions should run last, the constructor is attempting to be the first to register with atexit.static void register_fini(void) __attribute__((constructor (0)));

The register function checks for a valid __libc_fini symbol and registers the destructors if its defined.static void register_fini(void) { if (&__libc_fini) { #ifdef HAVE_INITFINI_ARRAY extern void __libc_fini_array (void); atexit (__libc_fini_array); #else extern void _fini (void); atexit (_fini); #endif } }

exit Family

To complete our analysis of _start and crt0.s, we’ll look at the exit family of functions:

exit

__call_exitprocs

_exit

_kill

exit

The exit function is implemented in newlib/libc/stdlib/exit.c.

The Newlib exit function is a wrapper. exit calls all registered exit-time functions via __call_exitprocs. If the _LITE_EXIT configuration is used, this function may not be defined.

Following the invocation of exit-time destructors, a _GLOBAL_REEINT->__cleanup function is called. This function flushes stdio buffers, if necessary.

Once all destruction and cleanup activities are complete, control proceeds to _exit.void exit (int code) { #ifdef _LITE_EXIT /* Refer to comments in __atexit.c for more details of lite exit. */ void __call_exitprocs (int, void *) __attribute__((weak)); if (__call_exitprocs) #endif __call_exitprocs (code, NULL); if (_GLOBAL_REENT->__cleanup) (*_GLOBAL_REENT->__cleanup) (_GLOBAL_REENT); _exit (code); }

__call_exitprocs

The __call_exitprocs function is responsible for calling exit-time destructor routines that were registered with the atexit famil of functions. __call_exitprocs is implemented in newlib/libc/stdlib/__call_atexit.c. I’ve stripped out some of the #ifdef blocks to make the code more readable.

The function starts by acquiring a lock if threading is enabled:void __call_exitprocs (int code, void *d) { register struct _atexit *p; struct _atexit **lastp; register struct _on_exit_args * args; register int n; int i; void (*fn) (void); #ifndef __SINGLE_THREAD__ __lock_acquire_recursive(__atexit_recursive_mutex); #endif

Next the linked-list of exit-time functions is accessed. Note the restart label, as it will be referenced later. restart: p = _GLOBAL_ATEXIT; lastp = &_GLOBAL_ATEXIT;

For each entry in the list, the following actions are performed:

Arguments are loaded

The function is removed from the list

The index is decremented

If unloading a shared library, check that the _dso_handle matches the unloaded library

Skip to the next entry if there is a mismatch

Check if the function has been called

Skip to the next entry if it has already been called

Call the function

The loop also checks the index after calling the destructor. If that function registered new exit-time functions, the loop jumps back to restart to ensure to preserve the destructor LIFO order. while (p) { args = &p->_on_exit_args; for (n = p->_ind - 1; n >= 0; n--) { int ind; i = 1 << n; /* Skip functions not from this dso. */ if (d && (!args || args->_dso_handle[n] != d)) continue; /* Remove the function now to protect against the function calling exit recursively. */ fn = p->_fns[n]; if (n == p->_ind - 1) p->_ind--; else p->_fns[n] = NULL; /* Skip functions that have already been called. */ if (!fn) continue; ind = p->_ind; /* Call the function. */ if (!args || (args->_fntypes & i) == 0) fn (); else if ((args->_is_cxa & i) == 0) (*((void (*)(int, void *)) fn))(code, args->_fnargs[n]); else (*((void (*)(void *)) fn))(args->_fnargs[n]); /* The function we called call atexit and registered another function (or functions). Call these new functions before continuing with the already registered functions. */ if (ind != p->_ind || *lastp != p) goto restart; } // end of for - while still in effect

At the end of each block of exit-functions, the now-empty block is removed from the list and the memory is freed. If malloc is not provided or dynamic allocations in atexit are disallowed, the function ends after the first block.// while still in effect #if !defined (_ATEXIT_DYNAMIC_ALLOC) || !defined (MALLOC_PROVIDED) break; #else /* Move to the next block. Free empty blocks except the last one, which is part of _GLOBAL_REENT. */ if (p->_ind == 0 && p->_next) { /* Remove empty block from the list. */ *lastp = p->_next; free (p); p = *lastp; } else { lastp = &p->_next; p = p->_next; } #endif } // end of while

The lock is released, and the function exits.#ifndef __SINGLE_THREAD__ __lock_release_recursive(__atexit_recursive_mutex); #endif }

_exit

The _exit function is found at libgloss/arm/_exit.c. This function is simply a wrapper around _kill_shared.void _exit (int status) { /* The same SWI is used for both _exit and _kill. For _exit, call the SWI with "reason" set to ADP_Stopped_ApplicationExit to mark a standard exit. Note: The RDI implementation of _kill_shared throws away all its arguments and all implementations ignore the first argument. */ _kill_shared (-1, status, ADP_Stopped_ApplicationExit); }

_kill

The _kill_shared function is implemented in libgloss/arm/_kill.c.

When we remove the Semihosting / debug montior suport, this function does nothing:int _kill_shared (int pid, int sig, int reason) { (void) pid; (void) sig; __builtin_unreachable(); }

When debug monitor support is included, the __builtin_unreachable() call makes sense, because the debug monitor will trap the code in an SWI handler. If we have compiled without debug monitor support, this function will return up the call stack to crt0.s, and we will invoke the SWI handler anyway:swi SWI_Exit

Visual Summary

Startup Activity Checklist

In the first article of this series, we reviewed a broad range of startup activities that occur before main is called.

Here is a checklist of actions that were observed in the Newlib ARM program startup procedures:

[x] Early low-level initialization of the processor/hardware

[x] Stack initialization

[x] Frame pointer initialization

[x] C/C++ runtime setup

[x] Handle relocations (some sections are copied from flash to RAM)

[x] Initialize .bss

[x] Call global constructors

[x] Prepare argc, argv (set to 0)

[ ] Prepare environment variables

[x] Heap initialization

[ ] stdio initialization

[ ] Initialize exception support

[x] Register destructors and other exit-time functionality

[ ] System scaffolding setup

[ ] Threading support

[ ] Thread local storage

[ ] Buffer overrun detection

[ ] Run-time error checks

[ ] Locale settings

[ ] Math error handling

[ ] Math precision

[x] Jump to main

[x] Exit after main