MCU Boot Sequence Deep Dive

A practical walkthrough of the STM32/Cortex-M boot path, from the hardware-automatic load of the stack pointer and reset vector through SystemInit(), .data/.bss setup, and C-library initialization, ending at main(). The post emphasizes the configuration-dependent steps — clock and flash latency, power scaling, FPU, caches, external memory, watchdogs, and TrustZone — that silently break firmware when overlooked.

Introduction

On a Cortex-M MCU, main() is not where the program starts — it is where a substantial amount of setup has already finished. By the time your int main(void) runs, the stack pointer is valid, global variables hold their initializers, zero-initialized data is cleared, C++ constructors have run, and (depending on configuration) the FPU, caches, external RAM, and PLL may or may not be live. Misunderstanding this prologue is a recurring source of bugs that present as impossible symptoms: a HardFault before the first user instruction, globals that read as garbage, a chip that runs at 16 MHz when you "configured" 168 MHz, or floating-point code that faults on its first multiply.

This article reconstructs the STM32 boot path step by step and then maps each step to the hardware configuration that controls it. The goal is to make the startup code (startup_stm32xxxx.s, system_stm32xxxx.c, the linker script) legible, so you can reason about — and safely modify — what happens before main().

The Cortex-M Reset Model: No Instruction at the Reset Vector

A key difference from classic ARM (and most other architectures): on Cortex-M, the reset vector does not contain an instruction. After reset deassertion, the core hardware reads two 32-bit words from the base of the vector table and loads them into registers automatically, before fetching any code:

Offset Word Loaded into
0x00 Initial stack pointer MSP (Main Stack Pointer)
0x04 Address of Reset_Handler PC (Program Counter)

This means the stack pointer is set up by hardware, not by software — a common misconception. The startup code re-loading SP is often redundant (though kept for clarity or for cases where SP must be reset after a bootloader jump). The vector table layout in the GCC startup file reflects this exactly:

g_pfnVectors:
    .word   _estack          /* [0x00] initial MSP: top of RAM, from linker */
    .word   Reset_Handler     /* [0x04] initial PC */
    .word   NMI_Handler
    .word   HardFault_Handler
    /* ... remaining exception and IRQ vectors ... */

_estack is a linker-defined symbol pointing at the top of SRAM. Because the Cortex-M stack is full-descending and AAPCS requires 8-byte alignment at public interfaces, _estack must be 8-byte aligned, or the very first interrupt that stacks a frame can violate alignment assumptions.

Boot Configuration: What Is Mapped at 0x00000000

The core always fetches the initial SP/PC from address 0x00000000. What physically lives there is decided by the boot configuration, sampled at reset:

  • BOOT0 pin (and on some parts BOOT1 / a nBOOT1 option bit) selects between main Flash, system memory (the ST factory bootloader, see AN2606), or embedded SRAM.
  • The selected memory is aliased to 0x00000000, while also remaining visible at its native address (e.g., Flash at 0x08000000, SRAM at 0x20000000).
  • Option bytes can further override this on newer families (e.g., nBOOT0 option to ignore the pin, dual-bank boot selection BFB2, RDP read-protection level).

The practical consequence: if you relocate your application (typical with a custom bootloader), the application's vector table is no longer at the aliased 0x00000000. You must set VTOR (Vector Table Offset Register) to the application's table base — usually inside SystemInit() — or every exception will dispatch through the wrong handlers.

/* In system_stm32xxxx.c, executed early in SystemInit() */
SCB->VTOR = FLASH_BASE | VECT_TAB_OFFSET;  /* e.g. 0x08020000 for an app at +128K */

Note: Cortex-M0 has no VTOR (table is fixed at 0x0; remapping is done via the SYSCFG memory-remap on STM32F0). Cortex-M0+ (STM32L0/G0) and all of M3/M4/M7/M33 have VTOR.

Inside Reset_Handler: The C Runtime Startup

Once the core branches to Reset_Handler, software-controlled C runtime (CRT) startup begins. The canonical STM32/GCC sequence is:

Reset_Handler:
    ldr   sp, =_estack          /* explicit SP set (HW already did this) */
    bl    SystemInit            /* CMSIS: FPU, VTOR, optional early clock */

    /* Copy .data: initialized globals from Flash (LMA) to RAM (VMA) */
    ldr   r0, =_sdata           /* dest start in RAM   */
    ldr   r1, =_edata           /* dest end            */
    ldr   r2, =_sidata          /* source in Flash     */
    movs  r3, #0
    b     CheckData
CopyData:
    ldr   r4, [r2, r3]
    str   r4, [r0, r3]
    adds  r3, r3, #4
CheckData:
    adds  r4, r0, r3
    cmp   r4, r1
    bcc   CopyData

    /* Zero .bss: uninitialized globals */
    ldr   r2, =_sbss
    ldr   r4, =_ebss
    movs  r3, #0
    b     CheckBss
ZeroBss:
    str   r3, [r2]
    adds  r2, r2, #4
CheckBss:
    cmp   r2, r4
    bcc   ZeroBss

    bl    __libc_init_array     /* C++ static ctors + __attribute__((constructor)) */
    bl    main                  /* finally: application entry */

Three things deserve attention:

  • SystemInit() is minimal by design. In stock CMSIS it enables the FPU coprocessor (M4F/M7/M33), sets VTOR, and on some families resets clock registers to a known state. It does not configure the PLL — full clock setup is done later via SystemClock_Config(), typically inside main(). So all of CRT startup above runs at the default oscillator speed (HSI: 16 MHz on STM32F4, 8 MHz on STM32F1, 64 MHz on STM32H7), not your target frequency.
  • .data copy depends on the linker script. The _sidata/_sdata/_edata symbols and the AT> (load region) directive must agree. A wrong load-memory-address is the classic cause of "my global constant is garbage."
  • __libc_init_array runs user code before main(). Anything in __attribute__((constructor)), C++ object constructors for globals, and statically-initialized C++ singletons execute here — at default clock, with the FPU only available if SystemInit() already enabled it.

Hardware Configurations That Change Pre-main() Behavior

The skeleton above is constant, but several configuration-dependent steps determine whether it succeeds. These are the items to verify before trusting main():

Clock and Flash latency. When SystemClock_Config() raises the core frequency, Flash wait states and the prefetch/ART accelerator must be set first. On STM32F4 at 168 MHz / 3.3 V you need 5 wait states (FLASH_ACR_LATENCY_5WS). Increasing frequency before raising latency causes Flash read errors and faults. The HAL handles ordering correctly; hand-written init must not.

Power / voltage scaling. High-frequency operation requires a voltage-scaling step. On STM32F4/F7 you set the regulator scale (VOS) and, on F7/H7, enable over-drive before selecting the top frequency. On STM32H7 you must also wait for the voltage-ready flag. Skipping this leaves the part unstable at speed.

FPU. If your code (or the compiler, even for nominally integer code under -mfloat-abi=hard) touches the FPU before CP10/CP11 are enabled in CPACR, the first FP instruction triggers a UsageFault/NOCP. SystemInit() enables it — but only if you didn't strip that step from a minimal custom startup.

Caches and TCM (Cortex-M7: STM32F7/H7). Enabling I-cache/D-cache is often done in SystemInit(). If .data/.bss live in DTCM there is no cache-coherency concern, but if buffers used by DMA live in cacheable SRAM, you inherit coherency problems later. Decide cache and section placement at startup, in the linker script.

External memory (.data in SDRAM/SRAM via FMC). If a section is placed in external RAM, the FMC/FSMC controller must be initialized before the .data copy runs — otherwise the copy loop writes into dead address space. This requires an FMC bring-up hook invoked from SystemInit() (or a custom early-init), ahead of the CRT copy.

CCM RAM and ECC (STM32F4 CCMRAM, H7 ECC). Core-Coupled Memory is not initialized by the default startup — variables in .ccmram are not zeroed unless you add an explicit copy/zero loop. ECC-protected RAM may require a full write pass to initialize ECC bits before first read, or early reads fault.

Hardware watchdog from option bytes. If the IWDG is configured as hardware-start in the option bytes, it is already counting at reset. CRT startup at the (slow) default clock can be long enough to trip it; you may need to refresh or reconfigure it very early, before main().

TrustZone (Cortex-M33: STM32L5/U5). With TrustZone enabled, the device boots into the Secure world first: secure SystemInit(), SAU/IDAU region setup, and configuration of which peripherals and memory are non-secure, before the secure firmware branches to the Non-secure image's own reset handler — which then performs its own CRT startup. There are effectively two boot prologues and two vector tables.

Conclusion

The pre-main() prologue on STM32 is short but unforgiving. Hardware loads MSP and PC from the vector table; Reset_Handler runs SystemInit(), copies .data, zeros .bss, runs constructors, then calls main() — all at the default oscillator frequency. Everything else is configuration: VTOR for relocated images, Flash latency and voltage scaling before clock-up, FPU enable before any FP code, FMC before external-RAM data, explicit init for CCM/ECC RAM, watchdog awareness, and a full secure/non-secure split under TrustZone.

When to dig into this layer: any custom bootloader or relocated application, any use of external memory or CCM/TCM, any minimal/from-scratch startup, any low-power or high-frequency design where clock/power ordering matters, and all Cortex-M33 TrustZone projects. When not to: a single-image application using the vendor-provided startup, default linker script, and HAL SystemClock_Config() — the toolchain already does this correctly, and rewriting it by hand introduces more risk than it removes. Treat the startup files as code you can read and audit, not code you must replace.

Return to Post List