MCU Boot Sequence Deep Dive
Introduction
On a Cortex-M MCU, main() is not where the program starts — it is where a substantial amount of setup has already finished. By the time your int main(void) runs, the stack pointer is valid, global variables hold their initializers, zero-initialized data is cleared, C++ constructors have run, and (depending on configuration) the FPU, caches, external RAM, and PLL may or may not be live. Misunderstanding this prologue is a recurring source of bugs that present as impossible symptoms: a HardFault before the first user instruction, globals that read as garbage, a chip that runs at 16 MHz when you "configured" 168 MHz, or floating-point code that faults on its first multiply.
This article reconstructs the STM32 boot path step by step and then maps each step to the hardware configuration that controls it. The goal is to make the startup code (startup_stm32xxxx.s, system_stm32xxxx.c, the linker script) legible, so you can reason about — and safely modify — what happens before main().
The Cortex-M Reset Model: No Instruction at the Reset Vector
A key difference from classic ARM (and most other architectures): on Cortex-M, the reset vector does not contain an instruction. After reset deassertion, the core hardware reads two 32-bit words from the base of the vector table and loads them into registers automatically, before fetching any code:
| Offset | Word | Loaded into |
|---|---|---|
| 0x00 | Initial stack pointer | MSP (Main Stack Pointer) |
| 0x04 | Address of Reset_Handler |
PC (Program Counter) |
This means the stack pointer is set up by hardware, not by software — a common misconception. The startup code re-loading SP is often redundant (though kept for clarity or for cases where SP must be reset after a bootloader jump). The vector table layout in the GCC startup file reflects this exactly:
g_pfnVectors:
.word _estack /* [0x00] initial MSP: top of RAM, from linker */
.word Reset_Handler /* [0x04] initial PC */
.word NMI_Handler
.word HardFault_Handler
/* ... remaining exception and IRQ vectors ... */
_estack is a linker-defined symbol pointing at the top of SRAM. Because the Cortex-M stack is full-descending and AAPCS requires 8-byte alignment at public interfaces, _estack must be 8-byte aligned, or the very first interrupt that stacks a frame can violate alignment assumptions.
Boot Configuration: What Is Mapped at 0x00000000
The core always fetches the initial SP/PC from address 0x00000000. What physically lives there is decided by the boot configuration, sampled at reset:
- BOOT0 pin (and on some parts BOOT1 / a
nBOOT1option bit) selects between main Flash, system memory (the ST factory bootloader, see AN2606), or embedded SRAM. - The selected memory is aliased to 0x00000000, while also remaining visible at its native address (e.g., Flash at
0x08000000, SRAM at0x20000000). - Option bytes can further override this on newer families (e.g.,
nBOOT0option to ignore the pin, dual-bank boot selectionBFB2, RDP read-protection level).
The practical consequence: if you relocate your application (typical with a custom bootloader), the application's vector table is no longer at the aliased 0x00000000. You must set VTOR (Vector Table Offset Register) to the application's table base — usually inside SystemInit() — or every exception will dispatch through the wrong handlers.
/* In system_stm32xxxx.c, executed early in SystemInit() */
SCB->VTOR = FLASH_BASE | VECT_TAB_OFFSET; /* e.g. 0x08020000 for an app at +128K */
Note: Cortex-M0 has no VTOR (table is fixed at 0x0; remapping is done via the SYSCFG memory-remap on STM32F0). Cortex-M0+ (STM32L0/G0) and all of M3/M4/M7/M33 have VTOR.
Inside Reset_Handler: The C Runtime Startup
Once the core branches to Reset_Handler, software-controlled C runtime (CRT) startup begins. The canonical STM32/GCC sequence is:
Reset_Handler:
ldr sp, =_estack /* explicit SP set (HW already did this) */
bl SystemInit /* CMSIS: FPU, VTOR, optional early clock */
/* Copy .data: initialized globals from Flash (LMA) to RAM (VMA) */
ldr r0, =_sdata /* dest start in RAM */
ldr r1, =_edata /* dest end */
ldr r2, =_sidata /* source in Flash */
movs r3, #0
b CheckData
CopyData:
ldr r4, [r2, r3]
str r4, [r0, r3]
adds r3, r3, #4
CheckData:
adds r4, r0, r3
cmp r4, r1
bcc CopyData
/* Zero .bss: uninitialized globals */
ldr r2, =_sbss
ldr r4, =_ebss
movs r3, #0
b CheckBss
ZeroBss:
str r3, [r2]
adds r2, r2, #4
CheckBss:
cmp r2, r4
bcc ZeroBss
bl __libc_init_array /* C++ static ctors + __attribute__((constructor)) */
bl main /* finally: application entry */
Three things deserve attention:
SystemInit()is minimal by design. In stock CMSIS it enables the FPU coprocessor (M4F/M7/M33), sets VTOR, and on some families resets clock registers to a known state. It does not configure the PLL — full clock setup is done later viaSystemClock_Config(), typically insidemain(). So all of CRT startup above runs at the default oscillator speed (HSI: 16 MHz on STM32F4, 8 MHz on STM32F1, 64 MHz on STM32H7), not your target frequency..datacopy depends on the linker script. The_sidata/_sdata/_edatasymbols and theAT>(load region) directive must agree. A wrong load-memory-address is the classic cause of "my global constant is garbage."__libc_init_arrayruns user code beforemain(). Anything in__attribute__((constructor)), C++ object constructors for globals, and statically-initialized C++ singletons execute here — at default clock, with the FPU only available ifSystemInit()already enabled it.
Hardware Configurations That Change Pre-main() Behavior
The skeleton above is constant, but several configuration-dependent steps determine whether it succeeds. These are the items to verify before trusting main():
Clock and Flash latency. When SystemClock_Config() raises the core frequency, Flash wait states and the prefetch/ART accelerator must be set first. On STM32F4 at 168 MHz / 3.3 V you need 5 wait states (FLASH_ACR_LATENCY_5WS). Increasing frequency before raising latency causes Flash read errors and faults. The HAL handles ordering correctly; hand-written init must not.
Power / voltage scaling. High-frequency operation requires a voltage-scaling step. On STM32F4/F7 you set the regulator scale (VOS) and, on F7/H7, enable over-drive before selecting the top frequency. On STM32H7 you must also wait for the voltage-ready flag. Skipping this leaves the part unstable at speed.
FPU. If your code (or the compiler, even for nominally integer code under -mfloat-abi=hard) touches the FPU before CP10/CP11 are enabled in CPACR, the first FP instruction triggers a UsageFault/NOCP. SystemInit() enables it — but only if you didn't strip that step from a minimal custom startup.
Caches and TCM (Cortex-M7: STM32F7/H7). Enabling I-cache/D-cache is often done in SystemInit(). If .data/.bss live in DTCM there is no cache-coherency concern, but if buffers used by DMA live in cacheable SRAM, you inherit coherency problems later. Decide cache and section placement at startup, in the linker script.
External memory (.data in SDRAM/SRAM via FMC). If a section is placed in external RAM, the FMC/FSMC controller must be initialized before the .data copy runs — otherwise the copy loop writes into dead address space. This requires an FMC bring-up hook invoked from SystemInit() (or a custom early-init), ahead of the CRT copy.
CCM RAM and ECC (STM32F4 CCMRAM, H7 ECC). Core-Coupled Memory is not initialized by the default startup — variables in .ccmram are not zeroed unless you add an explicit copy/zero loop. ECC-protected RAM may require a full write pass to initialize ECC bits before first read, or early reads fault.
Hardware watchdog from option bytes. If the IWDG is configured as hardware-start in the option bytes, it is already counting at reset. CRT startup at the (slow) default clock can be long enough to trip it; you may need to refresh or reconfigure it very early, before main().
TrustZone (Cortex-M33: STM32L5/U5). With TrustZone enabled, the device boots into the Secure world first: secure SystemInit(), SAU/IDAU region setup, and configuration of which peripherals and memory are non-secure, before the secure firmware branches to the Non-secure image's own reset handler — which then performs its own CRT startup. There are effectively two boot prologues and two vector tables.
Conclusion
The pre-main() prologue on STM32 is short but unforgiving. Hardware loads MSP and PC from the vector table; Reset_Handler runs SystemInit(), copies .data, zeros .bss, runs constructors, then calls main() — all at the default oscillator frequency. Everything else is configuration: VTOR for relocated images, Flash latency and voltage scaling before clock-up, FPU enable before any FP code, FMC before external-RAM data, explicit init for CCM/ECC RAM, watchdog awareness, and a full secure/non-secure split under TrustZone.
When to dig into this layer: any custom bootloader or relocated application, any use of external memory or CCM/TCM, any minimal/from-scratch startup, any low-power or high-frequency design where clock/power ordering matters, and all Cortex-M33 TrustZone projects. When not to: a single-image application using the vendor-provided startup, default linker script, and HAL SystemClock_Config() — the toolchain already does this correctly, and rewriting it by hand introduces more risk than it removes. Treat the startup files as code you can read and audit, not code you must replace.