Not Every Embedded Project Needs an RTOS - A Practical Overhead Analysis

Choosing between an RTOS and bare-metal firmware is one of the most consequential architectural decisions in embedded development — yet it is frequently made on habit or toolchain familiarity rather than technical merit. This post examines the concrete overhead introduced by an RTOS, the determinism advantages of bare-metal approaches, and the specific hardware and application profiles where each solution is genuinely appropriate. Engineers will come away with a framework for making this decision analytically rather than by default.

Introduction

Real-Time Operating Systems have become almost reflexively included in embedded firmware stacks. FreeRTOS, Zephyr, ThreadX, and their peers offer compelling abstractions: task isolation, inter-process communication primitives, and deterministic scheduling. The ecosystem is mature, the tooling is solid, and most engineers who trained in the last decade have used at least one of them.

The problem is that these same engineers often reach for an RTOS on a microcontroller (MCU) running a single control loop at 48 MHz with 32 kB of RAM — a system that would run faster, more predictably, and with fewer failure modes without one. The narrative that "RTOS equals professionalism" obscures the fact that every abstraction layer has a cost, and in constrained embedded systems, those costs are measured in microseconds, kilobytes, and CPU cycles that are not available for actual work.

This post is a structured counterargument. RTOS is a tool, not a standard. It belongs in a subset of embedded projects. Knowing which subset requires understanding what an RTOS actually does to your system when it runs.


What an RTOS Actually Adds to Your System

An RTOS provides, at minimum:

  • A preemptive or cooperative scheduler that context-switches between tasks based on priority or time slices
  • Synchronization primitives: mutexes, semaphores, event flags, message queues
  • Stack isolation per task, with optional stack overflow detection
  • Tick-based timing infrastructure (typically 1 ms resolution with a hardware timer)

Each of these has a runtime cost. None of it is free.


Quantifying RTOS Overhead

Context Switch Latency

A context switch requires saving the current task's CPU state (registers, program counter, stack pointer) and restoring the next task's state. On a Cortex-M4 running FreeRTOS, this takes approximately 1–4 µs depending on FPU usage and compiler settings. That number seems small until your system has 20 task switches per millisecond and 40–80 µs of every millisecond disappears into the scheduler.

// FreeRTOS task — each invocation incurs full context save/restore
void sensor_task(void *pvParameters) {
    TickType_t xLastWakeTime = xTaskGetTickCount();
    for (;;) {
        vTaskDelayUntil(&xLastWakeTime, pdMS_TO_TICKS(1)); // 1 ms period
        read_adc();       // ~12 µs
        filter_sample();  // ~8 µs
        // Context switch overhead: ~3 µs on Cortex-M4 @ 168 MHz
    }
}

On a bare-metal ISR-driven equivalent, the same operation has no scheduler involvement — the ADC interrupt fires, the handler executes, and the CPU returns to the main loop or idle state.

Memory Overhead

Each RTOS task requires its own stack. A minimal FreeRTOS configuration on Cortex-M3/M4 with four tasks typically consumes:

Resource FreeRTOS (4 tasks) Bare-metal equivalent
Stack memory 4 × 512 B = 2 kB (minimum) Shared, single stack
RTOS kernel ~5–10 kB Flash 0
TCB structures ~96 B × 4 = 384 B RAM 0
Tick timer 1 hardware timer Timer reusable

On an MCU with 32 kB RAM, an RTOS has already consumed 10–15% of available memory before a single line of application logic runs.

Interrupt-to-Task Latency

When a hardware interrupt needs to wake a task, the latency chain is: ISR fires → posts to queue or semaphore → scheduler runs on ISR exit → context switch to unblocked task. This chain typically adds 5–20 µs beyond the raw ISR latency. For hard real-time events — motor commutation, encoder capture, safety shutdowns — this is often unacceptable.

// RTOS path: ISR → queue → task (total latency: ISR + scheduler + switch)
void EXTI0_IRQHandler(void) {
    BaseType_t xHigherPriorityTaskWoken = pdFALSE;
    xSemaphoreGiveFromISR(xEventSem, &xHigherPriorityTaskWoken);
    portYIELD_FROM_ISR(xHigherPriorityTaskWoken); // scheduler invoked here
}

// Bare-metal path: ISR handles directly, zero scheduling overhead
void EXTI0_IRQHandler(void) {
    handle_event_directly(); // deterministic, bounded, no OS involvement
}

Where Bare-Metal Genuinely Wins

Latency-Critical Control Loops

Motor control (FOC algorithms), power converter control loops (DC-DC at 200–500 kHz switching), and safety-critical analog signal chains operate on tight timing budgets — often 1–10 µs from event to response. An RTOS introduces non-determinism in interrupt latency that bare-metal ISR architectures simply do not have. On Cortex-M devices, the Nested Vectored Interrupt Controller (NVIC) with proper priority assignment provides deterministic, hardware-managed preemption without any software scheduler involvement.

Deeply Resource-Constrained Devices

On MCUs with 8–16 kB of Flash and 4–8 kB of RAM — still a large segment of the market (PIC, AVR, STM8, Cortex-M0+) — an RTOS is not a reasonable option. A well-structured bare-metal firmware with a cooperative state machine or a super-loop with interrupt-driven peripherals is not a compromise; it is the correct architecture.

Certifiable Safety-Critical Systems

IEC 61508, ISO 26262, and DO-178C certifications impose strict requirements on software complexity and coverage. A bare-metal firmware with a static call graph, no dynamic memory allocation, and bounded execution paths is substantially easier to certify than an RTOS — even a safety-certified one. Certifying an RTOS kernel itself is a significant engineering and documentation effort; avoiding it where unnecessary is a sound systems engineering decision.


Where an RTOS Earns Its Keep

RTOS overhead is justified when:

  • Multiple independent subsystems must coexist and share CPU time in a managed, predictable way (e.g., BLE stack + sensor fusion + display update)
  • Blocking I/O is unavoidable (USB CDC, TCP/IP via lwIP, file system access via FatFS) — tasks can block without wasting CPU cycles
  • Team development across independent software components benefits from task isolation and defined inter-task interfaces
  • Portability requirements necessitate hardware abstraction that an RTOS HAL provides

Even here, the correct RTOS configuration matters. Disabling unused features, tuning tick rate, minimizing the number of tasks, and using static allocation instead of heap allocation all reduce overhead substantially.


Architectural Patterns for Bare-Metal That Scale

"Bare-metal" does not mean "unstructured." Scalable bare-metal firmware typically uses one of:

  • Foreground/background (super-loop + ISRs): ISRs handle time-critical events; the main loop processes deferred work. Works well up to moderate complexity.
  • Cooperative state machines: Each peripheral or subsystem is modeled as an explicit FSM. The main loop runs all state machines in round-robin. Predictable, testable, zero overhead.
  • Event-driven architecture with a ring buffer: ISRs post lightweight event records to a ring buffer; a dispatcher in the main loop processes them. Scales to hundreds of event types without an RTOS.
// Event dispatcher — bare-metal, zero RTOS, supports complex workflows
typedef enum { EVT_ADC_DONE, EVT_UART_RX, EVT_TIMEOUT } event_t;

static volatile event_t event_queue[32];
static volatile uint8_t head = 0, tail = 0;

void dispatch_events(void) {
    while (head != tail) {
        event_t e = event_queue[tail++ & 0x1F]; // power-of-2 mask
        switch (e) {
            case EVT_ADC_DONE:  process_adc();  break;
            case EVT_UART_RX:   process_uart(); break;
            case EVT_TIMEOUT:   handle_timeout(); break;
        }
    }
}

Engineer Skill Set Comparison

Competency Area Bare-Metal Engineer RTOS Engineer
Hardware understanding Deep — must manage peripherals directly Moderate — often abstracted by HAL/drivers
Interrupt architecture Expert-level NVIC, priority design Working knowledge, mostly ISR-to-task patterns
Timing analysis Manual WCET, oscilloscope validation Relies on scheduler guarantees
Memory management Manual, static allocation preferred Stack sizing, heap fragmentation awareness
Debugging tools Logic analyzer, JTAG, printf tracing RTOS-aware debugger (e.g., SEGGER SystemView)
Concurrency model Explicit — the engineer owns all sequencing Implicit — scheduler manages concurrency
Certification readiness High — simpler static analysis Requires certified RTOS or additional effort
Onboarding complexity Higher initial curve Familiar patterns for those with OS background

Neither profile is superior — they address different problem spaces. The concern is engineers who know RTOS patterns deeply but lack the hardware-level intuition required to diagnose a bare-metal timing violation, or conversely, bare-metal specialists who reach for bespoke solutions where an RTOS would reduce integration risk on a complex multi-subsystem product.


Conclusion

The default assumption that RTOS equals better firmware quality is not supported by engineering evidence. An RTOS is a multiplexing mechanism with measurable runtime costs: context switch latency in the 1–5 µs range, static memory overhead in the kilobytes, and interrupt-to-task latency that can exceed 20 µs on loaded systems. For latency-critical control loops, resource-constrained MCUs, or safety-certifiable systems, bare-metal architectures with disciplined ISR design and cooperative state machines consistently outperform RTOS-based equivalents.

Use an RTOS when: - The application has multiple independent, blocking workloads that benefit from preemptive scheduling - Middleware with blocking semantics (TCP/IP, USB, file systems) is required - Team structure and interface isolation justify the abstraction

Prefer bare-metal when: - Deterministic latency below 10 µs is required at any point in the system - Total RAM is below ~32 kB or Flash below ~64 kB - A single control loop or a small number of ISR-driven tasks covers the full application - Certification against functional safety standards is on the roadmap

Understanding both approaches — and maintaining the hardware-level proficiency to implement bare-metal correctly — remains a core competency for embedded engineers, regardless of which solution a given project demands.


References / Further Reading

  1. Barry, R. (2016). Mastering the FreeRTOS Real Time Kernel — A Hands-On Tutorial Guide. Real Time Engineers Ltd. https://www.freertos.org/Documentation/RTOS_book.html
  2. Samek, M. (2008). Practical UML Statecharts in C/C++: Event-Driven Programming for Embedded Systems (2nd ed.). Newnes. ISBN 978-0750687065.
  3. Labrosse, J. J. (2002). MicroC/OS-II: The Real-Time Kernel (2nd ed.). CMP Books. ISBN 978-1578201037.
  4. ARM Limited. (2021). Cortex-M4 Technical Reference Manual (Rev. r0p1). https://developer.arm.com/documentation/ddi0439/latest
  5. IEC 61508-3:2010. Functional Safety of E/E/PE Safety-Related Systems — Part 3: Software Requirements. International Electrotechnical Commission.
Return to Post List