On-Chip Interconnects in FPGA Design: AXI and the Art of Not Over-Implementing

On-chip interconnect standards like AXI dominate FPGA design, but full protocol compliance is frequently more than a given block requires. This post compares the common memory-mapped and streaming interconnects, identifies the single handshake primitive they share, and examines when a minimal hand-written interface — particularly around FIFO producers and consumers — is the sound engineering decision rather than instantiating a heavyweight standard end to end.

Note on code examples: All code in this post is written in Verilog (IEEE 1364-2005). The handshake rules discussed are vendor- and tool-neutral and apply equally to SystemVerilog or VHDL implementations.

Introduction

Modern FPGA designs are assembled from blocks that must exchange data without the designer re-inventing the wiring contract each time. That contract is the interconnect: the set of signals and timing rules that let one module hand data to another. The dominant standard in FPGA flows is AMBA AXI, an open, royalty-free interface family defined by Arm and adopted across vendor toolchains as the default integration bus. AMBA AXI4 is the fourth generation of the AMBA interface and comes in three flavors: AXI4 (memory-mapped, burst-capable), AXI4-Lite (a lightweight non-bursting version), and AXI4-Stream (a unidirectional streaming interface).

The practical problem is that AXI4 in full is a large protocol — five independent channels, transaction IDs, bursts, out-of-order completion, and optional quality-of-service signaling — and most internal data paths use only a fraction of it. Instantiating a full crossbar to move samples from a filter to a packetizer is a recurring source of wasted logic, harder timing closure, and obscured data flow. The recurring engineering question is therefore not "which standard?" but "how much of the standard does this link actually need, and what is the cost of the parts left out?"

The interconnect landscape: two families, one primitive

On-chip interconnects split into two categories. Memory-mapped interfaces address a target by an address and carry read/write transactions to it — the model used for register banks, memories, and DMA (Direct Memory Access). Streaming interfaces carry an ordered sequence of data with no addresses at all — the model used for signal-processing pipelines, where backpressure (a downstream block signaling "not ready, hold off") matters more than addressing.

What unifies almost all of them is a single primitive: the VALID/READY handshake. The data source asserts a valid signal when it is presenting data; the destination asserts a ready signal when it can accept data; a transfer occurs on exactly the clock edge where both are high simultaneously. AXI calls these TVALID/TREADY (streaming) or xVALID/xREADY per channel; Avalon expresses the same idea with valid/ready (or an inverted waitrequest); Wishbone uses STB/ACK with a STALL signal for its pipelined mode. Understanding this one mechanism is most of what is needed to reason about — and to hand-implement — any of these buses.

The families differ mainly in how much they wrap around that primitive:

Interconnect Model Channels Backpressure signal Bursts Typical use Relative complexity
AXI4 Memory-mapped 5 (AW, W, B, AR, R) VALID/READY per channel Yes High-throughput masters, DMA, memory High
AXI4-Lite Memory-mapped 5 (single beat) VALID/READY per channel No Control/status register maps Low–medium
AXI4-Stream Streaming 1 TVALID/TREADY n/a DSP and data pipelines Low
APB Memory-mapped 1 (phased) PREADY No Low-bandwidth peripherals Very low
Avalon-MM Memory-mapped 1 waitrequest Yes Intel IP integration Medium
Avalon-ST Streaming 1 ready/valid n/a Intel data pipelines Low
Wishbone (B4) Memory-mapped 1 (classic / pipelined) ACK / STALL Block transfers Open-source IP cores Low

A few orientation notes. AXI3 was introduced in 2003 with AMBA 3; AMBA 4 in 2010 defined AXI4, AXI4-Lite, and AXI4-Stream; AMBA 5 with AXI5 was released later, adding atomicity, data protection, and cache operations. APB (Advanced Peripheral Bus) is the oldest and simplest AMBA member — a low-frequency, non-pipelined peripheral bus, appropriate where bandwidth is negligible and gate count is the priority. Wishbone is a public-domain specification maintained by the OpenCores community; its current B4 revision was released in 2010 and adds a pipelined traffic mode, making it a common choice in open-source cores. Avalon is Intel/Altera's equivalent family and is the path of least resistance inside that vendor's platform builder. AXI4-Stream, despite the AXI name, carries no addresses and is structurally closer to Avalon-ST than to AXI4.

When the full protocol is justified — and when it is not

The deciding factor is what the link's responsibilities actually are, not which standard is fashionable.

AXI4 in full earns its complexity when a block must behave as a general-purpose master against shared memory: issuing bursts to amortize address overhead, keeping multiple transactions outstanding to hide memory latency, using transaction IDs to allow out-of-order completion, or arbitrating among several masters through a crossbar (a switch fabric routing several masters to several slaves). A DMA engine feeding an external memory controller is the canonical case — there, IDs, bursts, and the separation of address and data channels translate directly into sustained throughput.

AXI4-Lite is the right scope for register access. A control/status register bank does not need bursts or reordering; it needs a clean address-decoded write and read path. AXI4-Lite keeps the five-channel structure for tool compatibility but drops bursting, which is why it is the standard interface for a block's configuration registers.

A streaming interface is sufficient — and preferable — for ordered data flow. A filter chain, a video line buffer, or a packet path has no addresses to express. Forcing it through a memory-mapped bus adds address decoding and response channels that carry no information. AXI4-Stream or a bare FIFO interface captures exactly what such a path needs: data, a valid flag, a ready flag, and optionally a TLAST marker for packet boundaries.

The cases where full implementation is unnecessary share a pattern: a single producer talks to a single consumer, in order, on the same clock, with no addressing. For such point-to-point internal links, most of AXI4 is inert — there is one master and one slave, so IDs and arbitration do nothing; data is consumed in order, so out-of-order machinery does nothing; the link is internal, so QoS and region signals do nothing. Carrying that unused signaling costs LUTs (Look-Up Tables — the FPGA's basic logic elements), complicates timing closure, and makes the data flow harder to read in simulation.

Rational "manual" implementation: respect the handshake, drop the rest

Hand-writing an interface is reasonable precisely when the link is point-to-point and in-order. The discipline that keeps a hand-written interface correct comes down to honoring the handshake contract, even when the surrounding protocol is stripped away.

The contract has one rule that is non-negotiable and is the source of most interoperability defects:

A source must not wait for READY before asserting VALID. The destination may wait for VALID before asserting READY, or it may assert READY first — but VALID must never depend combinationally on READY.

The reason is deadlock avoidance: if a source waited for READY and a destination waited for VALID, and that dependency closed a combinational loop, neither side could ever move first. A secondary rule follows: once VALID is asserted with given data, both must remain stable until the handshake completes; data may not be retracted mid-offer.

In Verilog, the transfer condition itself is trivial — the whole protocol reduces to one expression:

// A single VALID/READY beat. IEEE 1364-2005 Verilog.
// A transfer happens on the rising edge where both are high.
wire beat = m_valid & s_ready;   // "fire" condition, used everywhere below

always @(posedge clk) begin
    if (beat)
        data_reg <= m_data;      // capture only on a successful handshake
end

The discipline is in generating valid and ready without violating the rules above — and that is where most hand-rolled interfaces go wrong. The safe building block is the skid buffer (also called a register slice): a one-deep buffer that lets the interface register its outputs for timing while still absorbing backpressure without dropping data. It is the minimal correct way to break a long combinational path on a stream.

// Skid buffer / register slice for a generic VALID/READY stream.
// Breaks the combinational path between upstream and downstream
// while obeying the rule that VALID must not depend on READY.
// IEEE 1364-2005 Verilog.
module skid_buffer #(parameter W = 32) (
    input              clk,
    input              rst_n,
    // upstream (we are the destination)
    input  [W-1:0]     up_data,
    input              up_valid,
    output             up_ready,
    // downstream (we are the source)
    output [W-1:0]     dn_data,
    output             dn_valid,
    input              dn_ready
);
    reg [W-1:0] skid_data;
    reg         skid_full;   // a beat is parked in the skid register

    // We can accept upstream data whenever the skid slot is empty.
    assign up_ready = ~skid_full;

    // Present skid data if parked, otherwise pass upstream through.
    assign dn_valid = skid_full | up_valid;
    assign dn_data  = skid_full ? skid_data : up_data;

    always @(posedge clk) begin
        if (!rst_n) begin
            skid_full <= 1'b0;
        end else begin
            // Park an incoming beat only if downstream can't take it now.
            if (up_valid & up_ready & ~dn_ready)
                {skid_full, skid_data} <= {1'b1, up_data};
            // Release the parked beat once downstream accepts it.
            else if (skid_full & dn_ready)
                skid_full <= 1'b0;
        end
    end
endmodule

This block is the honest replacement for "just register the output," which silently breaks the protocol by dropping a beat whenever downstream stalls in the same cycle.

FIFO producers and consumers: the common case in concrete terms

Most internal data movement reduces to a producer writing into a FIFO (First-In, First-Out buffer) and a consumer reading from it. The FIFO is the natural place where the two handshake rules become physical: the producer's ready is simply not full, and the consumer sees valid as not empty. Expressing it that way makes the interface trivially interoperable with any VALID/READY peer, including AXI4-Stream, without implementing the rest of AXI.

// Synchronous FIFO exposing VALID/READY on both ports.
// Producer side: wr_ready = ~full.  Consumer side: rd_valid = ~empty.
// IEEE 1364-2005 Verilog. Depth is a power of two.
module fifo_vr #(parameter W = 32, parameter AW = 4) (  // AW=4 -> depth 16
    input              clk,
    input              rst_n,
    // producer port (we are the destination)
    input  [W-1:0]     wr_data,
    input              wr_valid,
    output             wr_ready,
    // consumer port (we are the source)
    output [W-1:0]     rd_data,
    output             rd_valid,
    input              rd_ready
);
    localparam DEPTH = (1 << AW);
    reg [W-1:0] mem [0:DEPTH-1];
    reg [AW:0]  wptr, rptr;            // extra MSB distinguishes full vs empty

    wire empty = (wptr == rptr);
    wire full  = (wptr[AW] != rptr[AW]) && (wptr[AW-1:0] == rptr[AW-1:0]);

    assign wr_ready = ~full;           // producer may push when not full
    assign rd_valid = ~empty;          // consumer sees data when not empty
    assign rd_data  = mem[rptr[AW-1:0]];

    wire do_wr = wr_valid & wr_ready;  // handshake fire conditions
    wire do_rd = rd_valid & rd_ready;

    always @(posedge clk) begin
        if (!rst_n) begin
            wptr <= 0; rptr <= 0;
        end else begin
            if (do_wr) begin
                mem[wptr[AW-1:0]] <= wr_data;
                wptr <= wptr + 1'b1;
            end
            if (do_rd)
                rptr <= rptr + 1'b1;
        end
    end
endmodule

A producer built against this FIFO needs no protocol awareness beyond the fire condition — it asserts wr_valid when it has a sample and advances its own state only when wr_valid & wr_ready is true. A consumer mirrors this: it presents rd_ready when it can absorb a sample and acts only on rd_valid & rd_ready. The same two-line condition governs both ends, which is exactly why this style drops into an AXI4-Stream context unchanged — the FIFO's ports are a stream interface under different signal names.

Two implementation cautions recur with FIFO-backed links. First, single-clock versus dual-clock: the FIFO above is synchronous. The moment producer and consumer run on different clocks, the pointer comparison becomes a clock-domain-crossing problem requiring Gray-coded pointers and synchronizers — a substantively harder design that should not be improvised. Second, almost-full thresholds: a producer with internal pipeline latency cannot stop on the cycle full asserts, because data already in flight has nowhere to go. Exposing an "almost-full" flag a few entries before the boundary is the standard remedy, and omitting it is a frequent cause of dropped data under backpressure.

By contrast, the same producer addressing a register through AXI4-Lite would have to drive a write-address channel, a write-data channel, and then wait on a write-response channel — three handshakes and a response code to move one word. That is the right cost for a configuration register that must report errors; it is unjustified overhead for a data sample whose only requirement is "in order, don't lose it." Recognizing that asymmetry is the core of choosing how much protocol to implement.

Conclusion

On-chip interconnects look like many distinct standards but rest on one shared mechanism — the VALID/READY handshake — wrapped in varying amounts of addressing, bursting, and arbitration. The practical decision is scoping, not branding:

  • Use full AXI4 when a block is a memory master that genuinely needs bursts, multiple outstanding transactions, IDs, or arbitration — typically DMA against shared memory.
  • Use AXI4-Lite for control/status register maps, where the response channel's error reporting is worth its modest overhead.
  • Use a streaming interface (AXI4-Stream, Avalon-ST, or a bare FIFO port) for ordered, address-less data flow, which covers the majority of internal data paths.
  • Hand-write a minimal interface when the link is point-to-point, in-order, and single-clock — provided the handshake rules are honored exactly: VALID never depends combinationally on READY, offered data stays stable until the beat fires, and timing is broken with a skid buffer rather than a naive output register.

A reduced, hand-written interface is the appropriate choice when the unused parts of a full standard would only add inert logic, harder timing closure, and obscured data flow. It becomes the wrong choice the moment requirements grow beyond that envelope — multiple masters, out-of-order completion, crossing clock domains, or third-party IP that expects strict protocol compliance. At that boundary, a verified vendor interconnect or a compliant IP core is the lower-risk path, because the cost of a subtly non-compliant hand-rolled bus surfaces late, in integration, where it is most expensive to diagnose.

References / Further Reading

  1. Arm Ltd., AMBA AXI and ACE Protocol Specification, ARM IHI 0022 (Issue H.c). Defines AXI3, AXI4, AXI4-Lite, AXI5, and the streaming/coherency variants. Available: developer.arm.com.
  2. OpenCores, WISHBONE System-on-Chip (SoC) Interconnection Architecture for Portable IP Cores, Revision B4, 2010. Public-domain specification; classic and pipelined cycle modes. Available: opencores.org.
  3. AMD (Xilinx), AXI — Adaptive SoC and FPGA Intellectual Property, vendor documentation on AXI4/AXI4-Lite/AXI4-Stream adoption in FPGA flows. Available: amd.com.
  4. Intel Corporation, Avalon Interface Specifications (Avalon-MM and Avalon-ST), vendor reference manual.

(APB is documented in the Arm AMBA APB protocol specification; consult the current Arm-published issue directly rather than a recalled revision number.)

Return to Post List