Reliability, Availability, and Serviceability

Overview

Reliability, Availability, and Serviceability (RAS) aims to increase the robustness of a system by detecting hardware errors, recording them and correcting them where possible. Arm’s RAS extension provides this robustness to both the processor and system architectures.

RAS techniques help reduce unplanned outages by:

  • Detecting and correcting transient errors before they lead to application or system failure.

  • Identifying and replacing failing components.

  • Predicting failures in advance, enabling proactive maintenance during planned downtime.

In system software, there are two primary models for error handling:

  • Firmware-First Handling (FFH)

    RAS events are initially reported to firmware. Firmware is responsible for reading the RAS error record provided by the hardware, transposing the information into an error report, and notifying the operating system of the error. The operating system takes the recovery actions.

  • Kernel-First Handling (KFH)

    The operating system kernel is responsible for directly handling RAS events and managing recovery actions.

In the current stage, this reference design implements only FFH.

Primary Compute CPU Core RAS

This implementation targets RAS support for the Cortex-A720AE CPU cores.

The RAS extension implemented for Cortex-A720AE includes cache protection. It protects against RAM bit-cell errors that could result in incorrect data being stored or read.

The Cortex-A720AE RAS includes the following features:

  • Cache protection with Single Error Detect (SED) parity on the functional RAMs that contain only clean data. This includes the L1 instruction cache tag, L1 instruction cache data, and the Memory Management Unit (MMU) RAMs.

  • Cache protection with Single Error Correct, Double Error Detect (SECDED) error-correcting code (ECC) on the functional RAMs that contain dirty data. This includes the L1 data cache tag, L1 data cache data, L2 cache tag, L2 cache data, and the L2 Transaction Queue (TQ) RAMs.

The core can continue operating in the presence of a single-bit RAM error.

Error types

For a RAS error, the following types can be recorded:

  • Corrected Error (CE)

    The error was detected and corrected. It no longer affects the node’s state and has not been silently propagated. The node continues to operate normally.

  • Deferred Error (DE)

    The error was detected but not corrected and has been deferred. It has not been silently propagated and may remain latent in the system.

  • Uncorrected Error (UC)

    The error was detected but neither corrected nor deferred. It remains latent in the system. Uncorrected errors can be further classified as:

    • Unrecoverable: The error has not been silently propagated.

    • Uncontainable: The error may have been silently propagated. If isolation is not possible, a system shutdown is required to prevent catastrophic failure.

The following diagram illustrates the taxonomy of RAS error types.

RAS Taxonomy of Error Types

Fig. 44 RAS Taxonomy of Error Types


Error processing

The reference design supports handling two types of RAS interrupts:

  • Fault Handling Interrupt (FHI)

  • Error Recovery Interrupt (ERI)

These interrupts are routed to both the Primary Compute and the Safety Island. Each side uses different handling logic.

The following diagram shows the overall RAS error handling process.

RAS Error Processing

Fig. 45 RAS Error Processing


Primary Compute error processing

TF-A

When a corrected or deferred error occurs, the TF-A RAS handler (running at EL3) receives the interrupt. The handler takes the following actions in sequence:

  • Reads the error record, logs the details, and clears the relevant registers.

  • Generates a Common Platform Error Record (CPER) and writes it to a memory buffer shared with Linux. CPER is the standardized UEFI format for hardware error information. See UEFI Common Platform Error Record (CPER) for the CPER format definition.

  • Sets a pending interrupt, SPI 89, to notify Linux to handle the error report after TF-A returns from EL3.

SPI 89 is used as a software notification interrupt between TF-A and Linux for Firmware-First Handling. It is a platform-reserved SPI rather than an interrupt line driven by a dedicated hardware device. This allows the notification path to avoid conflicting with SPIs that are already assigned to hardware blocks, while still giving Linux a dedicated interrupt for RAS error report processing.

Linux

The upstream Linux kernel supports RAS FFH on platforms that use ACPI, but for those using Device Tree, FFH is not supported. In the reference design, the Generic Hardware Error Source (GHES) code of the ACPI-based RAS FFH framework is refactored. The general-purpose logic of handling Generic Error Status buffer is abstracted into a common part which will be used by both ACPI and Device Tree. A new RAS FFH driver is introduced for the Device Tree based platforms.

After TF-A returns control to the Normal World, Linux receives the SPI interrupt and knows that an error occurred. The RAS FFH driver handles the SPI interrupt, takes the following actions in sequence:

  • Reads the CPER error report from the shared memory buffer and parses the content.

  • Invokes the Generic Error Status handler to process the error report.

  • Generates a trace event for user-space logging.

rasdaemon

rasdaemon is an upstream Linux user-space daemon that collects, decodes, and logs hardware RAS error events from the kernel. It is integrated into the reference design. See rasdaemon upstream project.

rasdaemon listens to hardware error notifications reported by the kernel via the event tracing interface. By decoding and persisting these events in a database, rasdaemon makes complex low-level RAS information accessible for diagnostics and long-term analysis.

Safety Island error processing

When an uncorrected error occurs, the Safety Island Cluster 0 receives an ERI notification on its GIC SPI path. This is separate from the Primary Compute path, where TF-A handles CPU faults using the FHI delivered on AP GIC PPIs. Although both the AP and the Safety Island implement RAS fault and error interrupt handling, they do so through different GIC interrupt paths. As a result, the Safety Island can process the uncorrected error independently, without requiring a handshake with the AP. The handler reads the error record, assesses the impact, and performs the appropriate action:

  • If the error is uncorrected but containable, signal the SSU as a non-critical error.

  • If the error is uncorrected and uncontainable, signal the SSU as a critical error.

  • If the error is raised by the TFP mechanism running on the Cortex-A720AE, log only a diagnostic message.

Error injection

Error injection uses detection and reporting registers to simulate errors for testing error handling mechanisms. The Cortex-A720AE core supports injection of the following error types:

  • Corrected Error

    A Corrected Error is generated for a single-bit ECC error on an L1 data cache access.

  • Deferred Error

    A Deferred Error is generated for a double-bit ECC error on eviction of a cache line from the L1 cache to the L2 cache, or as a result of a snoop on the L1 cache.

  • Uncontainable Error

    An Uncontainable Error is generated for a double-bit ECC error on the L1 and L2 tag RAM following an eviction.

Error injection for the Cortex-A720AE core is performed in the Secure World by leveraging the following components:

  • ts-ras-inject

    ts-ras-inject is the user-space application that takes input commands and returns the result of the injection. It talks to the Secure World components via the TS-TEE module in Linux using the TS RPC protocol.

  • RAS EINJ LSP

    A Logical Secure Partition (LSP) is a partition running in OP-TEE at S-EL1. The RAS EINJ LSP executes the privileged injection routine according to the user request and returns the result.

  • RAS EINJ Bridge

    The S-EL1 RAS EINJ LSP is not directly connected to ts-ras-inject because it cannot access the TS RPC protocol. RAS EINJ Bridge, an S-EL0 Secure Partition, is added to link them.