Platform Fault Detection Interface (PFDI)

Overview

The Platform Fault Detection Interface (PFDI) is a modular framework designed to detect and report hardware faults.

PFDI integrates with low-level firmware, the operating system kernel, and user space to enable robust fault monitoring and system health diagnostics. It is primarily intended for use in safety-critical automotive environments, where early detection of hardware anomalies is crucial for maintaining system integrity and safety.

By default, PFDI is enabled with example reference implementations in place of actual firmware test libraries. These serve as integration placeholders for early development and bring-up. Arm Software Test Libraries (STL) can be integrated as the PFDI firmware test backend. To enable STL support and obtain access, please contact Arm or visit Arm Software Test Libraries.

Application Processor PFDI Implementation

Architecture

The PFDI framework consists of the following key components:

  1. Trusted Firmware-A

    • SMC Service Handlers: Secure Monitor Call (SMC) handlers expose PFDI services to non-secure world software as the Linux kernel. These SMC interfaces are defined as part of the Arm PFDI specification and follow the Arm SMCCC (Secure Monitor Call Calling Convention).

    • PFDI Driver: Executes in EL3 (platform firmware) and interfaces with the platform’s fault detection logic to initiate fault checks.

    • Reference PFDI Firmware Test Implementation: A reference implementation of the PFDI firmware test APIs is provided. It is intended for use in simulation, bring-up, or in environments where platform-specific logic is not yet integrated. This implementation does not perform actual fault detection but provides structural integration to validate the PFDI framework.

  2. Linux Kernel PFDI Driver

    • A miscellaneous character device driver responsible for interacting with the firmware via Secure Monitor Calls (SMC).

    • It serves as the bridge between the user space and the firmware.

  3. User Space PFDI Library

    • The Platform Fault Detection Interface (PFDI) library provides a standardized API for interacting with fault detection interface driver in a Linux environment.

    • It enables platform-specific tests, version management, and forcing errors by abstracting the Input Output Control (ioctl) operations complexity.

  4. User Space PFDI Tool

    • The Platform Fault Detection Interface (PFDI) Tool is implemented to help developers analyze, correct, generate, and pack YAML configuration files that define CPU task ranges.

  5. User Space PFDI Sample Application

    • Command-line utility or background service to initiate and log test results.

    • Useful for demonstration, diagnostics, or integration with larger health monitoring systems.

  6. User Space Command Line Interface

    • Command-line utility to

      1. Query the userspace library version.

      2. Query the firmware library version.

      3. Query the Out-of-Reset (OoR) PFDI results.

      4. Inject PFDI errors.

      5. Query the PFDI test count.

Interaction Flow

The PFDI framework facilitates a multi-layered interaction between user space application and platform firmware. Below is the typical flow of interaction:

Out-of-Reset PFDI

  1. Primary Core

    • During early cold boot, the primary core executes the OoR PFDI.

    • If the primary core fails the OoR PFDI, the boot is aborted.

  2. Secondary Cores

    • The primary core sequentially pulls the secondary cores out of reset.

    • A secondary core runs its OoR PFDI, reports the results and re-enters an off state.

  3. Boot Blocking on Failure

    • Secondary cores are prevented from being turned on by Linux if their OoR PFDI had failed.

Online PFDI

  1. User Space Initiation

    • A user space application invokes a function from the libpfdi library to request a fault detection test.

    • The library constructs a control request and sends it via an ioctl call to the PFDI kernel driver.

  2. Kernel-Level Mediation

    • The Linux kernel driver receives the ioctl call and translates it into a Secure Monitor Call (SMC).

    • An SMC is issued to transition from non-secure EL1 (Linux) to secure EL3 (firmware).

  3. Platform Firmware Execution

    • The Trusted Firmware SMC handler receives the request and delegates it to the internal PFDI driver.

    • The PFDI driver provides appropriate handlers to register with the firmware test library and invoke the necessary test routines.

  4. Test Execution in EL3

    • The firmware test library performs low-level validation of CPU functional logic.

    • The result is captured and returned through the PFDI driver.

  5. Result Propagation

    • The result is passed back through the SMC call to the Linux kernel driver.

    • The kernel driver makes the test result available to the user space application via the original ioctl return or a subsequent query.

The following diagram shows the components and interaction flow that implement the Platform Fault Detection Interface.


Platform Fault Detection Interface

Fig. 34 Platform Fault Detection Interface


The PFDI ACS test suite is executed as part of the validation flow to verify compliance with the PFDI specification.

Refer README.md, libpfdi/README.md and pfdi-demo/README.md for further details on the PFDI project, the PFDI library, and the application.

Safety Island Cluster 1 PFDI Implementation

Architecture

The Safety Island Cluster 1 PFDI framework is implemented on top of Zephyr Real Time Operating System, which contains the following blocks:

  1. PFDI Driver

    • PFDI Module: Executes in EL1 (platform firmware) and interfaces with the platform fault detection logic to initiate fault checks.

    • Reference PFDI Firmware Test Implementation: The framework provides a reference implementation of the PFDI firmware test APIs. It is intended for use in simulation, bring-up, or in environments where platform-specific logic is not yet integrated. This implementation does not perform actual fault detection; it provides structural integration to validate the PFDI framework.

  2. PFDI Subsystem

    • The PFDI subsystem manager interacts with the firmware through PFDI Driver public APIs.

    • The subsystem schedules periodic PFDI fault checks on each online CPU at power on. Configure the periodic interval in Kconfig.

    • It serves as the bridge between the user-facing shell environment and the firmware.

  3. PFDI Shell Utility

    • The PFDI shell utility provides a standardized command-line interface in the Zephyr shell environment.

    • Command-line utility to

      1. Run PFDI Online (Onl) test.

      2. Query the firmware library version.

      3. Query the Out-of-Reset (OoR) PFDI results.

      4. Inject PFDI errors.

      5. Query the online (Onl) status.

      6. Enable or disable the online test.

      7. Query the number of PFDI blocks and parts.

Note

The Platform Fault Detection Interface (PFDI) Specification defines an SMCCC/SMC-based firmware interface. The Safety Island Cluster 1 Cortex-R82AE implementation in this software stack does not expose PFDI through SMC, because the Armv8-R AArch64 execution model used by Cortex-R82AE has no EL3 Secure Monitor and EL2 is the highest exception level. In addition, Cortex-R82AE operates in Secure state at all implemented exception levels, so the SI CL1 PFDI runs as a local secure EL1 firmware service within Zephyr rather than as a Secure Monitor service.

Interaction Flow

The PFDI framework enables interaction between the user-facing shell utilities and platform firmware. The following steps describe the interaction flow:

Out-of-Reset PFDI

  1. Primary Core

    • During early cold boot, the primary core runs the OoR PFDI tests and reports the results.

  2. Secondary Cores

    • The primary core releases the secondary cores from reset.

    • Each secondary core runs its OoR PFDI tests and reports the results.

  3. On Failure Cases

    • In this reference implementation, the framework reports the result, logs failures, and continues boot. The PFDI framework provides a post-run hook that can be used to implement the platform specific behavior on failure.

Online PFDI

  1. User Initiation

    • A user-facing PFDI shell command invokes a function from the PFDI subsystem to request a fault detection test.

    • The shell command creates a control request.

    • The subsystem sends the request to the PFDI driver.

  2. PFDI Subsystem Mediation

    • The PFDI subsystem schedules periodic PFDI fault checks on each online CPU at power on.

    • The subsystem receives the request and forwards it to the PFDI driver through the CPU worker thread.

  3. Platform Firmware Execution

    • The PFDI driver provides handlers that integrate with the firmware test library.

  4. Test Execution in EL1

    • The firmware test library validates CPU functional logic.

    • The result is captured and returned through the PFDI driver.

The following diagram shows the modules and interaction flow that implement the Platform Fault Detection Interface.


Safety Island Cluster 1 Platform Fault Detection Interface

Fig. 35 Safety Island Cluster 1 Platform Fault Detection Interface


PFDI on R82AE follows the standard PFDI architecture semantics. The PFDI ACS compliance tests are not intended to be run on R82AE platforms

Platform Fault Detection Interface (PFDI) Monitoring

The diagram below illustrates the components and interaction flow that comprise the Platform Fault Detection Interface Monitoring mechanism.


Platform Fault Detection Interface Monitoring

Fig. 36 Platform Fault Detection Interface Monitoring


A software component, referred to as the PFDI monitor, is allocated to the Safety Island Cluster 0 (CL0) subsystem. It is responsible for executing watchdog functions based on messages received from PFDI agents.

The PFDI monitor initializes a watchdog (countdown) timer for each physical CPU in the PC subsystem and Safety Island Cluster 1 once Safety Island Cluster 0 (CL0) releases the corresponding CPU from reset.

Each time PFDI tests are completed, the PFDI agent sends a System Control and Management Interface (SCMI) message to the PFDI monitor to reset the watchdog timer. This message exchange must occur periodically once per Fault Detection Time Interval (FDTI).

To facilitate independent communication, each PFDI agent is assigned a dedicated MHU (Message Handling Unit) channel and a shared memory region, effectively creating a separate SCMI communication channel for each agent.

For the SCMI communication, a custom vendor specific protocol identifier 0x90 is used. During Safety Island Cluster 1 boot, the system prints the following debug output:

[00:00:00.000,000] <inf> pfdi_agent: PFDI Agent setup complete