FIPER: Failure Prediction at Runtime for Generative Robot Policies

Ralf Römer1,*, Adrian Kobras1,*, Luca Worbis1, Angela P. Schoellig1,

1Technical University of Munich

The official code repository for "Failure Prediction at Runtime for Generative Robot Policies," accepted to NeurIPS 2025.

Overview

FIPER is a general framework for predicting failures of generative robot policies across different tasks. The repository handles task initialization, dataset management, policy training, evaluation of failure prediction, and result visualization.

Repository Structure

fiper/
├── configs/                  # Configuration files for tasks, evaluation, and results
│   ├── default.yaml          # Default pipeline configuration: Set methods and tasks to evaluate
│   ├── eval/                 # Evaluation-specific configurations including method hyperparameters
│   └── task/                 # Task-specific configurations including policy parameters
├── data/                     # Directory for storing task-specific data (rollouts, models, etc.) and results
│   ├── {task}/               # Subdirectories for each task (e.g., push_t, pretzel)
│   └── results/              # Generated results
├── datasets/                 # Data management
│   ├── __init__.py
│   └── rollout_datasets.py   # ProcessedRolloutDataset class implementation
├── evaluation/               # Evaluation module
│   ├── __init__.py
│   ├── evaluation_manager.py # Class that manages the evaluation
│   ├── results_manager.py    # Class that manages the results generation
│   └── method_eval_classes/  # Base and method-specific evaluation classes
├── scripts/                  # Main scripts for running the pipeline and generating results
│   ├── run_fiper.py          # Main pipeline script
│   ├── results_generation.py # Script for generating summaries and visualizations of the results
├── shared_utils/             # Shared utility functions
├── rnd/                      # Random Network Distillation (RND)-specific modules

Getting Started

Installation 🛠️

# Create and activate the Conda environment
conda env create -f environment_clean.yml
conda activate fiper
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126

Download Data 📁

The repository requires test and calibration rollouts generated by a generative policy. These rollouts must include all necessary data (e.g., action predictions, observation embeddings, states, RGB images, etc.) for the failure prediction methods.

Our calibration and test rollouts can be downloaded here. Huggingface dataset and LeRobot conversion scripts coming soon! After downloading, place the extracted rollouts into the following directory structure:

fiper/data/{task}/rollouts

Replace {task} with the name of the respective task (e.g., push_t). After placing the rollouts, each task folder should have a subfolder rollouts with a test and a calibration subfolder.

<details> <summary>Rollout Structure Details</summary>

Currently, it is assumed that each rollout is saved as an individual .pkl file with one of the following structures:

Dictionary: A dictionary with two keys, metadata and rollout, where metadata is a dictionary containing the metadata of the rollout and rollout is a list with the k-th entry being a dictionary that contains the neccessary rollout data of the k-th rollout timestep.
List: Only the rollout part of the Dictionary option. It is checked whether the first entry of the rollout list contains the rollout metadata.

It is recommended to provide task-specific metadata in the corresponding task configuration file. Additionally, basic information (success and rollout ID) can be extracted from the rollout filenames.

</details>

Adjusting the Pipeline Settings

Below is an overview of the key configuration components in the configs/ directory:

default.yaml: Specifies the tasks and methods to evaluate.
eval/: Contains evaluation settings:
- eval/base.yaml: Common evaluation setting.
- eval/{method}.yaml: Method-specific hyperparameters.
task/{task}.yaml: Contains task-specific and policy parameters, such as observation and action spaces.
results/base.yaml: Defines how to process results and which plots to generate.

Running the Pipeline

Once the desired settings are configured, run FIPER:

python fiper/scripts/run_fiper.py

Managing and Visualizing Results

After the pipeline run is complete, you can generate various results and visualizations by adjusting the results/base.yaml configuration file and running:

python fiper/scripts/results_generation.py

Evaluation Details 📊

Calibration & Threshold Design

During calibration, thresholds are calculated using Conformal Prediction (CP) based on the uncertainty scores of the calibration rollouts. During evaluation, a test rollout is flagged as failed if the uncertainty score at any step surpasses the threshold at that step.

Thresholds are controlled by:

A quantile parameter, which defines the percentage of calibration rollouts flagged as successful.
The window size, which indirectly influences the thresholds (see Moving Window Design).

We support the following threshold styles:

Constant Thresholds: These thresholds are static and calculated based on the maximum uncertainty score of each calibration rollout.
- ct_quantile: Threshold set to a specific quantile of the maximum scores across calibration rollouts. For example, the 95th percentile ensures that 95% of calibration rollouts are classified as successful.
Time-Varying Thresholds: These thresholds vary over time and are calculated for each timestep in the calibration rollouts.
- tvt_cp_band: A time-varying CP threshold.
- tvt_quantile: Similar to ct_quantile, but applied at each timestep.

<details> <summary> Extension of Time-Varying Thresholds </summary>

Since successful rollouts are typically shorter than failed ones, the calibration set may not provide thresholds for the entire length of the test rollouts. To address this, the time-varying thresholds are extended to match the maximum length of the test rollouts. This is implemented in two ways:

Repeat Last Value (default): Use the last available threshold value for all remaining steps.
Repeat Mean: Use the mean of the thresholds from the calibration rollouts for the remaining steps.

</details>

Extension of Time-Varying Thresholds:

Moving Window Design

The moving window aggregates uncertainty scores over a fixed number of past steps (defined by window_size), including the current step. This approach allows the failure predictor to consider past uncertainty scores and improves robustness by smoothing the thresholds and uncertainty scores and thus reducing sensitivity to outliers. For instance, for window_size = 5 the uncertainty score at step t is calculated as the aggregate of the scores from steps max(t-4 , 0) to t.

Adding Tasks

Create a Task Configuration File: Add a task-specific configuration file in fiper/configs/task/{task_name}.yaml.
Load Raw Rollouts: Place the raw rollouts for the task in fiper/data/{task_name}/rollouts/.
Update the Default Configuration: Add the task to available_tasks and tasks in fiper/configs/default.yaml.

Adding Failure Prediction Methods

1. Create an Evaluation Class

Add a new evaluation class in fiper/evaluation/method_eval_classes/, inheriting from BaseEvalClass in fiper/evaluation/method_eval_classes/base_eval_class.py.
Implement the calculate_uncertainty_score function to compute uncertainty scores for each rollout step based on the required elements.
If the method requires model loading or preprocessing, implement the load_model and execute_preprocessing functions.

Naming Convention: The class name of a method evaluation class is given by f"{{method_name}.replace('_', '').upper()}Eval". For instance, the evaluation class of the rnd_oe method is RNDOEEval.

2. Create a Configuration File

Add a configuration file in fiper/configs/eval/{method_name}.yaml, inheriting from the base evaluation configuration file.
Define the method-specific parameters in this file.

3. Update the Default Configuration

Add the new method to the methods and implemented_methods lists in fiper/configs/default.yaml.

<details> <summary> Workflow & Module Details </summary>

The pipeline is designed to evaluate failure pred

Fiper

Install / Use

README