Fiper
[NeurIPS 2025] Official code repository for "Failure Prediction at Runtime for Generative Robot Policies".
Install / Use
/learn @utiasDSL/FiperREADME
FIPER: Failure Prediction at Runtime for Generative Robot Policies
Ralf Römer<sup>1,*</sup>, Adrian Kobras<sup>1,*</sup>, Luca Worbis<sup>1</sup>, Angela P. Schoellig<sup>1</sup>,
<sup>1</sup>Technical University of Munich
The official code repository for "Failure Prediction at Runtime for Generative Robot Policies," accepted to NeurIPS 2025.
<img src="fiper_dark.png" alt="FIPER"/>Overview
FIPER is a general framework for predicting failures of generative robot policies across different tasks. The repository handles task initialization, dataset management, policy training, evaluation of failure prediction, and result visualization.
<!-- ### Key Components 1. **TaskManager**: Handles task-specific configurations, metadata extraction, and rollout conversion. 2. **Dataset Class**: Manages data preprocessing, normalization, and iteration for training and evaluation. 3. **Failure Prediction Methods**: Includes Random Network Distillation for observation embeddings (RND-OE), action chunk entropy (ACE), and numerous baselines. 4. **EvaluationManager**: Interfaces with method-specific evaluation classes and computes evaluation metrics. 5. **ResultsManager**: Summarizes and visualizes evaluation results. -->Repository Structure
fiper/
├── configs/ # Configuration files for tasks, evaluation, and results
│ ├── default.yaml # Default pipeline configuration: Set methods and tasks to evaluate
│ ├── eval/ # Evaluation-specific configurations including method hyperparameters
│ └── task/ # Task-specific configurations including policy parameters
├── data/ # Directory for storing task-specific data (rollouts, models, etc.) and results
│ ├── {task}/ # Subdirectories for each task (e.g., push_t, pretzel)
│ └── results/ # Generated results
├── datasets/ # Data management
│ ├── __init__.py
│ └── rollout_datasets.py # ProcessedRolloutDataset class implementation
├── evaluation/ # Evaluation module
│ ├── __init__.py
│ ├── evaluation_manager.py # Class that manages the evaluation
│ ├── results_manager.py # Class that manages the results generation
│ └── method_eval_classes/ # Base and method-specific evaluation classes
├── scripts/ # Main scripts for running the pipeline and generating results
│ ├── run_fiper.py # Main pipeline script
│ ├── results_generation.py # Script for generating summaries and visualizations of the results
├── shared_utils/ # Shared utility functions
├── rnd/ # Random Network Distillation (RND)-specific modules
Getting Started
Installation 🛠️
# Create and activate the Conda environment
conda env create -f environment_clean.yml
conda activate fiper
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126
Download Data 📁
The repository requires test and calibration rollouts generated by a generative policy. These rollouts must include all necessary data (e.g., action predictions, observation embeddings, states, RGB images, etc.) for the failure prediction methods.
Our calibration and test rollouts can be downloaded here. Huggingface dataset and LeRobot conversion scripts coming soon! After downloading, place the extracted rollouts into the following directory structure:
fiper/data/{task}/rollouts
Replace {task} with the name of the respective task (e.g., push_t). After placing the rollouts, each task folder should have a subfolder rollouts with a test and a calibration subfolder.
Currently, it is assumed that each rollout is saved as an individual .pkl file with one of the following structures:
- Dictionary: A dictionary with two keys,
metadataandrollout, wheremetadatais a dictionary containing the metadata of the rollout androlloutis a list with the k-th entry being a dictionary that contains the neccessary rollout data of the k-th rollout timestep. - List: Only the
rolloutpart of the Dictionary option. It is checked whether the first entry of the rollout list contains the rollout metadata.
It is recommended to provide task-specific metadata in the corresponding task configuration file. Additionally, basic information (success and rollout ID) can be extracted from the rollout filenames.
</details>Adjusting the Pipeline Settings
Below is an overview of the key configuration components in the configs/ directory:
default.yaml: Specifies the tasks and methods to evaluate.eval/: Contains evaluation settings:eval/base.yaml: Common evaluation setting.eval/{method}.yaml: Method-specific hyperparameters.
task/{task}.yaml: Contains task-specific and policy parameters, such as observation and action spaces.results/base.yaml: Defines how to process results and which plots to generate.
Running the Pipeline
Once the desired settings are configured, run FIPER:
python fiper/scripts/run_fiper.py
Managing and Visualizing Results
After the pipeline run is complete, you can generate various results and visualizations by adjusting the results/base.yaml configuration file and running:
python fiper/scripts/results_generation.py
Evaluation Details 📊
Calibration & Threshold Design
During calibration, thresholds are calculated using Conformal Prediction (CP) based on the uncertainty scores of the calibration rollouts. During evaluation, a test rollout is flagged as failed if the uncertainty score at any step surpasses the threshold at that step.
Thresholds are controlled by:
- A quantile parameter, which defines the percentage of calibration rollouts flagged as successful.
- The window size, which indirectly influences the thresholds (see Moving Window Design).
We support the following threshold styles:
-
Constant Thresholds: These thresholds are static and calculated based on the maximum uncertainty score of each calibration rollout.
ct_quantile: Threshold set to a specific quantile of the maximum scores across calibration rollouts. For example, the 95th percentile ensures that 95% of calibration rollouts are classified as successful.
-
Time-Varying Thresholds: These thresholds vary over time and are calculated for each timestep in the calibration rollouts.
tvt_cp_band: A time-varying CP threshold.tvt_quantile: Similar toct_quantile, but applied at each timestep.
Since successful rollouts are typically shorter than failed ones, the calibration set may not provide thresholds for the entire length of the test rollouts. To address this, the time-varying thresholds are extended to match the maximum length of the test rollouts. This is implemented in two ways:
- Repeat Last Value (default): Use the last available threshold value for all remaining steps.
- Repeat Mean: Use the mean of the thresholds from the calibration rollouts for the remaining steps.
Extension of Time-Varying Thresholds:
Moving Window Design
The moving window aggregates uncertainty scores over a fixed number of past steps (defined by window_size), including the current step. This approach allows the failure predictor to consider past uncertainty scores and improves robustness by smoothing the thresholds and uncertainty scores and thus reducing sensitivity to outliers. For instance, for window_size = 5 the uncertainty score at step t is calculated as the aggregate of the scores from steps max(t-4 , 0) to t.
Adding Tasks
-
Create a Task Configuration File: Add a task-specific configuration file in
fiper/configs/task/{task_name}.yaml. -
Load Raw Rollouts: Place the raw rollouts for the task in
fiper/data/{task_name}/rollouts/. -
Update the Default Configuration: Add the task to
available_tasksandtasksinfiper/configs/default.yaml.
Adding Failure Prediction Methods
1. Create an Evaluation Class
-
Add a new evaluation class in
fiper/evaluation/method_eval_classes/, inheriting fromBaseEvalClassinfiper/evaluation/method_eval_classes/base_eval_class.py. -
Implement the
calculate_uncertainty_scorefunction to compute uncertainty scores for each rollout step based on the required elements. -
If the method requires model loading or preprocessing, implement the
load_modelandexecute_preprocessingfunctions.
Naming Convention: The class name of a method evaluation class is given by f"{{method_name}.replace('_', '').upper()}Eval". For instance, the evaluation class of the rnd_oe method is RNDOEEval.
2. Create a Configuration File
- Add a configuration file in
fiper/configs/eval/{method_name}.yaml, inheriting from the base evaluation configuration file. - Define the method-specific parameters in this file.
3. Update the Default Configuration
- Add the new method to the
methodsandimplemented_methodslists infiper/configs/default.yaml.
<details> <summary> Workflow & Module Details </summary> <!-- ## Workflow & Module Details -->
The pipeline is designed to evaluate failure pred
