Robust FPCA for sparsely observed curves

Matias Salibian-Barrera & Graciela Boente 2025-12-04

This repository contains the sparseFPCA package that implements the robust FPCA method introduced in Robust functional principal components for sparse longitudinal data (Boente and Salibian-Barrera, 2021) (ArXiv).

LICENSE: The content in this repository is released under the “Creative Commons Attribution-ShareAlike 4.0 International” license. See the human-readable version here and the real thing here.

sparseFPCA - Robust FPCA for sparsely observed curves

The sparseFPCA package implements the robust functional principal components analysis (FPCA) estimator introduced in Boente and Salibian-Barrera, 2021. sparseFPCA computes robust estimators for the mean and covariance (scatter) functions, and the corresponding eigenfunctions. It can be used with functional data sets where only a few observations per curve are available (possibly recorded at irregular intervals).

Installing the `sparseFPCA` package for `R`

The package can be installed directly from this repository using the following command in R:

devtools::install_github('msalibian/sparseFPCA', ref = "master")

An example - CD4 counts data

Here we illustrate the use of our method and compare it with existing alternatives. We will analyze the CD4 data, which is available in the catdata package (catdata). These data are part of the Multicentre AIDS Cohort Study (Zeger and Diggle, 1994). They consist of 2376 measurements of CD4 cell counts, taken on 369 men. The times are measured in years since seroconversion (t = 0).

We first load the data set and arrange it in a suitable format. Because the data consist of trajectories of different lengths, possibly measured at different times, the software requires that the observations be arranged in two lists, one (which we call X$x below) containing the vectors (of varying lengths) of points observed in each curve, and the other (X$pp) with the corresponding times:

data(aids, package='catdata')
X <- vector('list', 2) 
names(X) <- c('x', 'pp')
X$x <- split(aids$cd4, aids$person)
X$pp <- split(aids$time, aids$person)

To ensure that there are enough observations to estimate the covariance function at every pair of times (s, t), we only consider observations for which t >= 0, and remove individuals that only have one measurement.

n <- length(X$x)
shorts <- vector('logical', n)
for(i in 1:n) {
  tmp <- (X$pp[[i]] >= 0)
  X$pp[[i]] <- (X$pp[[i]])[tmp]
  X$x[[i]] <- (X$x[[i]])[tmp]
  if( length(X$pp[[i]]) <= 1 ) shorts[i] <- TRUE
}
X$x <- X$x[!shorts]
X$pp <- X$pp[!shorts]

This results in a data set with N = 292 curves, where the number of observations per individual ranges between 2 and 11 (with a median of 5):

length(X$x)

## [1] 292

summary(lens <- sapply(X$x, length))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   5.000   4.983   6.000  11.000

table(lens)

## lens
##  2  3  4  5  6  7  8  9 10 11 
## 51 52 35 43 39 20 23 10 15  4

The following figure shows the data set with three randomly chosen trajectories highlighted with solid black lines:

xmi <- min( tmp <- unlist(X$x) )
xma <- max( tmp )
ymi <- min( tmp <- unlist(X$pp) )
yma <- max( tmp ) 
n <- length(X$x)
plot(seq(ymi, yma, length=5), seq(xmi, xma,length=5), type='n', xlab='t', ylab='X(t)')
for(i in 1:n) { lines(X$pp[[i]], X$x[[i]], col='gray', lwd=1, type='b', pch=19, 
                     cex=1) }
lens <- sapply(X$x, length)
set.seed(22)
ii <- c(sample((1:n)[lens==2], 1), sample((1:n)[lens==5], 1), 
        sample((1:n)[lens==10], 1))
for(i in ii) lines(X$pp[[i]], X$x[[i]], col='black', lwd=4, type='b', pch=19, 
                   cex=1, lty=1)

Robust and non-robust FPCA

We will compare the robust and non-robust versions of our approach with the PACE estimator of Yao, Muller and Wang (paper - package). We need to load the following packages

library(sparseFPCA)
library(doParallel)
library(fdapace)

The specific versions of these packages that were used here (via the output of the function sessionInfo()) can be found at the bottom of this page.

The following are parameters required for our estimator.

ncpus <- 4
seed <- 123
rho.param <- 1e-3 
max.kappa <- 1e3
ncov <- 50
k.cv <- 10
k <- 5
s <- k 
hs.mu <- seq(.1, 1.5, by=.1)
hs.cov <- seq(1, 7, length=10)

We now fit the robust and non-robust versions of our proposal, and also the PACE estimator. This step may take several minutes to run:

ours.ls <- lsfpca(X=X, ncpus=ncpus, hs.mu=hs.mu, hs.cov=hs.cov, rho.param=rho.param, 
                  k = k, s = k, trace=FALSE, seed=seed, k.cv=k.cv, ncov=ncov,
                  max.kappa=max.kappa)
ours.r <- efpca(X=X, ncpus=ncpus, hs.mu=hs.mu, hs.cov=hs.cov, rho.param=rho.param,
                alpha=0.2, k = k, s = k, trace=FALSE, seed=seed, k.cv=k.cv, ncov=ncov,
                max.kappa=max.kappa)
myop <- list(error=FALSE, methodXi='CE', dataType='Sparse', 
             userBwCov = 1.5, userBwMu= .3, kernel='epan', verbose=FALSE, nRegGrid=50)
pace <- FPCA(Ly=X$x, Lt=X$pp, optns=myop)

The coverage plot:

plot(ours.ls$ma$mt[,1], ours.ls$ma$mt[,2], pch=19, col='gray70', cex=.8, 
     xlab='s', ylab='t', cex.lab=1.2, cex.axis=1.1)
points(ours.ls$ma$mt[,1], ours.ls$ma$mt[,1], pch=19, col='gray70', cex=.8)

The estimated covariance functions:

ss <- tt <- ours.r$ss
G.r <- ours.r$cov.fun
filled.contour(tt, ss, G.r, main='ROB')
ss <- tt <- ours.ls$ss
G.ls <- ours.ls$cov.fun
filled.contour(tt, ss, G.ls, main='LS')
ss <- tt <- pace$workGrid
G.pace <- pace$smoothedCov
filled.contour(tt, ss, G.pace, main='PACE')

Another take:

persp(ours.r$tt, ours.r$ss, G.r, xlab="s", ylab="t", zlab=" ",  
      zlim=c(10000, 130000), theta = -30, phi = 30, r = 50, 
      col="gray90", ltheta = 120, shade = 0.15, ticktype="detailed", 
      cex.axis=0.9, main = 'ROB')
persp(ours.ls$tt, ours.ls$ss, G.ls, xlab="s", ylab="t", zlab=" ", 
      zlim=c(10000, 130000), theta = -30, phi = 30, r = 50, 
      col="gray90", ltheta = 120, shade = 0.15, ticktype="detailed", 
      cex.axis=0.9, cex.lab=.9, main = 'LS')
persp(pace$workGrid, pace$workGrid, G.pace, xlab="s", ylab="t", zlab=" ",  
      zlim=c(10000, 130000), theta = -30, phi = 30, r = 50, 
      col="gray90", ltheta = 120, shade = 0.15, ticktype="detailed", 
      cex.axis=0.9, main = 'PACE')

The “proportion of variance” explained by the first few principal directions are:

dd <- eigen(ours.r$cov.fun)$values
ddls <- eigen(ours.ls$cov.fun)$values
ddp <- eigen(pace$smoothedCov)$values
rbind(ours = cumsum(dd)[1:3] / sum(dd[dd > 0]), 
      ls = cumsum(ddls)[1:3] / sum(ddls[ddls > 0]), 
      pace = cumsum(ddp)[1:3] / sum(ddp[ddp > 0]))

##           [,1]      [,2]      [,3]
## ours 0.9467379 0.9983907 0.9998978
## ls   0.9524343 0.9894052 0.9994967
## pace 0.8774258 0.9480532 0.9731165

In what follows we will use 2 principal components. The corresponding estimated scores are:

colors <- c('skyblue2', 'tomato3', 'gray70') #ROB, LS, PACE
boxplot(cbind(ours.r$xis[, 1:2], ours.ls$xis[, 1:2], pace$xiEst[, 1:2]), 
        names = rep(1:2, 3), col=rep(colors, each=2))
abline(h=0, lwd=2)
abline(v=c(2.5, 4.5), lwd=2, lty=2)
axis(3, las=1, at=c(1.5,3.5,5.5), cex.axis=1.4, lab=c('ROB', 'LS', 'PACE'),
     line=0.2, pos=NA, col="white")

We now compare the first two eigenfunctions.

G2 <- ours.r$cov.fun
G2.svd <- eigen(G2)$vectors
G.pace <- pace$smoothedCov
Gpace.svd <- eigen(G.pace)$vectors
G2.ls <- ours.ls$cov.fun
G2.ls.svd <- eigen(G2.ls)$vectors
ma <- -(mi <- -0.5) # y-axis limits
for(j in 1:2) {
  phihat <- G2.svd[,j]
  phipace <- Gpace.svd[,j]
  phils <- G2.ls.svd[,j]
  sg  <- as.numeric(sign(phihat  %*% phipace ))
  phipace <- sg * phipace
  sg <- as.numeric(sign(phihat  %*% phils ))
  phils <- sg * phils
  tt <- unique(ours.r$tt)
  tt.ls <- unique(ours.ls$tt)
  tt.pace <- pace$workGrid
  plot(tt, phihat, ylim=c(mi,ma), type='l', lwd=4, lty=1,
       xlab='t', ylab=expression(hat(phi)), cex.lab=1.1, 
       main=paste0('Eigenfunction ', j)) 
  lines(tt.ls, phils, lwd=4, lty=2) 
  lines(tt.pace, phipace, lwd=4, lty=3) 
  legend('topright', legend=c('Robust (ROB)', 'Non-robust (LS)', 
                              'PACE'), lwd=2, lty=1:3)
}

Potential outliers

We look for potential outliers, using the scores on the first two eigenfunctions.

kk <- 2
xis.r  <- ours.r$xis[, 1:kk]
dist.ous <- RobStatTM::covRob(xis.r)$dist 
ous <- (1:length(dist.ous))[ dist.ous > qchisq(.995, df=kk)]

We look at the 5 most outlying curves, as flagged by the robust fit:

xmi <- min( tmp <- unlist(X$x) )
xma <- max( tmp )
ymi <- min( tmp

SparseFPCA

Install / Use

README

Robust FPCA for sparsely observed curves

sparseFPCA - Robust FPCA for sparsely observed curves

Installing the `sparseFPCA` package for `R`

An example - CD4 counts data

Robust and non-robust FPCA

Potential outliers

Related Skills

SparseFPCA

Install / Use

README

Robust FPCA for sparsely observed curves

sparseFPCA - Robust FPCA for sparsely observed curves

Installing the sparseFPCA package for R

An example - CD4 counts data

Robust and non-robust FPCA

Potential outliers

Related Skills

Installing the `sparseFPCA` package for `R`