SparseFPCA
FPCA for sparse (longitudinal) data
Install / Use
/learn @msalibian/SparseFPCAREADME
Robust FPCA for sparsely observed curves
Matias Salibian-Barrera & Graciela Boente 2025-12-04
This repository contains the sparseFPCA package that implements the
robust FPCA method introduced in Robust functional principal components
for sparse longitudinal
data (Boente and
Salibian-Barrera, 2021) (ArXiv).
LICENSE: The content in this repository is released under the “Creative Commons Attribution-ShareAlike 4.0 International” license. See the human-readable version here and the real thing here.
sparseFPCA - Robust FPCA for sparsely observed curves
The sparseFPCA package implements the robust functional principal
components analysis (FPCA) estimator introduced in Boente and
Salibian-Barrera, 2021.
sparseFPCA computes robust estimators for the mean and covariance
(scatter) functions, and the corresponding eigenfunctions. It can be
used with functional data sets where only a few observations per curve
are available (possibly recorded at irregular intervals).
Installing the sparseFPCA package for R
The package can be installed directly from this repository using the
following command in R:
devtools::install_github('msalibian/sparseFPCA', ref = "master")
An example - CD4 counts data
Here we illustrate the use of our method and compare it with existing
alternatives. We will analyze the CD4 data, which is available in the
catdata package
(catdata). These data are
part of the Multicentre AIDS Cohort Study (Zeger and Diggle,
1994). They consist of 2376
measurements of CD4 cell counts, taken on 369 men. The times are
measured in years since seroconversion (t = 0).
We first load the data set and arrange it in a suitable format. Because
the data consist of trajectories of different lengths, possibly measured
at different times, the software requires that the observations be
arranged in two lists, one (which we call X$x below) containing the
vectors (of varying lengths) of points observed in each curve, and the
other (X$pp) with the corresponding times:
data(aids, package='catdata')
X <- vector('list', 2)
names(X) <- c('x', 'pp')
X$x <- split(aids$cd4, aids$person)
X$pp <- split(aids$time, aids$person)
To ensure that there are enough observations to estimate the covariance
function at every pair of times (s, t), we only consider observations
for which t >= 0, and remove individuals that only have one
measurement.
n <- length(X$x)
shorts <- vector('logical', n)
for(i in 1:n) {
tmp <- (X$pp[[i]] >= 0)
X$pp[[i]] <- (X$pp[[i]])[tmp]
X$x[[i]] <- (X$x[[i]])[tmp]
if( length(X$pp[[i]]) <= 1 ) shorts[i] <- TRUE
}
X$x <- X$x[!shorts]
X$pp <- X$pp[!shorts]
This results in a data set with N = 292 curves, where the number of
observations per individual ranges between 2 and 11 (with a median of
5):
length(X$x)
## [1] 292
summary(lens <- sapply(X$x, length))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 5.000 4.983 6.000 11.000
table(lens)
## lens
## 2 3 4 5 6 7 8 9 10 11
## 51 52 35 43 39 20 23 10 15 4
The following figure shows the data set with three randomly chosen trajectories highlighted with solid black lines:
xmi <- min( tmp <- unlist(X$x) )
xma <- max( tmp )
ymi <- min( tmp <- unlist(X$pp) )
yma <- max( tmp )
n <- length(X$x)
plot(seq(ymi, yma, length=5), seq(xmi, xma,length=5), type='n', xlab='t', ylab='X(t)')
for(i in 1:n) { lines(X$pp[[i]], X$x[[i]], col='gray', lwd=1, type='b', pch=19,
cex=1) }
lens <- sapply(X$x, length)
set.seed(22)
ii <- c(sample((1:n)[lens==2], 1), sample((1:n)[lens==5], 1),
sample((1:n)[lens==10], 1))
for(i in ii) lines(X$pp[[i]], X$x[[i]], col='black', lwd=4, type='b', pch=19,
cex=1, lty=1)
<!-- -->
Robust and non-robust FPCA
We will compare the robust and non-robust versions of our approach with the PACE estimator of Yao, Muller and Wang (paper - package). We need to load the following packages
library(sparseFPCA)
library(doParallel)
library(fdapace)
The specific versions of these packages that were used here (via the
output of the function sessionInfo()) can be found at the bottom of
this page.
The following are parameters required for our estimator.
ncpus <- 4
seed <- 123
rho.param <- 1e-3
max.kappa <- 1e3
ncov <- 50
k.cv <- 10
k <- 5
s <- k
hs.mu <- seq(.1, 1.5, by=.1)
hs.cov <- seq(1, 7, length=10)
We now fit the robust and non-robust versions of our proposal, and also the PACE estimator. This step may take several minutes to run:
ours.ls <- lsfpca(X=X, ncpus=ncpus, hs.mu=hs.mu, hs.cov=hs.cov, rho.param=rho.param,
k = k, s = k, trace=FALSE, seed=seed, k.cv=k.cv, ncov=ncov,
max.kappa=max.kappa)
ours.r <- efpca(X=X, ncpus=ncpus, hs.mu=hs.mu, hs.cov=hs.cov, rho.param=rho.param,
alpha=0.2, k = k, s = k, trace=FALSE, seed=seed, k.cv=k.cv, ncov=ncov,
max.kappa=max.kappa)
myop <- list(error=FALSE, methodXi='CE', dataType='Sparse',
userBwCov = 1.5, userBwMu= .3, kernel='epan', verbose=FALSE, nRegGrid=50)
pace <- FPCA(Ly=X$x, Lt=X$pp, optns=myop)
The coverage plot:
plot(ours.ls$ma$mt[,1], ours.ls$ma$mt[,2], pch=19, col='gray70', cex=.8,
xlab='s', ylab='t', cex.lab=1.2, cex.axis=1.1)
points(ours.ls$ma$mt[,1], ours.ls$ma$mt[,1], pch=19, col='gray70', cex=.8)
<!-- -->
The estimated covariance functions:
ss <- tt <- ours.r$ss
G.r <- ours.r$cov.fun
filled.contour(tt, ss, G.r, main='ROB')
ss <- tt <- ours.ls$ss
G.ls <- ours.ls$cov.fun
filled.contour(tt, ss, G.ls, main='LS')
ss <- tt <- pace$workGrid
G.pace <- pace$smoothedCov
filled.contour(tt, ss, G.pace, main='PACE')
<img src="README_files/figure-gfm/covariances-1.png" width="33%" /><img src="README_files/figure-gfm/covariances-2.png" width="33%" /><img src="README_files/figure-gfm/covariances-3.png" width="33%" />
Another take:
persp(ours.r$tt, ours.r$ss, G.r, xlab="s", ylab="t", zlab=" ",
zlim=c(10000, 130000), theta = -30, phi = 30, r = 50,
col="gray90", ltheta = 120, shade = 0.15, ticktype="detailed",
cex.axis=0.9, main = 'ROB')
persp(ours.ls$tt, ours.ls$ss, G.ls, xlab="s", ylab="t", zlab=" ",
zlim=c(10000, 130000), theta = -30, phi = 30, r = 50,
col="gray90", ltheta = 120, shade = 0.15, ticktype="detailed",
cex.axis=0.9, cex.lab=.9, main = 'LS')
persp(pace$workGrid, pace$workGrid, G.pace, xlab="s", ylab="t", zlab=" ",
zlim=c(10000, 130000), theta = -30, phi = 30, r = 50,
col="gray90", ltheta = 120, shade = 0.15, ticktype="detailed",
cex.axis=0.9, main = 'PACE')
<img src="README_files/figure-gfm/covariances.2-1.png" width="33%" /><img src="README_files/figure-gfm/covariances.2-2.png" width="33%" /><img src="README_files/figure-gfm/covariances.2-3.png" width="33%" />
The “proportion of variance” explained by the first few principal directions are:
dd <- eigen(ours.r$cov.fun)$values
ddls <- eigen(ours.ls$cov.fun)$values
ddp <- eigen(pace$smoothedCov)$values
rbind(ours = cumsum(dd)[1:3] / sum(dd[dd > 0]),
ls = cumsum(ddls)[1:3] / sum(ddls[ddls > 0]),
pace = cumsum(ddp)[1:3] / sum(ddp[ddp > 0]))
## [,1] [,2] [,3]
## ours 0.9467379 0.9983907 0.9998978
## ls 0.9524343 0.9894052 0.9994967
## pace 0.8774258 0.9480532 0.9731165
In what follows we will use 2 principal components. The corresponding estimated scores are:
colors <- c('skyblue2', 'tomato3', 'gray70') #ROB, LS, PACE
boxplot(cbind(ours.r$xis[, 1:2], ours.ls$xis[, 1:2], pace$xiEst[, 1:2]),
names = rep(1:2, 3), col=rep(colors, each=2))
abline(h=0, lwd=2)
abline(v=c(2.5, 4.5), lwd=2, lty=2)
axis(3, las=1, at=c(1.5,3.5,5.5), cex.axis=1.4, lab=c('ROB', 'LS', 'PACE'),
line=0.2, pos=NA, col="white")
<!-- -->
We now compare the first two eigenfunctions.
G2 <- ours.r$cov.fun
G2.svd <- eigen(G2)$vectors
G.pace <- pace$smoothedCov
Gpace.svd <- eigen(G.pace)$vectors
G2.ls <- ours.ls$cov.fun
G2.ls.svd <- eigen(G2.ls)$vectors
ma <- -(mi <- -0.5) # y-axis limits
for(j in 1:2) {
phihat <- G2.svd[,j]
phipace <- Gpace.svd[,j]
phils <- G2.ls.svd[,j]
sg <- as.numeric(sign(phihat %*% phipace ))
phipace <- sg * phipace
sg <- as.numeric(sign(phihat %*% phils ))
phils <- sg * phils
tt <- unique(ours.r$tt)
tt.ls <- unique(ours.ls$tt)
tt.pace <- pace$workGrid
plot(tt, phihat, ylim=c(mi,ma), type='l', lwd=4, lty=1,
xlab='t', ylab=expression(hat(phi)), cex.lab=1.1,
main=paste0('Eigenfunction ', j))
lines(tt.ls, phils, lwd=4, lty=2)
lines(tt.pace, phipace, lwd=4, lty=3)
legend('topright', legend=c('Robust (ROB)', 'Non-robust (LS)',
'PACE'), lwd=2, lty=1:3)
}
<img src="README_files/figure-gfm/eigenfunctions-1.png" width="50%" /><img src="README_files/figure-gfm/eigenfunctions-2.png" width="50%" />
Potential outliers
We look for potential outliers, using the scores on the first two eigenfunctions.
kk <- 2
xis.r <- ours.r$xis[, 1:kk]
dist.ous <- RobStatTM::covRob(xis.r)$dist
ous <- (1:length(dist.ous))[ dist.ous > qchisq(.995, df=kk)]
We look at the 5 most outlying curves, as flagged by the robust fit:
xmi <- min( tmp <- unlist(X$x) )
xma <- max( tmp )
ymi <- min( tmp
Related Skills
node-connect
335.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
82.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
335.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
82.7kCommit, push, and open a PR
