MLE4LTI
Maximum Likelihood and Prediction Error Method for parameter estimation in Linear Time-Invariant Systems
Install / Use
/learn @Leo-Simpson/MLE4LTIREADME
Nonlinear Programming for Maximum Likelihood Estimation of parameters in Linear Time Invariant systems.
This repository is an adaptation from the repository "SQP-for-MLE-LTV" for the special case of time-invariant linear systems, with a more efficient optimization routine. The associated article (for the LTV version) is "An Efficient Method for the Joint Estimation of System Parameters and Noise Covariances for Linear Time-Variant Systems"
Prerequisites
Python packages
- numpy
- scipy
- matplotlib
- casadi (with IPOPT)
- time
- contextlib
Description of the problem to solve
Parametric Linear Time-Invariant Systems
We assume that the data is generated through the following dynamical system (linear, and with Gaussian Noise)
\begin{equation}
\begin{aligned}
x_{k+1} &= A(\theta) x_k + B(\theta) u_k + b(\theta)+ w_k, && k = 0, \dots, N-1, \\
y_{k} &= C(\theta) x_k + D(\theta) u_k + d(\theta)+ v_k, &&k = 0, \dots, N, \\
w_k &\sim \mathcal{N}\left( 0, Q(\theta) \right), &&k = 0, \dots, N-1, \\
v_k &\sim \mathcal{N}\left( 0, R(\theta) \right), &&k = 0, \dots, N,
\end{aligned}
\end{equation}
where $\theta$ are parameters of the model
Note that the time-varying behavior comes from the inputs $u_k$, which can appear in the dynamics linearly or nonlinearly (Note: this is also different from the paper, where the concept of inputs was not present).
Also, the parameters are assumed to be in some set defined with inequality constraints:
\begin{align}
\{ \theta \in \mathbb{R}^{n_{\theta}} \; \big| \; h(\theta) \geq 0 \},
\end{align}
Note that the inequality is of the opposite sign of how it is in the paper.
The Estimation
We consider optimization problems for estimation of $\theta$. These are maximizing the performance of a Kalman filter on the training data over the parameters $\theta$.
Prediction Error / Maximum Likelihood Estimation
\begin{aligned}
&\underset{ \substack{
\theta, e,
\hat{x}}, P, L, S,
}
{\mathrm{minimize}} \; \frac{1}{N}\sum_{k=1}^{N} F(e_k, S_k) \\
& \mathrm{subject} \, \mathrm{to} \,
\\&\qquad
L_k = A( \theta) P_{k} \, C(\theta)^{\top} S_k^{-1},
\\&\qquad
S_k = C(\theta) \, P_{k} \, C(\theta)^{\top} + R(\theta),
\\&\qquad
e_k = y_k - \left( C(\theta) \hat{x}_k + D(\theta) u_k + d(\theta) \right),
\\&\qquad
\hat{x}_{k+1} = A( \theta)\hat{x}_{k} + B(\theta)u_k + L_k e_k,
\\&\qquad
P_{k+1} = A( \theta) P_kA( \theta)^{\top} - L_k S_k L_k^{\top} + Q(\theta) ,
\\&\qquad
h(\theta) \geq 0.
\end{aligned}
Regarding the cost function $F(\cdot, \cdot)$, two options are considered
\begin{align}
\begin{split}
F_{\mathrm{MLE}}(e, S) & \equiv e^{\top} S^{-1} e + \log \det S, \\
F_{\mathrm{PredErr}}(e, S) & \equiv \left\lVert e \right\rVert^2.
\end{split}
\end{align}
The first of them is referred to as "MLE" because it corresponds to the Maximum Likelihood problem. The second is called "PredErr" because it corresponds to the Prediction Error Methods.
Simplification for LTI
In the case of an LTI system, the Kalman Filter equations quickly converge to their steady state, which is given by the Riccati Equation. Replacing the Kalman Filter equation with their corresponding steady states equation considerably reduces the complexity of the equations:
\begin{aligned}
&\underset{ \substack{
\theta, e,
\hat{x}}, P, L, S,
}
{\mathrm{minimize}} \; \frac{1}{N}\sum_{k=1}^{N} F(e_k, S) \\
& \mathrm{subject} \, \mathrm{to} \,
\\&\qquad
e_k = y_k - \left( C(\theta) \hat{x}_k + D(\theta) u_k + d(\theta) \right),
\\&\qquad
\hat{x}_{k+1} = A( \theta)\hat{x}_{k} + B(\theta)u_k + L e_k,
\\&\qquad
L = A( \theta) P \, C(\theta)^{\top} S^{-1},
\\&\qquad
S = C(\theta) \, P \, C(\theta)^{\top} + R(\theta),
\\&\qquad
P = A( \theta) P A( \theta)^{\top} - L S L^{\top} + Q(\theta) ,
\\&\qquad
h(\theta) \geq 0.
\end{aligned}
Reformulation for a better optimization
Let us parameterize $S$ and $P$ with additional variables $\eta$ (for example with the Cholesky decomposition to make it positive).
We also define the following:
\begin{aligned}
\tilde{A}(\theta, L) &= A(\theta) - L C(\theta),\\
\tilde{u}_k &= \begin{bmatrix} u_k \\ 1 \\ y_k\end{bmatrix}, \\
\tilde{B}(\theta, L) &=
\begin{bmatrix} B(\theta) & b(\theta) - L d(\theta) & L \end{bmatrix}, \\
\tilde{D}(\theta, L) &=
\begin{bmatrix} D(\theta) & d(\theta) & -I \end{bmatrix}, \\
g(\theta, \eta, L) & =
\begin{bmatrix}
A( \theta) P(\eta) \, C(\theta)^{\top} - LS(\eta)
\\
C(\theta) \, P(\eta) \, C(\theta)^{\top} + R(\theta) - S(\eta)
\\
A( \theta) P(\eta) A( \theta)^{\top} - L S(\eta) L^{\top} + Q(\theta) - P(\eta)
\end{bmatrix}.
\end{aligned}
Finally, we simply stack all main optimization variables $p = (\theta, \eta, L)$, which leads to the following formulation:
\begin{aligned}
&\underset{ \substack{
p, e,
\hat{x}}
}
{\mathrm{minimize}} \; \; G\left(\; \frac{1}{N}\sum_{k=1}^{N} e_k e_k^\top, \; S(p) \right) \\
& \mathrm{subject} \, \mathrm{to} \,
\\&\qquad
e_k = C(p) \hat{x}_k + \tilde{D}(p) \tilde{u}_k,
\\&\qquad
\hat{x}_{k+1} = \tilde{A}(p) \hat{x}_{k} + \tilde{B}(p) \tilde{u}_k,
\\&\qquad
g(p) = 0,
\\&\qquad
h(p) \geq 0
\end{aligned}
where the function $G(\cdot, \cdot)$, is one of the two following
\begin{align}
\begin{split}
G_{\mathrm{MLE}}(V, S) & \equiv \mathrm{Tr}\left( S^{-1} V \right) + \log \det S, \\
G_{\mathrm{PredErr}}(V, S) & \equiv \mathrm{Tr}\left( V \right) .
\end{split}
\end{align}
Description of the algorithms
IPOPT-lifted
One option is simply to call the solver IPOPT to solve the optimization problem in this lifted form:
\begin{aligned}
&\underset{ \substack{
p, e,
\hat{x}}
}
{\mathrm{minimize}} \; \; G\left(\; \frac{1}{N}\sum_{k=1}^{N} e_k e_k^\top, \; S(p) \right) \\
& \mathrm{subject} \, \mathrm{to} \,
\\&\qquad
e_k = C(p) \hat{x}_k + \tilde{D}(p) \tilde{u}_k,
\\&\qquad
\hat{x}_{k+1} = \tilde{A}(p) \hat{x}_{k} + \tilde{B}(p) \tilde{u}_k,
\\&\qquad
g(p) = 0,
\\&\qquad
h(p) \geq 0,
\end{aligned}
IPOPT-dense
One can also condense partially the problem to remove the dependency in $N$:
\begin{aligned}
&\underset{ p }{\mathrm{minimize}} \; \;
G\left(\; V(p), \; S(p) \right) \\
& \mathrm{subject} \, \mathrm{to} \,
\\&\qquad
g(p) = 0,
\\&\qquad
h(p) \geq 0,
\end{aligned}
where the function $V(\cdot)$ is defined as follows:
\begin{aligned}
V(p) & \coloneqq \frac{1}{N} \sum_{k=1}^{N} e_k(p) e_k(p)^\top
\end{aligned}
and $e_k(p)$ are computed via the following equations:
\begin{aligned}
e_k &= C(p) \hat{x}_k + \tilde{D}(p) \tilde{u}_k,
\\
\hat{x}_{k+1} &= \tilde{A}(p) \hat{x}_{k} + \tilde{B}(p) \tilde{u}_k,
\end{aligned}
Sequential Programming (SP)
We propose a tailored SP method. Here, we solve a sequence of optimization problems similar to the one above, but where the function $V(\cdot)$ is approximated by a quadratic approximation to remove the dependency in the horizon $N$ in each optimization problem. Ultimately, for globalization, a trust region approach is adopted, where the size of the trust region $\Delta$ is reduced when necessary.
In this algorithm, we update the current solution point $p^{(i)}$ by solving the following (smaller) NonLinear Program:
\begin{aligned}
p^{(i+1)} = &\underset{ p }{\mathrm{\arg \min}} \; \;
G\left(\; V^{\textup{quad}}(p; p^{(i)}), \; S(p) \right) \\
& \mathrm{subject} \, \mathrm{to} \,
\\&\qquad
g(p) = 0,
\\&\qquad
h(p) \geq 0,
\\&\qquad
\left\lVert p - p^{(i)} \right\rVert_1 \leq \Delta^{(i)},
\end{aligned}
where $V^{\textup{quad}}(p; \bar{p})$ is a (Gauss-Newton) quadratic approximation of $V(\cdot)$ around the point $\bar{p}$:
\begin{aligned}
V^{\textup{quad}}(p; \bar{p}) &\coloneqq
\frac{1}{N}\sum_{k=1}^{N} e_k^{\textup{lin}}(p; \bar{p}) e_k^{\textup{lin}}(p; \bar{p})^\top,
\end{aligned}
where $e_k^{\textup{lin}}(p; \bar{p})$ is the linearization of $e_k(p)$:
\begin{aligned}
e_k^{\textup{lin}}(p; \bar{p})
&\coloneqq
e_k(\bar{p}) + \frac{d e_k(\bar{p})}{d p } (p - \bar{p}).
\end{aligned}
Note that $e_k(\bar{p})$ and $\frac{d e_k(\bar{p})}{d p }$ are computed via propagation of the following dynamical equations and of their derivatives
\begin{aligned}
e_k &= C(p) \hat{x}_k + \tilde{D}(p) \tilde{u}_k,
\\
\hat{x}_{k+1} &= \tilde{A}(p) \hat{x}_{k} + \tilde{B}(p) \tilde{u}_k.
\end{aligned}
Regarding the size of the trust region, whenever the objective value of the new candidate point $p^{(i+1)}$ is higher the previous point $p^{(i)}$, we reject the step and shrink the trust region: $\Delta^{(i+1)} = \gamma \Delta^{(i)}$.
Benchmark
To assess the computational speed of the three algorithms, we run them on a simple example, with generated data, for different realization of the data generation and different data lengths $N$.
Example 1
For a simple model with $3$ states, $1$ input, $2$ outputs, representing heat transfers:
\begin{aligned}
\begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}^{+}
&=
\begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}
+
\theta_1 \begin{bmatrix} u - x_1 \\ x_1 - x_2 \\ x_2 - x_3 \end{bmatrix}
+
\theta_2
\begin{bmatrix} 0 \\ 0 \\ 0 - x_3 \end{bmatrix} + w
\\
\begin{bmatrix} y_1 \\ y_2 \end{bmatrix} &= \begin{bmatrix} x_1 \\ x_3 \end{bmatrix} + v
\end{aligned}
and with
\begin{aligned}
\mathbb{E}\big[ w w^\top\big] = Q(\theta) &= \begin{bmatrix} \theta_3 & 0 & 0 \\ 0 & \theta_3 & 0 \\ 0 & 0 & \theta_3 \end{bmatrix} \\
\mathbb{E}\big[ v v^\top\big] = R(\theta) &= \begin{bmatrix} \theta_4 & 0 \\ 0 & \theta_4 \end{bmatrix} \\
\end{aligned}
The model is defined in [models/example1](https://github.com/Leo-Simpson/MLE4LTI/blob/main/m
Related Skills
node-connect
339.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
339.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.9kCommit, push, and open a PR
