WeightWatcher
The WeightWatcher tool for predicting the accuracy of Deep Neural Networks
Install / Use
/learn @CalculatedContent/WeightWatcherREADME
WeightWatcher (WW) is an open-source, diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data. It is based on theoretical research into Why Deep Learning Works, based on our Theory of Heavy-Tailed Self-Regularization (HT-SR). It uses ideas from Random Matrix Theory (RMT), Statistical Mechanics, and Strongly Correlated Systems.
It can be used to:
- analyze pre/trained pyTorch, Keras, DNN models (Conv2D and Dense layers)
- monitor models, and the model layers, to see if they are over-trained or over-parameterized
- predict test accuracies across different models, with or without training data
- detect potential problems when compressing or fine-tuning pretrained models
- layer warning labels: over-trained; under-trained
Quick Links
-
Please see our latest talk from the Sillicon Valley ACM meetup
-
Join the Discord Server
-
For a deeper dive into the theory,
- Dr. Martin's invited talk at NeurIPS 2023
- the deep theory [SETOL monograph] (https://arxiv.org/abs/2507.17912)
- the most recent [Grokking paper] (https://arxiv.org/abs/2506.04434)
-
and some of the most recent Podcasts:
-
More details and demos can be found on the Calculated Content Blog and
-
and on the open-souce landing page [weightwatcher.ai] (https://weightwatcher.ai)
And in the notebooks provided in the WeightWatcher-examples github repo (the examples folder here is quite old )
If you have some models you would like to analyze and get feedback on, check out WeightWatcher-Pro. It's currently in beta and free.
Installation: Version 0.7.6
pip install weightwatcher
if this fails try
Current TestPyPI Version 0.7.5.5
python3 -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple weightwatcher
Usage
import weightwatcher as ww
import torchvision.models as models
model = models.vgg19_bn(pretrained=True)
watcher = ww.WeightWatcher(model=model)
details = watcher.analyze()
summary = watcher.get_summary(details)
It is as easy to run and generates a pandas dataframe with details (and plots) for each layer

and summary dictionary of generalization metrics
{'log_norm': 2.11, 'alpha': 3.06,
'alpha_weighted': 2.78,
'log_alpha_norm': 3.21,
'log_spectral_norm': 0.89,
'stable_rank': 20.90,
'mp_softrank': 0.52}
Advanced Usage
The watcher object has several functions and analysis features described below
Notice the min_evals setting: the power law fits need at least 50 eigenvalues to make sense but the describe and other methods do not
watcher.analyze(model=None, layers=[], min_evals=50, max_evals=None,
plot=True, randomize=True, mp_fit=True, pool=True, savefig=True):
...
watcher.describe(self, model=None, layers=[], min_evals=0, max_evals=None,
plot=True, randomize=True, mp_fit=True, pool=True):
...
watcher.get_details()
watcher.get_summary(details) or get_summary()
watcher.get_ESD()
...
watcher.distances(model_1, model_2)
PEFT / LORA models (experimental)
To analyze an PEFT / LORA fine-tuned model, specify the peft option.
-
peft = True: Forms the BA low rank matric and analyzes the delta layers, with 'lora_BA" tag in name
details = watcher.analyze(peft='peft_only') -
peft = 'with_base': Analyes the base_model, the delta, and the combined layer weight matrices.
details = watcher.analyze(peft=True)
The base_model and fine-tuned model must have the same layer names. And weightwatcher will ignore layers that do not share the same name. Also,at this point, biases are not considered. Finally, both models should be stored in the same format (i.e safetensors)
Note: If you want to select by layer_ids, you must first run describe(peft=False), and then select both the lora_A and lora_B layers
Usage: Base Model

Ploting and Fitting the Empirical Spectral Density (ESD)
WW creates plots for each layer weight matrix to observe how well the power law fits work
details = watcher.analyze(plot=True)
For each layer, WeightWatcher plots the ESD--a histogram of the eigenvalues of the layer correlation matrix X=W<sup>T</sup>W. It then fits the tail of ESD to a (Truncated) Power Law, and plots these fits on different axes. The summary metrics (above) characterize the Shape and Scale of each ESD. Here's an example:
<img src="./img/ESD-plots.png" width='800px' height='auto' />Generally speaking, the ESDs in the best layers, in the best DNNs can be fit to a Power Law (PL), with PL exponents alpha closer to 2.0.
Visually, the ESD looks like a straight line on a log-log plot (above left).
Generalization Metrics
<details> <summary> The goal of the WeightWatcher project is find generalization metrics that most accurately reflect observed test accuracies, across many different models and architectures, for pre-trained models and models undergoing training. </summary>Our HTSR theory says that well trained, well correlated layers should be signficantly different from the MP (Marchenko-Pastur) random bulk, and specifically to be heavy tailed. There are different layer metrics in WeightWatcher for this, including:
rand_distance: the distance in distribution from the randomized layeralpha: the slope of the tail of the ESD, on a log-log scalealpha-hatoralpha_weighted: a scale-adjusted form ofalpha(similar to the alpha-shatten-Norm)stable_rank: a norm-adjusted measure of the scale of the ESDnum_spikes: the number of spikes outside the MP bulk regionmax_rand_eval: scale of the random noise etc
All of these attempt to measure how on-random and/or non-heavy-tailed the layer ESDs are.
Scale Metrics
-
log Frobenius norm : <img src="https://render.githubusercontent.com/render/math?math=\log_{10}\Vert\mathbf{W}\Vert^{2}_{F}">
-
log_spectral_norm: <img src="https://render.githubusercontent.com/render/math?math=\log_{10}\lambda_{max}=\log_{10}\Vert\mathbf{W}\Vert^{2}_{\infty}"> -
stable_rank: <img src="https://render.githubusercontent.com/render/math?math=R_{stable}=\Vert\mathbf{W}\Vert^{2}_{F}/\Vert\mathbf{W}\Vert^{2}_{\infty}"> -
mp_softrank: <img src="https://render.githubusercontent.com/render/math?math=R_{MP}=\lambda_{MP}/\lambda_{max}">
Shape Metrics
alpha: <img src="https://render.githubusercontent.com/render/math?math=\alpha"> Power Law (PL) exponent- (Truncated) PL quality of fit
D: <img src="https://render.githubusercontent.com/render/math?math=\D"> (the Kolmogorov Smirnov Distance metric)
(advanced usage)
- TPL : (alpha and Lambda) Truncated Power Law Fit
- E_TPL : (alpha and Lambda) Extended Truncated Power Law Fit
Scale-adjusted Shape Metrics
alpha_weighted: <img src="https://render.githubusercontent.com/render/math?math=\hat{\alpha}=\alpha\log_{10}\lambda_{max}">log_alpha_norm: (Shatten norm): <img src="https://render.githubusercontent.com/render/math?math=\log_{10}\Vert\mathbf{X}\Vert^{\alpha}_{\alpha}">
Direct Correlation Metrics
The random distance metric is a new, non-parameteric approach that appears to work well in early testing. See this recent blog post
rand_distance: <img src="https://render.githubusercontent.com/render/math?math=div(\mathbf{W},rand(\mathbf{W}))"> Distance of layer ESD from the ideal RMT MP ESD
There re also related metrics, including the new
- 'ww_maxdist'
- 'ww_softrank'
Misc Details
N, M: Matrix or Tensor Slice Dimensionsnum_spikes: number of spikes outside the bulk region of the ESD, when fit to an MP distributionnum_rand_spikes: number of Correlation Trapsmax_rand_eval: scale of the random noise in the layer
Summary Statistics:
The layer metrics are averaged in the summary statistics:
Get the average metrics, as a summary (dict), from the given (or current) details dataframe
details = watcher.analyze(model=model)
summary = watcher.get_summary(model)
or just
summary = watcher.get_summary()
The summary statistics can be used to gauge the test error of

