WeightWatcher (WW) is an open-source, diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data. It is based on theoretical research into Why Deep Learning Works, based on our Theory of Heavy-Tailed Self-Regularization (HT-SR). It uses ideas from Random Matrix Theory (RMT), Statistical Mechanics, and Strongly Correlated Systems.

It can be used to:

analyze pre/trained pyTorch, Keras, DNN models (Conv2D and Dense layers)
monitor models, and the model layers, to see if they are over-trained or over-parameterized
predict test accuracies across different models, with or without training data
detect potential problems when compressing or fine-tuning pretrained models
layer warning labels: over-trained; under-trained

Quick Links

Please see our latest talk from the Sillicon Valley ACM meetup
Join the Discord Server
For a deeper dive into the theory,
- Dr. Martin's invited talk at NeurIPS 2023
- the deep theory [SETOL monograph] (https://arxiv.org/abs/2507.17912)
- the most recent [Grokking paper] (https://arxiv.org/abs/2506.04434)
and some of the most recent Podcasts:
More details and demos can be found on the Calculated Content Blog and
and on the open-souce landing page [weightwatcher.ai] (https://weightwatcher.ai)

And in the notebooks provided in the WeightWatcher-examples github repo (the examples folder here is quite old )

If you have some models you would like to analyze and get feedback on, check out WeightWatcher-Pro. It's currently in beta and free.

Installation: Version 0.7.6

pip install weightwatcher

if this fails try

Current TestPyPI Version 0.7.5.5

 python3 -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple weightwatcher

Usage

import weightwatcher as ww
import torchvision.models as models

model = models.vgg19_bn(pretrained=True)
watcher = ww.WeightWatcher(model=model)
details = watcher.analyze()
summary = watcher.get_summary(details)

It is as easy to run and generates a pandas dataframe with details (and plots) for each layer

Sample Details Dataframe

and summary dictionary of generalization metrics

    {'log_norm': 2.11,      'alpha': 3.06,
      'alpha_weighted': 2.78,
      'log_alpha_norm': 3.21,
      'log_spectral_norm': 0.89,
      'stable_rank': 20.90,
      'mp_softrank': 0.52}

Advanced Usage

The watcher object has several functions and analysis features described below

Notice the min_evals setting: the power law fits need at least 50 eigenvalues to make sense but the describe and other methods do not

watcher.analyze(model=None, layers=[], min_evals=50, max_evals=None,
	 plot=True, randomize=True, mp_fit=True, pool=True, savefig=True):
...
watcher.describe(self, model=None, layers=[], min_evals=0, max_evals=None,
         plot=True, randomize=True, mp_fit=True, pool=True):
...
watcher.get_details()
watcher.get_summary(details) or get_summary()
watcher.get_ESD()
...
watcher.distances(model_1, model_2)

PEFT / LORA models (experimental)

To analyze an PEFT / LORA fine-tuned model, specify the peft option.

peft = True: Forms the BA low rank matric and analyzes the delta layers, with 'lora_BA" tag in name

details = watcher.analyze(peft='peft_only')
peft = 'with_base': Analyes the base_model, the delta, and the combined layer weight matrices.

details = watcher.analyze(peft=True)

The base_model and fine-tuned model must have the same layer names. And weightwatcher will ignore layers that do not share the same name. Also,at this point, biases are not considered. Finally, both models should be stored in the same format (i.e safetensors)

Note: If you want to select by layer_ids, you must first run describe(peft=False), and then select both the lora_A and lora_B layers

Usage: Base Model

Ploting and Fitting the Empirical Spectral Density (ESD)

WW creates plots for each layer weight matrix to observe how well the power law fits work

details = watcher.analyze(plot=True)

For each layer, WeightWatcher plots the ESD--a histogram of the eigenvalues of the layer correlation matrix X=W<sup>T</sup>W. It then fits the tail of ESD to a (Truncated) Power Law, and plots these fits on different axes. The summary metrics (above) characterize the Shape and Scale of each ESD. Here's an example:

Generally speaking, the ESDs in the best layers, in the best DNNs can be fit to a Power Law (PL), with PL exponents alpha closer to 2.0. Visually, the ESD looks like a straight line on a log-log plot (above left).

Generalization Metrics

<details> <summary> The goal of the WeightWatcher project is find generalization metrics that most accurately reflect observed test accuracies, across many different models and architectures, for pre-trained models and models undergoing training. </summary>

Our HTSR theory says that well trained, well correlated layers should be signficantly different from the MP (Marchenko-Pastur) random bulk, and specifically to be heavy tailed. There are different layer metrics in WeightWatcher for this, including:

rand_distance : the distance in distribution from the randomized layer
alpha : the slope of the tail of the ESD, on a log-log scale
alpha-hat or alpha_weighted : a scale-adjusted form of alpha (similar to the alpha-shatten-Norm)
stable_rank : a norm-adjusted measure of the scale of the ESD
num_spikes : the number of spikes outside the MP bulk region
max_rand_eval : scale of the random noise etc

All of these attempt to measure how on-random and/or non-heavy-tailed the layer ESDs are.

Scale Metrics

log Frobenius norm : <img src="https://render.githubusercontent.com/render/math?math=\log_{10}\Vert\mathbf{W}\Vert^{2}_{F}">
log_spectral_norm : <img src="https://render.githubusercontent.com/render/math?math=\log_{10}\lambda_{max}=\log_{10}\Vert\mathbf{W}\Vert^{2}_{\infty}">
stable_rank : <img src="https://render.githubusercontent.com/render/math?math=R_{stable}=\Vert\mathbf{W}\Vert^{2}_{F}/\Vert\mathbf{W}\Vert^{2}_{\infty}">
mp_softrank : <img src="https://render.githubusercontent.com/render/math?math=R_{MP}=\lambda_{MP}/\lambda_{max}">

Shape Metrics

alpha : <img src="https://render.githubusercontent.com/render/math?math=\alpha"> Power Law (PL) exponent
(Truncated) PL quality of fit D : <img src="https://render.githubusercontent.com/render/math?math=\D"> (the Kolmogorov Smirnov Distance metric)

(advanced usage)

TPL : (alpha and Lambda) Truncated Power Law Fit
E_TPL : (alpha and Lambda) Extended Truncated Power Law Fit

Scale-adjusted Shape Metrics

alpha_weighted : <img src="https://render.githubusercontent.com/render/math?math=\hat{\alpha}=\alpha\log_{10}\lambda_{max}">
log_alpha_norm : (Shatten norm): <img src="https://render.githubusercontent.com/render/math?math=\log_{10}\Vert\mathbf{X}\Vert^{\alpha}_{\alpha}">

Direct Correlation Metrics

The random distance metric is a new, non-parameteric approach that appears to work well in early testing. See this recent blog post

rand_distance : <img src="https://render.githubusercontent.com/render/math?math=div(\mathbf{W},rand(\mathbf{W}))"> Distance of layer ESD from the ideal RMT MP ESD

There re also related metrics, including the new

'ww_maxdist'
'ww_softrank'

Misc Details

N, M : Matrix or Tensor Slice Dimensions
num_spikes : number of spikes outside the bulk region of the ESD, when fit to an MP distribution
num_rand_spikes : number of Correlation Traps
max_rand_eval : scale of the random noise in the layer

Summary Statistics:

The layer metrics are averaged in the summary statistics:

Get the average metrics, as a summary (dict), from the given (or current) details dataframe

details = watcher.analyze(model=model)
summary = watcher.get_summary(model)

or just

summary = watcher.get_summary()

The summary statistics can be used to gauge the test error of

WeightWatcher

Install / Use

README