SkillAgentSearch skills...

WeightWatcher

The WeightWatcher tool for predicting the accuracy of Deep Neural Networks

Install / Use

/learn @CalculatedContent/WeightWatcher
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Downloads PyPI GitHub Published in Nature Video Tutorial Discord LinkedIn Blog CalculatedContent

WeightWatcher Logo

WeightWatcher (WW) is an open-source, diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data. It is based on theoretical research into Why Deep Learning Works, based on our Theory of Heavy-Tailed Self-Regularization (HT-SR). It uses ideas from Random Matrix Theory (RMT), Statistical Mechanics, and Strongly Correlated Systems.

It can be used to:

  • analyze pre/trained pyTorch, Keras, DNN models (Conv2D and Dense layers)
  • monitor models, and the model layers, to see if they are over-trained or over-parameterized
  • predict test accuracies across different models, with or without training data
  • detect potential problems when compressing or fine-tuning pretrained models
  • layer warning labels: over-trained; under-trained

Quick Links

And in the notebooks provided in the WeightWatcher-examples github repo (the examples folder here is quite old )

If you have some models you would like to analyze and get feedback on, check out WeightWatcher-Pro. It's currently in beta and free.

Installation: Version 0.7.6

pip install weightwatcher

if this fails try

Current TestPyPI Version 0.7.5.5

 python3 -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple weightwatcher

Usage

import weightwatcher as ww
import torchvision.models as models

model = models.vgg19_bn(pretrained=True)
watcher = ww.WeightWatcher(model=model)
details = watcher.analyze()
summary = watcher.get_summary(details)

It is as easy to run and generates a pandas dataframe with details (and plots) for each layer

Sample Details Dataframe

and summary dictionary of generalization metrics

    {'log_norm': 2.11,      'alpha': 3.06,
      'alpha_weighted': 2.78,
      'log_alpha_norm': 3.21,
      'log_spectral_norm': 0.89,
      'stable_rank': 20.90,
      'mp_softrank': 0.52}

Advanced Usage

The watcher object has several functions and analysis features described below

Notice the min_evals setting: the power law fits need at least 50 eigenvalues to make sense but the describe and other methods do not

watcher.analyze(model=None, layers=[], min_evals=50, max_evals=None,
	 plot=True, randomize=True, mp_fit=True, pool=True, savefig=True):
...
watcher.describe(self, model=None, layers=[], min_evals=0, max_evals=None,
         plot=True, randomize=True, mp_fit=True, pool=True):
...
watcher.get_details()
watcher.get_summary(details) or get_summary()
watcher.get_ESD()
...
watcher.distances(model_1, model_2)

PEFT / LORA models (experimental)

To analyze an PEFT / LORA fine-tuned model, specify the peft option.

  • peft = True: Forms the BA low rank matric and analyzes the delta layers, with 'lora_BA" tag in name

    details = watcher.analyze(peft='peft_only')

  • peft = 'with_base': Analyes the base_model, the delta, and the combined layer weight matrices.

    details = watcher.analyze(peft=True)

The base_model and fine-tuned model must have the same layer names. And weightwatcher will ignore layers that do not share the same name. Also,at this point, biases are not considered. Finally, both models should be stored in the same format (i.e safetensors)

Note: If you want to select by layer_ids, you must first run describe(peft=False), and then select both the lora_A and lora_B layers

Usage: Base Model

Usage: Base Model

Ploting and Fitting the Empirical Spectral Density (ESD)

WW creates plots for each layer weight matrix to observe how well the power law fits work

details = watcher.analyze(plot=True)

For each layer, WeightWatcher plots the ESD--a histogram of the eigenvalues of the layer correlation matrix X=W<sup>T</sup>W. It then fits the tail of ESD to a (Truncated) Power Law, and plots these fits on different axes. The summary metrics (above) characterize the Shape and Scale of each ESD. Here's an example:

<img src="./img/ESD-plots.png" width='800px' height='auto' />

Generally speaking, the ESDs in the best layers, in the best DNNs can be fit to a Power Law (PL), with PL exponents alpha closer to 2.0. Visually, the ESD looks like a straight line on a log-log plot (above left).

Generalization Metrics

<details> <summary> The goal of the WeightWatcher project is find generalization metrics that most accurately reflect observed test accuracies, across many different models and architectures, for pre-trained models and models undergoing training. </summary>

Our HTSR theory says that well trained, well correlated layers should be signficantly different from the MP (Marchenko-Pastur) random bulk, and specifically to be heavy tailed. There are different layer metrics in WeightWatcher for this, including:

  • rand_distance : the distance in distribution from the randomized layer
  • alpha : the slope of the tail of the ESD, on a log-log scale
  • alpha-hat or alpha_weighted : a scale-adjusted form of alpha (similar to the alpha-shatten-Norm)
  • stable_rank : a norm-adjusted measure of the scale of the ESD
  • num_spikes : the number of spikes outside the MP bulk region
  • max_rand_eval : scale of the random noise etc

All of these attempt to measure how on-random and/or non-heavy-tailed the layer ESDs are.

Scale Metrics

  • log Frobenius norm : <img src="https://render.githubusercontent.com/render/math?math=\log_{10}\Vert\mathbf{W}\Vert^{2}_{F}">

  • log_spectral_norm : <img src="https://render.githubusercontent.com/render/math?math=\log_{10}\lambda_{max}=\log_{10}\Vert\mathbf{W}\Vert^{2}_{\infty}">

  • stable_rank : <img src="https://render.githubusercontent.com/render/math?math=R_{stable}=\Vert\mathbf{W}\Vert^{2}_{F}/\Vert\mathbf{W}\Vert^{2}_{\infty}">

  • mp_softrank : <img src="https://render.githubusercontent.com/render/math?math=R_{MP}=\lambda_{MP}/\lambda_{max}">

Shape Metrics

  • alpha : <img src="https://render.githubusercontent.com/render/math?math=\alpha"> Power Law (PL) exponent
  • (Truncated) PL quality of fit D : <img src="https://render.githubusercontent.com/render/math?math=\D"> (the Kolmogorov Smirnov Distance metric)

(advanced usage)

  • TPL : (alpha and Lambda) Truncated Power Law Fit
  • E_TPL : (alpha and Lambda) Extended Truncated Power Law Fit

Scale-adjusted Shape Metrics

  • alpha_weighted : <img src="https://render.githubusercontent.com/render/math?math=\hat{\alpha}=\alpha\log_{10}\lambda_{max}">
  • log_alpha_norm : (Shatten norm): <img src="https://render.githubusercontent.com/render/math?math=\log_{10}\Vert\mathbf{X}\Vert^{\alpha}_{\alpha}">

Direct Correlation Metrics

The random distance metric is a new, non-parameteric approach that appears to work well in early testing. See this recent blog post

  • rand_distance : <img src="https://render.githubusercontent.com/render/math?math=div(\mathbf{W},rand(\mathbf{W}))"> Distance of layer ESD from the ideal RMT MP ESD

There re also related metrics, including the new

  • 'ww_maxdist'
  • 'ww_softrank'

Misc Details

  • N, M : Matrix or Tensor Slice Dimensions
  • num_spikes : number of spikes outside the bulk region of the ESD, when fit to an MP distribution
  • num_rand_spikes : number of Correlation Traps
  • max_rand_eval : scale of the random noise in the layer

Summary Statistics:

The layer metrics are averaged in the summary statistics:

Get the average metrics, as a summary (dict), from the given (or current) details dataframe

details = watcher.analyze(model=model)
summary = watcher.get_summary(model)

or just

summary = watcher.get_summary()

The summary statistics can be used to gauge the test error of

View on GitHub
GitHub Stars1.7k
CategoryDevelopment
Updated1d ago
Forks145

Languages

Python

Security Score

95/100

Audited on Mar 26, 2026

No findings