Pysr3

SciKit-Learn compatible library for training mixed-effects models.

Generate Convert Improve

Install / Use

/learn @aksholokhov/Pysr3

About this skill

Quality Score

0/100

README

Quickstart with `pysr3`

SR3 is a relaxation method designed for accurate feature selection. It currently supports:

Linear Models (L0, LASSO, A-LASSO, CAD, SCAD)
Linear Mixed-Effect Models (L0, LASSO, A-LASSO, CAD, SCAD)

Installation

pysr3 can be installed via

 pip install pysr3>=0.3.5

from pysr3.__about__ import __version__
print(f"This tutorial was generated using PySR3 v{__version__}\n"
     "You might see slightly different numerical results if you are using a different version of the library.")

This tutorial was generated using PySR3 v0.3.5
You might see slightly different numerical results if you are using a different version of the library.

Requirements

Make sure that Python 3.6 or higher is installed. The package has the following dependencies, as listed in requirements.txt:

numpy>=1.21.1
pandas>=1.3.1
scipy>=1.7.1
PyYAML>=5.4.1
scikit_learn>=0.24.2

Usage

pysr3 models are fully compatible to sklearn standards, so you can use them as you normally would use a sklearn model.

Linear Models

A simple example of using SR3-empowered LASSO for feature selection is shown below.

import numpy as np

from pysr3.linear.problems import LinearProblem

# Create a sample dataset
seed = 42
num_objects = 300
num_features = 500
np.random.seed(seed)
# create a vector of true model's coefficients
true_x = np.random.choice(2, size=num_features, p=np.array([0.9, 0.1]))
# create sample data
a = 10 * np.random.randn(num_objects, num_features)
b = a.dot(true_x) + np.random.randn(num_objects)

print(f"The dataset has {a.shape[0]} objects and {a.shape[1]} features; \n"
      f"The vector of true parameters contains {sum(true_x != 0)} non-zero elements out of {num_features}.")

The dataset has 300 objects and 500 features; 
The vector of true parameters contains 55 non-zero elements out of 500.

First, let's fit a model with a fixed parameter lambda:

from pysr3.linear.models import LinearL1ModelSR3
from sklearn.metrics import confusion_matrix
lam = 0.1*np.max(np.abs(a.T.dot(b)))
model = LinearL1ModelSR3(lam=lam, el=1e5)

%%timeit
model.fit(a, b)

38.6 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

maybe_x = model.coef_['x']
tn, fp, fn, tp = confusion_matrix(true_x, np.abs(maybe_x) > np.sqrt(model.tol_solver)).ravel()

print(f"The model found {tp} out of {tp + fn} features correctly, but also chose {fp} out of {tn+fp} extra irrelevant features. \n")

The model found 55 out of 55 features correctly, but also chose 5 out of 445 extra irrelevant features.

Now let's see if we can improve it by adding grid-search:

# Automatic features selection using information criterion
from pysr3.linear.models import LinearL1ModelSR3
from sklearn.model_selection import RandomizedSearchCV
from sklearn.utils.fixes import loguniform

# Here we use SR3-empowered LASSO, but many other popular regularizers are also available
# See the glossary of models for more details.
model = LinearL1ModelSR3()
# We will search for the best model over the range of strengths for the regularizer
params = {
    "lam": loguniform(1e-1, 1e2)
}
selector = RandomizedSearchCV(estimator=model,
                              param_distributions=params,
                              n_iter=50,
                              # The function below evaluates an information criterion
                              # on the test portion of CV-splits.
                              scoring=lambda clf, x, y: -clf.get_information_criterion(x, y, ic='bic'))

selector.fit(a, b)
maybe_x = selector.best_estimator_.coef_['x']
tn, fp, fn, tp = confusion_matrix(true_x, np.abs(maybe_x) > np.sqrt(model.tol_solver)).ravel()

print(f"The model found {tp} out of {tp + fn} features correctly, but also chose {fp} out of {tn+fp} extra irrelevant features. \n"
      f"The best parameter is {selector.best_params_}")

The model found 55 out of 55 features correctly, but also chose 1 out of 445 extra irrelevant features. 
The best parameter is {'lam': 0.15055187290939537}

Note that the discovered coefficients will be biased downwards due to L1 regularization.

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
indep = list(range(num_features))
ax.plot(indep, maybe_x, label='Discovered Coefficients')
ax.plot(indep, true_x, alpha=0.5, label='True Coefficients')
ax.legend(bbox_to_anchor=(1.05, 1))
plt.show()

png

You can get rid of the bias by refitting the model using only features that were selected.

Linear Mixed-Effects Models

Below we show how to use Linear Mixed-Effects (LME) models for simultaneous selection of fixed and random effects.

from pysr3.lme.models import L1LmeModelSR3
from pysr3.lme.problems import LMEProblem, LMEStratifiedShuffleSplit


# Here we generate a random linear mixed-effects problem.
# To use your own dataset check LMEProblem.from_dataframe and LMEProblem.from_x_y
problem, true_parameters = LMEProblem.generate(
    groups_sizes=[10] * 8,  # 8 groups, 10 objects each
    features_labels=["fixed+random"] * 20,  # 20 features, each one having both fixed and random components
    beta=np.array([0, 1] * 10),  # True beta (fixed effects) has every other coefficient active
    gamma=np.array([0, 0, 0, 1] * 5),  # True gamma (variances of random effects) has every fourth coefficient active
    obs_var=0.1,  # The errors have standard errors of sqrt(0.1) ~= 0.33
    seed=seed    # random seed, for reproducibility
)

# LMEProblem provides a very convenient representation
# of the problem. See the documentation for more details.

# It also can be converted to a more familiar representation
x, y, columns_labels = problem.to_x_y()
# columns_labels describe the roles of the columns in x:
# fixed effect, random effect, or both of those, as well as groups labels and observation standard deviation.

# You can also convert it to pandas dataframe if you'd like.
pandas_dataframe = problem.to_dataframe()

# We use SR3-empowered LASSO model, but many other popular models are also available.
# See the glossary of models for more details.
model = L1LmeModelSR3(practical=True)

# We're going to select features by varying the strength of the prior
# and choosing the model that yields the best information criterion
# on the validation set.
params = {
    "lam": loguniform(1e-3, 1e2),
    "ell": loguniform(1e-1, 1e2)
}
# We use standard functionality of sklearn to perform grid-search.
selector = RandomizedSearchCV(estimator=model,
                              param_distributions=params,
                              n_iter=30,  # number of points from parameters space to sample
                              # the class below implements CV-splits for LME models
                              cv=LMEStratifiedShuffleSplit(n_splits=2, test_size=0.5,
                                                           random_state=seed,
                                                           columns_labels=columns_labels),
                              # The function below will evaluate the information criterion
                              # on the test-sets during cross-validation.
                              # We use cAIC from Vaida, but other options (BIC, Muller's IC) are also available
                              scoring=lambda clf, x, y: -clf.get_information_criterion(x,
                                                                                       y,
                                                                                       columns_labels=columns_labels,
                                                                                       ic="vaida_aic"),
                              random_state=seed,
                              n_jobs=20
                              )
selector.fit(x, y, columns_labels=columns_labels)
best_model = selector.best_estimator_

maybe_beta = best_model.coef_["beta"]
maybe_gamma = best_model.coef_["gamma"]

# Since the solver stops witin sqrt(tol) from the minimum, we use it as a criterion for whether the feature
# is selected or not 
ftn, ffp, ffn, ftp = confusion_matrix(y_true=true_parameters["beta"],
                                      y_pred=abs(maybe_beta) > np.sqrt(best_model.tol_solver)
                                     ).ravel()
rtn, rfp, rfn, rtp = confusion_matrix(y_true=true_parameters["gamma"],
                                      y_pred=abs(maybe_gamma) > np.sqrt(best_model.tol_solver)
                                     ).ravel()

print(
    f"The model found {ftp} out of {ftp + ffn} correct fixed features, and also chose {ffp} out of {ftn + ffp} extra irrelevant fixed features. \n"
    f"It also identified {rtp} out of {rtp + rfn} random effects correctly, and got {rfp} out of {rtn + rfp} non-present random effects. \n"
    f"The best sparsity parameter is {selector.best_params_}")

The model found 10 out of 10 correct fixed features, and also chose

Related Skills

node-connect

351.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

110.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

351.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

351.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

aksholokhov

View profile

View on GitHub

GitHub Stars13

CategoryDevelopment

Updated1y ago

Forks5

aksholokhov/pysr3

Languages

Python

Security Score

75/100

Audited on Jul 9, 2024

No findings

Pysr3

Install / Use

README

Quickstart with pysr3

Installation

Requirements

Usage

Linear Models

Linear Mixed-Effects Models

Related Skills

Quickstart with `pysr3`