SubsetSelection.jl

Fast Subset Selection algorithm for Statistics/Machine Learning

Generate Convert Improve

Install / Use

/learn @jeanpauphilet/SubsetSelection.jl

About this skill

Quality Score

0/100

README

SubsetSelection

SubsetSelection is a Julia package that computes sparse L2-regularized estimators. Sparsity is enforced through explicit cardinality constraint or L0-penalty. Supported loss functions for regression are least squares, L1 and L2 SVR; for classification, logistic, L1 and L2 Hinge loss. The algorithm formulates the problem as a mixed-integer saddle-point problem and solves its boolean relaxation using a dual sub-gradient approach.

Quick start

To install the package:

julia> Pkg.install("SubsetSelection")

or the have the latest version

julia> Pkg.clone("git://github.com/jeanpauphilet/SubsetSelection.jl.git")

To fit a basic model:

julia> using SubsetSelection, StatsBase

julia> n = 100; p = 10000; k = 10;
julia> indices = sort(sample(1:p, StatsBase.Weights(ones(p)/p), k, replace=false));
julia> w = sample(-1:2:1, k);
julia> X = randn(n,p); Y = X[:,indices]*w;
julia> Sparse_Regressor = subsetSelection(OLS(), Constraint(k), Y, X)
SubsetSelection.SparseEstimator(SubsetSelection.OLS(),SubsetSelection.Constraint(10),10.0,[362,1548,2361,3263,3369,3598,5221,7314,7748,9267],[5.37997,-5.51019,-5.77256,-7.27197,-6.32432,-4.97585,5.94814,4.75648,5.48098,-5.91967],[-0.224588,-1.1446,2.81566,0.582427,-0.923311,4.1153,-2.43833,0.117831,0.0982258,-1.60631  …  0.783925,-1.1055,0.841752,-1.09645,-0.397962,3.48083,-1.33903,1.44676,4.03583,1.05817],0.0,19)

The algorithm returns a SparseEstimator object with the following fields: loss (loss function used), sparsity (model to enforce sparsity), indices (features selected), w (value of the estimator on the selected features only), α (values of the associated dual variables), b (bias term), iter (number of iterations required by the algorithm).

For instance, you can access the selected features directly in the indices field:

julia> Sparse_Regressor.indices
10-element Array{Int64,1}:
  362
 1548
 2361
 3263
 3369
 3598
 5221
 7314
 7748
 9267

or compute predictions

julia> Y_pred = X[:,Sparse_Regressor.indices]*Sparse_Regressor.w
100-element Array{Float64,1}:
   4.62918
   8.59952
 -16.2796
  -5.611
   1.62764
 -50.4562
  37.407
 -12.3341
  -4.75339
  25.122
   ⋮
  -7.98349
  11.0327
  -8.58172
  16.904
  -9.04211
 -36.5475
  17.2558
 -22.3915
 -57.9727
  -6.06553

For classification, we use +1/-1 labels and the convention P ( Y = y | X = x ) = 1 / (1+e^{- y x^T w}).

Required and optional parameters

subsetSelection has four required parameters:

the loss function to be minimized, to be chosen among least squares (OLS()), L1SVR (L1SVR(ɛ)), L2SVR (L2SVR(ɛ)), Logistic loss (LogReg()), Hinge Loss (L1SVM()), L2-SVM (L2SVM()). For classification, we recommend using Hinge loss or L2-SVM functions. Indeed, the Fenchel conjugate of the Logistic loss exhibits unbounded gradients, which largely hinders convergence of the algorithm and might require smaller and more steps (see optional parameters).
the model used to enforce sparsity; either by adding a hard constraint of the form "||w||_0 < k" (Constraint(k)) or by adding a penalty of the form "+ λ ||w||_0" (Penalty(λ)) to the objective. For tractability issues, we highly recommend using an explicit constraint instead of a penalty, for it ensures the size of the support remains bounded through the algorithm.
the vector of outputs Y of size n, the sample size. In classification settings, Y should be a vector of ±1s.
the matrix of covariates X of size n×p, where n and p are the number of samples and features respectively.

In addition, subsetSelection accepts the following optional parameters:

an initialization for the selected features, indInit.
an initialization for the dual variables, αInit.
the value of the ℓ2-regularization parameter γ, set to 1/√n by default.
intercept, a boolean. If true, an intercept/bias term is computed as well. By default, set to false.
the maximum number of iterations in the sub-gradient algorithm, maxIter.
the value of the gradient stepsize δ. By default, the stepsize is set to 1e-3, which demonstrates very good empirical performance. However, smaller stepsizes might be needed when dealing with very large datasets or when the Logistic loss is used.
the number of gradient updates of dual variable α performed per update of primal variable s, gradUp.
anticycling a boolean. If true, the algorithm stops as soon as the support is not unchanged from one iteration to another. Empirically, the accuracy of the resulting support is strongly sensitive to noise - to use with caution. By default, set to false.
averaging a boolean. If true, the dual solution is averaged over past iterates. By default, set to true.

Best practices

Tuning the regularization parameter γ: By default, γ is set to 1/√n, which is an appropriate scaling in most regression instances. For an optimal performance, and especially in classification or noisy settings, we recommend performing a grid search and using cross-validation to assess out-of-sample performance. The grid search should start with a very low value for γ, such as

    γ = 1.*p / k / n / maximum(sum(X[train,:].^2,2))

and iteratively increase it by a factor 2. Mean square error or Area Under the Curve (see ROCAnalysis for implementation) are commonly used performance metrics for regression and classification tasks respectively.

Instances where the algorithm fails to converge have been reported. If you occur such cases, try normalize the data matrix X and relaunch the algorithm. If the algorithm still fails to converge, reduce the stepsize δ by a factor 10 or 100 and increase the number of iterations maxIter by a factor at least 2.

Reference

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

isf-agent

a repo for an agent that helps researchers apply for isf funding