Skip to content
Hussein Hazimeh edited this page Apr 9, 2018 · 31 revisions

The Reference Manual contains a detailed description of all the functions available in L0Learn. In what follows, we give a brief tour of the main functions and settings.

The main function in L0Learn is the fit function which has the following interface and default values:

L0Learn.fit(X,y, Loss="SquaredError", Penalty="L0", Algorithm="CD", MaxSuppSize=100, NLambda=100)
  • X is the data matrix and y is the response vector
  • Loss specifies the loss function. Currently, we only officially support "SquaredError". We will be adding support for other loss functions soon.
  • Penalty: The following three penalties are possible "L0", "L0L2", and "L0L1". Note: When using "L0L2" and "L0L1", the number of grid points for the parameter gamma should be specified using the parameter Ngamma. Moreover, for "L0L2", the maximum and minimum values of gamma should be specified using GammaMax and GammaMin, respectively.
  • Algorithm: The choice of the optimization algorithm can have significant effect on the quality of the solutions. Currently, we support the following two algorithms:
    • CD: A Coordinate Descent-type algorithm with all the tweaks and heuristics discussed in our paper.
    • CDPSI: This algorithm combines Coordinate Descent and Local Combinatorial Search to escape weak local minima. It typically leads to higher-quality solutions compared to CD, at the cost of additional running time. This is the CD-PSI(1) algorithm introduced in our paper.
  • MaxSuppSize specifies the maximum support size in the regularization path after which the algorithm terminates. The toolkit's internals optimize the running time based on this parameter (this choice can affect the type of optimization algorithm used). We recommend experimenting with small values first (e.g., 5% of p) as L0-regularization typically selects a small portion of the features.
  • NLambda is the number of Lambda grid points. Note: The actual values of Lambda are data-dependent and are computed automatically by the algorithm.

A Simple Demonstration

To demonstrate how L0Learn works we will generate the following dummy dataset

  • A 500x1000 design matrix X with iid standard normal entries
  • A 1000x1 vector B with the first 10 entries set to 1 and the rest are zeros.
  • A 500x1 vector e with iid standard normal entries
  • Set y = XB + e
set.seed(1) # fix the seed to get a reproducible result
X = matrix(rnorm(500*1000),nrow=500,ncol=1000)
B = c(rep(1,10),rep(0,990))
e = rnorm(500)
y = X%*%B + e

Our objective is to use L0Learn to recover the true vector B by examining X and y only. First we load L0Learn:

library(L0Learn)

For this example, we are going to fit an L0-regularized model with CD and signal the algorithm to stop when the support size reaches 50. This can be done by executing:

fit = L0Learn.fit(X, y, Loss="SquaredError", Penalty="L0", Algorithm="CD", MaxSuppSize=50)

This will generate solutions for a sequence of Lambda values. To view the path of Lambda values along with the associated support sizes, we print a summary of the regularization path as follows:

print(fit)

and we get the following output:

        lambda suppsize
1  0.068285500        1
2  0.055200200        2
3  0.049032300        3
4  0.040072500        6
5  0.038602800        7
6  0.037265300        8
7  0.032514200       10
8  0.001142920       11
9  0.000821221       13
10 0.000702287       14
11 0.000669519       15
12 0.000489943       17
13 0.000412565       22
14 0.000404252       24
15 0.000369975       27
16 0.000357211       31
17 0.000331164       40
18 0.000284271       42
19 0.000240881       50

Now to print the learned vector B for a specific solution in the path, we use the function coef(fit,lambda) which takes the object fit as the first parameter and the value of lambda corresponding to the solution as the second parameter. Note that (in this example) the solution with lambda=0.0325142 has a support size of 10. We can retrieve this solution using the coef function as follows:

coef(fit,lambda=0.0325142)

to get the following output:

1001 x 1 sparse Matrix of class "dgCMatrix"
                    
Intercept 0.01052402
V1        1.01601044
V2        1.01830944
V3        1.00606875
V4        0.98309180
V5        0.97389883
V6        0.96148076
V7        1.00990714
V8        1.08535507
V9        1.02686930
V10       0.94235619
V11       .         
V12       .         
V13       .         
V14       .         
V15       .         
V16       .         
V17       .         
V18       .         
V19       .         
V20       .      
.
.
.

The only non-zeros in the above solution are the first 10 coordinates, which are the coordinates in the true support (used to generated the data). Thus, this solution successfully recovers the true support. We can also make predictions using a specific solution in the grid using the function predict(fit,newx,lambda) where x is a testing sample (vector or matrix). For example, to predict the response for the samples in the data matrix X using the solution with lambda=0.0325142 we call the prediction function

predict(fit,X,lambda=0.0325142)

We have demonstrated the simple case of using an L0 penalty alone. On real datasets typically the L0L2 and L0L1 penalties lead to better performance. Also, using CDPSI instead for CD for optimization can lead to significant improvements, especially when the features are highly correlated.

Clone this wiki locally