Skip to content
Hussein Hazimeh edited this page Apr 9, 2018 · 31 revisions

The Reference Manual contains a detailed description of all the functions available in L0Learn. In what follows, we give a brief tour of the main functions and settings.

The main function in L0Learn is the fit function which has the following interface and default values:

L0Learn.fit(X,y, Loss="SquaredError", Penalty="L0", Algorithm="CD", MaxSuppSize=100, NLambda=100)
  • X is the data matrix and y is the response vector
  • Loss specifies the loss function. Currently, we only officially support "SquaredError". We will be adding support for other loss functions soon.
  • Penalty: The following three penalties are possible "L0", "L0L2", and "L0L1". Note: When using "L0L2" and "L0L1", the number of grid points for the parameter gamma should be specified using the parameter Ngamma. Moreover, for "L0L2", the maximum and minimum values of gamma should be specified using GammaMax and GammaMin, respectively.
  • Algorithm: The choice of the optimization algorithm can have significant effect on the quality of the solutions. Currently, we support the following two algorithms:
    • CD: A Coordinate Descent-type algorithm with all the tweaks and heuristics discussed in our paper.
    • CDPSI: This algorithm combines Coordinate Descent and Local Combinatorial Search to escape weak local minima. It typically leads to higher-quality solutions compared to CD, at the cost of additional running time. This is the CD-PSI(1) algorithm introduced in our paper.
  • MaxSuppSize specifies the maximum support size in the regularization path after which the algorithm terminates. The toolkit's internals optimize the running time based on this parameter (this choice can affect the type of optimization algorithm used). We recommend experimenting with small values first (e.g., 5% of p) as L0-regularization typically selects a small portion of the features.
  • NLambda is the number of Lambda grid points. Note: The actual values of Lambda are data-dependent and are computed automatically by the algorithm.

A Simple Demonstration

To demonstrate how L0Learn works we will generate the following synthetic dataset

  • A 500x1000 design matrix X with iid standard normal entries
  • A 1000x1 vector B with the first 10 entries set to 1 and the rest are zeros.
  • A 500x1 vector e with iid standard normal entries
  • Set y = XB + e
set.seed(1) # fix the seed to get a reproducible result
X = matrix(rnorm(500*1000),nrow=500,ncol=1000)
B = c(rep(1,10),rep(0,990))
e = rnorm(500)
y = X%*%B + e

We will use L0Learn to estimate B from the data (y,X). First we load L0Learn:

library(L0Learn)

To fit a path of solutions for the L0-regularized model with maximal support size of 50 using Algorithm CD we use the command:

fit = L0Learn.fit(X, y, Loss="SquaredError", Penalty="L0", Algorithm="CD", MaxSuppSize=50)

This will generate solutions for a sequence of Lambda values (chosen automatically by the algorithm). To view the sequence of Lambda values along with the associated support sizes, we use:

print(fit)

and we get the following output:

        lambda suppsize
1  0.068285500        1
2  0.055200200        2
3  0.049032300        3
4  0.040072500        6
5  0.038602800        7
6  0.037265300        8
7  0.032514200       10
8  0.001142920       11
9  0.000821221       13
10 0.000702287       14
11 0.000669519       15
12 0.000489943       17
13 0.000412565       22
14 0.000404252       24
15 0.000369975       27
16 0.000357211       31
17 0.000331164       40
18 0.000284271       42
19 0.000240881       50

The sequence of lambda values can is a member of the fit object and can be accessed using fit$lambda. For example, the lambda corresponding to the first solution in the path can be retrieved using fit$lambda[1]. To print the estimated B for a particular value of lambda, we use the function coef(fit,lambda) which takes the object fit as the first parameter and the value of lambda corresponding to the solution as the second parameter. Note that (in this example) the solution at index 7 has a support size of 10. We can retrieve this solution using the coef function as follows:

coef(fit,lambda=fit$lambda[7])

to get the following output:

1001 x 1 sparse Matrix of class "dgCMatrix"
                    
Intercept 0.01052402
V1        1.01601044
V2        1.01830944
V3        1.00606875
V4        0.98309180
V5        0.97389883
V6        0.96148076
V7        1.00990714
V8        1.08535507
V9        1.02686930
V10       0.94235619
V11       .         
V12       .         
V13       .         
V14       .         
V15       .         
V16       .         
V17       .         
V18       .         
V19       .         
V20       .      
.
.
.

The output is sparse vector of type dgCMatrix. The first element in the vector is intercept and the rest are the B coefficients. Aside from the intercept, the only non-zeros in the above solution are coordinates V1, V2, V3, ..., V10, which are the coordinates in the true support (used to generated the data). Thus, this solution successfully recovers the true support. We can also make predictions using a specific solution in the grid using the function predict(fit,newx,lambda) where newx is a testing sample (vector or matrix). For example, to predict the response for the samples in the data matrix X using the solution at index 7 we call the prediction function

predict(fit,X,lambda=fit$lambda[7])

We have demonstrated the simple case of using an L0 penalty alone.

On real datasets typically the L0L2 and L0L1 penalties lead to better performance, and are more robust to the presence of noise. Also, using Algorithm=CDPSI instead of Algorithm=CD can lead to significant improvements, especially when the features are highly correlated.

Clone this wiki locally