Comparing CG optimization with Optim.jl #219

mateuszbaran · 2023-02-20T13:07:32Z

mateuszbaran
Feb 20, 2023
Maintainer

I have the following code which optimizes a generalized Rosenbrock function using both Manopt and Optim:

using BenchmarkTools, Profile
using Manifolds, Optim, Manopt

const p = [1.0, 100.0]

function rosenbrock(M::AbstractManifold, x)
    return rosenbrock(x)
end
function rosenbrock(x)
    val = zero(eltype(x))
    for i in 1:(length(x)-1)
        val += (p[1] - x[i])^2 + p[2] * (x[i+1] - x[i]^2)^2
    end
    return val
end

function rosenbrock_grad!(M::AbstractManifold, storage, x)
    # the first part can be computed using AD tools
    rosenbrock_grad!(storage, x)
    # projection is needed because Riemannian optimizers expect
    # Riemannian gradients instead of Euclidean ones.
    project!(M, storage, x, storage)
end

function rosenbrock_grad!(storage, x)
    # the first part can be computed using AD tools
    storage .= 0.0
    for i in 1:(length(x)-1)
        storage[i] += -2.0 * (p[1] - x[i]) - 4.0 * p[2] * (x[i+1] - x[i]^2) * x[i]
        storage[i+1] += 2.0 * p[2] * (x[i+1] - x[i]^2)
    end
    return storage
end

function rosenbrock_grad(M, x)
    # the first part can be computed using AD tools
    storage = similar(x)
    return rosenbrock_grad!(M, storage, x)
end

function test_cg()
    n_dims = 5
    M = Euclidean(n_dims)
    x0 = vcat(zeros(n_dims-1), 1.0)
    x_opt = conjugate_gradient_descent(
        M,
        rosenbrock,
        rosenbrock_grad!,
        x0;
        evaluation=InplaceEvaluation(),
        stepsize=ArmijoLinesearch(M),
        coefficient=HagerZhangCoefficient(),
        stopping_criterion=StopAfterIteration(15),
        return_state=true,
    )
    return x_opt.p
end


function test_cg_optim()
    n_dims = 5
    x0 = vcat(zeros(n_dims-1), 1.0)
    optimize(rosenbrock, rosenbrock_grad!, x0, ConjugateGradient())
end

Optim works fine:

julia> test_cg_optim()
 * Status: success

 * Candidate solution
    Final objective value:     4.077585e-18

 * Found with
    Algorithm:     Conjugate Gradient

 * Convergence measures
    |x - x'|               = 5.05e-11 ≰ 0.0e+00
    |x - x'|/|x'|          = 5.05e-11 ≰ 0.0e+00
    |f(x) - f(x')|         = 1.55e-19 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 3.80e-02 ≰ 0.0e+00
    |g(x)|                 = 8.13e-09 ≤ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    389
    f(x) calls:    976
    ∇f(x) calls:   592

but Manopt returns NaNs:

julia> test_cg()
5-element Vector{Float64}:
 NaN
 NaN
 NaN
 NaN
 NaN

A couple of leads:

Optim.jl uses a different linesearch. Does Manopt have the same one?
In Manopt Armijo returns near-zero step sizes (something like 1e-19) every other iteration which causes HagerZhangCoefficient calculation to be fairly unstable. Maybe stop_step in Armijo should be greater than 0 by default?

EDIT: fixed gradient code.

Answered by mateuszbaran

Feb 20, 2023

Sorry, I've made a typo in gradient code. By coincidence it was correct at the one point where I checked it. Now it's... better but not quite fine yet. I have to investigate a bit more before I bother you further 😉 .

View full answer

mateuszbaran · 2023-02-20T13:08:35Z

mateuszbaran
Feb 20, 2023
Maintainer Author

BTW, I'm using this branch of Manopt: #212 .

0 replies

kellertuer · 2023-02-20T13:53:54Z

kellertuer
Feb 20, 2023
Maintainer

If Armijo returns near zero step sizes, that often indicates the gradient is wrong (or the CG-direction) and is not a descent direction any more.
But sure, a zero-step (or near-zero-step) in CG causes many coefficients to collapse, because we often divide by norms or inner products with last steps (or last directions) so this might lead to NaNs.

stop_step is more like a numerical safeguard. If we set that default to something larger than zero, then only with a warming or a possibility to have a stop/debug reacting on that.

Which line search does Optim use?

1 reply

kellertuer Feb 20, 2023
Maintainer

Here's a first insight – it is indeed the coefficient that gets to NaN, if you add

        debug=[:Iteration, :Cost, " | ",:Iterate, " | ",:Stepsize, " | ", DebugEntry(:β),"\n"],

to the solver call you get

Initial F(x): 104.000000 |  |  | β: 0.0
# 1     F(x): 65.599617 | p: [0.0178484983009344, 0.0178484983009344, 0.0178484983009344, 0.0178484983009344, -0.7848498300934399] | s:0.0089242491504672 | β: -0.6360396694060821
# 2     F(x): 38.301099 | p: [0.021790141174161448, 0.004883197221383551, 0.004883197221383551, -0.022752408132118103, 0.5858507926214256] | s:0.004822308053269666 | β: 0.5870008607395764
# 3     F(x): 38.301099 | p: [0.02179014117416145, 0.00488319722138355, 0.00488319722138355, -0.022752408132118096, 0.5858507926214256] | s:1.1097518119957807e-18 | β: -0.8497946528117976
# 4     F(x): 25.087462 | p: [0.021081015370676182, 0.014815465865761827, 0.014489243412040737, 0.016330519390152072, -0.45965452317773936] | s:0.006560143180425466 | β: 1.90076786016192
# 5     F(x): 25.087462 | p: [0.021081015370676182, 0.014815465865761827, 0.014489243412040737, 0.016330519390152072, -0.45965452317773936] | s:1.287114802469768e-19 | β: NaN

and if beta is nan, then of course all continues in NaN. So the question is then why does that one NaN and can we maybe get Iterates and/or directions from Optim for comparison? Can we check that these are descent directions?

kellertuer · 2023-02-20T14:42:32Z

kellertuer
Feb 20, 2023
Maintainer

It really seems to be their quite-advanced HagerZhang step size () which I do not understand just by the code (and have not checked the paper; it does check Wolfe conditions, though. So the approach might be

    x_opt = conjugate_gradient_descent(
        M,
        rosenbrock,
        rosenbrock_grad!,
        x0;
        evaluation=InplaceEvaluation(),
        stepsize=WolfePowellLinesearch(M; linesearch_stopsize = 1e-12),
        coefficient=HagerZhangCoefficient(),
        stopping_criterion = StopAfterIteration(2000) | StopWhenGradientNormLess(1e-6) | StopWhenChangeLess(1e-12),
        debug=[:Iteration, (:Cost, "F(x): %1.4e")," | ", (:Change, "|Δp|: %1.6e"), (:GradientNorm,"|grad f(p)|: %1.6e"), " | ", :Stepsize, " | ", DebugEntry(:β),"\n"],
        return_state=true,
    )

note that this returns a Debug state, so it is better to use get_solver_result(x_opt) instead of directly accessing p.

The debug lines yields a bit of information during the iterations, the WolfePowell needs this stop check (which is more like a numerical fallback). The last line of debug and the solver result read as

# 1531  F(x): 3.6067e-08 | |Δp|: 8.567739e-07|grad f(p)|: 6.073167e-04 | s:0.000244140625 | β: -0.0004281215818537985
# 1532  F(x): 3.5946e-08 | |Δp|: 5.931164e-07|grad f(p)|: 3.780542e-04 | s:0.0009765625 | β: 2.001390522420301
# 1533  F(x): 3.5946e-08 | |Δp|: 9.959342e-13|grad f(p)|: 3.780551e-04 | s:9.313225746154785e-10 | β: -0.2251800081308954

# Solver state for `Manopt.jl`s Conjugate Gradient Descent Solver
After 1533 iterations

## Parameters
* conjugate gradient coefficient: HagerZhangCoefficient(ParallelTransport()) (last β=-0.2251800081308954)
* retraction method: ExponentialRetraction()
* vector transport method: ParallelTransport()

## Stepsize
WolfePowellLinesearch(DefaultManifold(), 0.0001, 0.999) with keyword arguments
  * retraction_method = ExponentialRetraction()
  * vector_transport_method = ParallelTransport()

## Stopping Criterion
Stop When _one_ of the following are fulfilled:
    Max Iteration 2000: not reached
    |Δf| < 1.0e-6: not reached
    |Δp| < 1.0e-12: reached
Overall: reached
This indicates convergence: Yes

which I also expect for a tough example like Rosenbrock (which we definetly should and could add to ManoptExamples).

1 reply

kellertuer Feb 20, 2023
Maintainer

Oh this was run on master where the CG Coefficient print in the overview is a bit nicer, but on the current PR we work on the number of iterations and the result are the same.

kellertuer · 2023-02-20T14:55:09Z

kellertuer
Feb 20, 2023
Maintainer

Looking at the original paper of the LInesearch used in Optim.jl – https://www.math.lsu.edu/~hozhang/papers/cg_compare.pdf – the line search is tailored very specifically for CG and a bit technical so sure with such a specific line search Optim.jl is very good, I think this can be generalised to manifolds “easily” in the sense that all their assumptions and values they use are easily statable on manifolds (with retractions, inverse retractions and vector transport where needed), it is still a bit technical to implement I think.

0 replies

mateuszbaran · 2023-02-20T18:40:14Z

mateuszbaran
Feb 20, 2023
Maintainer Author

Sorry, I've made a typo in gradient code. By coincidence it was correct at the one point where I checked it. Now it's... better but not quite fine yet. I have to investigate a bit more before I bother you further 😉 .

1 reply

kellertuer Feb 20, 2023
Maintainer

As I said in the very very first post: If Armijo collapses – that most probably indicates a wrong gradient ;)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing CG optimization with Optim.jl #219

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Comparing CG optimization with Optim.jl #219

mateuszbaran Feb 20, 2023 Maintainer

Replies: 5 comments · 3 replies

mateuszbaran Feb 20, 2023 Maintainer Author

kellertuer Feb 20, 2023 Maintainer

kellertuer Feb 20, 2023 Maintainer

kellertuer Feb 20, 2023 Maintainer

kellertuer Feb 20, 2023 Maintainer

kellertuer Feb 20, 2023 Maintainer

mateuszbaran Feb 20, 2023 Maintainer Author

kellertuer Feb 20, 2023 Maintainer

mateuszbaran
Feb 20, 2023
Maintainer

Replies: 5 comments 3 replies

mateuszbaran
Feb 20, 2023
Maintainer Author

kellertuer
Feb 20, 2023
Maintainer

kellertuer Feb 20, 2023
Maintainer

kellertuer
Feb 20, 2023
Maintainer

kellertuer Feb 20, 2023
Maintainer

kellertuer
Feb 20, 2023
Maintainer

mateuszbaran
Feb 20, 2023
Maintainer Author

kellertuer Feb 20, 2023
Maintainer