Compute err on recv Ca (~netin) and use S*R acts as trace (integrate over trials) #49

rcoreilly · 2022-08-17T08:20:03Z

rcoreilly
Aug 17, 2022
Maintainer

The trace branch is implementing this, and it is working well on ra25 and starting to work on objrec. Will add more details later.

Current status is that effective learning rate for new trace mech is very different from one based on activation differences, which explains need for a lrate schedule -- will try to use act diff to modulate effective lrate instead of using manual sched...

rcoreilly · 2022-08-22T10:23:17Z

rcoreilly
Aug 22, 2022
Maintainer Author

Error-driven learning has two components: the error signal and the credit assignment factor.

dW = Err * Credit

In the simple delta rule equation, these two terms are:

$$ dW = (y^+ - y^-) x $$

where $y$ is the activity of the receiving neuron in the "target" or "plus" phase vs. its actual activity or "minus" phase (this difference representing the Err signal), and $x$ is the sending neuron activity, which serves as the credit assignment.

These end up being mixed together in the CHL contrastive hebbian learning learning rule:

dW = (x+ y+) - (x- y-)

Which clearly does not have these two separable factors.

The key point of this new learning rule is to unmix these factors, so that we can use a trace factor for credit assignment, which integrates activity over time, as in the BellecScherrSubramoneyEtAl20 paper showing that backprop through time (BPTT) can be approximated using such a learning trace. Also, the unmixed error signal factor gives us more control over the vanishing gradient issues that otherwise arise from using only activation-based terms.

To begin, the original GeneRec (OReilly96) derivation of backprop -> CHL goes like this:

$$ \frac{\partial E}{\partial w}= \frac{\partial E}{\partial y} \frac{\partial y}{\partial g} \frac{\partial g}{\partial w}$$

where E is overall error, w is the weight, y is recv unit activity, g is recv conductance (net input), and x is sending activity.

$$ g = \sum x w $$

$$ y = f(g) $$

This chain rule turns into:

$$ dW = \frac{\partial E}{\partial w} = \left[ \left( \sum_i x_i^+ - \sum_i x_i^- \right )w \right] y' x = (g^+ - g^-) y' x$$

Thus, the Err factor is $(g^+ - g^-)$ and $y' x$ is the Credit factor. In words, the error signal is received by each unit in the form of their weighted net input from all other neurons -- the error is the temporal difference in this net input signal between the plus and minus phases. And the credit assignment factor is the sending unit activity x times the derivative of activation function.

The presence of this derivative is critical -- and has many tradeoffs embedded within it, as discussed later (e.g., the ReLU eliminates the derivative by using a mostly linear function, and thereby eliminates the vanishing gradient problem that otherwise occurs in sigmoidal activation functions).

The original GeneRec derivation of CHL mixes these factors by approximating the derivative of the activation function using the discrete difference in receiving activation state, such that:

$$ (g^+ - g^-) y' \approx y^+ - y^- $$

In the GeneRec derivation, the approximate midpoint integration method, and symmetry preservation, cause the terms to get mixed together with the sending activations, producing the CHL algorithm.

To derive the new trace-enabling rule, we avoid this mixing, and explore learning using the more separable Err * Credit form. In practice, the key issue is on what variable is the temporal difference computed: just using raw net input turns out to be too diffuse -- the units end up computing too similar of error gradients, and the credit assignment is not quite sufficient to separate them out.

In the axon framework in particular, the weights are constrained to be positive, and especially at the start of learning, the net input terms are all fairly close in values across units. The lateral inhibition provides the critical differentiation so that only a subset of neurons are active, and thus having some contribution of the actual recv activity is critical for a learning rule that ends up having different neurons specializing on different aspects of the problem. The relative lack of this kind of differential receiver-based credit assignment in backprop nets is a critical difference from the CHL learning rule -- in the GeneRec derivation, it arises from making the learning rule symmetric, so that the credit assignment factor includes both sides of the synapse.

In short, backprop is at one end of a continuum where the only credit assignment factor is presynaptic activity, and existing weights provide a "filter" through which the Err term is processed. At the other end is the symmetric CHL equation where pre * post (xy) is the credit assignment factor in effect, and this new "trace" equation is somewhere in between..

The biology is a useful guide here: the relevant calcium signals driving learning represent a combination of a netin-like term via the NMDA channels which are opened by sending activity, and, via VGCC's, some of the receiving unit activation. These calcium signals activate the CaMKII and DAPK1 kinases, via which the temporal-difference error signal is computed:

$$ Err = (g^+ + \gamma y^+) - (g^- + \gamma y^-) $$

In practice, about 2/3 or 3/4 of calcium from net input, and the remainder from VGCC / receiving activity works best, i.e., gamma = .25 or so. In addition, the NMDA channels have a postsynaptic activation factor via their Mg blocking mechanism, so it isn't purely additive as in the above equation. In practice, the direct NMDA conductance computed using standard equations works

For the Credit side, the synapse-level integration of pre * post spiking works well, as used for the axon kinase synapse-level learning rule, integrated as a running-average integrated signal over trials:

$$ Credit = < x y > $$

In practice, this works well, and provides a better fit overall to the biology in terms of the nature of the Ca signals at play.

0 replies

rcoreilly · 2022-08-22T10:43:08Z

rcoreilly
Aug 22, 2022
Maintainer Author

One outstanding issue is the learning rate: it remains high throughout, and requires lrate schedule as of now -- but perhaps some element of the y' derivative function could be re-introduced to capture this aspect -- this must be done carefully to not just end up back at CHL and vanishing gradients -- hopefully there is a golden middle ground there somewhere..

4 replies

rcoreilly Aug 26, 2022
Maintainer Author

Using a simple square-wave version of the derivative of the sigmoid seems to work well: base learning rate for "mid range" values and a lower learning rate multiplier (.05 seems to work best) for "extreme" values (< .1 or > .9 by default). This seemed to work better than the literal derivative (act * (1-act)). Using a normalized CaSpkP works well for the "act" activity value, and normalizing as a function of layer-level max CaSpkP works the same or better than doing it by pools.

rcoreilly Aug 26, 2022
Maintainer Author

Actually I will re-investigate the literal SigDeriv case because it was not well tested and likely dismissed too soon.

rcoreilly Aug 28, 2022
Maintainer Author

In general, the derivative-based lrate modulation is not useful for Target layers, whereas the earlier delta-based modulation remains useful for all layers. However, for ra25, it is helpful.. and perhaps other cases. So we can make it a builtin default but still overridable in params, as opposed to a hard !Target case.

rcoreilly Sep 14, 2022
Maintainer Author

And the actual sigmoid derivative works well so made that the default.

rcoreilly · 2022-08-28T01:28:22Z

rcoreilly
Aug 28, 2022
Maintainer Author

Finally found the major culprit that was derailing LVis runs: the inhibitory within-layer connections use HebbPrjn, and this really requires SubMean = 1 -- the prior blanket removal of submean left these without it.

So it is now added back in, and LVis is finally proceeding well past the first 300 epochs.

Also, now that SubMean is available again, I compared SubMean = 0 vs. 1 on LVis and ra25x -- the LVis runs are virtually identical out to the current 200+ epochs! it is astounding how balanced the new learning rule is.

0 replies

rcoreilly · 2022-08-28T01:32:41Z

rcoreilly
Aug 28, 2022
Maintainer Author

Overall, the clear picture emerging across all models is that the trace / netin err learning rule is naturally balanced in its "main effect" mean direction of weight change, such that subtracting the mean has almost no effect. Previous CHL learning rule had significant main effect and required submean = 1, and even so, would end up causing layers to become more active in general over time. The lack of this dynamic in trace means that you need to reduce the inhibition in general, at least in many cases.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute err on recv Ca (~netin) and use S*R acts as trace (integrate over trials) #49

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Compute err on recv Ca (~netin) and use S*R acts as trace (integrate over trials) #49

rcoreilly Aug 17, 2022 Maintainer

Replies: 4 comments · 4 replies

rcoreilly Aug 22, 2022 Maintainer Author

rcoreilly Aug 22, 2022 Maintainer Author

rcoreilly Aug 26, 2022 Maintainer Author

rcoreilly Aug 26, 2022 Maintainer Author

rcoreilly Aug 28, 2022 Maintainer Author

rcoreilly Sep 14, 2022 Maintainer Author

rcoreilly Aug 28, 2022 Maintainer Author

rcoreilly Aug 28, 2022 Maintainer Author

rcoreilly
Aug 17, 2022
Maintainer

Replies: 4 comments 4 replies

rcoreilly
Aug 22, 2022
Maintainer Author

rcoreilly
Aug 22, 2022
Maintainer Author

rcoreilly Aug 26, 2022
Maintainer Author

rcoreilly Aug 26, 2022
Maintainer Author

rcoreilly Aug 28, 2022
Maintainer Author

rcoreilly Sep 14, 2022
Maintainer Author

rcoreilly
Aug 28, 2022
Maintainer Author

rcoreilly
Aug 28, 2022
Maintainer Author