Replies: 4 comments 4 replies
-
Error-driven learning has two components: the error signal and the credit assignment factor.
In the simple delta rule equation, these two terms are: where These end up being mixed together in the CHL contrastive hebbian learning learning rule:
Which clearly does not have these two separable factors. The key point of this new learning rule is to unmix these factors, so that we can use a trace factor for credit assignment, which integrates activity over time, as in the BellecScherrSubramoneyEtAl20 paper showing that backprop through time (BPTT) can be approximated using such a learning trace. Also, the unmixed error signal factor gives us more control over the vanishing gradient issues that otherwise arise from using only activation-based terms. To begin, the original GeneRec (OReilly96) derivation of backprop -> CHL goes like this: where E is overall error, w is the weight, y is recv unit activity, g is recv conductance (net input), and x is sending activity. This chain rule turns into: Thus, the Err factor is The presence of this derivative is critical -- and has many tradeoffs embedded within it, as discussed later (e.g., the ReLU eliminates the derivative by using a mostly linear function, and thereby eliminates the vanishing gradient problem that otherwise occurs in sigmoidal activation functions). The original GeneRec derivation of CHL mixes these factors by approximating the derivative of the activation function using the discrete difference in receiving activation state, such that: In the GeneRec derivation, the approximate midpoint integration method, and symmetry preservation, cause the terms to get mixed together with the sending activations, producing the CHL algorithm. To derive the new trace-enabling rule, we avoid this mixing, and explore learning using the more separable Err * Credit form. In practice, the key issue is on what variable is the temporal difference computed: just using raw net input turns out to be too diffuse -- the units end up computing too similar of error gradients, and the credit assignment is not quite sufficient to separate them out. In the axon framework in particular, the weights are constrained to be positive, and especially at the start of learning, the net input terms are all fairly close in values across units. The lateral inhibition provides the critical differentiation so that only a subset of neurons are active, and thus having some contribution of the actual recv activity is critical for a learning rule that ends up having different neurons specializing on different aspects of the problem. The relative lack of this kind of differential receiver-based credit assignment in backprop nets is a critical difference from the CHL learning rule -- in the GeneRec derivation, it arises from making the learning rule symmetric, so that the credit assignment factor includes both sides of the synapse. In short, backprop is at one end of a continuum where the only credit assignment factor is presynaptic activity, and existing weights provide a "filter" through which the Err term is processed. At the other end is the symmetric CHL equation where pre * post (xy) is the credit assignment factor in effect, and this new "trace" equation is somewhere in between.. The biology is a useful guide here: the relevant calcium signals driving learning represent a combination of a netin-like term via the NMDA channels which are opened by sending activity, and, via VGCC's, some of the receiving unit activation. These calcium signals activate the CaMKII and DAPK1 kinases, via which the temporal-difference error signal is computed: In practice, about 2/3 or 3/4 of calcium from net input, and the remainder from VGCC / receiving activity works best, i.e., gamma = .25 or so. In addition, the NMDA channels have a postsynaptic activation factor via their Mg blocking mechanism, so it isn't purely additive as in the above equation. In practice, the direct NMDA conductance computed using standard equations works For the Credit side, the synapse-level integration of pre * post spiking works well, as used for the axon kinase synapse-level learning rule, integrated as a running-average integrated signal over trials: In practice, this works well, and provides a better fit overall to the biology in terms of the nature of the Ca signals at play. |
Beta Was this translation helpful? Give feedback.
-
One outstanding issue is the learning rate: it remains high throughout, and requires lrate schedule as of now -- but perhaps some element of the y' derivative function could be re-introduced to capture this aspect -- this must be done carefully to not just end up back at CHL and vanishing gradients -- hopefully there is a golden middle ground there somewhere.. |
Beta Was this translation helpful? Give feedback.
-
Finally found the major culprit that was derailing LVis runs: the inhibitory within-layer connections use So it is now added back in, and LVis is finally proceeding well past the first 300 epochs. Also, now that SubMean is available again, I compared SubMean = 0 vs. 1 on LVis and ra25x -- the LVis runs are virtually identical out to the current 200+ epochs! it is astounding how balanced the new learning rule is. |
Beta Was this translation helpful? Give feedback.
-
Overall, the clear picture emerging across all models is that the trace / netin err learning rule is naturally balanced in its "main effect" mean direction of weight change, such that subtracting the mean has almost no effect. Previous CHL learning rule had significant main effect and required submean = 1, and even so, would end up causing layers to become more active in general over time. The lack of this dynamic in trace means that you need to reduce the inhibition in general, at least in many cases. |
Beta Was this translation helpful? Give feedback.
-
The trace branch is implementing this, and it is working well on ra25 and starting to work on objrec. Will add more details later.
Current status is that effective learning rate for new trace mech is very different from one based on activation differences, which explains need for a lrate schedule -- will try to use act diff to modulate effective lrate instead of using manual sched...
Beta Was this translation helpful? Give feedback.
All reactions