CT Learning: The SRN model #63

rcoreilly · 2022-10-11T10:27:59Z

rcoreilly
Oct 11, 2022
Maintainer

The corticothalamic (CT) projecting layer in the deep predictive learning framework is computationally modeled by the simple recurrent network (SRN) (Elman, 1991; Jordan 1989). This discussion is to review and update thinking about this, in the context of the deep_fsa, deep_music and deep_move models.

rcoreilly · 2022-10-11T11:22:29Z

rcoreilly
Oct 11, 2022
Maintainer Author

The challenge and benefits of not being able to backprop through time (BPTT)

If you can do BPTT, then you can magically propagate the error signals from later in time backwards through time to change the trajectory of the network "next time around" to avoid those errors.

However, that "next time around" is likely to be different in various ways, and therein lies the rub of BPTT: it tends to create fragile chains of dynamics that quickly fall apart. And time intrinsically has an exponential, iterative character to it, so that small deviations tend to multiply over iterations, magnifying the brittleness of BPTT.

The LSTM tames BPTT by eliminating the exponential character of time evolution, essentially copying fixed activity patterns forward through time indefinitely (and adding external gates to then be able to update these signals). But critically, it retains the ability to actually do BPTT through this perfect linear copy function, thereby bridging potentially long gaps between "cause and effect".

The SRN also adopts the copy solution (copying the hidden activity as a context input for the next time step), but with a severely limited time step of 1, relying on an emergent exponential-like reverberatory effect for anything longer. And it relies on a kind of "wishful thinking" that learning shaping the current hidden layer representations will result in the relevant information being available and properly represented as a context for the next time step. This is because it has the computationally limited, but biologically appealing, property of not doing BPTT.

This SRN logic is not quite as bad as it sounds: if information is relevant across multiple time steps, and you happen to preserve some shred of that information by chance, then learning at later time steps will tend to reinforce the maintenance of that information over time. However, if there is something in the relatively distant past that is relevant for only 1 timestep, an SRN will have a hard time learning to maintain it.

One obvious way of improving the SRN is to introduce a bias toward maintaining information across multiple time steps in a relatively stable manner (like the LSTM), so that it is more likely to be around when you need it later. This is what the newly revised CT dynamics based on NMDA channels and recurrent connections do: they sustain longer patterns of activity over time, which is key for deep_music which has these longer-term temporal dependencies, but is not good for deep_move which does not. Thus, in the absence of an all-powerful BPTT mechanism, we need to rely more on setting appropriate biases for what needs to be learned at a given level in the network.

To summarize, the advantages of the extended SRN-like solution are:

It enforces relatively simple temporal dynamics, avoiding the "exponential curse" of raw BPTT.
It is highly biologically plausible, in comparison to BPTT.

And the disadvantages are:

Effectively learning requires appropriate biases toward representing and maintaining the relevant information -- you can't expect a single set of parameters and a fully generic network to solve all temporal problems.
See General principles of network configuration, wiring, and dynamics #61 for more discussion of relevance to this issue.

0 replies

rcoreilly · 2022-10-11T11:35:06Z

rcoreilly
Oct 11, 2022
Maintainer Author

Trace makes some (but not too much) difference

The new Trace learning mechanism: #49 is supposed to help improve learning in temporally-extended tasks, by providing a kind of approximation to BPTT that is fully biologically plausible (i.e., just exponentially integrating an activity trace over time).

However, the effects are relatively minor in the various test cases (though always beneficial when there are differences). The above discussion makes sense of this: if the model is not properly biased to maintain relevant information, then the trace is not going to be a strong enough mechanism to overcome that problem. And if it is properly biased, then the existing error-driven learning is generally sufficient to shape the representations.

The trace mechanism basically amounts to integrating activity over time and using that integrated value for the credit assignment factor in the backprop learning logic. However, this is really not as powerful as full BPTT: it can only achieve a kind of "associative" level learning, where earlier states can become "associated" (i.e., receive credit / blame) for a subsequent error gradient, but it isn't actually propagating a gradient through the different activity states that were present in prior trials, so it can't really "rescue" some earlier signal.

Nevertheless, it is generally beneficial and does often produce small performance advantages -- it is just nowhere near as powerful (and dangerous!) as full BPTT.

0 replies

rcoreilly · 2022-10-11T11:47:35Z

rcoreilly
Oct 11, 2022
Maintainer Author

Simple case logic for Super -> CT -> Pulvinar -> back

The simple case for how the deep system works is as follows:

Superficial layers (i.e., standard Hidden layers) encode new stimulus inputs and anything else going on in the network.
At the end of a ThetaCycle, this state is copied via one-to-one projections into the CT, which holds onto it in an essentially veridical format through the next ThetaCycle.
Then the TRC (Pulvinar) undergoes standard plus - minus error-driven learning (based on driving inputs in plus phase relative to continuous CT inputs through minus and plus) to learn the one-step mapping from the CT context into the prediction for what happened next.
This TRC error signal also goes up to the Super and CT layers, and shapes their representations.

This last point is a bit strange: the error signal here is about how the t-1 CT info can be used to predict the current t input -- this is not obviously directly relevant for making sure that the current Super layer encodes relevant information into the t+1 future, which is what we want it to be doing. Thus, the only way to make sense of this is in terms of shaping larger multi-trial patterns so there is some generalization across time -- what we learn at point t will be relevant for t+1 too. The fact that this is not conveyed as a direct BPTT kind of error through the CT copy and back into Super at the prior time step does not appear to be too relevant for allowing the error signal to be useful.

Consistent with this, eliminating pulvinar -> Hidden projection is lethal in most models.

Furthermore, it cannot be replaced by a CT -> Super prjn, which in general does not do much. The CT layers by virtue of their stronger maintenance currents do not exhibit as much of a temporal-difference error signal, and thus are not a good thing to try to backpropagate through.

In short, CT can just be a simple copy with almost no interesting learning itself -- it just reflects Super. This is the case in the move model, and the simple logic above. In this respect, it resembles the reservoir computing layer, which has fixed random connections and dynamics that unfold over time to then be decoded.

However, unlike reservoir models, CT does reflect the learning taking place in the Super layers, and also has the capacity to learn, e.g., in the fsa model it is essential that it learn with full connections from Super. Likewise in the music model, even though it has one2one projections from super, the learning there is important. Furthermore, with new NDMA-based recurrent CT, it can do more dynamic updating over time and learning in CT recurrent connections then becomes much more important, as in both the music and fsa models.

0 replies

rcoreilly · 2022-10-11T11:48:24Z

rcoreilly
Oct 11, 2022
Maintainer Author

Details on `deep_move`, `deep_fsa`, and `deep_music` models

Each of these models places different learning demands on the CT network, and requires different parameters and connectivity to work their best. Thus, they serve as important templates for different levels of CT functionality in different brain areas.

Move

The deep_move model is the simplest, with one2one copy from Super, and following the simplest case logic discussed above. The Super layer requires topographic connections from the input depths, to enable each different depth and angle to be processed relatively independently. There is a global and necessarily strong connection from the Action layer -- in effect, the Super layer implements the Action transform of the depth signal, CT copies it, and uses it to predict what will happen on the next trial as the action plays out.

FSA

The deep_fsa model is computationally challenging and goes well beyond simple copy logic -- it needs to learn to represent the hidden causal state, not the current visible input label. It is easy to make it fail, and it is highly sensitive to getting the learning just right. It requires a full projection from Super -> CT, and really benefits from learning in this projection and the CT collaterals. It does not require as long of a temporal integration window as Music (nor a very big hidden layer), but longer than Move. The trace tau of 2 is reliably better than 1. It is the best place to test for new learning tweaks etc: fast and sensitive.

Music

The deep_music model requires the longest temporal integration -- NMDA tau of 300 and higher Ge values, and critically depends on the CT collaterals. It can use full connections from Super but works best / fastest with learned one2one -- it doesn't really need to do significant nonlinear transformations, just properly contextualize things over time. It benefits from a larger hidden layer and to learn the full song it can use 400 units and 200 epochs of training.

0 replies

rcoreilly · 2022-10-12T05:36:09Z

rcoreilly
Oct 12, 2022
Maintainer Author

Additional hidden layers: not useful here

None of the models benefit from additional deeper hidden layers. If you make the music model's primary hidden layer smaller, it can use the units in the 2nd hidden layer to advantage, but this seems to be more about capacity than depth.

In music with smaller primary hidden, having the 2nd hidden layer learn to predict the 1st hidden layer's state (i.e., a HiddenP pulvinar) does not help beyond just having the 2nd hidden layer directly predict the Input notes. This is consistent with the idea that music doesn't require a lot of abstraction, and just needs more differentiating context units.
Thus, none of these models are particularly useful as models of higher-order abstraction -- need to go to the WWI object prediction model or a multimodal map nav model for such things. At least, one can test for ways of not having the higher hidden layers cause impaired performance..

0 replies

rcoreilly · 2022-10-12T08:15:25Z

rcoreilly
Oct 12, 2022
Maintainer Author

Alpha / Theta relationship in CT dynamics

There is strong evidence that the 5IB bursting occurs at alpha, 100 msec intervals.

And yet, in spiking, minus phase / plus phase learning only works over 200 msec windows -- I haven't been able to get it working at the 100 msec window, which is also so short as to provide a very limited sample of relevant activity.

One way to reconcile this situation is to suggest that, as was originally envisioned, there are two alpha cycles within each theta cycle, with the first setting up the context for a subsequent alpha cycle where sufficient information is present to actually make a sensible prediction.

Thus, the plus phase represents the last 50 msec of this second theta cycle.

However, the models do not currently include the role of the first alpha cycle bursting.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CT Learning: The SRN model #63

{{title}}

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

CT Learning: The SRN model #63

rcoreilly Oct 11, 2022 Maintainer

Replies: 6 comments

rcoreilly Oct 11, 2022 Maintainer Author

The challenge and benefits of not being able to backprop through time (BPTT)

rcoreilly Oct 11, 2022 Maintainer Author

Trace makes some (but not too much) difference

rcoreilly Oct 11, 2022 Maintainer Author

Simple case logic for Super -> CT -> Pulvinar -> back

rcoreilly Oct 11, 2022 Maintainer Author

Details on deep_move, deep_fsa, and deep_music models

Move

FSA

Music

rcoreilly Oct 12, 2022 Maintainer Author

Additional hidden layers: not useful here

rcoreilly Oct 12, 2022 Maintainer Author

Alpha / Theta relationship in CT dynamics

rcoreilly
Oct 11, 2022
Maintainer

rcoreilly
Oct 11, 2022
Maintainer Author

rcoreilly
Oct 11, 2022
Maintainer Author

rcoreilly
Oct 11, 2022
Maintainer Author

rcoreilly
Oct 11, 2022
Maintainer Author

Details on `deep_move`, `deep_fsa`, and `deep_music` models

rcoreilly
Oct 12, 2022
Maintainer Author

rcoreilly
Oct 12, 2022
Maintainer Author