Replies: 6 comments
-
The challenge and benefits of not being able to backprop through time (BPTT)If you can do BPTT, then you can magically propagate the error signals from later in time backwards through time to change the trajectory of the network "next time around" to avoid those errors. However, that "next time around" is likely to be different in various ways, and therein lies the rub of BPTT: it tends to create fragile chains of dynamics that quickly fall apart. And time intrinsically has an exponential, iterative character to it, so that small deviations tend to multiply over iterations, magnifying the brittleness of BPTT. The LSTM tames BPTT by eliminating the exponential character of time evolution, essentially copying fixed activity patterns forward through time indefinitely (and adding external gates to then be able to update these signals). But critically, it retains the ability to actually do BPTT through this perfect linear copy function, thereby bridging potentially long gaps between "cause and effect". The SRN also adopts the copy solution (copying the hidden activity as a context input for the next time step), but with a severely limited time step of 1, relying on an emergent exponential-like reverberatory effect for anything longer. And it relies on a kind of "wishful thinking" that learning shaping the current hidden layer representations will result in the relevant information being available and properly represented as a context for the next time step. This is because it has the computationally limited, but biologically appealing, property of not doing BPTT. This SRN logic is not quite as bad as it sounds: if information is relevant across multiple time steps, and you happen to preserve some shred of that information by chance, then learning at later time steps will tend to reinforce the maintenance of that information over time. However, if there is something in the relatively distant past that is relevant for only 1 timestep, an SRN will have a hard time learning to maintain it. One obvious way of improving the SRN is to introduce a bias toward maintaining information across multiple time steps in a relatively stable manner (like the LSTM), so that it is more likely to be around when you need it later. This is what the newly revised CT dynamics based on NMDA channels and recurrent connections do: they sustain longer patterns of activity over time, which is key for To summarize, the advantages of the extended SRN-like solution are:
And the disadvantages are:
|
Beta Was this translation helpful? Give feedback.
-
Trace makes some (but not too much) differenceThe new Trace learning mechanism: #49 is supposed to help improve learning in temporally-extended tasks, by providing a kind of approximation to BPTT that is fully biologically plausible (i.e., just exponentially integrating an activity trace over time). However, the effects are relatively minor in the various test cases (though always beneficial when there are differences). The above discussion makes sense of this: if the model is not properly biased to maintain relevant information, then the trace is not going to be a strong enough mechanism to overcome that problem. And if it is properly biased, then the existing error-driven learning is generally sufficient to shape the representations. The trace mechanism basically amounts to integrating activity over time and using that integrated value for the credit assignment factor in the backprop learning logic. However, this is really not as powerful as full BPTT: it can only achieve a kind of "associative" level learning, where earlier states can become "associated" (i.e., receive credit / blame) for a subsequent error gradient, but it isn't actually propagating a gradient through the different activity states that were present in prior trials, so it can't really "rescue" some earlier signal. Nevertheless, it is generally beneficial and does often produce small performance advantages -- it is just nowhere near as powerful (and dangerous!) as full BPTT. |
Beta Was this translation helpful? Give feedback.
-
Simple case logic for Super -> CT -> Pulvinar -> backThe simple case for how the
This last point is a bit strange: the error signal here is about how the t-1 CT info can be used to predict the current t input -- this is not obviously directly relevant for making sure that the current Super layer encodes relevant information into the t+1 future, which is what we want it to be doing. Thus, the only way to make sense of this is in terms of shaping larger multi-trial patterns so there is some generalization across time -- what we learn at point t will be relevant for t+1 too. The fact that this is not conveyed as a direct BPTT kind of error through the CT copy and back into Super at the prior time step does not appear to be too relevant for allowing the error signal to be useful. Consistent with this, eliminating pulvinar -> Hidden projection is lethal in most models. Furthermore, it cannot be replaced by a CT -> Super prjn, which in general does not do much. The CT layers by virtue of their stronger maintenance currents do not exhibit as much of a temporal-difference error signal, and thus are not a good thing to try to backpropagate through. In short, CT can just be a simple copy with almost no interesting learning itself -- it just reflects Super. This is the case in the However, unlike reservoir models, CT does reflect the learning taking place in the Super layers, and also has the capacity to learn, e.g., in the |
Beta Was this translation helpful? Give feedback.
-
Details on
|
Beta Was this translation helpful? Give feedback.
-
Additional hidden layers: not useful hereNone of the models benefit from additional deeper hidden layers. If you make the music model's primary hidden layer smaller, it can use the units in the 2nd hidden layer to advantage, but this seems to be more about capacity than depth.
|
Beta Was this translation helpful? Give feedback.
-
Alpha / Theta relationship in CT dynamicsThere is strong evidence that the 5IB bursting occurs at alpha, 100 msec intervals. And yet, in spiking, minus phase / plus phase learning only works over 200 msec windows -- I haven't been able to get it working at the 100 msec window, which is also so short as to provide a very limited sample of relevant activity. One way to reconcile this situation is to suggest that, as was originally envisioned, there are two alpha cycles within each theta cycle, with the first setting up the context for a subsequent alpha cycle where sufficient information is present to actually make a sensible prediction. Thus, the plus phase represents the last 50 msec of this second theta cycle. However, the models do not currently include the role of the first alpha cycle bursting. |
Beta Was this translation helpful? Give feedback.
-
The corticothalamic (CT) projecting layer in the deep predictive learning framework is computationally modeled by the simple recurrent network (SRN) (Elman, 1991; Jordan 1989). This discussion is to review and update thinking about this, in the context of the
deep_fsa
,deep_music
anddeep_move
models.Beta Was this translation helpful? Give feedback.
All reactions