Skip to content

Latest commit

ย 

History

History
108 lines (75 loc) ยท 6.84 KB

File metadata and controls

108 lines (75 loc) ยท 6.84 KB

Language Model

Recurrent Neural Networks (RNN)

$$h_t = \tanh (W_{hh}h_{t-1} + W_{xh}x_{t})$$

Backpropagation Through Time (BPTT)

๊ธฐ์กด์˜ ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ์—์„œ๋Š” backpropagation ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ด์šฉํ•œ๋‹ค. RNN์—์„œ๋Š” ์ด๋ฅผ ์‚ด์ง ๋ณ€ํ˜•์‹œํ‚จ ๋ฒ„์ „์ธ Backpropagation Through Time (BPTT) ์„ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ๊ทธ ์ด์œ ๋Š” ๊ฐ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์ด ๋„คํŠธ์›Œํฌ์˜ ๋งค ์‹œ๊ฐ„ ์Šคํ…๋งˆ๋‹ค ๊ณต์œ ๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ฆ‰, ๊ฐ ์‹œ๊ฐ„ ์Šคํ…์˜ ์ถœ๋ ฅ๋‹จ์—์„œ์˜ gradient๋Š” ํ˜„์žฌ ์‹œ๊ฐ„ ์Šคํ…์—์„œ์˜ ๊ณ„์‚ฐ์—๋งŒ ์˜์กดํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์ด์ „ ์‹œ๊ฐ„ ์Šคํ…์—๋„ ์˜์กดํ•œ๋‹ค.

RNN tradeoffs

  • RNN Advantages
    • ์–ด๋–ค ๊ธธ์ด์˜ ์ž…๋ ฅ์ด๋ผ๋„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.
    • ๊ธด ๊ธธ์ด์˜ ์ž…๋ ฅ์ด ๋“ค์–ด์™€๋„, ๋ชจ๋ธ์˜ ํฌ๊ธฐ๊ฐ€ ์ปค์ง€์ง€ ์•Š๋Š”๋‹ค.
    • ์ด๋ก ์ ์œผ๋กœ, step t์— ๋Œ€ํ•œ ๊ณ„์‚ฐ์€, ์ด์ „ ๋‹จ๊ณ„์˜ ์ •๋ณด๋ฅผ ํฌํ•จํ•œ๋‹ค.
  • RNN Disadvantages
    • Recurrent ์—ฐ์‚ฐ์ด ๋Š๋ฆฌ๋‹ค.
    • ํ˜„์‹ค์ ์œผ๋กœ, ์˜ค๋ž˜ ์ „์˜ step์—์„œ์˜ ์ •๋ณด์— ์ ‘๊ทผํ•˜๊ธฐ ์–ด๋ ต๋‹ค.
    • ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค (Vanashing Gradient)
      • ์—ญ์ „ํŒŒ ๊ณผ์ •์—์„œ ์ž…๋ ฅ์ธต์œผ๋กœ ๊ฐˆ์ˆ˜๋ก ๊ธฐ์šธ๊ธฐ๊ฐ€ ์ ์ฐจ ์ž‘์•„์ง€๋Š” ํ˜„์ƒ.
      • ์ž…๋ ฅ์ธต์— ๊ฐ€๊นŒ์šด ์ธต๋“ค์—์„œ ๊ฐ€์ค‘์น˜๋“ค์ด ์—…๋ฐ์ดํŠธ๊ฐ€ ์ œ๋Œ€๋กœ ๋˜์ง€ ์•Š์œผ๋ฉด ์ตœ์ ์˜ ๋ชจ๋ธ์„ ์ฐพ์„ ์ˆ˜ ์—†๊ฒŒ ๋œ๋‹ค.
    • ๊ธฐ์šธ๊ธฐ ํญ์ฃผ (Exploding Gradient)

Vanishing Gradient๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ์ด์œ 

ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋กœ ํ”ํžˆ ์‚ฌ์šฉ๋˜๋Š” sigmoid ํ•จ์ˆ˜์™€ tanh ํ•จ์ˆ˜์˜ ๋„ํ•จ์ˆ˜๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค. ๋„ํ•จ์ˆ˜์˜ ์ตœ๋Œ€๊ฐ’์ด 1๋ณด๋‹ค ์ž‘์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

์—ญ์ „ํŒŒ (back-propagation) ๊ณผ์ •์—์„œ ํ™œ์„ฑํ™” ํ•จ์ˆ˜์˜ ๋ฏธ๋ถ„๊ฐ’์ด ์‚ฌ์šฉ๋˜๋Š”๋ฐ, 1๋ณด๋‹ค ์ž‘์€ ๊ฐ’์ด ๋ฐ˜๋ณตํ•ด์„œ ๊ณฑํ•ด์ง€๋ฉด 0์— ์ˆ˜๋ ดํ•˜๊ฒŒ ๋œ๋‹ค.

Why is Vanishing Gradient a Problem?

๋จผ ๊ฑฐ๋ฆฌ์˜ gradient signal์€ ๊ฐ€๊นŒ์šด ๊ฑฐ๋ฆฌ์˜ gradient signal๋ณด๋‹ค ํ›จ์”ฌ ์ž‘๊ธฐ ๋•Œ๋ฌธ์— ์†์‹ค๋œ๋‹ค.

๋”ฐ๋ผ์„œ ๋จผ ๊ฑฐ๋ฆฌ์˜ gradient signal์€ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜์— ์˜ํ–ฅ์„ ๋ผ์น˜์ง€ ๋ชปํ•˜๊ณ , ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๋Š” ๊ฐ€๊นŒ์šด ๊ฑฐ๋ฆฌ์˜ gradient signal์— ๋Œ€ํ•ด์„œ๋งŒ ์—…๋ฐ์ดํŠธ๋œ๋‹ค.

Exploding Gradient๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ์ด์œ 

Back propagation ์—ฐ์‚ฐ์—๋Š” Activation function์˜ ํŽธ๋ฏธ๋ถ„ ๊ฐ’๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๊ฐ€์ค‘์น˜ ๊ฐ’๋“ค๋„ ๊ด€์—ฌํ•˜๊ฒŒ ๋œ๋‹ค.

๋งŒ์•ฝ, ๋ชจ๋ธ์˜ ๊ฐ€์ค‘์น˜๋“ค์ด ์ถฉ๋ถ„ํžˆ ํฐ ๊ฐ’์ด๋ผ๊ณ  ๊ฐ€์ •์„ ํ•˜๋ฉด, ๋ ˆ์ด์–ด๊ฐ€ ๊นŠ์–ด์งˆ์ˆ˜๋ก ์ถฉ๋ถ„ํžˆ ํฐ ๊ฐ€์ค‘์น˜๋“ค์ด ๋ฐ˜๋ณต์ ์œผ๋กœ ๊ณฑํ•ด์ง€๋ฉด์„œ backward value๊ฐ€ ํญ๋ฐœ์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๊ฒŒ ๋œ๋‹ค. ์ด๋ฅผ Exploding gradient๋ผ ํ•œ๋‹ค.

Why is Exploding Gradient a Problem?

Gradient๊ฐ€ ๋„ˆ๋ฌด ํฌ๋ฉด, SGD update step์ด ์ปค์ง€๊ฒŒ ๋œ๋‹ค.

๋„ˆ๋ฌด ํฐ step์œผ๋กœ ์—…๋ฐ์ดํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ ์ž˜๋ชป๋œ ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ตฌ์„ฑ์— ๋„๋‹ฌํ•˜๋ฉด, ํฐ ์†์‹ค์ด ๋ฐœ์ƒํ•˜์—ฌ ์ž˜๋ชป๋œ ์—…๋ฐ์ดํŠธ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.

How to fix vanishing/exploding gradient problem?

  • Two new types of RNN
  • Other fixes for vanishing (or exploding) gradient:
    • Gradient clipping
      • ๊ทธ๋ผ๋ฐ์ด์…˜์˜ norm์ด threshold๋ณด๋‹ค ํฐ ๊ฒฝ์šฐ, SGD ์—…๋ฐ์ดํŠธ๋ฅผ ์ ์šฉํ•˜๊ธฐ ์ „์— ํฌ๊ธฐ๋ฅผ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•
    • Skip connections
  • More fancy RNN variants:
    • Bidirectional RNNs
    • Multi-layer RNNs

Long Short-Term Memory (LSTM)

  • LSTM์˜ ์‚ฌ์šฉ ์ด์œ 
    • RNN์€ ๋ฐ์ดํ„ฐ์˜ ์žฅ๊ธฐ ์˜์กด์„ฑ(long-term dependency)์„ ํ•™์Šตํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€์„ ๊ฒช์ง€๋งŒ, LSTM์€ ๋” ๊ธด long-term ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ๋งŒ๋“ค์–ด์กŒ๋‹ค.
    • LSTM์€ Memory Cell๊ณผ Gate๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์žฅ๊ธฐ ๋ฐ ๋‹จ๊ธฐ Hidden State๋ฅผ ํ†ต๊ณผํ•œ๋‹ค.
    • Cell state๋ฅผ ๋„์ž…ํ•˜๋ฉด์„œ gradient flow๊ฐ€ ๋ฐฉํ•ด ๋ฐ›์ง€ ์•Š๊ฒŒ ํ•˜๋ฉฐ, gradient๊ฐ€ ์ž˜ ํ˜๋Ÿฌ๊ฐ€์ง€ ์•Š๋Š” ๊ฒƒ์„ ๋ง‰์•„์ค€๋‹ค.

LSTM์˜ ๊ฐ Gate๋“ค์˜ ์—ญํ• 

1. Forget Gate ($f_{t}$)

  • ๊ณผ๊ฑฐ ์ •๋ณด๋ฅผ ์–ผ๋งˆ๋งŒํผ ์œ ์ง€ํ•  ๊ฒƒ์ธ๊ฐ€?
  • Forget Gate๋Š” cell state์—์„œ sigmoid($\sigma{}$) layer๋ฅผ ๊ฑฐ์ณ ์ด์ „ hidden state๋ฅผ ์–ผ๋งˆ๋‚˜ ๊ธฐ์–ตํ• ์ง€ ์ •ํ•œ๋‹ค.

2. Input Gate ($i_{t}$)

  • ์ƒˆ๋กœ ์ž…๋ ฅ๋œ ์ •๋ณด๋Š” ์–ผ๋งˆ๋งŒํผ ํ™œ์šฉํ•  ๊ฒƒ์ธ๊ฐ€?
  • Input Gate๋Š” ์ƒˆ๋กœ ์ž…๋ ฅ๋œ ์ •๋ณด ์ค‘ ์–ด๋–ค ๊ฒƒ์„ cell state์— ์ €์žฅํ•  ๊ฒƒ์ธ์ง€๋ฅผ ์ •ํ•œ๋‹ค. ๋จผ์ € sigmoid($\sigma{}$) layer๋ฅผ ๊ฑฐ์ฒ˜ ์–ด๋–ค ๊ฐ’์„ ์—…๋ฐ์ดํŠธ ํ•  ๊ฒƒ์ธ์ง€๋ฅผ ์ •ํ•œ ํ›„, $\tanh$ layer์—์„œ ์ƒˆ๋กœ์šด ํ›„๋ณด Vector๋ฅผ ๋งŒ๋“ ๋‹ค.
    • ํ™œ์„ฑํ™”ํ•จ์ˆ˜๋กœ์จ sigmoid๋ฅผ ์“ฐ์ง€ ์•Š๊ณ , $\tanh$๋ฅผ ์“ฐ๋Š” ์ด์œ ๋Š” ์—ฌ๊ธฐ ์ฐธ๊ณ 

3. Output Gate ($o_{t}$)

  • ๋‘ ์ •๋ณด๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๋‚˜์˜จ ์ถœ๋ ฅ ์ •๋ณด๋ฅผ ์–ผ๋งˆ๋งŒํผ ๋„˜๊ฒจ์ค„ ๊ฒƒ์ธ๊ฐ€?
  • Output Gate๋Š” ์–ด๋–ค ์ •๋ณด๋ฅผ output์œผ๋กœ ๋‚ด๋ณด๋‚ผ์ง€ ์ •ํ•œ๋‹ค. ๋จผ์ € sigmoid($\sigma{}$) layer์— input data๋ฅผ ๋„ฃ์–ด output ์ •๋ณด๋ฅผ ์ •ํ•œ ํ›„ Cell state๋ฅผ $\tanh$ layer์— ๋„ฃ์–ด sigmoid($\sigma{}$) layer์˜ output๊ณผ ๊ณฑํ•˜์—ฌ output์œผ๋กœ ๋‚ด๋ณด๋‚ธ๋‹ค.

References

  1. ์ธ๊ณต์ง€๋Šฅ ์‘์šฉ (ICE4104), ์ธํ•˜๋Œ€ํ•™๊ต ์ •๋ณดํ†ต์‹ ๊ณตํ•™๊ณผ ํ™์„ฑ์€ ๊ต์ˆ˜๋‹˜
  2. Stanford CS224N - NLP w/ DL | Winter 2021 | Lecture 5 - Recurrent Neural networks (RNNs)
  3. [๋จธ์‹ ๋Ÿฌ๋‹/๋”ฅ๋Ÿฌ๋‹] 10-1. Recurrent Neural Network(RNN)
  4. RNN Tutorial Part 2 - Python, NumPy์™€ Theano๋กœ RNN ๊ตฌํ˜„ํ•˜๊ธฐ
  5. Neural Network Notes
  6. (2) ํ™œ์„ฑํ•จ์ˆ˜์™€ ์ˆœ์ „ํŒŒ ๋ฐ ์—ญ์ „ํŒŒ
  7. Vanishing gradient / Exploding gradient
  8. LSTM(Long-Short Term Memory)๊ณผ GRU(gated Recurrent Unit)(์ฝ”๋“œ ์ถ”๊ฐ€ํ•ด์•ผํ•จ)
  9. 07-07 ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค(Gradient Vanishing)๊ณผ ํญ์ฃผ(Exploding) - ๋”ฅ ๋Ÿฌ๋‹์„ ์ด์šฉํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์ž…๋ฌธ
  10. LSTM - ์ธ์ฝ”๋ค, ์ƒ๋ฌผ์ •๋ณด ์ „๋ฌธ์œ„ํ‚ค
  11. [๋”ฅ๋Ÿฌ๋‹] LSTM ์‰ฝ๊ฒŒ ์ดํ•ดํ•˜๊ธฐ - YouTube
  12. Loner์˜ ํ•™์Šต๋…ธํŠธ :: LSTM ๊ฐœ์ธ์ •๋ฆฌ