Skip to content

Latest commit

ย 

History

History

02. Optimization

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
ย 
ย 

Optimization

Epoch vs Batch vs Iteration

  • Epoch
    • ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์„ ํ•œ ๋ฐ”ํ€ด ์ˆœํšŒํ•˜์—ฌ ํ•™์Šตํ•˜๋Š” ๊ฒƒ
  • (Mini-)Batch
    • ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์„ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ž‘์€ ๋ฌถ์Œ์œผ๋กœ ๋‚˜๋ˆ„์–ด ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•
  • Iteration / Step
    • 1-Epoch์„ ๋งˆ์น˜๋Š”๋ฐ ํ•„์š”ํ•œ Mini-Batch์˜ ๊ฐœ์ˆ˜
      • 1-Epoch์—์„œ Batch ๋‹จ์œ„๋กœ ํ•™์Šต์„ ์™„๋ฃŒํ•œ ๊ฒƒ
    • 1-epoch๋ฅผ ๋งˆ์น˜๋Š”๋ฐ ํ•„์š”ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ ํšŸ์ˆ˜
      • ๊ฐ Mini-Batch๋งˆ๋‹ค ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํ„ฐ๊ฐ€ ํ•œ ๋ฒˆ ์”ฉ ์ง„ํ–‰๋จ

Batch Size์™€ Learning Rate ์ •ํ•˜๊ธฐ

Baseline ๋ชจ๋ธ์˜ Batch Size์™€ Learning Rate๋ฅผ ์•Œ๊ณ  ์žˆ์„ ๋•Œ, Batch Size๋ฅผ ๋Š˜๋ฆฌ๊ฑฐ๋‚˜ ์ค„์ธ๋‹ค๋ฉด, Learning Rate๋Š” ์–ด๋–ป๊ฒŒ ์กฐ์ ˆํ•ด์•ผ ํ• ๊นŒ?

  • Baseline์˜ batch_size = 128, learning_rate=0.001์ด๋ผ๊ณ  ๊ฐ€์ •
    • if batch_size = 64๋กœ 1/2๋ฐฐ ํ•œ๋‹ค๋ฉด
      • learning_rate=0.0005๋กœ ๋˜‘๊ฐ™์ด 1/2๋ฐฐ ํ•ด์คŒ
    • if batch_size = 256๋กœ 2๋ฐฐ ํ•œ๋‹ค๋ฉด
      • learning_rate=0.002๋กœ ๋˜‘๊ฐ™์ด 2๋ฐฐ ํ•ด์คŒ

Gradient Descent

Scalar derivative VS. Gradient

  • Scalar derivative (์Šค์นผ๋ผ ๋ฏธ๋ถ„)
    • ์Šค์นผ๋ผ ๋ฏธ๋ถ„์˜ ๊ฒฝ์šฐ, ๋ฏธ๋ถ„๊ฐ’์ด ์ด๋ฆ„์ฒ˜๋Ÿผ ์Šค์นผ๋ผ ๊ฐ’์„ ๊ฐ–๊ฒŒ ๋œ๋‹ค.
    • ๋”ฐ๋ผ์„œ ์Šค์นผ๋ผ ๋ฏธ๋ถ„์„ ํ•˜๊ฒŒ ๋์„ ๋•Œ๋Š”, ํ•ด๋‹น ์ง€์ ์—์„œ์˜ ์ˆœ๊ฐ„๊ธฐ์šธ๊ธฐ๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
      • ์ˆœ๊ฐ„๊ธฐ์šธ๊ธฐ๋Š” ํ•ด๋‹น ์ง€์ ์—์„œ์˜ ์†๋„๋กœ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋‹ค.
    • ์ˆœ๊ฐ„๊ธฐ์šธ๊ธฐ๋ฅผ ํ†ตํ•ด ๋ฐฉํ–ฅ์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์Œ๊ณผ ์–‘์˜ 1์ฐจ์› ์ƒ์—์„œ ํ•ด์„ํ•  ์ˆ˜ ์žˆ์„ ์ •๋„๋กœ๋งŒ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
  • Gradient (๊ทธ๋ผ๋””์–ธํŠธ) $\bigtriangledown f$
    • ๊ทธ๋ผ๋””์–ธํŠธ๋Š” ๋ฏธ๋ถ„๊ฐ’์ด ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„๋œ๋‹ค.
    • ํ•ด๋‹น ์ง€์ ์—์„œ ๋ฏธ๋ถ„ ์—ฐ์‚ฐ์ด ์ ์šฉ๋œ๋‹ค๋Š” ๊ฒƒ์ด ์Šค์นผ๋ผ ๋ฏธ๋ถ„๊ณผ ๊ณตํ†ต์ ์ด์ง€๋งŒ, ํ•ด๋‹น ์œ„์น˜์—์„œ์˜ ๋ฐฉํ–ฅ์„ 3์ฐจ์› ๊ณต๊ฐ„ ์ƒ์œผ๋กœ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ฐจ์ด์ ์ด ์žˆ๋‹ค.
    • $\bigtriangledown f$๋Š” $f$์˜ ๊ฐ’์ด ๊ฐ€์žฅ ๊ฐ€ํŒŒ๋ฅด๊ฒŒ ์ฆ๊ฐ€ํ•˜๋Š” ๋ฐฉํ–ฅ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค.
    • Gradient Descent๊ฐ€ Global Minimum Loss์— ์ˆ˜๋ ดํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š”, ๊ฐ€์žฅ ๋นจ๋ฆฌ ๊ฐ์†Œ์‹œํ‚ค๋Š” ๋ฐฉํ–ฅ์ธ $-\bigtriangledown f$๋กœ ์ด๋™ํ•œ๋‹ค.

Stochastic Gradient Descent

Mini-batch SGD: Loop

  1. Sample a batch of data
  2. Forward prop it thriugh the graph(network), get loss
  3. Backprop to calculate the gradients
  4. Update the parameters using the gradient

Momentum

  • ์žฅ์ 
    1. weight ๋ณ€์ˆ˜๋“ค์— ๋Œ€ํ•œ ํŽธ๋ฏธ๋ถ„ ๊ณ„์ˆ˜ ๊ฐ’๋“ค์˜ ์ฐจ์ด๊ฐ€ ํด ๋•Œ์—๋„ ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ดํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ค€๋‹ค.
    2. Saddle Point๋‚˜ Local Minimum์—์„œ ๋น ์ ธ๋‚˜์˜ค๋„๋ก ๋„์™€์ค€๋‹ค.

Learning Rate Scheduling

Learning Rate Decay

  • ํ•„์š”์„ฑ
    • ๊ณ ์ •๋œ Learning Rate๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ, ๋ชจ๋ธ์ด Global Minimum Loss์— ์ˆ˜๋ ดํ•˜๊ธฐ ์‰ฝ์ง€ ์•Š๋‹ค. ๋”ฐ๋ผ์„œ, ์ดˆ๊ธฐ ํ•™์Šต ๋‹จ๊ณ„์—์„œ ํฐ Learning Rate๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Global Minimum Loss ์ฃผ๋ณ€์— ๋น ๋ฅด๊ฒŒ ์ ‘๊ทผํ•˜๊ณ , ์ ์ฐจ Learning Rate๋ฅผ ์ค„์—ฌ๋‚˜๊ฐ€๋ฉด์„œ Global Minimum Loss์— ์„ธ๋ถ€์ ์œผ๋กœ ์ˆ˜๋ ดํ•˜๋„๋ก ์‚ฌ์šฉํ•œ๋‹ค.
  • ๋™์ž‘ ๋ฐฉ์‹
    • ์•ž์„œ ๋งํ•œ๋Œ€๋กœ, Learning Rate Decay์˜ ์ฒ ํ•™์€ ์ดˆ๊ธฐ ํ•™์Šต ๋‹จ๊ณ„์—์„œ๋Š” ํฐ Learning Rate๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , ํ•™์Šต์ด ์ง„ํ–‰๋จ์— ๋”ฐ๋ผ ์ ์ฐจ Learning Rate๋ฅผ ์ค„์—ฌ๋‚˜๊ฐ€๋Š” ๊ฒƒ์ด๋‹ค. Learning Rate Decay์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์—ฌ๋Ÿฌ๊ฐ€์ง€๊ฐ€ ์žˆ์œผ๋ฉฐ, ๋Œ€ํ‘œ์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” Cosine Annealing ๊ธฐ๋ฒ•์€ Learning Rate๊ฐ€ Cosine ํ•จ์ˆ˜ ๊ทธ๋ž˜ํ”„์ฒ˜๋Ÿผ ๋ณ€ํ•œ๋‹ค. ์ด์™ธ์—๋„ Learning Rate๊ฐ€ Exponential ํ•จ์ˆ˜์˜ ๊ฐœํ˜•์ฒ˜๋Ÿผ ๋ณ€ํ•˜๋Š” ExponentialLR ๊ธฐ๋ฒ•๊ณผ, ์‚ฌ์šฉ์ž๊ฐ€ ์ง€์ •ํ•œ Step ๋งˆ๋‹ค ์ผ์ •ํ•˜๊ฒŒ Learning Rate๊ฐ€ ์ค„์–ด๋“œ๋Š” StepLR ๊ธฐ๋ฒ• ๋“ฑ์ด ์žˆ๋‹ค.

Warmup

  • ํ•„์š”์„ฑ
    • ๋ชจ๋ธ ํ•™์Šต ์‹œ, ์ดˆ๊ธฐ ํ•™์Šต ๋‹จ๊ณ„๋ถ€ํ„ฐ ํฐ Learning Rate๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ํ•™์Šต์˜ ๋ถˆ์•ˆ์ •์„ ์ดˆ๋ž˜ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ, Warmup์„ ํ†ตํ•ด ์ž‘์€ Learning Rate๋ถ€ํ„ฐ ํ•™์Šต์„ ์‹œ์ž‘ํ•ด์„œ ์ ์ฐจ ๋Š˜๋ ค๋‚˜๊ฐ€๊ณ , Warmup ์ดํ›„์— ๊ณ ์ •๋œ Learning Rate๋ฅผ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜, Learning Rate Decay ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค.
  • ๋™์ž‘ ๋ฐฉ์‹
    • Warmup์€ ๋ชจ๋ธ์˜ ์ดˆ๊ธฐ ํ•™์Šต ๋‹จ๊ณ„์—์„œ ์ ์ฐจ Learning Rate๋ฅผ ๋Š˜๋ ค๋‚˜๊ฐ€๋Š” ๊ฒƒ์ด๋‹ค. ์•ž์„œ ๋งํ•œ๋Œ€๋กœ, Warmup ์ดํ›„์—๋Š” ๊ณ ์ •๋œ Learning Rate๋ฅผ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜, Learning Rate Decay ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค.

References

  1. ์ธ๊ณต์ง€๋Šฅ ์‘์šฉ (ICE4104), ์ธํ•˜๋Œ€ํ•™๊ต ์ •๋ณดํ†ต์‹ ๊ณตํ•™๊ณผ ํ™์„ฑ์€ ๊ต์ˆ˜๋‹˜
  2. uni1023.log - ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ(batch size), ์—ํฌํฌ(epoch), ๋ฐ˜๋ณต(iteration)์˜ ์ฐจ์ด๋Š”?
  3. [๋…ผ๋ฌธ์š”์•ฝ] Classification ํ•™์Šต๋ฐฉ๋ฒ• - Bag of Tricks(2018)
  4. ๋‹คํฌ ํ”„๋กœ๊ทธ๋ž˜๋จธ :: Gradient, Jacobian ํ–‰๋ ฌ, Hessian ํ–‰๋ ฌ, Laplacian