==============================

학습내용

[DLBasic] Optimization

[DLBasic] Optimization Assignment

[AI Math 9강] CNN 첫걸음

[AI Math 9강 퀴즈] CNN 첫걸음-1~5 시험

[DLBasic] Optimization

사실 각각의 최적화에 대해 한학기 한학기.. 로 다뤄야 할 만큼 많은데 줄이고 줄이고 줄여서 주목할만한 것들만 살펴본다고 한다. 용어들의 정확한 컨셉을 잡는게 중요

local minimum을 찾는게 목적이 된다.

Generalization
Under-fittingvs.over-fitting
Crossvalidation
Bias-variancetradeoff
Bootstrapping
Baggingandboosting

training error 가 낮다고 test error가 낮다는게 아님. 트레이닝 모델이 좋다는건 모델학습하면 test error도 낮다는걸 의미. 너무 트레이닝에만 맞으면 overfitting.

validation은 어떻게 나누는게 좋은가. test data는 어떤식으로든 활용해선 안된다. 단순 예측만.

variance가 낮은게 좋은거. bias는 상관X. bias가 높은건 mean과 많이 벗어난 것.

cost = bias^2 + variance + noise. 각각 trade off 관계.

Bootstrapping is any test or metric that uses random sampling with replacement.

bootstrapping 은 통계학 용어인데, 학습데이터가 고정되어 있을 때 그 안에서 subsampling을 통해서 학습데이터를 여러개를 만들고 그다음에 100개가 있으면 100개 다 쓰는게 아니라 80개씩 만들어서 그거 가지고 여러 모델이나 여러 matric을 만들어서 무언가를 하겠다. output을 통해 평균을 내겠다.

Bagging(Bootstrapping aggregating)

학습데이터가 10만개 있을 때 다 사용하는게 아니라 학습데이터를 여러개 만든다. 이 때 bootstrapping을 하는거. 일반적으로 앙상블이라고 불리기도 함. 보통 10만개 다 학습해서 돌리는게 좋지는 않고, 8만개 정도로 n개의 모델을 만든 다음 테스트 입력이 들어왔을 때 n개의 모델을 모두 돌려보고 돌려본 값들의 평균이나 voting을 통해 나온 출력값을 보는게 한 개의 모델을 쓸 떼보다 좋은성과를 내는 경우가 많음. 기본적으로 kaggle이나 대회에 나갈 때 이런 앙상블을 쓰는데 이게 bagging에 속할때가 많음.

Boosting

만약 100개의 데이터가 있다면 그 중 80개만 예측하는 간단한 모델 weak learner을 만든다. 20개는 많이 틀릴텐데 다음은 이 20개만 예측하는 모델을 만든다. 위 처럼 각각 n개의 모델을 독립적인 모델로 보는게 아니라 weak learner들을 sequence하게 합쳐서 하나의 strong learner를 만든다. 각각 wear learner들의 weight를 찾는 식으로 정보를 취합하게 된다.

practical gradient methods

Stochastic gradient descent

Update with the gradient computed from a single sample.

10만개의 데이터가 있으면 한번에 하나만 보게 해서 gradient decent를 구하고 그렇게 얻어진 gradient를 update 하고 또 한개를 구해서 update 하고 update 하고

Mini-batch gradient descent

Update with the gradient computed from a subset of data.

Batch gradient descent

Update with the gradient computed from the whole data.

사실 batch-size를 결정하는게 중요하다. 좀 작게 잡는게 좋다. 작게잡으면 flat minimum이 되고 크게 작으면 sharp minimum이 되는 경향이 있음.

On Large-batch Training for Deep Learning: Generalization Gap and Sharp Minima, 2017 논문 읽어보길 추천. 여러 실험들을 해봤다.

Gradient Descent Methods

Stochastic gradient descent
Momentum
Nesterov accelerated gradient
Adagrad
Adadelta
RMSprop
Adam

모멘텀, 말 그대로 운동량 모방해서. adaptive, 지금까지 변했는지 안변했는지를 확인. EMA는 gradient를 그냥 더하는게 아니라 exponential moving average.

Adam은 모멘텀과 adaptive를 잘 섞은거

Regularization

학습에 방해하기. 학습데이터에만 잘 되는게 아니라 테스트데이터에서도 잘 동작하도록.

도구라고 보면 된다.

Early stopping
Parameter norm penalty
Data augmentation
Noise robustness
Label smoothing
Dropout
Batch normalization

mixup cutmix 추천.

batch norm. 왜 되는지는 모르겠는데 그냥 잘 되더라

실습

회귀문제를 풀고있다.

1-D 선형회귀. 강조하고 싶은건 최적화를 할 때 똑같이 다 하더라도 최적화 함수에 따라 결과가 다르다는걸 보여주고 싶다.

Regression with Different Optimizers

[41]
!pip install matplotlib==3.3.0
Requirement already satisfied: matplotlib==3.3.0 in /usr/local/lib/python3.6/dist-packages (3.3.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib==3.3.0) (1.3.1)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib==3.3.0) (2.8.1)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.6/dist-packages (from matplotlib==3.3.0) (7.0.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib==3.3.0) (0.10.0)
Requirement already satisfied: numpy>=1.15 in /usr/local/lib/python3.6/dist-packages (from matplotlib==3.3.0) (1.19.5)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /usr/local/lib/python3.6/dist-packages (from matplotlib==3.3.0) (2.4.7)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.1->matplotlib==3.3.0) (1.15.0)

[42]
import numpy as np
import torch
from torch import nn, optim
from torch.nn import functional as F
from torchvision import datasets, transforms
import matplotlib.pyplot as plt
%matplotlib inline  
%config InlineBackend.figure_format='retina'
print ("PyTorch version:[%s]."%(torch.__version__))
 
# Device Configuration
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print ("This notebook use [%s]."%(device))
PyTorch version:[1.7.0+cu101].
This notebook use [cuda:0].

Dataset

[43]
n_data = 10000
x_numpy = -3+6*np.random.rand(n_data,1)
y_numpy = np.exp(-(x_numpy**2)) * np.cos(10*x_numpy) + 3e-2*np.random.randn
(n_data,1)
plt.figure(figsize=(8,5))
plt.plot(x_numpy, y_numpy, 'r.', ms=2)
plt.show()
x_torch = torch.Tensor(x_numpy).to(device)
y_torch = torch.Tensor(y_numpy).to(device)
print ("Done.")
Done.

Define Model

[44]
# our model
class Model(nn.Module):
    def __init__(self, name='mlp', xdim=1, hdims=[16,16], ydim=1):
        super(Model, self).__init__()
        self.name = name
        self.xdim = xdim
        self.hdims = hdims
        self.ydim = ydim
 
        self.layers = []
        prev_hdim = self.xdim
        for hdim in self.hdims:
            self.layers.append(nn.Linear(prev_hdim, hdim, bias=True))
            self.layers.append(nn.Tanh()) # activation
            prev_hdim = hdim
        # Final layer (without activation)
        self.layers.append(nn.Linear(prev_hdim, self.ydim, bias=True))
 
        # Concatenate all layers
        self.net = nn.Sequential()
        for l_idx, layer in enumerate(self.layers):
            layer_name = "%s_%02d"%(type(layer).__name__.lower(), l_idx)
            self.net.add_module(layer_name, layer)
 
        self.init_params() # initialize parameters
 
    def init_params(self):
        for m in self.modules():
            if isinstance(m,nn.Conv2d): # init conv
                nn.init.kaiming_normal_(m.weight)
                nn.init.zeros_(m.bias)
            elif isinstance(m,nn.Linear): # init dense
                nn.init.kaiming_normal_(m.weight)
                nn.init.zeros_(m.bias)
 
    def forward(self, X):
        return self.net(X)
 
print("Done.")
Done.

[45]
LEARNING_RATE = 1e-2
# Instantiate models
model_sgd = Model(name='mlp_sgd', xdim=1, hdims=[64,64], ydim=1).to(device)
model_momentum = Model(name='mlp_momentum', xdim=1, hdims=[64,64], ydim=1).to
(device)
model_adam = Model(name='mlp_adam', xdim=1, hdims=[64,64], ydim=1).to(device)
# Optimizers
loss = nn.MSELoss()
optm_sgd = optim.SGD(model_sgd.parameters(),lr=LEARNING_RATE)
optm_momentum = optim.SGD(model_momentum.parameters(),lr=LEARNING_RATE, 
momentum=0.9)
optm_adam = optim.Adam(model_adam.parameters(),lr=LEARNING_RATE)
print("Done.")
Done.

Check Parameters

[46]
np.set_printoptions(precision=3)
n_param = 0
for p_idx,(param_name,param) in enumerate(model_sgd.named_parameters()):
    if param.requires_grad:
        param_numpy = param.detach().cpu().numpy() # to numpy array
        n_param += len(param_numpy.reshape(-1))
        print(f"{p_idx} name:[{param_name}] shape:[{param_numpy.shape}].")
        print(f"   val:{param_numpy.reshape(-1)[:5]}")
print(f"Total number of parameters:[{n_param:,d}].")
0 name:[net.linear_00.weight] shape:[(64, 1)].
   val:[ 1.302  1.836 -0.125  0.912 -0.781]
1 name:[net.linear_00.bias] shape:[(64,)].
   val:[0. 0. 0. 0. 0.]
2 name:[net.linear_02.weight] shape:[(64, 64)].
   val:[-0.363  0.022 -0.102 -0.304  0.267]
3 name:[net.linear_02.bias] shape:[(64,)].
   val:[0. 0. 0. 0. 0.]
4 name:[net.linear_04.weight] shape:[(1, 64)].
   val:[ 0.061 -0.062  0.156 -0.012  0.027]
5 name:[net.linear_04.bias] shape:[(1,)].
   val:[0.]
Total number of parameters:[4,353].

Train

[47]
MAX_ITER, BATCH_SIZE, PLOT_EVERY = 1e4, 64, 500
 
model_sgd.init_params()
model_momentum.init_params()
model_adam.init_params()
 
model_sgd.train()
model_momentum.train()
model_adam.train()
 
for it in range(int(MAX_ITER)):
    r_idx = np.random.permutation(n_data)[:BATCH_SIZE]
    batch_x, batch_y = x_torch[r_idx], y_torch[r_idx]
 
    # Update with Adam
    y_pred_adam = model_adam.forward(batch_x)
    loss_adam = loss(y_pred_adam, batch_y)
    optm_adam.zero_grad()
    loss_adam.backward()
    optm_adam.step()
 
    # Update with Momentum
    y_pred_momentum = model_momentum.forward(batch_x)
    loss_momentum = loss(y_pred_momentum, batch_y)
    optm_momentum.zero_grad()
    loss_momentum.backward()
    optm_momentum.step()
 
    # Update with SGD
    y_pred_sgd = model_sgd.forward(batch_x)
    loss_sgd = loss(y_pred_sgd, batch_y)
    optm_sgd.zero_grad()
    loss_sgd.backward()
    optm_sgd.step()
 
    # Plot
    if ((it%PLOT_EVERY) == 0) or (it==0) or (it==(MAX_ITER-1)):
        with torch.no_grad():
            y_sgd_numpy = model_sgd.forward(x_torch).cpu().detach().numpy()
            y_momentum_numpy = model_momentum.forward(x_torch).cpu().detach()
.numpy()
            y_adam_numpy = model_adam.forward(x_torch).cpu().detach().numpy()
 
            plt.figure(figsize=(8,4))
            plt.plot(x_numpy, y_numpy, 'r.', ms=4, label='GT')
            plt.plot(x_numpy, y_sgd_numpy, 'g.', ms=2, label='SGD')
            plt.plot(x_numpy, y_momentum_numpy, 'b.', ms=2, label='Momentum')
            plt.plot(x_numpy, y_adam_numpy, 'k.', ms=2, label='ADAM')
            plt.title("[%d/%d]"%(it, MAX_ITER),fontsize=15)
            plt.legend(labelcolor='linecolor', loc='upper right', fontsize=15)
            plt.show()
 
print("Done.")
Done.