skorch callbacks (3) ML Pipeline

2022-06-09

2022-06-12

callback, neural network, pipeline, pytorch, sklearn, visualization

PyTorch는 현재 가장 인기있는 딥러닝 라이브러리 중 하나입니다.
학습 세부 사항을 지정하기 위해 Callback으로 다양한 기능을 지원합니다.
skorch는 PyTorch를 scikit-learn과 함께 사용할 수 있게 해 줍니다.
skorch도 PyTorch callback을 이용할 수 있습니다.

글이 길어 세 개로 나눕니다.
세 번째로, skorch를 사용해 ML Pipeline을 완성합니다.
여러 callback으로 자세한 설정을 반영합니다.

3.2. skorch

Colab code: skorch callbacks

Colab에는 skorch가 기본으로 설치되어 있지 않습니다.
간단한 명령으로 skorch를 설치합니다.
1
!pip install skorch

3.2.1. preprocessor 포함 pipeline 작성

skorch: skorch.regressor

앞에서 만든 get_preprocessor()함수는 전처리 Pipeline을 출력합니다.
Pipeline을 연장해 이 뒤에 PyTorch로 만든 Neural Network를 덧붙입니다.
우리가 푸는 문제는 펭귄의 체중을 구하는 regression문제입니다.

skorch에서 제공하는 NeuralNetRegressor()로 PyTorch 신경망을 감쌉니다.
loss function, optimizer 등은 NeuralNetRegressor() 안에 criterion, optimizer 등의 매개변수를 사용해 입력합니다.
optimizer=optim.Adam으로 Adam을 선택했습니다.

PyTorch에서 Adam()안에 들어가던 매개변수 lr=1e-3은 optimizer__lr=1e-3으로 바뀌어 들어갑니다.

import skorch
from skorch import NeuralNetRegressor

# skorch로 PyTorch neural network wrapping
net_sk = NeuralNetRegressor(Net(), device=device, verbose=1,
                            criterion=RMSELoss,         # loss function
                            optimizer=optim.Adam,       # optimizer
                            optimizer__lr=1e-3)         # learning rate of the optimizer

# training
for epoch in range(300):
    net_sk.fit(X_train_np, y_train.astype(np.float32).values.reshape(-1, 1))

실행 결과 : 너무 길어서 out of memory 오류가 뜰 수 있습니다. 침착하게 새로 고침을 누르시면 됩니다.

또는, verbose=1을 verbose=0으로 바꾸시는 것도 방법입니다.

... (생략) ...

     10      271.1115      286.5908  0.0111
Re-initializing module.
Re-initializing criterion.
Re-initializing optimizer.
  epoch    train_loss    valid_loss     dur
-------  ------------  ------------  ------
      1      272.8632      286.3571  0.0108
      2      267.6815      285.4084  0.0145
      3      268.0632      284.9375  0.0106
      4      269.3830      284.9683  0.0104
      5      270.7425      285.2378  0.0101
      6      271.6377      285.5552  0.0108
      7      271.9493      285.8589  0.0082
      8      271.8214      286.1465  0.0080
      9      271.4798      286.4032  0.0087
     10      271.1165      286.5942  0.0088

학습이 model.fit()으로 잘 됩니다.
NeuralNetRegressor로 한 번 감싼 것 만으로 scikit-learn API를 사용할 수 있게 되었습니다.
X_train대신 전처리를 거친 X_train_np를 입력했습니다.
y_train대신으로는 y_train.astype(np.float32).values.reshape(-1, 1)이 들어갔습니다.
pandas Series에서 데이터 타입을 바꾸고, 값을 추출해서, shape을 바꾼 것입니다.

예측도 model.forward() 대신 model.predict()로 진행합니다.

# prediction
y_pred_train = net_sk.predict(X_train_np)
y_pred_val = net_sk.predict(X_val_np)
y_pred_test = net_sk.predict(X_test_np)

# parity plot
plot_parity3(net_sk, Xs=[X_train_np, X_val_np, X_test_np])

3.2.2. ML pipeline 작성

skorch: skorch.callbacks.InputShapeSetter
stackoverflow: How to pass input dim from fit method to skorch wrapper?

앞에서 만든 전처리기, get_preprocessor()를 포함하는 Pipeline을 만듭니다.
method매개변수로 neural network 뿐 아니라 linear regression, random forest를 선택할 수 있게 합니다.

Neural network는 이들 방법들과 달리 input dimension이 중요합니다.
신경망 구조를 만들 때 필요한 변수이기 때문입니다.
callback은 여러 설정을 지정할 수 있는 방법입니다. InputShapeSetter를 callback에 기본값으로 박아 넣습니다.

그 외에 여러 keyword arguments를 입력할 수 있도록 **kwargs를 NeuralNetRegressor()에 추가합니다.

# machine learning models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

# embedding pytorch model in scikit-learn Pipeline
from skorch import NeuralNetRegressor
from skorch.helper import predefined_split
from skorch.callbacks import Callback

# dynamic input size of the PyTorch module
class InputShapeSetter(Callback):
    def on_train_begin(self, net, X, y):
        net.set_params(module__ninput=X.shape[1])

def get_model(method="lr", device=device, cols_cat=cols_cat, cols_num=cols_num, degree=1, 
              callbacks=[InputShapeSetter()], **kwargs):
    if method == "lr":
        ml = LinearRegression(fit_intercept=True)
    elif method == "rf":
        ml = RandomForestRegressor(random_state=rng)
    elif method == "nn":
        ml = NeuralNetRegressor(Net(), device=device, callbacks=callbacks, **kwargs)
    else:
        print("# 'method' should be in ['lr', 'rf', 'nn'].")
        return None
    
    preprocessor = get_preprocessor(cols_cat=cols_cat, cols_num=cols_num, degree=degree)
    model = Pipeline([("preprocessor", preprocessor), 
                      ("ml", ml)])
    
    return model

이제 get_model()을 사용해 ML pipeline을 제작할 수 있습니다.

method에 입력하는 값에 따라 선형 회귀, 앙상블 트리, 신경망을 선택할 수 있습니다.

1 2	model = get_model("nn", max_epochs=epochs, verbose=1, criterion=RMSELoss, optimizer=optim.Adam, optimizer__lr = 1e-3) model

웬만한 매개변수는 모두 NeuralNetRegressor()에 들어갑니다.

어떤 매개변수들이 있는지 출력해서 확인합니다.

1	model["ml"].get_params()

실행 결과

{'_kwargs': {'optimizer__lr': 0.001},
 'batch_size': 128,
 'callbacks': [<__main__.InputShapeSetter at 0x7f4911406cd0>],
 'callbacks__epoch_timer': <skorch.callbacks.logging.EpochTimer at 0x7f49745b2a50>,
 'callbacks__print_log': <skorch.callbacks.logging.PrintLog at 0x7f49701261d0>,
 'callbacks__print_log__floatfmt': '.4f',
 'callbacks__print_log__keys_ignored': None,
 'callbacks__print_log__sink': <function print>,
 'callbacks__print_log__stralign': 'right',
 'callbacks__print_log__tablefmt': 'simple',
 'callbacks__train_loss': <skorch.callbacks.scoring.PassthroughScoring at 0x7f49745b2450>,
 'callbacks__train_loss__lower_is_better': True,
 'callbacks__train_loss__name': 'train_loss',
 'callbacks__train_loss__on_train': True,
 'callbacks__valid_loss': <skorch.callbacks.scoring.PassthroughScoring at 0x7f49745b24d0>,
 'callbacks__valid_loss__lower_is_better': True,
 'callbacks__valid_loss__name': 'valid_loss',
 'callbacks__valid_loss__on_train': False,
 'criterion': __main__.RMSELoss,
 'dataset': skorch.dataset.Dataset,
 'device': 'cuda:0',
 'iterator_train': torch.utils.data.dataloader.DataLoader,
 'iterator_valid': torch.utils.data.dataloader.DataLoader,
 'lr': 0.01,
 'max_epochs': 1000,
 'module': Net(
   (layer0): Linear(in_features=12, out_features=16, bias=True)
   (layer1): Linear(in_features=16, out_features=16, bias=True)
   (layer2): Linear(in_features=16, out_features=12, bias=True)
   (layer3): Linear(in_features=12, out_features=8, bias=True)
   (layer4): Linear(in_features=8, out_features=1, bias=True)
   (activation): ReLU()
 ),
 'optimizer': torch.optim.adam.Adam,
 'optimizer__lr': 0.001,
 'predict_nonlinearity': 'auto',
 'train_split': <skorch.dataset.ValidSplit object at 0x7f497434cc90>,
 'verbose': 1,
 'warm_start': False}

3.2.3. train and validate (self)

skorch: NeuralNet#train_split

X_train과 y_train만 사용해서 학습시킵니다.

1	model.fit(X_train, y_train.values.reshape(-1, 1).astype(np.float32))

실행 결과

Re-initializing module because the following parameters were re-set: module__ninput.
Re-initializing criterion.
Re-initializing optimizer.
  epoch    train_loss    valid_loss     dur
-------  ------------  ------------  ------
      1     4280.8532     4388.5093  0.0188
      2     4280.8443     4388.4990  0.0129
      
      ... (생략) ...
      
    998      295.5285      270.6504  0.0106
    999      295.5276      270.6288  0.0153
   1000      295.5247      270.6030  0.0102

validation set을 입력하지 않았음에도 valid_loss가 출력됩니다.
train data의 20%를 validation set으로 따로 떼어 놓기 때문입니다.
맨 처음 전체 데이터의 60%만 train set으로 지정했습니다.
여기서 다시 80%만 학습에 투입되었으니 총 48%. 반도 안되는 데이터로 학습한 셈입니다.
train_split=None을 입력하면 모든 데이터를 다 학습에 투입하지만 validation 결과가 출력되지 않습니다.

learning curve는 신경망에서 .history 속성을 추출해 확인할 수 있습니다.

history = model["ml"].history
train_loss = history[:, "train_loss"]
valid_loss = history[:, "valid_loss"]

plot_epoch(train_loss, valid_loss)

학습도 정상적으로 이루어졌습니다.
1
plot_parity3(model)

3.2.4. train and validate (predefined validation set)

skorch: skorch.dataset.Dataset

먼저 준비한 validation set을 사용하려면 train_split에 validation set을 입력합니다.
validation set은 skorch의 Dataset을 사용해 만듭니다.

내친 김에 y data도 ML Pipeline에 만들기 좋은 형태, 즉 float32, (-1, 1) shape으로 변경해서 모아놓습니다.

from skorch.dataset import Dataset

# ys (numpy)
y_train_np = y_train.values.reshape(-1, 1).astype(np.float32)
y_val_np = y_val.values.reshape(-1, 1).astype(np.float32)
y_test_np = y_test.values.reshape(-1, 1).astype(np.float32)

# predefined validation set
preprocessor = get_preprocessor()
X_val_pp = preprocessor.fit(X_train).transform(X_val)
valid_dataset = Dataset(X_val_pp, y_val_np)

# model training
model = get_model("nn", max_epochs=epochs, verbose=1, criterion=RMSELoss, optimizer=optim.Adam, optimizer__lr = 1e-3,
                  # predefined validataion set
                  train_split=predefined_split(valid_dataset))  
model.fit(X_train, y_train_np)

실행 결과

Re-initializing module because the following parameters were re-set: module__ninput.
Re-initializing criterion.
Re-initializing optimizer.
  epoch    train_loss    valid_loss     dur
-------  ------------  ------------  ------
      1     4302.1941     4250.5781  0.0160
      2     4302.1851     4250.5693  0.0175
      
      ... (생략) ...
      
    998      280.9618      267.7317  0.0121
    999      280.9596      267.7302  0.0122
   1000      280.9580      267.7280  0.0141

learning curve를 확인합니다.

여기서 얻은 learning curve를 reference로 사용하겠습니다.

history = model["ml"].history
train_loss_0 = history[:, "train_loss"]
valid_loss_0 = history[:, "valid_loss"]

plot_epoch(train_loss_0, valid_loss_0)

parity plot도 확인합니다.
1
plot_parity3(model)

3.2.5. learning rate scheduler

skorch: Learning rate schedulers
Pega Devlog: Fast.ai의 fit_one_cycle 방법론 이해

callbacks에 learning late scheduler를 추가해 learning rate를 조정할 수 있습니다.
input dimension 조정을 위해 callbacks 기본값으로 InputShapeSetter()가 들어가 있습니다.

이를 삭제하지 않도록 유의하면서 learning rate scheduler를 추가합니다.

from skorch.callbacks import LRScheduler

model = get_model("nn", max_epochs=epochs, verbose=1, criterion=RMSELoss, optimizer=optim.Adam, optimizer__lr = 1e-3,
                  train_split=predefined_split(valid_dataset),               # predefined validataion set                
                  callbacks=[# input dimension setter
                             ("input_shape_setter", InputShapeSetter()),
                             
                             # LR scheduler
                             ("lr_scheduler", LRScheduler(policy=OneCycleLR, # LR scheduler
                                                         max_lr=0.1,
                                                         total_steps=epochs))])
model.fit(X_train, y_train_np)

학습 과정은 생략하고 learning curve를 비교해서 봅니다.

ax = plot_epoch(train_loss_0, valid_loss_0)
lines = ax.lines
for line in lines:
    line.set_alpha(0.3)

history = model["ml"].history
train_loss = history[:, "train_loss"]
valid_loss = history[:, "valid_loss"]

ax = plot_epoch(train_loss, valid_loss, ax=ax)
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles=handles, labels=labels, ncol=2, title="base      OneCycleLR")

learning rate scheduler가 적용되어 학습 양상이 바뀌었습니다.
parity plot도 여전히 좋습니다.
1
plot_parity3(model)

3.2.6. early stopping

skorch: skorch.callbacks.EarlyStopping

불필요하게 학습을 길게 하는 경향을 줄이고자 early stopping을 적용합니다.
EarlyStopping()을 callbacks에 추가하고 기준을 설정합니다.

valid_loss가 20번 줄어들지 않으면 학습을 중단하도록 monitor="valid_loss", patience=20을 설정했습니다.

from skorch.callbacks import EarlyStopping

model = get_model("nn", max_epochs=epochs, verbose=1, criterion=RMSELoss, optimizer=optim.Adam, optimizer__lr = 1e-3,
                  train_split=predefined_split(valid_dataset),               # predefined validataion set        

                  callbacks=[# input dimension setter
                             ("input_shape_setter", InputShapeSetter()),

                             # LR scheduler
                             ("lr_scheduler", LRScheduler(policy=OneCycleLR, 
                                                         max_lr=0.1,
                                                         total_steps=epochs)),

                             # early stopping
                             ("early_stopping", EarlyStopping(monitor="valid_loss",
                                                              patience=20))])
model.fit(X_train, y_train.values.reshape(-1, 1).astype(np.float32))

learning curve를 비교합니다.

ax = plot_epoch(train_loss_0, valid_loss_0)
lines = ax.lines
for line in lines:
    line.set_alpha(0.3)

history = model["ml"].history
train_loss = history[:, "train_loss"]
valid_loss = history[:, "valid_loss"]

ax = plot_epoch(train_loss, valid_loss, ax=ax)
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles=handles, labels=labels, ncol=2, title="base      LRS+ES")
ax.set_xlim(0, 200)

parity plot도 정상적입니다.
1
plot_parity3(model)

3.2.7. Saving and Loading (manual)

skorch: Saving and Loading

모델 파라미터를 pickle 형식으로 저장하고 불러올 수 있습니다.
1
2
# save parameters
model["ml"].save_params(f_params="nn_params.pkl")
모델을 새로 만들면 paramter를 불러오기 전에 초기화하는 과정이 필요합니다.

.initialize()를 사용합니다.

# load parameters
model_new = get_model(method="nn")
model_new["ml"].initialize()
model_new["ml"].load_params(f_params="nn_params.pkl")

새로 만든 모델에 parameter를 불러와서 적용합니다.

단, 신경망 parameter에만 해당하는 상황이므로 preprocessor는 기존 모델의 preprocessor를 사용합니다.

# check reproducibility
y_pred_train = model_new["ml"].predict(model["preprocessor"].transform(X_train))
y_pred_val = model_new["ml"].predict(model["preprocessor"].transform(X_val))
y_pred_test = model_new["ml"].predict(model["preprocessor"].transform(X_test))

plot_parity3(model_new, preds=[y_pred_train, y_pred_val, y_pred_test])

학습을 하지 않았음에도 parity plot이 똑같이 재현되었습니다.

3.2.8. Saving and Loading (callbacks)

skorch: Saving and Loading Using Callbacks

callback을 이용하면 valid loss가 줄어들때마다 저장할 수 있습니다.
valid loss가 줄어들때마다 저장하는 cp와 epoch마다 저장하는 train_end_cp를 동시에 지정합니다.

exp1 폴더를 만들어 여기에 함께 저장하도록 합니다.

from skorch.callbacks import Checkpoint, TrainEndCheckpoint

# save the model parameters, optimizer, and history
cp = Checkpoint(dirname='exp1')
train_end_cp = TrainEndCheckpoint(dirname='exp1')

model = get_model("nn", max_epochs=epochs, verbose=1, criterion=RMSELoss, optimizer=optim.Adam, optimizer__lr = 1e-3,
                  train_split=predefined_split(valid_dataset),                               

                  callbacks=[# input dimension setter
                             ("input_shape_setter", InputShapeSetter()),

                             # LR scheduler
                             ("lr_scheduler", LRScheduler(policy=OneCycleLR, 
                                                         max_lr=0.1,
                                                         total_steps=epochs)),

                             # early stopping
                             ("early_stopping", EarlyStopping(monitor="valid_loss",
                                                              patience=20)),
                             
                             # Checkpoints
                             ("checkpoint", cp),
                             ("train_end_checkpoint", train_end_cp)
                             ])
model.fit(X_train, y_train.values.reshape(-1, 1).astype(np.float32))

실행 결과

Re-initializing module because the following parameters were re-set: module__ninput.
Re-initializing criterion.
Re-initializing optimizer.
  epoch    train_loss    valid_loss    cp      lr     dur
-------  ------------  ------------  ----  ------  ------
      1     4301.9097     4250.2661     +  0.0040  0.0093
      2     4301.8677     4250.2241     +  0.0040  0.0107
      3     4301.8278     4250.1914     +  0.0040  0.0083
      4     4301.7949     4250.1577     +  0.0040  0.0111
      
      ... (생략) ...
      
     96      294.3520      265.7108        0.0260  0.0112
     97      296.6454      273.7495        0.0264  0.0098
     98      290.3884      289.0570        0.0268  0.0071
     99      289.6089      273.4566        0.0273  0.0093
    100      284.6194      261.8893     +  0.0277  0.0092
    101      290.8631      262.4448        0.0281  0.0107
    102      289.3116      275.2869        0.0286  0.0077
    103      286.5198      278.3972        0.0290  0.0075
    104      283.5437      266.5125        0.0295  0.0106

cp라는 열이 하나 생겼고, 여기 +가 붙은 곳들이 있습니다.
valid_loss가 기존 기록보다 작아진 지점입니다.

History도 파일에서 불러와서 그립니다.
부를 때는 skorch.history.History를 사용합니다.

세 개의 learning curve를 겹쳐 그리느라 코드가 다소 복잡해졌습니다.

from skorch.history import History

# base plot
ax = plot_epoch(train_loss_0, valid_loss_0)
lines = ax.lines
for line in lines:
    line.set_alpha(0.3)

# history
history = History().from_file("./exp1/history.json")
train_loss = history[:, "train_loss"]
valid_loss = history[:, "valid_loss"]
ax = plot_epoch(train_loss, valid_loss, ax=ax)

# event_cp : cp == True
epoch = history[:, "epoch"]
event_cp = history[:, "event_cp"]
df_cp = pd.DataFrame({"epoch":epoch, "event_cp":event_cp, "train_loss":train_loss, "valid_loss":valid_loss})
df_cp = df_cp.loc[df_cp["event_cp"]==True]

ax.scatter(df_cp["epoch"], df_cp["train_loss"], fc=c_train, alpha=0.5, label="train_cp")
ax.scatter(df_cp["epoch"], df_cp["valid_loss"], fc=c_val, alpha=0.5, label="valid_cp")

ax.legend(ncol=3, title="base      LRS+ES           checkpoint", loc="center right")
ax.set_xlim(0, 200)

희미하게 그려진 것은 skorch에 validation set을 사용한 base line입니다.
그리고 Learning Rate Scheduler와 Early Stopping을 적용한 것을 LRS + ES로 표기했습니다.
앞에서와 같이 학습이 훨씬 빨리 끝났습니다.
그리고 이 중 checkpoint가 적용된 것을 scatter plot으로 표현했습니다.
one-cycle-fit의 영향으로 learning curve가 요동치는 와중에서도 train과 valid에서 단조 감소하는 모습만이 기록되었습니다.

3.2.9. valid loss가 가장 적었던 checkpoint 불러서 learning rate 낮추기

skorch: skorch.callbacks.LoadInitState

각 checkpoint에서는 history 외에도 criterion, optimizer, parameter 등의 상태를 저장합니다.
Colab 왼쪽의 폴더 모양을 클릭해 저장한 파일을 확인하면 볼 수 있습니다.
학습이 과하게 진행되어 overfitting이 되면 지나가버린 과거의 최적점이 아쉽습니다.
지나친 최적점을 불러와서 훨씬 낮은 learning rate로 살살 학습시키면 더 좋을 것 같습니다.

이 때 skorch에서 제공하는 LoadInitState를 사용할 수 있습니다.
cp = Checkpoint()로 저장된 위치를 지정하고,
load_state = LoadInitState(cp)로 불러와 상태를 불러옵니다.

마지막으로 callbacks에 cp와 load_state를 추가합니다.

from skorch.callbacks import LoadInitState

cp = Checkpoint(dirname='exp1')
load_state = LoadInitState(cp)

model = get_model("nn", max_epochs=epochs, verbose=1, criterion=RMSELoss, optimizer=optim.Adam, 
                  # learning rate 조정
                  optimizer__lr = 1e-5,

                  # predefined validataion set
                  train_split=predefined_split(valid_dataset),                               

                  callbacks=[# input dimension setter
                             ("input_shape_setter", InputShapeSetter()),

                             # early stopping
                             ("early_stopping", EarlyStopping(monitor="valid_loss",
                                                              patience=100)),
                             
                             # Checkpoints
                             ("checkpoint", cp),
                             ("load_initial_state", load_state)
                             ])
model.fit(X_train, y_train.values.reshape(-1, 1).astype(np.float32))

3.2.10. Saving and Loading (model itself)

모델 전체를 저장할 때는 pickle을 권장하고 있습니다.

pickle.dump()와 pickle.load()를 사용해 모델을 읽고 씁니다.

# saving
with open('skorch_dl.pkl', 'wb') as f:
    pickle.dump(model, f)
    
# loading
with open('skorch_dl.pkl', 'rb') as f:
    model_pkl = pickle.load(f)

저장한 모델을 불러오면서 이름을 model_pkl로 바꿨습니다.
이 모델의 parity plot을 그려서 잘 저장되었고 불러졌는지 확인합니다.
1
2
# check reproducibility
plot_parity3(model=model_pkl)
추가 학습 없이도 원래의 성능이 확인되었습니다.

4. 정리 : skorch ML pipeline

이제까지 세 편의 글에 걸쳐 데이터를 정리한 후,
scikit-learn preprocessor를 만들고,
PyTorch neural network를 구축한 후,
skorch로 이들을 엮은 뒤 callbacks로 여러 옵션을 뿌렸습니다.

최종적으로 사용한 코드가 여기 저기 흩뿌려져 있어 활용이 어려울 듯도 싶습니다.
skorch ML pipeline 코드를 아래에 정리합니다.
목적은 Ctrl+C/V와 약간의 수정으로 사용하는 것입니다.

4.1. skorch ML pipeline

scikit-learn preprocessor

# preprocessors
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures

# pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin

# Preprocessings for Categorical and Numerical features
def get_concat(cols_cat=cols_cat, cols_num=cols_num, degree=1):
    # categorical features: one-hot encoding
    cat_features = cols_cat
    cat_transformer = OneHotEncoder(sparse=False, handle_unknown="ignore")

    # numerical features: standard scaling & polynomial features
    num_features = cols_num
    num_transformer = Pipeline(steps=[("polynomial", PolynomialFeatures(degree=degree)),
                                      ("scaler", StandardScaler())])
    
    numcat = ColumnTransformer(transformers=[("categorical", cat_transformer, cat_features),
                                          ("numerical", num_transformer, num_features)])
    return numcat

# Float64 to Float32 for PyTorch
class FloatTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return np.array(X, dtype=np.float32)


# preprocessing Pipeline
def get_preprocessor(cols_cat=cols_cat, cols_num=cols_num, degree=1):
    concat = get_concat(cols_cat=cols_cat, cols_num=cols_num, degree=degree)
    ft = FloatTransformer()

    pipeline= Pipeline(steps=[("concat", concat), 
                              ("float64to32", ft)])
    return pipeline

PyTorch Neural Network

import torch
from torch import nn
from torch import optim

# neural network: ninput(12)-16-16-12-8-1
class Net(nn.Module):
    def __init__(self, ninput=12):
        super().__init__()
        self.layer0 = nn.Linear(ninput, 16)
        self.layer1 = nn.Linear(16, 16)
        self.layer2 = nn.Linear(16, 12)
        self.layer3 = nn.Linear(12, 8)
        self.layer4 = nn.Linear(8, 1)
        self.activation = nn.ReLU()

    def forward(self, x):
        x = self.activation(self.layer0(x))
        x = self.activation(self.layer1(x))
        x = self.activation(self.layer2(x))
        x = self.activation(self.layer3(x))
        x = self.layer4(x)
        return x

# loss: RMSE
class RMSELoss(nn.Module):
    def __init__(self, eps=1e-6):
        super().__init__()
        self.mse = nn.MSELoss()
        self.eps = eps
    def forward(self, true, pred):
        loss = torch.sqrt(self.mse(true, pred) + self.eps)
        return loss

skorch ML Pipeline

# machine learning models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

# embedding pytorch model in scikit-learn Pipeline
from skorch import NeuralNetRegressor
from skorch.helper import predefined_split

# callbacks
from skorch.callbacks import Callback
from skorch.callbacks import LRScheduler
from skorch.callbacks import EarlyStopping
from skorch.callbacks import Checkpoint, TrainEndCheckpoint

# dynamic input size of the PyTorch module
class InputShapeSetter(Callback):
    def on_train_begin(self, net, X, y):
        net.set_params(module__ninput=X.shape[1])

# save the model parameters, optimizer, and history
cp = Checkpoint(dirname='exp_test')
train_end_cp = TrainEndCheckpoint(dirname='exp_test')

# skorch ML pipeline
def get_model(method="lr", device=device, cols_cat=cols_cat, cols_num=cols_num, degree=1, 
              callbacks=[("input_shape_setter", InputShapeSetter()),
                         ("lr_scheduler", LRScheduler(policy=OneCycleLR, max_lr=0.1, total_steps=epochs)),
                         ("early_stopping", EarlyStopping(monitor="valid_loss", patience=20)),
                         ("checkpoint", cp), ("train_end_checkpoint", train_end_cp)], 
              **kwargs):
    
    if method == "lr":
        ml = LinearRegression(fit_intercept=True)
    elif method == "rf":
        ml = RandomForestRegressor(random_state=rng)
    elif method == "nn":
        ml = NeuralNetRegressor(Net(), device=device, callbacks=callbacks, **kwargs)
    else:
        print("# 'method' should be in ['lr', 'rf', 'nn'].")
        return None
    
    preprocessor = get_preprocessor(cols_cat=cols_cat, cols_num=cols_num, degree=degree)
    model = Pipeline([("preprocessor", preprocessor), 
                      ("ml", ml)])
    
    return model

4.2. Visualizations

learning curve

def plot_epoch(history=None, loss_trains=None, loss_vals=None, ax=None):
    
    if any([history, loss_trains]) == False:
        print("# one of 'history' and 'loss_trains' has to be used!")
        return ax
    
    if ax == None:
        fig, ax = plt.subplots(figsize=(10, 5))

    if loss_trains == None:
        loss_trains = history[:, "train_loss"]

    if history != None and loss_vals == None:
        loss_vals = history[:, "valid_loss"]

    ax.plot(list(range(1, len(loss_trains)+1)), loss_trains, c=c_train, label="train")
    if loss_vals != None:
        ax.plot(list(range(1, len(loss_vals)+1)), loss_vals, c=c_val, label="valid")
    ax.grid(axis="y")
    ax.set_xlabel("epochs", fontdict=font_label)
    ax.legend()

    return ax

parity plots

def plot_parity3(model, target=["train", "val", "test"], figsize=(10, 4),
                 Xs=None, trues=None, preds=None, colors=None):
    if not Xs:
        Xs = [eval(f"X_{t}") for t in target]
    if not trues:
        trues = [eval(f"y_{t}") for t in target]
    if not preds:
        preds = [model.predict(X) for X in Xs]
    if not colors:
        colors = [eval(f"c_{t}") for t in target]

    fig, axs = plt.subplots(ncols=len(target), figsize=figsize, constrained_layout=True)
    for ax, true, pred, c, title in zip(axs, trues, preds, colors, titles):
        plot_parity(true, pred, ax=ax, scatter_kws={"fc":c, "ec":c, "alpha":0.5}, title=title)
        if ax != axs[0]:
            ax.set_ylabel("")

4.3. test run

정의한 함수들로 예제를 돌려봅니다.

# predefined validation set
preprocessor = get_preprocessor()
X_val_pp = preprocessor.fit(X_train).transform(X_val)
valid_dataset = Dataset(X_val_pp, y_val_np)

# ML pipeline preparation
model_test = get_model("nn", max_epochs=epochs, verbose=1, criterion=RMSELoss, optimizer=optim.Adam, optimizer__lr = 1e-3,
                       train_split=predefined_split(valid_dataset))
model_test.fit(X_train, y_train_np)

# learning curve
history = History().from_file("./exp_test/history.json")
ax = plot_epoch(history)

# parity plots
plot_parity3(model_test)

실행 결과

Re-initializing module because the following parameters were re-set: module__ninput.
Re-initializing criterion.
Re-initializing optimizer.
  epoch    train_loss    valid_loss    cp      lr     dur
-------  ------------  ------------  ----  ------  ------
      1     4301.9874     4250.3501     +  0.0040  0.0135
      2     4301.9537     4250.3169     +  0.0040  0.0139
      
      ... (생략) ...
      
    110      283.9839      270.7625        0.0322  0.0108
    111      282.3574      270.6689        0.0326  0.0145
Stopping since valid_loss has not improved in the last 20 epochs.

잘 돌아갑니다. :)
전체를 실행해볼 수 있는 코드는 여기 있습니다: Notebook

도움이 되셨나요? 카페인을 투입하시면 다음 포스팅으로 변환됩니다

PythonDeep Learning

3.2. skorch

3.2.1. preprocessor 포함 pipeline 작성

3.2.2. ML pipeline 작성

3.2.3. train and validate (self)

3.2.4. train and validate (predefined validation set)

3.2.5. learning rate scheduler

3.2.6. early stopping

3.2.7. Saving and Loading (manual)

3.2.8. Saving and Loading (callbacks)

3.2.9. valid loss가 가장 적었던 checkpoint 불러서 learning rate 낮추기

3.2.10. Saving and Loading (model itself)

4. 정리 : skorch ML pipeline

4.1. skorch ML pipeline

4.2. Visualizations

4.3. test run