15  Miscellaneous Hyperparameters

As optimizers I left what was already there, so AdamW.

I used different learning rates during the previous training but in general:

The higher learning rate at the beginning is used to speed up the training while the lower learning rate at the end is to fine tune the model.

I slightly decreased β1 to help lowering the learning rate.

I decreased weight decay too, to be consistent with the lower learning rate.