Techniques

  • Grid Search: Exhaustively searches through all possible combinations of specified hyperparameters. Simple but can be very slow.
  • Random Search: Randomly samples combinations of hyperparameters. Often more efficient than grid search in high-dimensional spaces.
  • Bayesian Optimization: Builds a probabilistic model to predict performance and chooses hyperparameters based on expected improvement.
  • Tree-structured Parzen Estimator (TPE): Used in Hyperopt. Separates good and bad configurations and samples based on those distributions.
  • Genetic Algorithms / Evolutionary Search: Inspired by natural evolution. Evolves a population of hyperparameters through selection, mutation, and crossover.
  • Successive Halving / Hyperband: Allocates resources dynamically, pruning poor configurations early and focusing on promising ones.
  • Optuna: A modern, lightweight framework for hyperparameter tuning using TPE and pruning. Fast, flexible, and easy to use.

Tips and Tricks

The best thing to do is take an already trained model and leave the parameters as they are.

In the case we need to build a new model from scratch, there are some steps to follow:

  1. Check the loss. The loss should be initialized to for softmax with classes in the case of classification.
  2. Try to overfit on a small sample of training data (5 to 10 mini batches), and change the hyperparameters and/or the arcthirecture in order to achieve 100% training accuracy. This is good since it’s extremely fast to test, otherwise training each time with the entire dataset will take too long.
  3. Find a learning rage that makes the loss go down significantly during the first 100 iterations. Do this on all the training data, but since we are doing this only on 100 iterations, this also will be fast.
  4. Choose some sets of hyperparameters which worked in the previous steps and train a few models for 1 to 5 epochs.
  5. Pick the best models from the step before and train for even longer (10 to 20 epochs), without using learning rate decay.
  6. We can now look at the loss and accuracy curves. Let’s see what can we do by analysing the accuracy curves:
    1. If train and val have the same trend and there is a gap between the twos, it’s all good, keep training
    2. If training goes up and val goes down there is overfitting. Increase regularization.
    3. If train and validation have a little distance between (are very similar), then you are underfitting. You should try to train for longer or use a bigger model.
  7. Go to step 5 and iterate until you have good results.

Before deep learning it was preferable for the model to have just the right amount of capacity, not more not less. With the advent of deep learning we try to boost the capacity of the model as much as possible and then use regularization to not overfit the data.