Combined Regularisation Techniques for Artificial Neural Networks
About
BSc Theoretical Physics thesis written at the Department of Astronomy and Theoretical Physics, Lund University.
Abstract
Artificial neural networks are prone to overfitting – the process of learning details specific to a particular training data set. Success in preventing overfitting through combining the L2 and dropout regularisation techniques has led to the combination’s recent popularity. However, with the introduction of each additional regularisation technique to an artificial neural network, there comes new hyperparameters which must be tuned in an increasingly complex and computationally expensive manner. Motivated by L2’s action as a Gaussian prior on the loss function, we hypothesise an analytic relation for an optimal L2 strength’s dependence on the number of patterns. Conducted on an artificial neural network composed of a single hidden layer, this systematic study tests the hypothesis for optimal L2 strength, and considers what interactions the additional involvement of dropout and early stopping may have on the relation. On an otherwise static problem and network calibration, the results of this thesis suggested the success of the hypothesis within a valid working region. The results are useful informants for the choice of L2 strength, drop rate and early stopping usage, and gave promise that the predictor may find real world applications.
Contribution
As the hypothesis could not be studied sufficiently in standard Keras code, the first part of the project was in coding our own ANN. I was responsible for the stochastic data generation, the implementation of dropout and statistics calculations. Additionally, I lead the team’s large-scale neural network training via remote access to computer clusters.
Popular Science Description
Artificial Intelligence’s (AI’s) potential for incredible state-of-the-art performance has not gone unnoticed; from medicine to military, the interest of all manner of fields has been peaked [1]. This has encouraged the rapid integration of AI into our everyday lives [2]. However, in the recent swarm of industrial excitement, whilst new applications have taken the limelight, rigour and understanding have began to lag behind. By shining light on a popular choice of mechanisms which assist in the training of AI, known as dropout, L2 and early stopping, my study aimed to be a small step towards designing AI in a more informed and understood manner.
Artificial Neural Networks (ANNs) are a collection of computational architectures inspired by the brain; they are the current most realised form of AI. If an ANN is insufficient in size, it will lack the capacity to solve even the simplest of problems. However, if an ANN is too large, then that excessive capacity seldom lies dormant. Instead, in a process known as overfitting, the ANN tends to learn undesirable peculiarities in a data set, such as fuzzy noise. This, in turn, can result in an ANN that generalises poorly to new data – a tendency to perform insufficiently on previously unseen variations of the same underlying problem [3].
Driven by a desire to suppress overfitting, there have been a variety of developments of so-called regularisation techniques. L2, dropout and early stopping are common such choices. In particular, L2 and dropout have recently received praise and popularity for providing good results when applied in conjunction [4, 5]. Though regularisation techniques offer significant benefits – often being of practical necessity – their implementation does not come without its costs. Notably, both L2 and dropout have associated values controlling their strengths, each of which must be exhaustively fine-tuned to the specific problem and chosen ANN architecture [6].
To guide in what can become a lengthy and troublesome process of trial-and-error, my study aimed to test a hypothesised predictor for optimal L2 strength. The predictor proposed that optimal L2 strength is proportional to the amount of available training data. The effects on optimal L2 strength, of using L2 in conjunction with both the dropout and early stopping regularisation techniques, were then observed.
The results, which suggest the predictor to be successful within a suitable region, have helped to improve understanding of the interactions between these combined regularisation techniques. There shows promise that the predictor may find real world usage from its extrapolation to situations with many training patterns, which would otherwise rely upon a time-consuming hyperparameter search.