Estimating Model Complexity via MDL

Estimating Model Complexity
via Minimal Description Length

Nathan Intrator

Tel-Aviv University and Brown University

www.math.tau.ac.il/~nin

Model complexity definition

Given a functional relationship between d and y defined by the joint probability distribution P(d, y) the optimal solution c is given by c(d) = E_yy(d).

We seek ways to estimate the goodness of a family of models F_w(d).

How about this error surface:

While the surface is very smooth, it is impossible to get to the true minima.
Suggests that models that penalize on smoothness may be misleading.
Breiman (1992) has shown that even in simple problems and simple nonlinear models, the degree of generalization is a strong dependent on the stability of the parameters.
Standard approach:
- Optimize concurrently the likelihood or mean squared error together with a complexity penalty.
- Some penalties: norm of the weight vector, smoothness, number of terminating leaves (in CART), variance weights, cross validation... etc.
- Spend most computational time on optimizing the parameter solution via sophisticated Gradient descent methods or even global-minimum seeking methods.
Alternative approach:
- Support vectors: Create a representation that leads to a very simple optimization algorithm.
Affecting the error surface:
Hybrid Unsupervised/Supervised Architecture

Combined classification/reconstruction architecture
Remarks on optimal code
- In 1961 Horace Barlow suggested the bottleneck approach for finding optimal neuronal code and redundancy reduction.
- This was supported later by Atick and Redlich, 1992 for visual pathway preprocessing.
- Barlow later suggested that neurons should carry out an additional task.
  - While neurons should transmit transmit the probability of occurrence of the features they learn to detect;
  - They should also relay the prior probability of the feature they detect.
- This follows from his seminal work on minimal entropy codes and unsupervised learning Barlow, 1989.
Mathematical details
Addressing the computational complexity
- Ensemble of experts can give a rough estimation of the posterior distribution.
- Start with a narrow Gaussian distribution in weight space
- Start with a mixture of Gaussian's weight distribution (Hinton and Nowlan, 1992) or with a distribution that is set by other means such as PCA etc.
Model selection properties of the MDL approach
- Prefers models with a wide basin of attractions that lead to admissible local minima. Reduced sensitivity to the model parameters (flat minima).
- It prefers non-identifiable models, namely models with many local minima.
- While smoothness is a key issue, it rejects models with a Golf-course type of smoothness where the local minimum is difficult to find.
- Results in an ensemble of experts that is more robust to weight changes, and to partial damage to the ensemble.
- The large ensemble can give some approximation to the posterior probability of weights.
Further properties of the MDL approach
- A consequence of the smoothness and wide basin of attraction is that a simple learning rule should suffice.
- Most of the computation time is spent on finding a good architecture and not on finding a good solution.
- Finding a good architecture, can be done in parallel making a good use of the multitude of neurons.
Estimating the decoding complexity of a given representation (code)
- Discussed so far the complexity of model (learning)
- Following the above formulation, estimating code complexity is a natural extension;
- Given the information that is required to be extracted from the code, one can build a scheme that attempt to extract that information and estimate the complexity of that scheme as described above.
Related work
- Hochreiter and Schmidhuber (1997) work on Flat Minima networks.
- Hinton and van Camp (1993) work on measuring complexity of networks by MDL on the weights.
- Hinton and Nowlan, 1992. Mixture of Gaussians' weight constraints.