Model complexity definition
Given a functional relationship between d and y
defined by the joint probability distribution P(d, y)
the optimal solution c is given by
c(d) = Eyy(d).
We seek ways to estimate the goodness
of a family of models
Fw(d).
How about this error surface:
- While the surface is very smooth, it is impossible to get to the true minima.
- Suggests that models that penalize on smoothness may be misleading.
- Breiman (1992) has shown that even in simple problems and simple nonlinear
models,
the degree of generalization is a strong dependent on the stability
of the parameters.
Standard approach:
- Optimize concurrently the likelihood or mean squared error
together with a complexity penalty.
- Some penalties: norm of the weight vector, smoothness, number of
terminating leaves (in CART), variance weights, cross validation...
etc.
-
Spend most computational time on optimizing the parameter solution
via sophisticated Gradient descent methods or
even global-minimum seeking methods.
Alternative approach:
- Support vectors: Create a representation that leads to a very simple
optimization algorithm.
Affecting the error surface: Hybrid Unsupervised/Supervised Architecture
Combined classification/reconstruction architecture
Remarks on optimal code
-
In 1961
Horace Barlow suggested the bottleneck approach for finding optimal
neuronal code and redundancy reduction.
- This was supported later by Atick and Redlich, 1992 for visual
pathway preprocessing.
- Barlow later suggested that neurons should carry out an additional
task.
- While neurons should transmit
transmit the probability of occurrence of the features they learn to
detect;
- They should also relay the prior
probability
of the feature they detect.
- This follows from his seminal work on
minimal entropy codes and unsupervised learning
Barlow, 1989.
Mathematical details
Addressing the computational complexity
- Ensemble of experts can give a rough estimation of the posterior
distribution.
- Start with a narrow Gaussian distribution in weight space
- Start with a mixture of Gaussian's weight distribution (Hinton and
Nowlan, 1992) or with a
distribution that is set by other means such as PCA etc.
Model selection properties of the MDL approach
- Prefers models with a wide basin of attractions that
lead to admissible local minima. Reduced sensitivity to the model
parameters (flat minima).
- It prefers non-identifiable models, namely models with many
local minima.
- While smoothness is a key issue, it rejects models with a
Golf-course type of smoothness where the local minimum is difficult
to find.
- Results in an ensemble of experts that is more robust to weight
changes, and to partial damage to the ensemble.
- The large ensemble can give some approximation to the posterior
probability of weights.
Further properties of the MDL approach
-
A consequence of the smoothness and wide basin of attraction is that a
simple learning rule should suffice.
- Most of the computation time is spent on finding a good architecture and not
on finding a good solution.
- Finding a good architecture, can be done in parallel making a good use of the multitude of
neurons.
Estimating the decoding complexity of a given representation (code)
- Discussed so far the complexity of model (learning)
- Following the above formulation, estimating code complexity is a
natural extension;
- Given the information that is required to be extracted from the code, one
can build a scheme that attempt to extract that information and
estimate the complexity of that scheme as described above.
Related work
|