Learning and Artificial Neural Networks

Final Project

Yael Stav 0.2507922.9 Guy Hoffman 0.2513264.8

Classification of Seismic Data Using RBF Networks

General Architecture

Our work was aimed at training network ensembles to classify two types of preprocessed seismic data: SONL and PSD. We had 65 samples of each set.

First we applied a dimensionality reduction scheme converting each input vector (originally of dimension 242 and 352) into a low-dimension vector consisting of a mixture of two types of coordinates - distances from a given number of cluster centers, and choosing highest-entropy coordinates from the original data.

Next, an ensemble of three types of RBF networks was optimized. We have used Bishop's netlab code and two types of Orr's RBF code. Each net was optimized according to relevant input parameters, and a confidence estimate was retained. The confidence was based on a cross validation scheme.

Networks of the same type in the ensemble differed by two criteria: They were trained on different random subsets of input vectors, and the optimization procedure had a random element in it.

Then, a reconstruction constraint was added to the networks, and a supervised gradient descent learning process on all the parameters of the network was performed. The nets were again estimated for confidence, and in case of improvement the old net was replaced by the new net in the ensemble.

Finally one of 4 ensemble fusing methods can be chosen to use the ensemble as a classifier for new data.

Source Code Usage

Click here to obtain a zip file with all of our source code

Click here to obtain a zip file with bug-fixed Bishop netlab code

Click here to obtain a zip file with Orr's code (Note that Orr's code is zipped in subdirectories. Make sure your matlab path points to subdir "meth" and "util")

Click here to download this HTML document with images(zipped)

Source code consists of three functions and some helper functions, in addition to Orr's and Bishop's code. Some bugs in Bishop's code have been fixed, and therefore it is vital to use this version of NETLAB.

Below are the options of the main functions.

[processed, preproc] = findpreproc(train_data, train_label, procoptsions, preproc)

Requirements: This function has two tasks. First it finds an optimal preprocessing and second, it generates the pre-processed data. If preproc is supplied, the program runs the data through the preprocessing methodology which should be completely kept in preproc.

Usage:

Depending on the number of arguments: 3 - find an optimal preprocessing for a given training data 4 - runs the data through the preprocessing method. Doesn't need the T input in this mode. X - training data, T - target values (can be empty when preproc is supplied) OPTIONS (1) - data set: 0 - 'sonl' or 1 - 'psd' OPTIONS (3) - number of entropy coordinates to choose or 0 for optimal value OPTIONS (4) - number of cluster centers to use or 0 for optimal value OPTIONS (5) - nonzero: use gaussian kernel estimation in entropy evaluation (more precise, but time consuming) OPTIONS (6) - nonzero: use mahanalobis distance in clustering (very very time consuming) --- Uses "rentropy" code from previous year written by Faran, Mertens and Keinan

[preproc, netarch] = findarch(train_data, train_label, options)

Requirements: This function finds an optimal preprocessing and a set of architectures with their corresponding parameters which are best for the given training data. This function calls internally to findpreproc.

Usage:

OPTIONS(1) to (6) are for the preprocessing: OPTIONS (1) - data type: 0 - 'sonl' or 1 - 'psd' OPTIONS (3) - number of entropy coordinates to choose or -1 for optimal value OPTIONS (4) - number of cluster centers to use or -1 for optimal value OPTIONS (5) - nonzero: use gaussian kernel estimation in entropy evaluation (more precise, but time consuming) OPTIONS (6) - nonzero: use mahanalobis distance in clustering (very very time consuming) The other options are for the network training OPTIONS (7) - number of networks for each architecture in the ensemble default = 6 OPTIONS (8) - size of sample from data for each net (default 55) OPTIONS (9) - lamda - relative part of reconstruction constraint in error (default 0.1) OPTIONS (10) - type of ensemble fusing: 0 - weighted voting 1 - simple voting 2 - weighted mean 3 - simple mean

uses 'check_bias' code from last year written by Helman, Zivan and Friedrich

[label_pred, confidence] = netpred(data, preproc, netarch, options)

Requirements: This function performs the preprocessing on the given data based on the preprocessing scheme that was chosen by findarch and is stored in preproc and then prediction of the class labels based on the architectures that were found by findarch and are stored in netarch. Note that if a collection of several experts is found by findarch, then the algorithm for fusing the experts should also be stored in netarch to be used by netpred. You should suggest a way to calculate the confidence and justify in your written report why this confidence method is useful.

Usage:

CONFIDENCE is a structure containing fields: VAR - variance of classification results DIST - distance of raw classification from 0.5 [pred, confidence, y] = netpred (X, preproc, netarch, options) returns also raw classifications from each classifier OPTIONS(1) and (6) are for the preprocessing: OPTIONS (1) - data type: 0 - 'sonl' or 1 - 'psd' OPTIONS (6) - nonzero: use mahanalobis distance in clustering preprocessing (very very time consuming)

Preprocessing

Two methods of preprocessing were used: Maximal Entropy and Clustering. In addition some standard deviation normalization was applied as described below. Output vectors consisted of a mixture of the two preprocessing outputs:

preprocessed = [ prep_cluster prep_entropy ];

`Clustering`

Method: Input data was clustered into k clusters. The output was the distance of the input vector from each of the k clusters. An option enables the user to specify using Mahalanobis distance from the cluster instead of regular Euclidean distance.

Source code:

find_clusters.m - find cluster centers and cov-matrix
prep_clusters.m - calculate distance from cluster centers

Remarks:

An initial clustering is performed using klogk clusters, and then the actual clustering is done.
Mahalanobis distance weighs heavily on performance, and didn't seem to improve later classification results.
Due to sparse input data. when calculating mahalanobis distance, there can be a one-sample cluster, which disallows standard computation of the mahalanobis distance. A pseudo-inverse of the covariance matrix was used to overcome this problem.
In general, we have not succeeded in finding a consistently optimal number of cluster centers. Clustering data did not, usually lead to good classification, and finally, a low number of cluster centers (3-5) was chosen as a default for our implementation.

`Maximal Entropy`

Method: We computed entropy values for each input coordinate. Then the indices were sorted according to descending entropy. the dim highest entropy indices were chosen as the output of this process. An option enables the user to specify using gaussian kernel estimation for the entropy probability estimate.

Source code:

rentropy.m -

Remarks:

Looking at a visualization of the entropy values (below) across the coordinates, one can see very clearly the variance between the different dimensions' entropy in each one of the data sets. These graphs have been used by us as a starting point for the optimization of the default values for number of high-entropy coordinates.
An automated procedure can be considered, cutting entropy coordinates according to first order derivative of the below graph. We have tried such a procedure on the given data sets with great success. The cutting point is then decided by defining a threshold for entropy derivative normalized by the entropy value. More data sets should be examined to determine general properties of this method.
Entropy based coordinates have proved a good preprocessing method in subsequent classification tasks.

`Normalization`

Method: Since the different coordinates of the input data are at extremely varying scales, a standard deviation normalization is in place. This is especially important for the clustering preprocess and was used as a preliminary stage before supplying the input to the clustering process. As for the maximal entropy preprocessing, we have seen that normalizing the input smoothes the results, and therefore we didn't apply it on the input of the maximal entropy function.

Another normalization was performed on the combined output of the two preprocessing methods to prevent extreme influence of certain coordinates on optimization and learning.

Source code:

normalize.m

RBF Net Optimization

Method: We have used RBF construction codes of Bishop and Orr to initialize our networks, and then tried various ways to optimize their configuration. Optimization was performed by sampling a random subset of the data, and testing the network with the current parameter set against the rest of the data.

Source code:

optimal_net.m -

`RBF by Bishop`

Method: Bishop's Netlab code should receive the initial configuration of the network in terms of amount of input units, hidden units, output units and type of activation function. We used the 'gaussian' activation function, and performed optimization on the amount of hidden units. We tried different sets of hidden units, but found it optimal to check for a wide range of values. In practice the range [2:4:26] is searched.

Source code:

optimize_bishop.m

Remarks:

During hidden layer construction (the EM algorithm) we discovered that there are cases of zero values in the probability matrix. Since this caused division by zero error, we took a precautionary measure and added epsilon to this matrix (see gmmem.m).
We have not found an optimal number of hidden units, but due to the random nature of the basic network training, a good network is usually found at one of the range of hidden unit numbers.
In very broad terms, more hidden units result in better classification, but it is not a monotonous relationship.
More hidden units do, of course, hold the danger of overfitting the specific data.

`RBF by Orr`

Method: Orr provides four types of RBF networks, that differ in the way the first layer parameters are optimized. We have found the regression tree functions to be malfunctioned for our data, so we focussed on the "forward selection" and "ridge regression" packages.

Orr does an automatic selection of the number of hidden units. Three important parameters can, though, be optimized:

Constant radii: The initial RBF radii can either be relative to the spread of the input data on each coordinate, or equal sized along the coordinates.
Scale of widths: RBF radii can be initialized with a multiplier called "scale", to enrich the mixture of the RB functions.
Bias: A boolean modifier to specify whether to use a bias unit in the second layer or not.

We performed optimization on the scales in an iterative way (with the help of previous year work) and found that for the "forward selection" scheme it is best to optimize the scales between [0.75 1 1.25], whereas for the "ridge regression" scheme an open optimization is benefitial. In addition, the usage of constant radii and bias is used only on the "ridge resgreesion" scheme (according to experimental results).

Remarks:

In general, we have found Orr's networks to tend to overfit the input data, since a large number of RBF centers is usually selected in the optimization algorithms applied by Orr. This results in a misleading cross validation error measure, which is usually lower than its generalization value (and negatively influences the ensemble fusing).
We have found constant radii to work better with the SONL data and worse with the PSD data. This conclusion should be further verified with more data.

Reconstruction Constraint

Method: A Reconstruction constraint was added to the networks as a post training step. This is due to the special kind of hybrid unsupervised/supervised training normally used for RBF networks.

The networks from the optimization step were added the reconstruction constraint in the form of additional output targets, and then a gradient descent optimization on the RBF parameters was performed according to the formulae below:

The error function that was optimized was a weighted combination of the y- target and the reconstruction target (divided by the size of the reconstruction vector). This weighting is governed by the lamda parameter, that the user can modify (default 0.1).

Source code:

add_reconstruction.m
remove_reconstruction.m
netopt_recons.m

Remarks:

For good (low error) networks from the first stage, the reconstruction stage rarely yielded an improvement over the training data. If the first stage optimization of a network in the ensemble was less successful in classifying the test data, reconstruction usually improved the error value slightly.
The reconstruction optimization does not converge if the number of parameters (i.e. hidden units) is too large. Therefore, on most of the networks from Orr's code the reconstruction step did not succeed to lower the error.
If a low number of hidden units is selected in Bishop's code, then the reconstruction step is usually effective.
In choosing the number of optimization iterations, there is a tradeoff between the computation time and the quality of the improvement.

Ensemble Fusing

Method: We have supplied 4 methods of ensemble fusing for the user to choose from.

Weighted Voting (default) - Each network votes on the classification (0 or 1) and each vote has a relative weight according to the network's confidence.
Simple Voting - Each network votes on the classification (0 or 1) and each vote has the same weight regardless of the network's confidence.
Weighted Mean - The prediction is based on the mean of the networks' output. Each output has a relative weight according to the networks confidence.
Simple Mean - The prediction is based on the mean of the networks' output.

Remarks:

Even though the weighting vote was found to be the best predictor, there still remains a problem of filtering bad networks that got a good error-confidence in the first stage. We have found several networks that passed the first confidence stage with very good results, and had a very poor prediction value. This had a negative influence on the overall prediction.

Ensemble Confidence

We have supplied two ways of computing confidence:

Variance - Low variance of prediction between networks implies a good ensemble confidence.
Significance - Another method of computing the confidence of the prediction is the output's distance from 0.5, which is basically a computation of the significance of the prediction.

Conclusions

In general, we have found Bishop's network to perform far better on the given data sets. Orr's networks tend to overfit the data, and out assumption is that it has to do with the large number of hidden units that are selected in any of Orr's first-layer optimization functions.

In addition, the reconstruction constraint has a positive effect only on initially badly performing networks, and many times, even though the reconstruction abilities of the network improved, the classification abilities didn't improve, and sometimes even deteriorated.

As to the data sets, the SONL data set is much easier to classify, and the entropy preprocessing stage implies that many of the PSD coordinates are too low-entropy to give any good indication as to the nature of the data. Finally, we couldn't achieve very good results on the PSD set, whereas for the SONL set an almost-perfect classification was often found.

In the preprocessing stage, the number of entropy based coordinates have proven to be crucial to the success of the classification. Too little, but also too many, coordinates worsen the classification immensely. The clustering preprocessing was found to be less than optimal.

Two types of error were used in the cross validation stage: The error over all the data (train and test), and error on the test data alone. The latter seems to characterize the generalization traits of the network better.

The expert fusion was not investigated enough, but initial results point to the "weighted vote" option as being optimal. Some problems remain, though, especially with "fake experts", i.e. nets that performed well on the optimization step, but don't generalize well over new data. This is a problem especially with Orr's networks.

Future Work

A minimal entropy constraint on the first layer parameters has yet to be implemented.
Find better optimization for the PSD data set.