Learning and Artificial Neural Networks

Final Project

Yael Stav 0.2507922.9 Guy Hoffman 0.2513264.8

Classification of Seismic Data Using RBF Networks


General Architecture

Our work was aimed at training network ensembles to classify two types of preprocessed seismic data: SONL and PSD. We had 65 samples of each set.

First we applied a dimensionality reduction scheme converting each input vector (originally of dimension 242 and 352) into a low-dimension vector consisting of a mixture of two types of coordinates - distances from a given number of cluster centers, and choosing highest-entropy coordinates from the original data.

Next,  an ensemble of three types of RBF networks was optimized. We have used Bishop's netlab code and two types of Orr's RBF code. Each net was optimized according to relevant input parameters, and a confidence estimate was retained. The confidence was based on a cross validation scheme.

Networks of the same type in the ensemble differed by two criteria: They were trained on different random subsets of input vectors, and the optimization procedure had a random element in it.

Then, a reconstruction constraint was added to the networks, and a supervised gradient descent learning process on all the parameters of the network was performed. The nets were again estimated for confidence, and in case of improvement the old net was replaced by the new net in the ensemble.

Finally one of 4 ensemble fusing methods can be chosen to use the ensemble as a classifier for new data.


Source Code Usage

Click here to obtain a zip file with all of our source code

Click here to obtain a zip file with bug-fixed Bishop netlab code

Click here to obtain a zip file with Orr's code (Note that Orr's code is zipped in subdirectories. Make sure your matlab path points to subdir "meth" and "util")

Click here to download this HTML document with images(zipped)

Source code consists of three functions and some helper functions, in addition to Orr's and Bishop's code. Some bugs in Bishop's code have been fixed, and therefore it is vital to use this version of NETLAB.

Below are the options of the main functions.
 

[processed, preproc] = findpreproc(train_data, train_label, procoptsions, preproc)

Requirements: This function has two tasks. First it finds an optimal preprocessing and second, it generates the pre-processed data. If preproc is supplied, the program runs the data through the preprocessing methodology which should be completely kept in preproc.

Usage:
 
  
  Depending on the number of arguments:

   3 - find an optimal preprocessing for a given training data
   4 - runs the data through the preprocessing method. Doesn't need
       the T input in this mode.

   X - training data,
   T - target values (can be empty when preproc is supplied)

   OPTIONS (1) - data set: 0 - 'sonl' or 1 - 'psd'
   OPTIONS (3) - number of entropy coordinates to choose or 0 for optimal value
   OPTIONS (4) - number of cluster centers to use or 0 for optimal value

   OPTIONS (5) - nonzero: use gaussian kernel estimation in entropy evaluation
                (more precise, but time consuming)

   OPTIONS (6) - nonzero: use mahanalobis distance in clustering
                 (very very time consuming)
   ---

  Uses "rentropy" code from previous year written by Faran, Mertens and Keinan
 

[preproc, netarch] = findarch(train_data, train_label, options)

Requirements: This function finds an optimal preprocessing and a set of architectures with their corresponding parameters which are best for the given training data. This function calls internally to findpreproc.

Usage:
 
 
  OPTIONS(1) to (6) are for the preprocessing:
 
   OPTIONS (1) - data type: 0 - 'sonl' or 1 - 'psd'
   OPTIONS (3) - number of entropy coordinates to choose or -1 for optimal value
   OPTIONS (4) - number of cluster centers to use or -1 for optimal value
 
   OPTIONS (5) - nonzero: use gaussian kernel estimation in entropy evaluation
                (more precise, but time consuming)
 
   OPTIONS (6) - nonzero: use mahanalobis distance in clustering
                 (very very time consuming)
 
 
   The other options are for the network training
 
   OPTIONS (7) - number of networks for each architecture in the ensemble
                 default = 6 
   OPTIONS (8) - size of sample from data for each net (default 55)
   OPTIONS (9) - lamda - relative part of reconstruction constraint in error (default 0.1)
   OPTIONS (10) - type of ensemble fusing: 
                    0 - weighted voting
                    1 - simple voting
                    2 - weighted mean
                    3 - simple mean  
   uses 'check_bias' code from last year written by Helman, Zivan and Friedrich
 

[label_pred, confidence] = netpred(data, preproc, netarch, options)

Requirements:  This function performs the preprocessing on the given data based on the preprocessing scheme that was chosen by findarch and is stored in preproc and then prediction of the  class labels based on the architectures that were found by findarch and are stored in  netarch. Note that if a collection of several experts is found by findarch, then the algorithm for fusing the experts should also be stored in netarch to be used by netpred.  You should suggest a way to calculate the confidence and justify in your written report why this confidence method is useful.

Usage:
 
   CONFIDENCE is a structure containing fields: 
      VAR - variance of classification results
      DIST - distance of raw classification from 0.5

   [pred, confidence, y] = netpred (X, preproc, netarch, options)
       returns also raw classifications from each classifier

   OPTIONS(1) and (6) are for the preprocessing:

   OPTIONS (1) - data type: 0 - 'sonl' or 1 - 'psd'
   OPTIONS (6) - nonzero: use mahanalobis distance in clustering preprocessing
                 (very very time consuming)
 


Preprocessing

Two methods of preprocessing were used: Maximal Entropy and Clustering. In addition some standard deviation normalization was applied as described below. Output vectors consisted of a mixture of the two preprocessing outputs:

preprocessed = [ prep_cluster  prep_entropy ];
 

Clustering

Method: Input data was clustered into k clusters. The output was the distance of the input vector from each of the k clusters. An option enables the user to specify using Mahalanobis distance from the cluster instead of regular Euclidean distance.

Source code:

Remarks:

Maximal Entropy

Method: We computed entropy values for each input coordinate. Then the indices were sorted according to descending entropy. the dim highest entropy indices were chosen as the output of this process. An option enables the user to specify using gaussian kernel estimation for the entropy probability estimate.

Source code:

Remarks:

Normalization

Method: Since the different coordinates of the input data are at extremely varying scales, a standard deviation normalization is in place. This is especially important for the clustering preprocess and was used as a preliminary stage before supplying the input to the clustering process. As for the maximal entropy preprocessing, we have seen that normalizing the input smoothes the results, and therefore we didn't apply it on the input of the maximal entropy function.

Another normalization was performed on the combined output of the two preprocessing methods to prevent extreme influence of certain coordinates on optimization and learning.

Source code:

normalize.m


RBF Net Optimization

Method: We have used RBF construction codes of Bishop and Orr to initialize our networks, and then tried various ways to optimize their configuration. Optimization was performed by sampling a random subset of the data, and testing the network with the current parameter set against the rest of the data.

Source code:

RBF by Bishop

Method: Bishop's Netlab code should receive the initial configuration of the network in terms of amount of input units, hidden units, output units and type of activation function. We used the 'gaussian' activation function, and performed optimization on the amount of hidden units. We tried different sets of hidden units, but found it optimal to check for a wide range of values. In practice the range [2:4:26] is searched.

Source code:

Remarks:

RBF by Orr

Method: Orr provides four types of RBF networks, that differ in the way the first layer parameters are optimized. We have found the regression tree functions to be malfunctioned for our data, so we focussed on the "forward selection" and "ridge regression" packages.

Orr does an automatic selection of the number of hidden units. Three important parameters can, though, be optimized:

Constant radii
The initial RBF radii can either be relative to the spread of the input data on each coordinate, or equal sized along the coordinates.
Scale of widths
RBF radii can be initialized with a multiplier called "scale", to enrich the mixture of the RB functions.
Bias
A boolean modifier to specify whether to use a bias unit in the second layer or not.

We performed optimization on the scales in an iterative way (with the help of previous year work) and found that for the "forward selection" scheme it is best to optimize the scales between [0.75 1 1.25], whereas for the "ridge regression" scheme an open optimization is benefitial. In addition, the usage of constant radii and bias is used only on the "ridge resgreesion" scheme (according to experimental results).


Remarks:


Reconstruction Constraint

Method: A Reconstruction constraint was added to the networks as a post training step. This is due to the special kind of hybrid unsupervised/supervised training normally used for RBF networks.

The networks from the optimization step were added the reconstruction constraint in the form of additional output targets, and then a gradient descent optimization on the RBF parameters was performed according to the formulae below:







The error function that was optimized was a weighted combination of the y- target and the reconstruction target (divided by the size of the reconstruction vector). This weighting is governed by the lamda parameter, that the user can modify (default 0.1).

Source code:

Remarks:


Ensemble Fusing

Method: We have supplied 4 methods of ensemble fusing for the user to choose from.
  1. Weighted Voting (default) - Each network votes on the classification (0 or 1) and each vote has a relative weight according to the network's confidence.
  2. Simple Voting  - Each network votes on the classification (0 or 1) and each vote has the same weight regardless of the network's confidence.
  3. Weighted Mean - The prediction is based on the mean of the networks' output. Each output has a relative weight according to the networks confidence.
  4. Simple Mean - The prediction is based on the mean of the networks' output.
Remarks:


Ensemble Confidence

We have supplied two ways of computing confidence:
  1. Variance - Low variance of prediction between networks implies a good ensemble confidence.
  2. Significance - Another method of computing the confidence of the prediction is the output's distance from 0.5, which is basically a computation of the significance of the prediction.


Conclusions

In general, we have found Bishop's network to perform far better on the given data sets. Orr's networks tend to overfit the data, and out assumption is that it has to do with the large number of hidden units that are selected in any of Orr's first-layer optimization functions.

In addition, the reconstruction constraint has a positive effect only on initially badly performing networks, and many times, even though the reconstruction abilities of the network improved, the classification abilities didn't improve, and sometimes even deteriorated.

As to the data sets, the SONL data set is much easier to classify, and the entropy preprocessing stage implies that many of the PSD coordinates are too low-entropy to give any good indication as to the nature of the data. Finally, we couldn't achieve very good results on the PSD set, whereas for the SONL set an almost-perfect classification was often found.

In the preprocessing stage, the number of entropy based coordinates have proven to be crucial to the success of the classification. Too little, but also too many, coordinates worsen the classification immensely. The clustering preprocessing was found to be less than optimal.

Two types of error were used in the cross validation stage: The error over all the data (train and test), and error on the test data alone. The latter seems to characterize the generalization traits of the network better.

The expert fusion was not investigated enough, but initial results point to the "weighted vote" option as being optimal. Some problems remain, though, especially with "fake experts", i.e. nets that performed well on the optimization step, but don't generalize well over new data. This is a problem especially with Orr's networks.
 


Future Work