In Chapter 1, the optimal goal of PLS as a solution to the drawbacks of NNC is highlighted. To continue this discussion, in this chapter we present a brief survey of PLS starting with one of the earliest - Hart's 1968 Condensed Nearest Neighbor (CNN) \cite{Hart68a} to a method published in 2010. The reader will see that since 1968, researchers in this field have created PLS which fit into at least one of two(2) categories: 1) instance selection and 2) instance abstraction. %Also included in this chapter, are the evaluation methods generally used to gage the performance of different PLS.

Before moving forward, in the interest of clarity, the following terminology are used throughout the remainder of this thesis: all data-sets refers to supervised data-sets and each data-set consists of rows and columns where each row is referred to as an instance and each column is called an attribute except for the last column which is the target class; a prototype is an instance selected or created to be part of the final reduced training set and finally, consistency is defined as the ability of the final subset of prototypes to correctly classify the original training set.

\section{Prototype Learning for Nearest Neighbor Classifiers}

Research in prototype learning is an active field of study \cite{Devi2002, Bezdek2001, Chang74, Dasarathy94, Li2009, Bezdek98, Cano2005, Garain2008, Veenman2005, García2008, Kohonen98, Kohonen90}. A review of the literature in this field has revealed two(2) categories of PLS: 1) instance selection and 2)instance abstraction. Instance selection involves selecting a subset of instances from the original training set as prototypes. Using what Dasarathy \cite{Dasarathy94} terms as \emph{edit rules}, instance selection can take place in four(4) different ways.

\be
\item incremental (CNN \cite{Hart68a})
\item decremental (RNN)
\item a combination of 1 and 2
\item border points, non-border points or central points
\ee

Instance abstraction involves creating prototypes by merging the instances in a training set according to pre-determined rules. For example, Chang \cite{Chang74} merges two instances if the have the same class, are closer to each other than any other instances and the result of merging does not degrade the performance of NNC. \fig{psw} shows the PLS surveyed in this work.

\begin{figure}[ht!]
\begin{center}
\begin{tabular}{| l@{~}|l@{~}| c@{~}| c@{~}|} \hline
Name & Abbreviation & Cite &  Type\\\hline
Condensed Nearest Neighbor & CNN & \cite{Hart68a} & Instance Selection \\ \hline
Reduced Nearest Neighbor & RNN & \cite{Gates72} & Instance Selection \\ \hline
Minimal Consistent Set & MCS & \cite{Dasarathy94} & Instance Selection \\ \hline
Prototype Selection by Clustering & PSC & \cite{lot2010} & Instance Selection \\ \hline
The Chang and Modified Chang Algorithms & MCAs & \cite{Chang74, Bezdek98} & Instance Abstraction \\ \hline
Learning Vector Quantization & LVQ & \cite{Kohonen90} & Instance Abstraction \\ \hline
\end{tabular}
\end{center}
\caption{PLS surveyed in this work}\label{fig:psw}
\end{figure}

\subsection{Instance Selection}

\subsubsection{Condensed Nearest Neighbor (CNN)}
Different PLS use various rules to determine which instance in a training set is a worthy choice as a prototype. Each also tend to focus on specific goals such as increasing speed or performance or storage reduction. CNN \cite{Hart68a} uses a \emph{incremental} strategy where it initializes a random subset of prototypes and adds to the list. Hart's goal with CNN focused on storage reduction with the aim to create a minimal consistent set, i.e. a smallest subset of the complete set that also classifies all the original instances correctly. 

\fig{cnn} shows the pseudo-code for CNN which begins by randomly selecting one instance from each target class and stores them as prototypes in a list. These prototypes are then used to classify (using the 1NN rule) the instances in the training set. If any of theses instances are misclassified they are added to the prototype list. This process is repeated until the prototype list can no longer be increased.

Admittedly, although a reduction in the training set with consistency was accomplished, Hart did not achieve his goal of a minimal consistent set with CNN. CNN also suffers with the following disadvantages:

\bi
\item sensitive to the initial order of input data
\item sensitive to noise which can degrade performance
\ei



\begin{figure}[h!]
\small
\begin{center}
\begin{tabular}{ p{10cm} }
\hline
\begin{verbatim}
INITIAL_PROTOTYPES = [RANDOM(FROM EACH TARGET CLASS)]
PREV = []
CUR = [INITIAL_PROTOTYPES]

REPEAT UNTIL PREV = CUR
  MISCLASSIFIED = classify TRAIN and RETURN MISCLASSIFIED INSTANCES
  PREV = CUR
  CUR = MISCLASSIFIED + CUR
END	
\end{verbatim}
 \\ \hline
    \end{tabular}
\end{center}
\caption{Pseudo-code for CNN}\label{fig:cnn}
\end{figure} 

\subsubsection{Reduced Nearest Neighbor (RNN)}
RNN \cite{Gates72} takes an opposite approach to CNN. Its strategy is \emph{decremental}. So rather than start with a subset of the training set as CNN does, RNN uses the entire training set as initial prototypes and reduces the list. In the end, it is computationally more expensive than CNN, but always produces a subset of a CNN result.

The algorithm begins by setting the initial prototypes as the entire training set. From here, a prototype is removed if and only if its removal does not cause the misclassification of any instance in the training set. This procedure stops when no more prototypes can be removed from the prototype list. \fig{rnn} shows the pseudo-code for RNN.

\begin{figure}[h!]
\small
\begin{center}
\begin{tabular}{ p{10cm} }
\hline
\begin{verbatim}
PREV = []
CUR = [TRAIN]

REPEAT UNTIL PREV = CUR
  PREV = CUR
  IF (CUR - (FIRST CUR)) cause misclassification of TRAIN
    CUR = CUR
    CUR = (REST CUR)
  END	
END	
\end{verbatim}
 \\ \hline
    \end{tabular}
\end{center}
\caption{Pseudo-code for RNN}\label{fig:rnn}
\end{figure}  


\subsubsection{Minimal Consistent Set (MCS)}
The goal of the MCS is to achieve what Hart \cite{Hart68a} failed to achieve: a minimal consistent set. MCS uses a voting strategy which favors instances with the greatest number of like instances (in other words, those with the same class) closer to them than unlike instances.

As explained in \cite{Dasarathy94}, MCS takes a \emph{decremental} approach as with RNN where at first the entire training set is seen as the initial prototypes. Then for each of these prototypes the distance of its nearest unlike neighbor (NUN), i.e. nearest neighbor with a different class, is found. Next, all nearest like neighbors (NLN) of this prototype whose distances from the prototype are less than that of the NUN are stored. Each of these NLN are counted as a vote toward the prototype for candidacy as a final prototype. The prototype with the most votes is then designated as a candidate and all NLN who contributed to the vote are removed from candidacy consideration as well as from the voting lists of other prototypes.

With the votes now updated, the process is repeated by designating the prototype who now has the most votes as a candidate. This process continues until the list can no longer be reduced. Further, since the goal of MCS is to find the minimal consistent set, the entire strategy is iterative in that the reduced list is now used as input for the next iteration starting with finding the distance of NUN for each prototype. \fig{mcs1} shows the pseudo-code for MCS.

\begin{figure}[h!]
\small
\begin{center}
\begin{tabular}{ p{10cm} }
\hline
\begin{verbatim}
PREV = []
CUR = [TRAIN]
P* = []   \\ CANDIDATE PROTOTYPE

REPEAT UNTIL PREV = CUR
 PREV = CUR
 CUR = MCS(CUR)
END

MCS(CUR)
REPEAT UNTIL PREV = CUR
  FOR EACH c IN CUR
   DISTANCE_NUN = DISTANCE(c, NUN)
   VOTER_LIST = NLN
   VOTES = COUNT(VOTER_LIST)
  END
	
  P* = NLN with MAX(VOTES)
  CUR = CUR - P*(VOTER_LIST)
END
\end{verbatim}
 \\ \hline
    \end{tabular}
\end{center}
\caption{Pseudo-code for MCS}\label{fig:mcs1}
\end{figure}  

\subsubsection{Prototype Selection by Clustering (PSC)}
Unlike the previous methods, PSC is uses clusters to aid in prototype selection. Based on the argument that "...in a training set interior prototypes can be removed with little effect on classification accuracy; and they also argue that border prototypes are critical for classification since they provide relevant information for preserving discrimination between classes" \cite{lot2010}, PSC selects prototypes which are border prototypes and also some internal prototypes. 

As explained in \cite{lot2010}, PSC begins by dividing the training data into regions using a clustering algorithm. Although $Cmeans$ is used in their work, it is made clear that any clustering method can be used in its place and so in our experiments for this thesis, $Kmeans$ is our algorithm of choice. 

After clustering, the internal prototypes are found. This is done by first finding the homogeneous clusters (all instances in the cluster have the same target class). Once identified, the instance which is nearest to the centroid of each cluster is selected as a prototype.

Now, to find the border prototypes, those clusters which are non-homogeneous are identified. Next, all instances with the majority target class are found and the border prototypes for this class are those closest to instances of a different target class(es). This is true for instances of each class. \fig{psc} shows the pseudo-code for PSC.

\begin{figure}[h!]
\small
\begin{center}
\begin{tabular}{ p{10cm} }
\hline
\begin{verbatim}
CLUSTER = [KMEANS]

CLUSTERS = CLUSTER ON TRAIN

FOR EACH cluster IN CLUSTERS
  IF cluster = homogeneous
    CENTROID = MEAN(cluster)
    PROTOTYPE = INSTANCE NEAREST CENTROID
  ELSE
    LET CM BE MAJORITY CLASS IN cluster
    FOR EACH CLASS CK IN cluster (CK NOT= CM)
    	FOR EACH INSTANCE_K IN CK
    	  IF INSTANCE_M IN CM IS CLOSEST TO INSTANCE_K
    	   PROTOTYPE = PROTOTYPE + INSTANCE_M
        IF INSTANCE_K IN CK IS CLOSEST TO INSTANCE_M
         PROTOTYPE = PROTOTYPE + INSTANCE_K
      END
  RETURN PROTOTYPE
END

\end{verbatim}
 \\ \hline
    \end{tabular}
\end{center}
\caption{Pseudo-code for PSC}\label{fig:psc}
\end{figure}  


\subsection{Instance Abstraction}
\subsubsection{The Chang and Modified Chang Algorithms}
Chang's \cite{Chang74} work on finding prototypes for nearest neighbor classifiers is one of the earliest abstraction methods in the literature. The basic idea of his algorithm is to start with every sample in the training set as a prototype, then merge any two nearest prototype (p1 and p2 to form p*) with the same class as long as the recognition rate is not degraded. The new prototype p* can be formed by simply finding the average of p1 and p2 or the average vector of weighted p1 and p2 (Chang uses the weighted average). p* will have the same class as the individual prototypes p1 and p2. The merging process will continue until the number of incorrect classifications of patterns in the data set starts to increase. \fig{chang}, shows Chang's algorithm. 

The Modified Chang Algorithm (MCA) \cite{Bezdek98} was proposed by Bezdek in 1998. In this approach, Chang's algorithm is modified in two(2) ways: 1) the arithmetic mean is used rather than the weighted mean to merge prototypes. 2) the search for prototypes to merge is improved by partitioning the training set by their class labels and only searches within these respective partitions for the nearest prototype to merge. This eliminates the need to check for the criteria of candidate prototypes having the same class. 

\begin{figure}[h!]
\small
\begin{center}
\begin{tabular}{ p{8cm} }
\hline
\begin{verbatim}
Step 1: Start with an arbitrary point tj
        in B* and assign it to A*
Step 2: For all points tk in B* such that
        class (tk) is not equal to class (tj),
        update bk to be the distance between
        tk ans tj if this distance is smaller
        than the present bk. Otherwise, bk is
        unchanged.
Step 3: Among all the points in B*, find the
        point ts which has the smallest bs
        associated with it.
Step 4: If tj is not the nearest point ts such
        that the classes of tj and ts are
        different, go to Step 6. Otherwise
        continue.
Step 5: Check whether or not d(tj, ts) is less
        than bj. If no, go to Step 6. If yes,
        let bj=d(tj, ts) and continue.
Step 6: Let j=s, move ts from B* to A*, and go
        to Step 2 until B* is empty. When B* is
        empty, the final b1,...,bm are the
        desired ones.
\end{verbatim}
 \\ \hline
    \end{tabular}
\end{center}
\caption{Chang's algorithm for finding prototypes}\label{fig:chang}
\end{figure} 


\subsubsection{Learning Vector Quantization (LVQ)}

Kohonen \cite{Kohonen90} in 1990, introduced the learning vector quantization (LVQ) to create prototypes for the 1NN classifier using error correction rules. LVQ is a set of learning algorithms for nearest prototype classification and its basic algorithm, LVQ1, works by first selecting a certain number of prototypes from each class randomly as initial prototypes. This ensures that each class is represented be at least one prototype. These initial prototypes are then updated using the training set with the basic idea that the prototypes will be attracted to training points with the same class label and repelled by those with different class labels. So for an input instance $x$, find the prototype, $m_i$ (which is denoted by $m_c$), that is closest to it. \eq{lvq1} to \eq{lvq} defines the basic LVQ1 process of how the prototypes are updated \cite{Kohonen90}. Here $\alpha(t)$ is a learning rate between 0 and 1 and it decreases with time.

\begin{equation}
m_c(t + 1) = m_c(t) + \alpha(t)[x(t) - m_c(t)] 
\label{eq:lvq1}
\end{equation}
if x and $m_c$ have the same class,

\begin{equation}
m_c(t + 1) = m_c(t) - \alpha(t)[x(t) - m_c(t)] 
\label{eq:lvq2}
\end{equation}
if x and $m_c$ have the different classes,

\begin{equation}
m_i(t + 1) = m_i(t) 
\label{eq:lvq}
\end{equation}
for i \neq c.

\subsection{Summary}
A troublesome feature with all the above research is the computational cost of the proposed instance reduction methods. For instance, CNN requires that for each new prototype added to the list, consistency must be checked to ensure that the new addition does not degrade the classification of the training set. Similarly, the Chang and Modified Chang algorithms have a consistency issue in that before a new prototype can be created via merging, consistency must be checked. On the other hand, PSC has a three(3) step process which contributes to its computational cost, first the clustering of the training set followed by the testing whether clusters are homogeneous or heterogeneous, and finally there is the task of finding the border prototypes of heterogeneous clusters. In the following chapter we propose a novel instance reduction method with envious computational costs.

%\section{Evaluation of Prototype Learning Schemes}
%In order to compare the performance of different PLS, most researchers in the field focus on a few criteria, namely - the reduction, accuracy, noise tolerance and speed. To test the performance of CLIFF (discussed in Chapter \ref{chapter:cliff}, the probability of detection (pd) and the probability of false alarm (pf) are used.

%\subsection{Reduction}
%Recall that one of the main goals of PLS is to avoid having the store the entire training set during the operation of predicting an unknown instance. So an ideal prototype learning scheme will significantly reduce the size of the training set while maintaining or improving on the accuracy of the classifier.

%\subsection{Speed Increase}

%\subsection{Generalization Accuracy}

%\subsection{Noise Tolerance}

%\subsection{Probability of Detection and False Alarm}
