\section{CLIFF: Tool for Instance Selection}
\label{subsection:selector}
CLIFF is a  Prototype Learning tool based on a novel prototype learning algorithm which is defined by it's purpose - to reduce a training set via various selection and/or creation methods to produce $good$ prototypes which increases the distance between NUNs and NLNs. These $good$ prototypes are therefore a representation of the original training data set such that they maintain comparable or increased classification performance of a classifier.

CLIFF selects samples from a training set which best represents their respective target class. To accomplish this, a Bayesian ranking measure along with a support measure is used. First we assume that the target class is divided into one class as $best$ and the other classes as $rest$. This makes it easy to find the attribute ranges of values which have a high probability of belonging to the current \emph{best} class. The attribute ranges are found using an equal frequency binning algorithm which sorts attribute values into $N$ equal frequency regions (making sure that there are no repeats in the numbers that are the boundaries between regions). The rank values are generated by applying \eq{one} to each attribute range. The remainder of this section further discusses how the attribute ranges and their corresponding ranks are generated.

\begin{verbatim}
Let likelihood = like
\end{verbatim}
\begin{equation}
P(best|E) * support(best|E) = \frac{like(best|E)^2}{like(best|E) + like(rest|E)} 
\label{eq:one}
\end{equation}

\begin{figure}[h!]
\small
\begin{center}
\begin{tabular}{ p{8cm} }
\hline
\begin{verbatim}
        
    Outlook   Temp(F)  Humidity Windy? Class
01. Sunny      69        70     False  Lots
02. Sunny      75        70     True   Lots
03. Overcast   83        88     False  Lots
04. Overcast   64        65     True   Lots
05. Overcast   72        90     True   Lots
06. Overcast   81        75     False  Lots
07. Sunny      85        86     False  None
08. Sunny      80        90     True   None
09. Sunny      72        95     False  None
10. Rain       65        70     True   None
11. Rain       71        96     True   None
12. Rain       70        96     False  Some
13. Rain       68        80     False  Some
14. Rain       75        80     False  Some
\end{verbatim}
 \\ \hline
    \end{tabular}
\end{center}
\caption{A log of some golf-playing behavior}\label{fig:golf}
\end{figure} 

The algorithm used in this work is a Support Based Bayesian Ranking (SBBR) algorithm. It's pseudo code is shown in \fig{sbbra}. The \emph{best-rest} function divides the data set $D$ into $best$ and $rest$ (for example, in \fig{golf} if class $Lots$ is selected as $best$ then samples 1-6 would be in the best set while samples 7-14 would be in the rest set). As explained earlier the $EqualFrequencyBinning$ function in line 2 finds the attribute ranges for each attribute in the data set. For instance, looking at the attribute values for $Humidity$, if we want to get three(3) bins the following would be generated - \newline \emph{65 70 70 70 | 75 80 80 86 | 88 90 90 95 96 96}.

Once the ranges for each attribute is found, lines 5-12 are applied to each range to produce a rank value. This rank value represents the probability a sample containing a value from a particular attribute range will be in the $best$ class.

Finally to choose the samples which best represents a class, for each class a criteria is established where samples containing attribute range values with the highest $r$ rank values of the highest $n$ attributes are selected. For example, looking at the golf data set in \fig{golf}, if r=1 and n=4, the criteria for choosing the best samples in the $Lots$ class could be [Overcast, 81-83, 75-88, False]. With this criteria, samples 3 and 6 would be selected. Please note that the attributes are ordered from most important to least important according to the highest rank value found in the respective attribute ranges. Therefore the highest $n$ attributes is/are chosen according to this order. 


\begin{figure}[h!]
\small
\begin{center}
\begin{tabular}{ p{8cm} }
\hline
\begin{verbatim}
DATA = [3, 5, 10, 20]
A = [Attributes]
BEST = [Instances with current best class]
REST = [Instances with other classes]
FREQ = [frequency]
LIKE = [likelihood]
EFB = [EqualFrequencyBinning]
BIN = [Values within Attribute range]
P_BEST = [Probability of BIN in BEST]
P_REST = [Probability of BIN in REST]
        
FOR EACH data IN DATA
 BIN_DATA = EFB on data
 FOR EACH attribute in A{
  FOR EACH BIN IN attribute{
   P_BEST = count(BEST) / count(data)
   P_REST = count(REST) / count(data)
   FREQ(BIN|BEST) = count(BIN in BEST) / count(BEST)
   FREQ(BIN|REST) = count(BIN in REST) / count(REST)                       
   LIKE(BEST|BIN) = FREQ(BIN|BEST) x P_BEST
   LIKE(REST|BIN) = FREQ(BIN|REST) x P_REST
   LIKE_BEST_REST = LIKE(BEST|BIN) + LIKE(REST|BIN)
   RANK = LIKE(BEST|BIN)^2 / LIKE_BEST_REST
   RETURN [BIN, RANK]
  END
 END
END
\end{verbatim}
 \\ \hline
    \end{tabular}
\end{center}
\caption{Pseudo code for Support Based Bayesian Ranking algorithm}\label{fig:sbbra}
\end{figure}       
