The PLS that we have discussed so far in the previous chapter tend to be based on misclassification, clustering, stochastic search and so on. Rarely does one come across a PLS which considers the idea that some ranges of values for attributes can be critical in selecting prototypes for each class. In other words, we consider using techniques practiced in the field of Feature Subset Selection (FSS) for instance selection.

CLIFF was born from this idea, in that the selection of a prototype is dependent on whether each value in an instance satisfies the criteria of being present in a range of values which has been determined to be critical in distinguishing a particular class. The following section explains the design of CLIFF in detail starting with the core of CLIFF - SBBR - and ending with CLIFF's time complexity. %Finding these critical range of values is the core of CLIFF and is accomplished by using a Support Based Bayesian Ranking Algorithm (SBBR) \cite{jalali08}. 

Before moving forward it is important to note that if a data-set contains attributes whose values are numeric, the values are binned using an equal frequency binning algorithm \cite{witten05} which sorts attribute values into $N$ equal frequency regions making sure that there are no repeats in the numbers which are the boundaries between regions. We refer to the values in these bins as a \emph{range of values} and is represented by a bin number. On the other hand, if an attributes contains discrete values then these are simply refered to as $values$. In the experiments in Chapter \ref{chapter:assess}, N = 10.

\section{The Support Based Bayesian Ranking Algorithm (SBBR)}
\label{section:selector}

The core of CLIFF is to find for a particular attribute a range of values more likely to be present for a particular class. To find these ranges, CLIFF uses a Bayesian ranking measure which includes a support measure (SBBR). First we assume that the target class is divided into one class as $best$ and the other classes as $rest$ \cite{jalali08}. This makes it easy to find the attribute values which have a high probability of belonging to the current \emph{best} class using Bayes theorem. The theorem uses evidence $E$ and a prior probability $P(H)$ for hypothesis \emph{H ∈ {best, rest}}, to calculate a posteriori probability \emph{P(H$|$E) = P(E$|$H)P(H) / P(E)}. When applying the theorem, likelihoods are computed from observed frequencies, then normalized to create probabilities: this normalization cancels out $P(E)$ in Bayes theorem (see \eq{one1}). 

\begin{verbatim}
Let likelihood = like
\end{verbatim}
\begin{equation}
P(best|E) = \frac{like(best|E)}{like(best|E) + like(rest|E)} 
\label{eq:one1}
\end{equation}

Unfortunately, one problem was found using the theorem, according to \cite{jalali08}, Bayes theorem is a poor ranking heuristic since it is distracted by low frequency evidence. To alleviate this problem the support measure was introduced. Its purpose was to increase as the frequency of a value increases i.e. like(best$|$E) is a valid support measure hence \eq{one}.
 
\begin{verbatim}
Let likelihood = like
\end{verbatim}
\begin{equation}
P(best|E) * support(best|E) = \frac{like(best|E)^2}{like(best|E) + like(rest|E)} 
\label{eq:one}
\end{equation}

So in this work, each attribute value or range of values is ranked according to \eq{one}. Once ranked, the critical ranges for each attribute are extracted (those with the highest ranks) and used as the criterion for selecting instances from the current $best$ class. The whole process is repeated using a different class as $best$ and all the others (including the previous $best$ class) as $rest$. When a criterion for each class has been found the process ends. \fig{sbbra} shows the pseudocode for ranking each value of an attribute for an entire training data set.

\begin{figure}[h!]
\small
\begin{center}
\begin{tabular}{ p{8cm} }
\hline
\begin{verbatim}
A = [Attributes]
BEST = [Instances with current best class]
REST = [Instances with other classes]
FREQ = [frequency]
LIKE = [likelihood]
EFB = [EqualFrequencyBinning]
BIN = [Values within Attribute range]
P_BEST = [Probability of BIN in BEST]
P_REST = [Probability of BIN in REST]
        
BIN_DATA = EFB on data
 FOR EACH attribute in A{
  FOR EACH BIN IN attribute{
   P_BEST = count(BEST) / count(data)
   P_REST = count(REST) / count(data)
   FREQ(BIN|BEST) = count(BIN in BEST) / count(BEST)
   FREQ(BIN|REST) = count(BIN in REST) / count(REST)                       
   LIKE(BEST|BIN) = FREQ(BIN|BEST) x P_BEST
   LIKE(REST|BIN) = FREQ(BIN|REST) x P_REST
   LIKE_BEST_REST = LIKE(BEST|BIN) + LIKE(REST|BIN)
   RANK = LIKE(BEST|BIN)^2 / LIKE_BEST_REST
   RETURN [BIN, RANK]
  END
END
\end{verbatim}
 \\ \hline
    \end{tabular}
\end{center}
\caption{Pseudo code for Support Based Bayesian Ranking algorithm for finding the rank of each value in each attribute.}\label{fig:sbbra}
\end{figure}


\section{Instance Selection Using Criteria}

Once the criterion for each class has been found, the question is - how do we use it to select instances? There is the possibility that using the entire criterion could result in zero instances being selected. To avoid this we use one critical range at a time for instance selection until just before there are no more instances to select from or the number of instance selected at between 0 and 15\% of the number of instances in the class. Another problem considered is the order in which the critical ranges are used. This problem is solved by sorting them in descending order according to there rank values. \fig{isc} shows the pseudocode for the process.

\begin{figure}[h!]
\small
\begin{center}
\begin{tabular}{ p{8cm} }
\hline
\begin{verbatim}
CRITERIA = [LIST OF CRITICAL RANGES 
            SORTED BY HIGH TO LOW RANK]
N = NUMBER OF INSTANCES IN CLASS
PROTOTYPES = [INSTANCES IN CLASS]

REPEAT UNTIL (PROTOTYPES = PREVIOUS) OR 
              (|PROTOTYPES| = [0 AND 15%] OF N)
  PREVIOUS = PROTOTYPES
  WHILE c IN CRITERIA
    PROTOTYPES = INSTANCES WITH c
    IF PROTOTYPES = EMPTY
      PROTOTYPES = PREVIOUS
  END
  
  RETURN PROTOTYPES
END    
      
\end{verbatim}
 \\ \hline
    \end{tabular}
\end{center}
\caption{Instance Selection Using Criteria}
\label{fig:isc}
\end{figure} 


%Once the ranks of the attribute values are found the critical ranges are determined 

%\be
%\item The highest rank coupled with their value are found for each attribute;
%\item These are then sorted from highest to lowest rank - this forms the criteria for the current $best$ class (bc) and are stored as [attribute, value] pairs;
%\item Taking the result from step 2, the first [attribute, value] pair in the criteria is used to select instances from the instances that make up the current best class whose [attribute, value] is the same;
%\item If the number of instances selected is greater than 15\% of $|$bc$|$, then the next [attribute, value] pair from the criteria is used to further reduce this group; 
%\item If the previous step results in an empty set, then the previous result is returned;
%\item The above steps are applied for each class until the original training set has been reduced to between 0 and 15\% or as close enough to 15\% to avoid getting an empty set.
%\ee

%In the end an instance is only selected if certain attribute values meet the criteria.  The following section clarifies CLIFF by using a simple example.


\section{CLIFF: A Simple Example}

\begin{figure}[h!]
\small
\begin{center}
\begin{tabular}{ p{8cm} }
\hline
\begin{verbatim}

		forecast  temp  humidty  windy  play
1.	sunny     hot   high     FALSE  no 
2.	sunny     hot   high     TRUE   no
3.	overcast  hot   high     FALSE  yes
4.	rainy     mild  high     FALSE  yes
5.	rainy     cool  normal   FALSE  yes
6.	rainy     cool  normal   TRUE   no
7.	overcast  cool  normal   TRUE   yes
8.	sunny     mild  high     FALSE  no
9.	sunny     cool  normal   FALSE  yes
10.	rainy     mild  normal   FALSE  yes
11.	sunny     mild  normal   TRUE   yes
12.	overcast  mild  high     TRUE   yes
13.	overcast  hot   normal   FALSE  yes
14.	rainy     mild  high     TRUE   no

\end{verbatim}
 \\ \hline
    \end{tabular}
\end{center}
\caption{A log of some golf-playing behavior}\label{fig:golf}
\end{figure} 

For this example we use the $weather$ data-set shown in \fig{golf}. This data-set contains four(4) attributes that report on the forecast (sunny, overcast and rainy), temperature (hot, mild and cool), humidity (high and normal) and whether or not the day is windy (true or false). Of the 14 days observed, on nine(9) of them, golf was played and on the remaining five(5) days, no golf was played. 

To create a criterion for each class in this data-set, we first divide it into $best$ and $rest$. For this example, let us say that all the instances with the $yes$ class are $best$ while the others are $rest$. Now we find the ranks of each value in each attribute, so let K = 14 (total number of instances), $best$ = 9 and $rest$ = 5. To find the rank of $sunny$ in forecast the following calculations are completed in \fig{rank1}.
\begin{figure}[h!]
\small
\begin{center}
\begin{tabular}{ p{8cm} }
\hline
\begin{verbatim}
E = sunny
P(best) = 9/14
P(rest) = 5/14
freq(E|best) = 2/9
freq(E|rest) = 3/5
like(best|E) = 2/9 * 9/14 = 2/14
like(rest|E) = 3/5 * 5/14 = 3/14
P(best|E) = (2/14) / (2/14 + 3/14) = 0.40

P(best|E) * support(best|E) = (2/14)^2 / (2/14 + 3/14)
														= 0.06
\end{verbatim}
 \\ \hline
    \end{tabular}
\end{center}
\caption{Finding the rank of $sunny$}
\label{fig:rank1}
\end{figure} 

Once the ranks are found for each value for each attribute, the criterion is created using the highest ranked values in each attribute to create [attribute, value] pairs. For this example we will use the following as the criterion for the $yes$ class: 

\begin{verbatim}
[[forecast, rainy] [temp, mild] [humidity, high] [windy, FALSE]]
\end{verbatim}

With this criterion, after applying [forecast, rainy], we are left with three instances (4, 5 and 10):
\begin{center}
\begin{verbatim}
rainy     mild  high     FALSE  yes
rainy     cool  normal   FALSE  yes
rainy     mild  normal   FALSE  yes
\end{verbatim}
\end{center}
Since this is about 33\% of the data in the $yes$ class, the second pair [temp, mild] in the criterion is used on the three instances. This yield two(2) instances (4 and 10) reducing the data in the $yes$ class to about 22\%:
\begin{center}
\begin{verbatim}
rainy     mild  high     FALSE  yes
rainy     mild  normal   FALSE  yes
\end{verbatim}
\end{center}

With the algorithm preferring a reduction of between 0 and 15\%, the third pair in the criterion is applied to the two(2) instances. The result of this is zero instances left and so the result of the previous pair is kept.

\section{CLIFF: Time Complexity}
\label{section:time}
Intuitively, time complexity for CLIFF can be considered in terms of 1) ranking each value in each attribute, a $O(m)$ operation where m represents attributes, 2) finding the criteria for each class, a \emph{O(m) + O(k)} operation where k represents the class, and 3) selecting instances from each class using the criteria a $O(n)$ operation where n represents the number of instances.

Assuming that \emph{n $>$ m $>$ k} (which is the case for all data-sets used in this thesis), this process yields a complexity of $O(m)$ + \emph{O(m) + O(k)} + $O(n)$ which reduces to $O(n)$.
