\section{Data and Preprocessing Tools}
\subsection{Data Set Characteristics}

\fig{info} lists the thirteen data sets used to assess CLIFF. The number of instances and attributes per instance are shown for each data set, along with the number of distinct classes of instances. All of these data sets were aquired from the UCI repository \cite{Frank+Asuncion:2010}. These data sets represent a variety of data types and characteristics. For example, three of the data sets (Sonar, Soybean and Splice) have large dimensions of 60, 35 and 60 respectively, while the others have dimensions in the range of four(4) to eighteen(18). Also the number of instances range from 148 (Lymph) to 3190 (Splice), while the number of classes range from two(2) to fifteen(15).  

Assessing CLIFF on such a diverse set of data will give a good indication of the level of generalization CLIFF is capable off. However in the interest of speed, 2 pre-processing tools are applied to the high dimensional data sets in \fig{info}. The following section details the algorithms used.

\begin{figure}
\begin{center}
\begin{tabular}{ | l | l | l | l |}
\hline
Data Set & Instances & Attributes & Class \\ \hline
Balance Scale & 625 & 4 & 3 \\ \hline
Breast Cancer & 286 & 9 & 2 \\ \hline
Heart (Cleveland) & 297 & 13 & 5 \\ \hline
Iris & 150 & 4 & 3 \\ \hline
Lymph & 148 & 18 & 4 \\ \hline
Pima Diabetes & 768 & 8 & 2 \\ \hline
Sonar & 208 & 60 & 2 \\ \hline
Soybean & 562 & 35 & 15 \\ \hline
Splice & 3190 & 60 & 3 \\ \hline
Vehicle & 846 & 18 & 4 \\ \hline
Vote & 430 & 16 & 2 \\ \hline    
\end{tabular}
\end{center}
\caption{Data Set Characteristics}
\label{fig:info}
\end{figure}    

\subsection{Pre-processing tools for Dimensionality Reduction}

For this work, the pre-processing tools are aimed at dimensionality reduction using a novel feature subset selector (FSS) and FastMap \cite{fastmap}. \fig{work} shows which pre-processing tool is used on which data set if any at all while \fig{numf} shows the pd and pf values of using various number of features for each data set. The highlighted f values are those choosen because of their high pds and low pfs, for example, when Soybean is FastMapped to f=16, it acheives the highest pd and lowest pf and so the data set with 16 features is used in future experiments in this work.

\begin{figure}
\begin{center}
\begin{tabular}{ | l | l | l | l |}
\hline
Data Set & CLIFF & CLIFF + FSS & CLIFF + FastMap \\ \hline
Balance Scale & yes &  &  \\ \hline
Breast Cancer & yes & yes &  \\ \hline
Heart (Cleveland) & yes & yes &  \\ \hline
Iris & yes &  &  \\ \hline
Lymph & yes &  &  \\ \hline
Pima Diabetes & yes & yes &  \\ \hline
Sonar &  & yes &  \\ \hline
Soybean &  &  & yes \\ \hline
Splice &  &  & yes \\ \hline
Vehicle &  & yes &  \\ \hline
Vote & & yes &  \\ \hline    
\end{tabular}
\end{center}
\caption{Data Set Characteristics}
\label{fig:work}
\end{figure}    

\begin{figure}
\begin{center}
\begin{tabular}{l@{~}|l@{~}|c@{~}| c@{~}|}
\cline{1-4}
Data Set & FSS & pd & pf\% \\\hline
\multirow{2}{*}{Breast Cancer}
 & \colorbox{gray}{\color{white} f=2} & \colorbox{gray}{\color{white} 53.2} & \colorbox{gray}{\color{white} 46.8} \\
 & f=4 & 51.5 & 48.5 \\
 \hline
\multirow{3}{*}{Heart (Cleveland)}
 & f=2 & 0.0 & 9.4 \\
 & \colorbox{gray}{\color{white} f=4} & \colorbox{gray}{\color{white} 25.0} & \colorbox{gray}{\color{white} 9.6} \\
 & f=8 & 25.0 & 10.0 \\
 \hline
\multirow{2}{*}{Pima Diabetes}
 & f=2 & 52.1 & 47.9 \\
 & \colorbox{gray}{\color{white} f=4} & \colorbox{gray}{\color{white} 59.5} & \colorbox{gray}{\color{white} 40.5} \\
 \hline 
\multirow{4}{*}{Sonar}
 & f=2 & 74.5 & 25.5 \\
 & f=4 & 78.0 & 22.0 \\
 & \colorbox{gray}{\color{white} f=8} & \colorbox{gray}{\color{white} 80.5} & \colorbox{gray}{\color{white} 19.5} \\
 & f=16 & 78.6 & 21.4 \\
  \hline    
\multirow{4}{*}{Soybean}
 & f=2 & 0.0 & 3.7 \\
 & f=4 & 50.0 & 1.9 \\
 & f=8 & 60.0 & 1.9 \\
 & \colorbox{gray}{\color{white} f=16} & \colorbox{gray}{\color{white} 66.7} & \colorbox{gray}{\color{white} 1.2} \\
  \hline    
\multirow{4}{*}{Splice}
 & f=2 & 24.3 & 14.8 \\
 & f=4 & 22.2 & 11.9 \\
 & f=8 & 22.0 & 13.8 \\
 & f=16 & 24.3 & 15.6 \\
  \hline    
\multirow{3}{*}{Vehicle}
 & f=2 & 62.8 & 11.0 \\
 & \colorbox{gray}{\color{white} f=4} & \colorbox{gray}{\color{white} 66.3} & \colorbox{gray}{\color{white} 10.2} \\
 & f=8 & 56.8 & 11.7 \\
 \hline          
\multirow{3}{*}{Vote}
 & f=2 & 83.9 & 16.4 \\
 & f=4 & 85.7 & 14.3 \\
 & \colorbox{gray}{\color{white} f=8} & \colorbox{gray}{\color{white} 89.0} & \colorbox{gray}{\color{white} 11.0} \\
 \hline            
\end{tabular}
\end{center}
\caption{Choosing the best number of features for each data set. The best choice will have a high pd along with a low pf}\label{fig:numf}
\end{figure}

\subsubsection{FastMap}

The general goal of FASTMAP is to project items in a $n$ dimensional to a $d$ dimensional space, with $n$ $>$ $d$. The basis of each reduction is using the cosine law on the triangle formed by an object in the feature space and the two objects that are furthest apart in the current (pre-reduction) space (see~\fig{fm1}). These two objects are referred to as the pivot objects of that step in the reduction phase ($n$ - $d$ total pivot object sets). Finding the optimal solution of the problem of finding the two furthest apart points is a $N^2$ problem (where $N$ is the total number of objects), but this is where the heuristic nature of FASTMAP comes into play.  

Instead of finding the absolute furthest apart points, FASTMAP takes a shortcut by first randomly selecting an object from the set, and then finding the object that is furthest from it and setting this object as the first pivot point.  After the first pivot point is selected, FASTMAP finds the points farthest from this and uses it as the second pivot point.  The line formed by these two points becomes the line that all of the other points will be mapped to in the new $n$ - 1 dimension space. 

\begin{figure}
\begin{center}
\includegraphics[width=200pt]{fastmap1.png}
\end{center}
\caption{Example of using the cosine law to find the position of $Oi$ in
the dimension $k$}
\label{fig:fm1}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=200pt]{fastmap2.png}
\end{center}
\caption{Projects of points $O_i$ and $O_j$ onto the hyper-plane perpendicular to the line $O_a$$O_b$}
\label{fig:fm2}
\end{figure}

FASTMAP uses the follow equation  to calculate $x_i$, or the position of object $O_i$ in the reduced space:
\begin{equation}
{x_i = \frac{d_{a,i}^2 + d_{a,b}^2 - d_{b,i}^2}{2d_{a,b}} }
\end{equation}
This technique can be visualized by imagining the hyper-plane perpendicular to the line formed by pivot points, $O_a$ and $O_b$, and projecting the new point onto this plane (\fig{fm2}).

FASTMAP requires only $2D$ passes over $D$ documents.


\subsubsection{Feature Subset Selection (FSS)}

Feature Subset Selection (FSS) is a technique that explores subsets of available features. There are many FSS algorithms out there including sequential forward selection and sequential backwards selection. However the simplest and fastest method involves ranking attribute from most important to least important. Infogain can be used effectively to rank attributes but in this work we use a novel ranking method for finding subsets for data set attributes.  Using the ranking algorithm described in Section \ref{subsection:selector}, first the highest rank of the attribute values is stored. The attributes are then sorted in the order of highest to lowest rank. With this information, the top $n$ attributes are used in experiments.
