\section{New Forensic Models - Design}

\subsection{Chemometrics}
In an effort to create forensic models which avoid statistical assumptions and the need for surveys, we mirror an approach used in the field of Chemistry called \emph{Chemometrics}. Chemometrics is generally defined as the application of mathematical, statistical and computer science techniques to chemistry. In the work done by \cite{Karslake09}, Chemometrics using computer science techniques is applied to analyze the infrared spectra of the clear coat layer of a range of cars. The analysis proceeded as follows:

\begin{itemize}
\item Agglomerative hierarchical clustering (AHC) for grouping the data into classes
\item Principal component analysis (PCA) for reducing dimensions of the data
\item Discriminant analysis for classification i.e. associating an unknown sample to a group or region
\end{itemize}

In \cite{Karslake09} Agglomerative hierarchical clustering (AHC) analysis of the spectra identified three classes. These classes are derived from only the spectra and are not influenced by information known about the samples, such as make of vehicle or year of production. This was done because the composition of the clear coat is unknown and cannot be assumed to be related to any known details of the vehicle.

Principal component analysis (PCA) was performed on the data to reduce the number of variables and to identify the regions of the spectra which contribute the most to the variation between the spectra. This analysis was performed using both Pearson correlation and covariance for comparison of the two. The eigenvalues for the correlation analysis fell steadily after about four principal components, at which point 69\% of the variability was accounted for, while the eigenvalues for the covariance analysis fell steadily also after about four principal components, but these four components accounted for about 82\% of the variability.

Covariance is the more appropriate method to use, as the values at each wave number in the spectra vary widely. As the data is not all on the same scale, correlation detects variation that is essentially noise, whereas covariance is able to filter this out and so account for more variability in fewer principal components. 

Discriminant analysis of the data formed a model which correctly placed all spectra into the classes assigned by AHC. The model was formed using the classification from AHC and the factor scores from the first four principal components identified by PCA. The model was validated by removing random samples from the model and using these to validate the model produced. One sample was chosen at random for validation. This was repeat to a total of 10 validations, and each time all validation samples were correctly assigned. This method was repeated leaving five, ten and twenty samples out at a time for ten validation analyses, and each time all validation samples were correctly assigned. This shows the model is highly discriminating. 

Adopting the steps of \cite{Karslake09}, we adapted the procedure by using different tools and extending it to include a brittleness measure our SBBR contrast set learning technique described in Section \ref{subsubsection:tec2} to reduce brittleness (see \fig{process}). We refer to our process as CLIFF which is characterized in the following steps:

\begin{enumerate}
\item Get the data
\item Reduce the dimensions of the data if necessary using FastMap \cite{fastmap}
\item Cluster or group data using Kmeans
\item Perform instance selection with SBBR
\item Classification with the k-nearest neighbor algorithm
\end{enumerate}

\begin{figure}[h!]
\begin{center}
\includegraphics[scale=0.32]{process}
\end{center}
\caption{Proposed procedure for the forensic evaluation of data}\label{fig:process}
\end{figure}

%need to write more on CLIFF


\subsection{CLIFF}

\subsubsection{Data and FastMap}
The performance of CLIFF was assessed using a data set donated by \cite{Karslake09, made up of the infrared spectra of the clear coat layer of a range of cars. Details of how data was collected and the measurement generated can be seen elsewhere \cite{Karslake09}.

\subsubsection{Kmeans}

\subsubsection{K-nearest neighbor}

\subsubsection{Prototype learning with SBBR}