\section{Introduction}
The principal goal of forensic evaluation models is to check that evidence found at a crime scene is (dis)similar to evidence found on a suspect. In creating these models, attention is given to the significance level of the solution however the \emph{brittleness} level is never considered. The \emph{brittleness} level is a measure of whether a solution comes from a region of similar solutions or from a region of dissimilar solutions. We contend that a solution coming from a region with a low level of brittleness i.e. a region of similar solutions, is much better that one from a high level of brittleness - a region of dissimilar solutions.

The concept of \emph{brittleness} is not a stranger to the world of forensic science, in fact it is recognized as the ``fall-off-the-cliff-effect", a term coined by Ken Smalldon. In other words, Smalldon recognized that tiny changes in input data could lead to a massive change in the output. Although Walsh \cite{Walsh94} worked on reducing the brittleness in his model, to the best of our knowledge, no work been done to quantify brittleness in current forensic models or to recognize and eliminate the causes of brittleness in these models.

In our studies of forensic models for evaluation particularly in the sub-field of glass forensics, we conjecture that brittleness is caused by the following:

%\begin{itemize}
\be
\item A tiny error(s) in the collection of data;
\item Inappropriate statistical assumptions, such as assuming that the distributions of refractive indices of glass collected at a crime scene or a suspect obeys the properties of a normal distribution; 
\item and the use of measured parameters from surveys to calculate the \emph{frequency of occurrence} of trace evidence in a population
\ee
%\end{itemize}

In this work we quickly eliminate the two(2) latter causes of brittleness by using simple classification methods such as k-nearest neighbor (KNN) which are neither concerned with the distribution of data nor the frequency of occurrence of the data in a population. To reduce the effects of errors in data collection, a novel prototype learning algorithm (PLA) is used to augment KNN. Basically this PLA selects samples from the data which best represents the region or neighborhood it comes from. In other words, we expect that samples which contain errors would be poor representatives and would therefore be eliminated from further analysis. This leads to neighbourhoods with different outcomes being futher apart from each other.

%In the end our goal for this work is threefold. First we want to show the forensic scientist the importance of reporting the brittleness level of their models. Second, to encourage them to seek out and eliminate the causes of brittleness in their models and third, for those causes which cannot be eliminated

In the end our goal for this work is threefold. First we want to develop a new generation of forensic models which avoids inappropriate statistical assumptions. Second, the new models must not be \emph{brittle}, so that they do not change their interpretation without sufficient evidence and third, provide not only an interpretation of the evidence but also a measure of how reliable the interpretation is, in other words, what is the brittleness level of the model.

Our research is guided by the following research questions:

\begin{itemize}
\item Using KNN as a model, what is the best K for each data set?
\item Are the results of using KNN better or comparable to current models which use statistical assumptions and surveys
\item Does prototype learning reduce brittleness?
\item Do the results of applying a PLA differ significantly from results of not applying a PLA?
\end{itemize}


\section{Visualization of Brittleness}
This work is motivated by a recent National Academy of Sciences report titled ``Strengthening Forensic Science" \cite{09NAS}. This report took special notice of forensic interpretation models stating:

\begin{quotation}
With the exception of nuclear DNA analysis, however, no forensic method has been rigorously shown to have the capacity to consistently, and with a high degree of certainty, demonstrate a connection between evidence and a specific individual or source. \cite{09NAS}, p6
\end{quotation}

In this study we visualize the inconsistencies of four(4) of these forensic methods in one way. By simply plotting the measurements derived from evidence from a crime scene denoted as $x$ and suspect ($y$), against the results of interpretation. %Second, by charting the probabilities of those measurements from $x$ in certain ranges would match corresponding measurements in $y$. This is done using a Contrast Set Learning (CSL) algorithm.

The rest of this section gives details of the four(4) forensic models evaluated in this work, followed by the visualization of these models to highlight their brittleness which results in the inconsistencies in model results.

\section{Glass Forensic Models}
This section provides an overview of the following glass forensic models used in this work to show brittleness.

\begin{enumerate}
\item The 1978 Seheult model \cite{Seheult78}
\item The 1980 Grove model \cite{Grove80}
\item The 1995 Evett model \cite{Evett94}
\item The 1996 Walsh model \cite{Walsh94}
\end{enumerate}

\subsection{Seheult 1978}
\label{subsection:seh}
Seheult \cite{Seheult78}, examines and simplifies Lindley's \cite{77Lindley} 6th equation for real-world application of Refractive Index (RI) Analysis. According to Seheult: 
\begin{quote}A measurement $x$, with normal error having known standard deviation $\sigma$, is made on the unknown refractive index $\Theta_1$ of the glass at the scene of the crime. Another measurement $y$, made on the glass found on the suspect, is also assumed to be normal but with mean $\Theta_2$ and the same standard deviation as $x$. The refractive indices $\Theta$ are assumed to be normally distributed with known mean $\mu$ and known standard deviation $\tau$. If $I$ is the event that the two pieces of glass come from the same source($\Theta_1$ = $\Theta_2$) and $\bar{I}$ the contrary event, Lindley suggests that the odds on identity should be multiplied by the factor
\begin{equation}
\frac{p(x,y|I)}{p(x,y|\bar{I})} \label{eq:lin1}
\end{equation}
In this special case, it follows from Lindley's 6th equation that the factor is
\begin{equation}
\frac{1+\lambda^2}{\lambda(2+\lambda^2)^{1/2}}^{-\frac{1}{2(1+\lambda^2)}\cdot(u^2-v^2)} \label{eq:lin2}
\end{equation}
Where
\begin{equation*}
\lambda = \frac{\sigma}{\tau}, \newline
u = \frac{x-y}{\sigma\sqrt{2}} ,  \newline
v = \frac{z-\mu}
{\tau(1+\frac{1}{2}\lambda^2)^{\frac{1}{2}}} , \newline
z = \frac{1}{2}(x+y) 
\end{equation*}
\end{quote}

\subsection{Grove 1980}
\label{subsection:gro}

By adopting a model used by Lindley and Seheult, Grove proposed a non-Bayesian approach based on likelihood ratios to solve the forensic problem. The problem of deciding whether the fragments have come from common source is distinguished from the problem of deciding the guilt or innocence of the suspect. To explain his method, Grove first reviewed Lindley's method. He argued that we should, where possible, avoid parametric assumptions about the underlying distributions. Hence, in discussing the respective roles of $\theta_1$ and $\theta_2$ Grove did not attribute any probability distribution to an unknown parameter without special justification. So when considering ($\theta_1$ != $\theta_2$), $\bar{I}$ can be interpreted as saying that the fragments are present by chance entailing a random choice of value for $\theta_2$. The simplified likelihood ratio obtained from the Grove's derivation is:
\begin{equation}
\frac{\tau}{\sigma}\cdot e^{\big\{\frac{-(X-Y)^2}{4\sigma^{2}} + \frac{(Y-\mu^2)}{2\tau^2}\big\}} \label{eq:gro2} 
\end{equation}

We are of course only concerned with the evidence about \emph{I} and $\bar{I}$ so far as it has the bearing on the guilt or innocence of the suspect. Grove also considered the Event of Guilty factor \b{G} in the calculation of likelihood ratio (LR). Therefore the LR now becomes 
\begin{equation} p(X,Y|G)/p(X,Y|\bar{G})\end{equation}
Here in the expansion event \b{T}, that fragments were transferred from the broken window to the suspect and persisted until discovery and event \b{A},that the suspect came into contact with glass from other source. Here p(A/{G})=p(A/$\bar{G}$)=Pa and p(T/G)= Pt. The resulting expression is %\newline

  \begin{equation} 
    \frac{P(X,Y,S|G)}{P(X,Y,S|\bar{G})} = 1+Pt\Big\{(\frac{1}{Pa}-1)\frac {p(X,Y|I)}{p(X,Y|\bar{I})}-1\Big\}\label{eq:gro4}  
  \end{equation} 
  
\subsection{Evett 1995}
\label{subsection:eve}

Evett et al used data from forensic surveys to create a Bayesian approach in determining the statistical significance of finding glass fragments and groups of glass fragments on individuals associated with a crime \cite{Evett94}.

Evett proposes that likelihood ratios are well suited for explaining the existence of glass fragments on a person suspected of a crime. A likelihood ratio is defined in the context of this paper as the ratio of the probability that the suspected person is guilty given the existing evidence to the probability that the suspected person is innocent given the existing evidence. The given evidence, as it applies to Evett's approach, includes the number of different sets of glass and the number of fragments in each unique group of glass.

The Lambert, Satterthwaite and Harrison (LSH) survey used empirical evidence to supply probabilities relevant to Evett's proposal. The LSH survey inspected individuals and collected glass fragments from each of them. These fragments were placed into groups based on their refractive index (RI) and other distinguishing physical properties. The number of fragments and the number of sets of fragments were recorded, and the discrete probabilities were published.  In particular, there are two unique probabilities that are of great interest in calculating Evett's proposed likelihood ratio.
\begin{itemize}
\item S, the probability of finding N glass \emph{fragments} per group
\item P, the probability of finding M \emph{groups} on an individual.
\end{itemize}

%\clearpage
%\subsection{Mathematical Formulae}
The following symbols are used by Evett to express his equations:
\begin{itemize}
\item $P_n$ is the probability of finding $n$ groups of glass on the surface of a person's
clothes
\item $T_n$ is the probability that $n$ fragments of glass would be transferred, retained
and found on the suspect's clothing if he had smashed the scene window
\item $S_n$ is the probability that a group of glass fragments on a person's clothing
consists of $n$ fragments
\item $f$ is the probability that a group of fragments on person's
clothing would match the control sample
\item $\lambda$ is the expected number of glass fragments remaining at a time, $t$
\end{itemize}
Evett utilizes the following equations to determine the likelihood ratio for the first case described in his 1994 paper. In this case, a single window is broken, and a single group of glass fragments is expected to be recovered.

\begin{equation}
LR = \frac{{P_0}{T_n}}{{P_1}{S_n}{f}}+{T_0}
\end{equation}

\begin{equation} 
T_n = \frac{e^{-\lambda}{\lambda^n}}{n!}
\end{equation}

\subsection{Walsh 1996}
\label{subsection:wal}

The equation presented by Walsh \cite{Walsh94} is similar to one of Evett's. The difference is that Walsh argues that instead of incorporating grouping	and matching, only grouping should be included. Walsh says this is because match/non-match is really just an arbitrary line. He examines the use of a technique in interpreting glass evidence of a specific case. This technique is as follows:
\begin{equation}
\frac{T_L P_0 p(\bar{X},\bar{Y}|S_y,S_x)}{P_1S_Lf_1}
\end{equation}
Where 

\begin{itemize}
\item
$T_L$ = the probability of 3 or more glass fragments being transferred from the crime scene to the person.
\item
$P_0$ = the probability of a person having no glass on their clothing
\item
$P_1$ = the probability of a person having one group of glass on their clothing
\item
$S_L$ = the probability that a group of glass on clothing is 3 or more fragments
\item
$\bar{X}$ and $\bar{Y}$ are the mean of the control and recovered groups respectively
\item
$S_x$ and $S_y$ are the sample standard deviations of the control and recovered groups respectively
\item
$f_1$ is the value of the probability density for glass at the mean of the recovered sample
\item
$p(\bar{X},\bar{Y}|S_y,S_x)$ is the value of the probability density for the difference between the sample means
\end{itemize} 
%%%%

\section{Visualization of Brittleness in Models}
%\subsection{The First Technique}

The result of applying the visualization technique i.e. plotting the measurements derived from evidence from a crime scene denoted as $x$ and suspect ($y$), against the results of interpretation on the glass forensic models are shown in \fig{models}.

\begin{figure}[h!]
  \begin{center}
  \scalebox{0.97}{
    \begin{tabular}{c}
      \resizebox{60mm}{!}{\includegraphics{seheult-chart}} \\
      \resizebox{60mm}{!}{\includegraphics{grove-chart}} \\
      \resizebox{60mm}{!}{\includegraphics{walsh-chart}} \\
      \resizebox{60mm}{!}{\includegraphics{evett-chart}} \\
    \end{tabular}}
    \caption{Visualization of four(4) glass forensic models}
    \label{fig:models}
  \end{center}
\end{figure}

For the first two(2) models the $x$ and $y$ axes represent the mean refractive index (RI) values of evidence from a crime scene and suspect respectively. While the $x$ axis of the Walsh model represents $f1$ is the value of the probability density for glass at the mean of the recovered sample and the $y$ axis represents the value of the probability density for the difference between the sample means. The $x$ and $y$ axes of the Evett model represents $\lambda$ and $f-values$ respectively. The $z$ axis of all the models represent the likelihood ratio (LR) generated from these models, in other words, the significance of the match/non-match of evidence to an individual or source.

Using data donated by the Royal Canadian Mounted Police (RCMP), values such as the RI ranges and their mean, were extracted to generate random samples for the forensic glass models. In all four(4) models $1000$ samples are randomly generated for the variables in each model. For instance, in the Seheult model, each sample looks like this: [$x$, $y$, $\sigma$, $\mu$, $\tau$]. The symbols are explained in \ref{subsection:seh}.

In \fig{models} - the sehult and grove models, brittleness or Smalldon's ``fall-off-the-cliff-effect" is clearly demonstrated. These models proposed by Seheult (Section \ref{subsection:seh}) and Grove (\ref{subsection:gro}) respectively, show how the likelihood ratio changes (on the vertical axis) as we try different values from the refractive index of from glass from two sources (x and y). This model could lead to incorrect interpretations if minor errors are made when measuring the refractive index of glass samples taken from a suspect's clothes. Note how, near the region where x=y, how tiny changes in the x or y refractive indexes can lead to dramatic changes in the likelihood ratio (from zero to one).

The visualization of the Evett (\ref{subsection:eve}) and Walsh (\ref{subsection:wal}) models show similar brittleness when the likelihood ratios are 0 and 1. For Walsh, values located at the edge of a cliff a LR=1 can easily become LR=0 at the smallest change in the $f1$ or $p$ values. While Evett will cause problems because a small change occurs with any sample it is possible for the LR to change. 

From these visualizations it is obvious that the concern of the National Academy of Sciences report \cite{09NAS} mentioned earlier in this section is a valid one. So how can this concern be alleviated? We propose not only including a \emph{brittleness} measure to a forensic method as a solution, but also moving away from forensic models which use a Bayesian approach \cite{Seheult78, Evett84, Evett90, Evett94, Walsh94}, and statitical assumptions \cite{Seheult78, Grove80, Walsh94}.

The following sections gives details of the models used in this work as well as the data set used to evaluate the models.


%\section{CLIFF Design and Operation}
\section{Introduction}
If standard methods are brittle what can we do? We seek our answer to this question in the work of \cite{Karslake09}, and we explore an intuition that to reduce brittleness, data with dissimilar outcomes should not be close neighbors. In this section the details of CLIFF's core procedure and tools are discussed. Included in this discussion is a sub-section which further explores our intuition for brittleness reduction and our tool borne from this intuition - the CLIFF selector.

The Design of CLIFF is deeply rooted in the work of \cite{Karslake09}. In their work, analysis is done using Chemometrics, an application of mathematical, statistical and/or computer science techniques to chemistry. In the work done by \cite{Karslake09}, Chemometrics using computer science techniques is applied to analyze the infrared spectra of the clear coat layer of a range of cars. The analysis proceeded as follows:

\begin{itemize}
\item Agglomerative hierarchical clustering (AHC) for grouping the data into classes
\item Principal component analysis (PCA) for reducing dimensions of the data
\item Discriminant analysis for classification i.e. associating an unknown sample to a group or region
\end{itemize}

This technique produced a strong model which achieved 100\% accuracy i.e. when validated by removing random samples from the model, all the samples were correctly assigned. The goal of CLIFF is not only to create a strong forensic model but also to show how strong the model is. To achieve this CLIFF includes a brittleness measure as well as a method to reduce brittleness. Also, in an effort to keep CLIFF simple, we substituted different tools to preform the analysis done in \cite{Karslake09}. For instance Kmeans is used instead of AHC for grouping the data into classes. FastMap is used for dimensionality reduction and K-nearest neighbor is used for classification. The basic operation of CLIFF is shown in \fig{process}. The data is collected and the dimensions is reduced if necessary. Clusters are then created from the data and classification is done along with a brittleness measure (further discussed in Section \ref{subsection:bm}). Finally, we test if brittleness can be reduced using a novel prototype learning technique (Instance Selection).

\begin{figure}[h!]
\begin{center}
\includegraphics[scale=0.32]{process}
\end{center}
\caption{Proposed procedure for the forensic evaluation of data}\label{fig:process}
\end{figure}

\section{Dimensionality Reduction}

\subsection{Principal Component Analysis}
The goal of Principal component analysis (PCA) is to reduce the number of variables or dimensions of a data set which has a large number of correlated variables while maintaining as much of the data variation as is possible. The result of this serves two main purposes:

\be
\item To simplify analysis and
\item To aid in the visualization of the data
\ee  

To achieve this goal, the data set is transformed to a new set of variables which are not correlated and which are ordered so that the first few principal components (PCs) retain most of the variation present in all of the original variables \cite{joll02}. Let us look at an example. \fig{iris} shows a visualization of Fisher's five-dimensional iris data on a two-dimensional scatter plot. First, PCs are extracted from the four continuous variables (sepal-width, sepal-length, petal-width, and petal-length). Second, these variables are projected onto the subspace formed by the first two components extracted. Finally this two-dimensional data is shown on a scatter-plot in \fig{iris}. The fifth dimension (species) is represented by the color of the points on the scatter-plot.

\begin{figure}[h!]
\begin{center}
\includegraphics[scale=0.8]{iris}
\end{center}
\caption{PCA for iris data set}\label{fig:iris}
\end{figure}


The data used in our experiments contains 1151 attributes and 185 instances. Using the data set as is would cause us to create a model that is computationally expensive and likely to produce unacceptable results such as a high false positive values caused by redundant and noisy data. To avoid this foreseen problem, we turn to dimensionality reduction.

Dimensionality Reduction refers to reducing high-dimensional data to low dimensional data. This is accomplished by attempting to summarize the data by using less terms than needed. While this reduces the overall information available and thus a level of precision, it allows for easy visualization of data otherwise impossible to visualize. Some algorithms that can be used for Dimensionality Reduction are Principle Component Analysis (PCA), and FastMap. 

The data used in this work contains 1,151 variables and 185 samples. To perform an analysis on this data set we must first reduce the number of variables used. In \cite{Karslake09}, PCA is used to perform dimensionality reduction. PCA can be defined as ``the orthogonal projection of the data onto a lower dimensional linear space". In other words, looking at our data set, our goal is to project the data onto a space having dimensionality that is less than 1,151 (M < 1,151) while maximizing the variance of the projected data \cite{pca}. In \cite{Karslake09}, two techniques - Pearson correlation and covariance for comparison of the two, were used to determine an appropriate value for M (M = 4). 

To speed things up a little, in our model we use \emph{FastMap} to reduce the dimensions of the data set. In FastMap the basis of each reduction is using the cosine law on the triangle formed by an object in the feature space and the two objects that are furthest apart in the current (pre-reduction) space. These two objects are referred to as the pivot objects of that step in the reduction phase (M total pivot object sets). Finding the optimal solution of the problem of finding the two furthest apart points is an N squared problem (where N is the total number of objects), but this is where the heuristic nature of FastMap comes into play. Instead of finding the absolute furthest apart points, FastMap takes a shortcut by first randomly selecting an object from the set, and then finding the object that is furthest from it and setting this object as the first pivot point. After the first pivot point is selected, FastMap finds the points farthest from this and uses it as the second pivot point. The line formed by these two points becomes the line that all of the other points will be mapped to in the new M dimension space. (Further details of this algorithm can be found elsewhere \cite{fastmap}).

To determine the appropriate value for M using FastMap, we experimented experimented with different values for M. \fig{exp1} shows results for various K-nearest neighbor classifiers (discussed further in Sections \ref{subsection:knn} and \ref{assess}), with M fixed at 2, 4, 8 and 16. When M is 2 or 4 100\% of the validation samples are predicted correctly (pd) and 0\% are predicted incorrectly (pd). For this reason, our model model is analysed using M = 4.

\begin{figure}
  \begin{center}
  \scalebox{0.95}{
    \begin{tabular}{c}
      \resizebox{100mm}{!}{\includegraphics{varspd}} \\
      \resizebox{100mm}{!}{\includegraphics{varspf}} \\
      
    \end{tabular}}
    \caption{Probability of detection (pd) and Probability of False alarms (pf) using fixed values for dimensions and fixed k values for k-nearest neighbor}
    \label{fig:exp1}
  \end{center}
\end{figure}


\section{Clustering}

Clustering is the second step in the CLIFF tool and can be defined as the grouping of the samples into groups whose members are similar in some way. The samples that belong to two different clusters are dissimilar. The major goal of clustering is to determine the intrinsic grouping in the set of unlabelled data. In most of the clustering techniques, distance is the major criteria. Two objects are similar if they are close according to the given distance.

CLIFF clusters using K-means. The \fig{kmeans} represents the pseudo code for the K-means algorithm. The idea behind K-means clustering is done by assuming some arbitrary number of centroids, then the objects are associated to nearest centroids. The centroids are then moved to center of the clusters. These steps are repeated until a suitable level of convergence is attained.

\begin{figure}[h!]
\small
\begin{center}
\begin{tabular}{ p{7cm} }
\hline
\begin{verbatim}

DATA = [3, 5, 10, 20]
k = [1, ..., Number of clusters]
STOP = [Stopping criteria]
        
FOR EACH data IN DATA
 N = count(data)
 WHILE STOP IS FALSE
  // Calculate membership in clusters
  FOR EACH data point X IN data
   FIND NEAREST CENTROID_k
   ADD TO CLUSTER_k
  END
 
  // Recompute the centroids  
  FOR EACH CLUSTER
   FIND NEW CENTROIDS
  END
  
  // Check stopping criteria
  [TRUE or FALSE] = STOP
 END
END
\end{verbatim}
 \\ \hline
    \end{tabular}
\end{center}
\caption{Pseudo code for K-means}\label{fig:kmeans}
\end{figure}

\section{Classification with KNN}
\label{subsection:knn}
%from wiki
%1973 Duda and Hart
K-nearest neighbor (KNN) classification is a simple classification method usually used when there is little or no prior knowledge about the distribution of data. KNN is described in \cite{knn} as follows: Stores the complete training data. New samples are classified by choosing the majority class among the k closest examples in the training data. For our particular problem, we used the Euclidean, i.e. sum of squares, distance to measure the distance between samples. Finally, to determine a value for k, we investigated the performance of six (6) KNN classifiers where k is fixed at 2, 4, 8 and 16. \fig{exp1} shows the results which indicate that using KNN classifiers where k is equal to 4, 8 or 16, the validation of samples is 100\%. For CLIFF k = 4 is used.

\section{The Brittleness Measure}
\label{subsection:bm}
Calculating the brittleness measure is a novel operation of CLIFF. We use the brittleness measure in this work to determine if the results of CLIFF comes from a region where all the possible results are (dis)similar. For the purpose of this work the optimal result will come from a region of similar results. To make this determination, using each sample from a validation set, once each sample from this set has been classified, the distance from the nearest unlike neighbor (NUN) i.e. the distance from a sample with a different class and the distance from the nearest like neighbor (NLN) i.e. the distance from a sample with the same class is recorded. Recall that brittleness is a small change can result in a different outcome, so here the closer the distances of NUN to NLN the more brittle the model. So an ideal result will have the greatest distance between NUNs and NLNs. 

The brittleness measure will give an output of either $high$ or $low$: high indicating that there is no significant difference between the NUN and NLN values, while $low$ indicates the opposite. The significance of these values was calculated using the Mann-Whitney U test. This is a non-parametric test which replaces the distance values with their rank or position inside the population of all sorted values.%need to talk about the rank process

\eq{bm} embodies our definition of brittleness: if the significance of NUN values are less than or equal to the NLN values, then an unacceptable level of  brittleness is present in the model. 

\begin{equation}
[NUN <= NLN] ==> BRITTLENESS
\label{eq:bm}
\end{equation}


%\section{CLIFF Assessment}
%\label{section:assess}

In this chapter, we evaluate CLIFF as a forensic model on a data set donated by \cite{Karslake09} in cross validation experiments. First, we describe the data set and experimental procedures. Next we present results which show the probability of detection (pd), probability of false alarm (pf) and brittleness level of CLIFF before and after the use of the selector. 

\section{Data Set and Experimental Method}
\label{section:brit}

The data set used in this work is donated by \cite{Karslake09}. It contains 37 samples each with five(5) replicates (37 x 5 = 185 instances). Each instance has 1151 infrared measurements ranging from 1800-650cm-1. (Further details of this algorithm can be found elsewhere \cite{Karslake09}). For our experiments we took the original data set and created four (4) data sets each with a different number of clusters (3, 5, 10 and 20) or groups. These clusters were created using the K-means algorithm (\fig{kmeans}).

The effectiveness of CLIFF is measured using pd, pf and brittleness level (high, low) completed as folows: By allowing A, B, C and D to represent true negatives, false negatives, false positives and true positives respectfully, it then follows that \emph{pd} also known as recall, is the result of true positives divided by the sum of false negative and true positives \emph{D / (B + D)}. While pf is the result of: \emph{C / (A + C)}. The $pd$ and $pf$ values range from 0 to 1. When there are no false alarms $pf$ = 0 and at 100\% detection, $pd$ = 1.

%The results were visualized using \emph{quartile charts}. To generate these charts the performance measures for an analysis are sorted to isolate the median and lower and upper quartile of numbers. In our quartile charts, the upper and lower quartiles are marked with black lines; the median is marked with a black dot; and the vertical bars are added to mark the 50\% percentile value. 
%need to include examples
The brittleness level measure is conducted as follows: First we calculate Euclidean distances between the validation or testing set which has already been validated and the training set. For each instance in the validation set the distance from its nearest like neighbor (NLN) and its nearest unlike neighbor (NUN) is found. Using these NLN and NUN distances from the entire validation set a Mann-Whitney U test was used to test for statistical difference between the NLN and NUN distances. The following sections describes two experiments and discusses their results.

\section{Experiment 1: KNN as a forensic model?}

Our goal is to determine if KNN is an adequate model for forensic evaluation. In other words, can it be used in preference to current statistical models? To answer this question, our experiment design follows the pseudo code given in \fig{knnexp1} for the four (4) data sets created from the original data set. For each data set, tests were built from 20\% of the data, selected at random. The models were learned from the remaining 80\% of the data.

This procedure was repeated 5 times, randomizing the order of data in each project each time. In the end CLIFF is tested and trained 25 times for each data set.

\begin{figure}[h!]
\small
\begin{center}
\begin{tabular}{ p{7cm} }
\hline
\begin{verbatim}
DATA = [3, 5, 10, 20]
LEARNER = [KNN]
STAT_TEST = [Mann Whitney]

REPEAT 5 TIMES
 FOR EACH data IN DATA
  TRAIN = random 90% of data
  TEST = data - TRAIN
		
  \\Construct model from TRAIN data
  MODEL = Train LEARNER with TRAIN
  \\Evaluate model on test data
  [brittleness] = STAT_TEST on NLN and NUN
  [pd, pf, brittleness] = MODEL on TEST
 END
END	
\end{verbatim}
 \\ \hline
    \end{tabular}
\end{center}
\caption{Pseudo code for Experiment 1}\label{fig:knnexp1}
\end{figure}

\subsection{Results from Experiment 1}

\fig{result1} shows the 25\%, 50\% and 100\% percentile values of the $pd$, $pf$ and position values in each data set when r=1 (upper table) and r=2 (lower table. Next to these is the brittleness signal where $high$ signals an unacceptable level of brittleness and $low$ signals an acceptable level of brittleness. The results show that the brittleness level for each data set is $low$. The $pd$ and $pf$ results are promising showing that 50\% of the pd values are at or above 95\% for the data set with 3 clusters and at 100\% for the other data sets. While 50\% of the pf values are at 3\% for 3 clusters and 0\% for the others. These results show that our model is highly discriminating and can be used successfully in the evaluation of trace evidence.


\begin{figure}
\begin{center}
\begin{tabular}{l@{~}|l@{~}|r@{~}r@{~}@{~}r@{~}| c@{~}|}
%\begin{tabular}{l@{~}|l@{~}|r@{~}|r@{~}|}
%\multicolumn{2}{c}{~}&\multicolumn{5}{c}{quartiles}\\\cline{3-5}
%&min& & median & &max\\\cline{3-5}
%\multicolumn{2}{c}{~}&\multicolumn{2}{c}{Before}\\
%\multicolumn{2}{c}{~}&\multicolumn{2}{c}{percentiles} \\\cline{3-4}
%Clusters& Type & Before & After \\\hline
\multicolumn{2}{c}{~}&\multicolumn{3}{c}{Before}\\
\multicolumn{2}{c}{~}&\multicolumn{3}{c}{percentiles}\\\cline{3-5}
Clusters & Types & 25\%& 50\% & 75\%  & Brittlness Level\\\hline
\multirow{3}{*}{3} & pd & 90& 95& 100 & \multirow{3}{*}{low}  \\
 & pf & 0& 3& 4 &\\
 & position & 264& 614& 1068 &\\ 
 \hline
\multirow{3}{*}{5} & pd & 94& 100& 100 & \multirow{3}{*}{low}\\
 & pf & 0& 0& 0  & \\
 & position & 374& 855& 1225 &\\
  \hline
\multirow{3}{*}{10} & pd & 75& 100& 100  & \multirow{3}{*}{low}\\
 & pf & 0& 0& 0 &\\
 & position & 361& 783& 1254 & \\
 \hline
\multirow{3}{*}{20} & pd & 0& 100& 100 & \multirow{3}{*}{low}\\
 & pf & 0& 0& 3 &\\
 & position & 377& 762& 1256 & \\
  \hline 
%\multicolumn{5}{c}{~}&~~~~~0~~~~~~~~50~~~~100 
\end{tabular}

\\ \\ \hline

\begin{tabular}{l@{~}|l@{~}|r@{~}r@{~}@{~}r@{~}| c@{~}|}
%\begin{tabular}{l@{~}|l@{~}|r@{~}|r@{~}|}
%\multicolumn{2}{c}{~}&\multicolumn{5}{c}{quartiles}\\\cline{3-5}
%&min& & median & &max\\\cline{3-5}
%\multicolumn{2}{c}{~}&\multicolumn{2}{c}{Before}\\
%\multicolumn{2}{c}{~}&\multicolumn{2}{c}{percentiles} \\\cline{3-4}
%Clusters& Type & Before & After \\\hline
\multicolumn{2}{c}{~}&\multicolumn{3}{c}{Before}\\
\multicolumn{2}{c}{~}&\multicolumn{3}{c}{percentiles}\\\cline{3-5}
Clusters & Types & 25\%& 50\% & 75\%  & Brittlness Level\\\hline
\multirow{3}{*}{3} & pd & 89& 94& 100 & \multirow{3}{*}{low}  \\
 & pf & 0& 0& 4 &\\
 & position & 419& 905& 1351 &\\ 
 \hline
\multirow{3}{*}{5} & pd & 94& 100& 100 & \multirow{3}{*}{low}\\
 & pf & 0& 0& 0  & \\
 & position & 437& 903& 1297 &\\
  \hline
\multirow{3}{*}{10} & pd & 50& 100& 100  & \multirow{3}{*}{low}\\
 & pf & 0& 0& 3 &\\
 & position & 442& 908& 1354 & \\
 \hline
\multirow{3}{*}{20} & pd & 0& 67& 100 & \multirow{3}{*}{low}\\
 & pf & 0& 0& 0 &\\
 & position & 437& 896& 1345 & \\
  \hline 
%\multicolumn{5}{c}{~}&~~~~~0~~~~~~~~50~~~~100 
\end{tabular}

\end{center}
\caption{Results for Experiment 1 for the 4 data sets distinguished by the number of clusters. Here for the upper and lower tables n=4 is used while r=1 is used for the upper table and r=2 for the lower table.}\label{fig:result1}
\end{figure}


%include - as the number of clusters increase...
%assume the brittleness is yes

\section{Experiment 2: Can brittleness be reduced?}

The first experiment shows that KNN creates strong models for forensic evaluation, with high pd's, low pf's and low brittleness levels. With experiment 2 we want to find out if these results can be improved by reducing brittleness further. Since we believe that it is the nearness of unlike neighbors which causes the brittleness (See \eq{bm}), in this section we evaluate the CLIFF selector which selects a subset of data from each cluster which best represents the cluster in hopes that this increases the distance between like neighbors and therefore decrease brittleness while maintaining comparable pd and pf results from experiment 1. Also we expect that the position values well be greater than those in the experiment 1.

The design for this experiment can be seen in \fig{knnexp2}. It is similar to that in \fig{knnexp1}, however, the CLIFF selector is included and is described in \ref{subsection:selector}.

\begin{figure}[h!]
\small
\begin{center}
\begin{tabular}{ p{8cm} }
\hline
\begin{verbatim}
DATA = [3, 5, 10, 20]
LEARNER = [KNN]
STAT_TEST = [Mann Whitney]
SELECTOR = [CLIFF selector]

REPEAT 5 TIMES
 FOR EACH data IN DATA
  TRAIN = random 90% of data
  TEST = data - TRAIN
		
  \\CLIFF selector: select best from clusters
  N_TRAIN = SELECTOR with TRAIN
		
  \\Construct model from TRAIN data
  MODEL = Train LEARNER with N_TRAIN
  \\Evaluate model on test data
  [brittleness] = STAT_TEST on NLN and NUN
  [pd, pf, brittleness] = MODEL on TEST
 END
END	
\end{verbatim}
 \\ \hline
    \end{tabular}
\end{center}
\caption{Pseudo code for Experiment 2}\label{fig:knnexp2}
\end{figure}       

\subsection{Results from Experiment 2}                                                                

%\fig{} shows the 25\%, 50\% and 100\% percentile values of the $pd$ and $pf$ values in each data set. Next to these is the brittleness signal where $yes$ signals an unacceptable level of brittleness and $no$ signals an acceptable level of brittleness. The $pd$ and $pf$ results are promising showing that 50\% of the pd values are at or above 95\% for the data set with 3 clusters and at 100\% for the other data sets. While 50\% of the pf values are at 3\% for 3 clusters and 0\% for the others. This shows that our model is highly discriminating and can be used successfully in the evaluation of trace evidence.
%include - as the number of clusters increase...
%assume the brittleness is yes

\fig{result2} shows results for 5 and 10 clusters remain the same for 50\% of the pd and pf values while for 3 and 20 clusters the pd's have decreased to 82\% and 67\% respectively. Also the brittleness level remains low for each data set. The results shown in \fig{result2} does not provide any information about the difference between the low level of brittleness between \fig{result1} and \fig{result2}, however the model remains strong. \fig{dist3} illustrates the reduction of brittleness after the CLIFF selector is applied. Mann Whitney U test was also applied to these results to see if there was a statistical difference between the before and after results. The test indicated that the $after$ results are better than $before$ (see \fig{result3}). So brittleness can be reduced while maintaining comparable results.

In summary, by using CLIFF, inappropriate statistical assumptions about the data are avoided. We found a successful way to reduce any brittleness found, to create strong forensic evaluation models. One important point to note here also is this: In order to evaluate data sets with multiple variables, a host of new statistical models has been built \cite{09Zadora, 09aZadora, 06Aitken, 04Aitken, 02Koons, 99Koons}. This has been the case with forensic scientists building these models for glass interpretation when using the elemental composition of glass rather than just the refractive indices. On the other hand, with CLIFF an increase in the number of variables used does not signal the need to create a new model, it works with any data set.
%brittleness?


\begin{figure}[ht!]
\begin{center}
\begin{tabular}{l@{~}|l@{~}|r@{~}r@{~}@{~}r@{~}| c@{~}|}
%\begin{tabular}{l@{~}|l@{~}|r@{~}|r@{~}|}
%\multicolumn{2}{c}{~}&\multicolumn{5}{c}{quartiles}\\\cline{3-5}
%&min& & median & &max\\\cline{3-5}
%\multicolumn{2}{c}{~}&\multicolumn{2}{c}{Before}\\
%\multicolumn{2}{c}{~}&\multicolumn{2}{c}{percentiles} \\\cline{3-4}
%Clusters& Type & Before & After \\\hline
\multicolumn{2}{c}{~}&\multicolumn{3}{c}{After}\\
\multicolumn{2}{c}{~}&\multicolumn{3}{c}{percentiles}\\\cline{3-5}
Clusters & Types & 25\%& 50\% & 75\% & Brittleness Level\\\hline
\multirow{3}{*}{3} & pd & 49& 82& 100 & \multirow{3}{*}{low} \\
 & pf & 0& 9& 20 &\\
 & position & 787& 1228& 1609 &\\
  \hline
\multirow{3}{*}{5} & pd & 94& 100& 100 & \multirow{3}{*}{low} \\
 & pf & 0& 0& 0  & \\
 & position & 563& 988& 1532 &\\ 
  \hline
\multirow{3}{*}{10} & pd & 60& 100& 100 & \multirow{3}{*}{low} \\
 & pf & 0& 0& 3 & \\
 & position & 578& 1048& 1463 &\\
  \hline
\multirow{3}{*}{20} & pd & 0& 67& 100 & \multirow{3}{*}{low} \\
 & pf & 0& 0& 3 & \\
 & position & 601& 1081& 1481 &\\
  \hline 
%\multicolumn{5}{c}{~}&~~~~~0~~~~~~~~50~~~~100 


\end{tabular}

\\ \\ \hline
\begin{tabular}{l@{~}|l@{~}|r@{~}r@{~}@{~}r@{~}| c@{~}|}
%\begin{tabular}{l@{~}|l@{~}|r@{~}|r@{~}|}
%\multicolumn{2}{c}{~}&\multicolumn{5}{c}{quartiles}\\\cline{3-5}
%&min& & median & &max\\\cline{3-5}
%\multicolumn{2}{c}{~}&\multicolumn{2}{c}{Before}\\
%\multicolumn{2}{c}{~}&\multicolumn{2}{c}{percentiles} \\\cline{3-4}
%Clusters& Type & Before & After \\\hline
\multicolumn{2}{c}{~}&\multicolumn{3}{c}{After}\\
\multicolumn{2}{c}{~}&\multicolumn{3}{c}{percentiles}\\\cline{3-5}
Clusters & Types & 25\%& 50\% & 75\% & Brittleness Level\\\hline
\multirow{3}{*}{3} & pd & 89& 100& 100 & \multirow{3}{*}{low} \\
 & pf & 0& 0& 5 &\\
 & position & 633& 1047& 1432 &\\
  \hline
\multirow{3}{*}{5} & pd & 90& 100& 100 & \multirow{3}{*}{low} \\
 & pf & 0& 0& 0  & \\
 & position & 507& 982& 1465 &\\ 
  \hline
\multirow{3}{*}{10} & pd & 100& 100& 100 & \multirow{3}{*}{low} \\
 & pf & 0& 0& 0 & \\
 & position & 506& 968& 1426 &\\
  \hline
\multirow{3}{*}{20} & pd & 0& 80& 100 & \multirow{3}{*}{low} \\
 & pf & 0& 0& 0 & \\
 & position & 495& 957& 1424 &\\
  \hline 
%\multicolumn{5}{c}{~}&~~~~~0~~~~~~~~50~~~~100 


\end{tabular}

\end{center}
\caption{Results for Experiment 2 for the 4 data sets distinguished by the number of clusters. Here for the upper and lower tables n=4 is used while r=1 is used for the upper table and r=2 for the lower table.}\label{fig:result2}
\end{figure}


\begin{figure*}[ht!]
%  \begin{center}
  \scalebox{0.85}{
    \begin{tabular}{l}
      \resizebox{50mm}{!}{\includegraphics{bd3r1}} 
      \resizebox{50mm}{!}{\includegraphics{bd5r1}} 
      \resizebox{50mm}{!}{\includegraphics{bd10r1}}
      \resizebox{50mm}{!}{\includegraphics{bd20r1}} \\
      \resizebox{50mm}{!}{\includegraphics{bd3r2}} 
      \resizebox{50mm}{!}{\includegraphics{bd5r2}} 
      \resizebox{50mm}{!}{\includegraphics{bd10r2}}
      \resizebox{50mm}{!}{\includegraphics{bd20r2}} \\
    \end{tabular}}
    \caption{Position of values in the 'before' and 'after' population with data set at 3, 5, 10 and 20 clusters. The first row shows the results for r=1 while the second row shows the results for r=2}
    \label{fig:dist3}
 % \end{center}
\end{figure*}

\begin{figure}[ht!]
\begin{center}
\begin{tabular}{l@{~}|l@{~}| c@{~}|}
%\begin{tabular}{l@{~}|l@{~}|r@{~}|r@{~}|}
%\multicolumn{2}{c}{~}&\multicolumn{5}{c}{quartiles}\\\cline{3-5}
%&min& & median & &max\\\cline{3-5}
%\multicolumn{2}{c}{~}&\multicolumn{2}{c}{Before}\\
%\multicolumn{2}{c}{~}&\multicolumn{2}{c}{percentiles} \\\cline{3-4}
%Clusters& Type & Before & After \\\hline
%\multicolumn{2}{c}{~}&\multicolumn{3}{c}{After}\\
%\multicolumn{2}{c}{~}&\multicolumn{3}{c}{percentiles}\\\cline{3-5}
Clusters & Treatments &  Significance\\\hline
\multirow{2}{*}{3} & before &  \multirow{2}{*}{-1} \\
 & after  &\\
  \hline
\multirow{2}{*}{5} & before & \multirow{2}{*}{-1} \\
 & after   & \\
  \hline
\multirow{2}{*}{10} & before  & \multirow{2}{*}{-1} \\
 & after  & \\
  \hline
\multirow{2}{*}{20} & before  & \multirow{2}{*}{-1} \\
 & after  & \\
  \hline 
%\multicolumn{5}{c}{~}&~~~~~0~~~~~~~~50~~~~100 
\end{tabular}
\end{center}
\caption{Results for Experiment 2 of before and after results. -1 indicates that the after is better than before}\label{fig:result3}
\end{figure}