\usepackage{cite}
\usepackage[table]{xcolor}

\section{Building the Experiment}   

Recall from the above discussion that a good data miner should do better than the  ZeroR lower-bound on performance. 
Table \ref{tabDependentVarsDistr} tell us that that lower bound is:
\begin{itemize}
pe
\item For first year retention: 71.3\% 
\item For second year retention: 60.36\% 
\item For third year retention: 54.78\% 
\end{itemize}
As discussed below, we will be able to do much better than some, but not all, of these targets. This was achieved by
\bi
\item Removing spurious attributes using {\em feature subset  selection};
\item Exploring a large range of classifiers;
\ei

The following represents brief explanations of each
method used. Results obtained from a combination of which are then
analyzed.

\subsection{Feature Subset Selection}

Table \fig{data} shows a sample of the 103 attributes used in this
study\footnote{ashutosh- can you describe those attributes?}. Our
pre-experimental suspicision was that some of the attributes were
``noisy''; i.e. contain signals not related to the target of
prediction retention.  Therefore, before we learn a theory, we first
explored {\em attribute selection}.

Now that the number of attributes to select is
crucial in the analysis of the data, because it allows us to comment on 
the hypotheses shown in the last section. If an hypothesis requires attributes which, if removed from the data,
does not change the performance of the predictior, then that suggests that that hypothesis is spurious.

In this experiment, we ranked the 103 attributes from most informative to least informative.
We then build theories using the top $n\in\{5,10,..,100,103\}$ ranked attributes. Attributes
were then discarded if adding them in did not improve the performance of our retention predictors.

The attributes were ranked using one of four methods:

\begin{enumerate}

\item{{\em Correlation-based feature
selection}~\cite{Hall00correlation-basedfeature}  constructs a matrix
of feature to feature, and feature-to-class correlations. CFS
uses a best first search by expanding the best subsets until no
improvement is made, in which case the search falls to the unexpanded
subset having the next best evaluation until a subset expansion
limit is met.}

\item{{\em Information Gain} uses  a concept
from information theory called $entropy$. Entropy measures the
amount of uncertainty, or randomness, that is associated with a
random variable. Thus, high entropy can be seen as a lack of purity
in the data. Information gain, as described in~\cite{Mitchell97}
is an expected reduction of the entropy measure that occurs when
splitting examples in the data using a particular attribute. Therefore
an attribute that has a high purity (high information gain) is
better at describing the data than one with a low purity. The
resulting attributes are then ranked by sorted their information
gain scores in a descending order.}

\item{The {\em chi-squared} statistic~\cite{Moore06} is used in
statistical tests to determine how distributions of variables are
different from one another. Note that these variables must be
categorical in nature\footnote{adam- is there a discretization story here?}. 
Thus, the chi-squared statistic can evaluate
an attribute's worth by calculating the value of this statistic
with respect to a class. Attributes can then be ranked based on
this statistic.}

\item{The {\em One-R} classifier, described below, can be used to deliver
top-ranking attributes. One-R constructs and scores rules using one attribute.
Feature selectors using One-R
sort the attributes 
based on these scores.
}

\end{enumerate}

\subsection{Classifiers} 

Classifiers are used in data mining by employing machine learning
techniques to learn connections between independent features and the dependent feature (called the {\em class}).  
Once these patterns
are learned, we can predict outcomes in
new data by reflecting on data that has already been examined. 

This study tried seven different classifiers:

\begin{enumerate}
\item{Simple rule learning with {\em One-R}:  One-R, described in~\cite{holte93}, builds rules from the data by iteratively examining each value of an attribute and counting the frequency of each class for that attribute-value pair. An attribute-value is then assigned the most frequently occurring class. Error rates of each of the rules can then be calculated, and the best rules can be ranked based on the lowest error rates.}
\item{Decision tree learning with {\em C4.5}:
 C4.5~\cite{quinlanC4.5} is an 
 extension to the ID3~\cite{quinlanID3} algorithm.
 A decision tree~\cite{Mitchell97DescTree} (shown in
 Figure~\ref{fig:decisionTree}) is constructed by first determining
 the best attribute to make as the root node of the tree. ID3 decides
 this root attribute by using one that best classifies training
 examples based upon the attribute's information gain (described
 above)~\cite{quinlan92}. Then, for each value of the attribute representing any
 node in the tree, the algorithm recursively builds child nodes
 based on how well another attribute from the data describes that
 specific branch of its parent node. The stopping criteria are
 either when the tree perfectly classifies all training examples,
 or until no attribute remains unused. C4.5 extends ID3 by making
 several improvements, such as the ability to operate on both
 continuous as well as discrete attributes, training data that
 contains missing values for a given attribute(s), and employ pruning
 techniques on the resulting tree.}

\begin{figure}
\begin{center}
\includegraphics[width=6.5in]{figures/Investment_decision_tree.png}
\end{center}
\caption{A decision tree consists of a root node and descending children nodes who denote decisions to make in the tree's strucure. This tree, for example, was constructed in an attempt to optimize investment portfolios by minimizing budgets and maximizing payoffs. The top-most branch represents the best selection in this example.}
\label{fig:decisionTree}
\end{figure}

\item{Alternating Decision Tree learning with  
{\em ADTrees}~\cite{Freund99thealternating}: are decision trees that contain
both decision nodes, as well as prediction nodes. Decision nodes
specify a condition, while prediction nodes contain only a number.
Thus, as an example in the data follows paths in the ADTree, it
only traverses branches whose decision nodes are true. The example
is then classified by summing all prediction nodes that are encountered
in this traversal. ADTrees, however, differ from binary classification
trees, such as C4.5, in that in those trees an example only traverses
a single path down the tree.}

\begin{figure}
\begin{center}
\includegraphics[width=4.5in]{figures/bayesnet_diagram.png}
\end{center}
\caption{In this simple bayesian network, the variable {\em Sprinkler} is dependent upon whether or not its raining; the sprinkler is generally not turned on when it's raining. However, either event is able to cause the grass to become wet - if it's raining, or if the sprinkler is caused to turn on. Thus, Bayesian networks excel at investigating information relating to relationships between variables.}
\label{fig:bayesnetwork}
\end{figure}


\item{Simple probabilistic learning with {\em Naive Bayes}:
A naive Bayes classifier 
uses Bayes' theorem to classify
training data. Bayes' theorem, as shown in Equation~\ref{eq:bayesTheorem},
determines the probability $P$ of an event $H$ occurring given an
amount of evidence $E$. This classifier assumes feature
independence; the algorithm examines features independently to
contribute to probabilities, as opposed to the assumption that
features depend on other features. Surprisingly, even though feature
independence is an integral part of the classifier, it often
outperforms many other learners ~\cite{NB-performance,domingos97optimality}.}

\begin{equation}
  Pr(H|E) = \frac{Pr(E|H) * Pr(H)}{Pr(E)}
\label{eq:bayesTheorem}
\end{equation}


\item{Complex probabilistic modeling with {\em Bayesian Network}- Bayesian networks, illustrated in Figure ~\ref{fig:bayesnetwork}, are graphical models that use a directed acyclic graph (DAG) to represent probabilistic relationships between variables. 
As stated in~\cite{Heckerman96atutorial},Bayesian networks have four important elements to offer:
\begin{enumerate}
\item{Incomplete data sets can be handled well by Bayesian networks. Because the networks encode a correlation between input variables, if an input is not observed, in will not necessarily produce inaccurate predictions, as would other methods.} 
\item{Causal relationships can be learned about via Bayesian networks. For instance, if an analyst wished to know if a certain action taken would produce a specific result, and also to what degree.}
\item{Bayesian networks promote the amalgamation of data and domain knowledge by allowing for a straightforward encoding of causal prior knowledge, as well as the ability to encode causal relationship strength.}
\item{Bayesian networks avoid over fitting of data, as "smoothing" can be used in a way such that all data that is available can be used for training.}
\end{enumerate}

\item{Neural networks using {\em Radial Basis Function Network}: A
radial basis function network (RBFN)~\cite{rbfnIntroBors} is an
artificial neural network (ANN) that
utilizes a radial basis function as an
activation function. An ANN's activation function is used in order
to offer non-linearity to the network. This is important for
multi-layer networks containing many hidden layers, because their
advantages lie in their ability to learn on non-linearly separable
examples.}}

\end{enumerate}


\subsection{Cross-Validation}

The performance of a learned theory can be assessed using equations one to four.
Also, if we use multiple
{\em hold out} test sets, we can also discover the variance in these performance figures.
A standart described above.

TIn the process of experimentation, it is crucial to determine a method's performance. Using performance criteria, further analysis can be conducted on experimental results to aid in the search for an optimal solution. Cross-validation provides the ability to discover how well a classifier performs on any given data set or a treatment of that data set. This is conducted by randomly partitioning the data into two subsets, called the training set, and the testing set. Specifically for this experiment, the data prior to partitioning has been reduced given n attributes selected using an FSS method.

In the learning phase, only the training subset is used by the classifier. The testing set is then used to determine how well the concepts learned from the training phase can be applied to unseen data. However, to reduce variability, the partitioning of the data and reclassification of resulting subsets is generally conducted multiple times. In this experiment, for example, a 5 X 5 cross-validation was performed. This means that five times we partitioned the data into a testing set consisting of $\frac{1}{5}$-th of the data, and a training set of $\frac{4}{5}$-ths of the data. After the five rounds, median values of the validation results are examined, and are assigned to a particular combination of the above facets.

\section{Analysis of Experimental Results}

\subsection{Evaluation Metrics}
The evaluation metrics used in this experiment are standard in data mining to measuring the performance of a method. These are represented as probability of detection (PD), probability of false alarm (PF), and variance. PD denotes the probability that the classifier will predict correctly for a given class, given both its correct and incorrect predictions. Thus, PD values should be maximized. PF, on the other hand, is the probability that the classifier will predict incorrectly for a given class, also given its correct and incorrect predictions. For this reason, PF results should be minimized.

Variance was also used in the experiment based on PD and PF values independently as an extra means of determining performance. Variance in these values provides insight into how much reliability a classifier supports on the data. For example, if a method's PD values range from very low to very high, we can determine that the particular method is not consistent in its probabilities of detection. Therefore, it is desired to have a very small variance in both PD and PF values. 

\begin{figure}
\centering
\includegraphics{figures/ret1_pd.pdf}
\includegraphics{figures/ret1_pf.pdf}
\caption{Probability of Detection (PD) and Probability of False Alarm (PF) with variances for first year retention.}
\label{fig:ret1graphs}
\end{figure}

\begin{figure}
\centering
\includegraphics{figures/ret2_pd.pdf}
\includegraphics{figures/ret2_pf.pdf}
\caption{Probability of Detection (PD) and Probability of False Alarm (PF) with variances for second year retention.}
\label{fig:ret2graphs}
\end{figure}

\begin{figure}
\centering
\includegraphics{figures/ret3_pd.pdf}
\includegraphics{figures/ret3_pf.pdf}
\caption{Probability of Detection (PD) and Probability of False Alarm (PF) with variances for third year retention.}
\label{fig:ret3graphs}
\end{figure}

\subsection{Visualizing the Results}  
     Figures~\ref{fig:ret1graphs},~\ref{fig:ret2graphs}, and~\ref{fig:ret3graphs} show the PD and PF median results for first, second and third year retention against the variance of these values. Each point represents a specific combination of the number of attributes selected, the feature subset selector used to select them, and the classifier used to train on the resulting data. For example, one point on a graph could be seen as 50/Information Gain/Naive Bayes, where 50 denotes the number of attributes used. The color of each point shows the number of attributes used for that particular combination representing that point. 

    The horizontal line segmenting the PD graphs are given as a baseline reference designated by the already existing retention rates in the data. Thus, to predict for retention in a given year, it is desirable to yield results higher than the baseline. As can be seen in the figures, the median probability of detection of retention values for the first year do not meet the baseline, and therefore we can assume that first year retention cannot accurately be predicted for using our methods. Second year retention provides better results than first year retention, but these results are hardly significant. For example, most of the points lie at or below the baseline. For this reason, second year retention is also not considered in further analysis. Lastly, third year PD values successfully exceed the baseline, and so require more thorough examination.  

\subsection{Narrowing the Search}

    From the visualizations described above, we can narrow our space of possible combinations to examine for third year retention. The graphs for PD and PF medians show that the range of number of attributes that maximizes PD and minimizes PF values while maintaining minimal variance is approximately 20 to 60. This is significant, as it allows filtering of the results so that concentration can be placed on only treatments whose attribute numbers lie in this range. 

\subsubsection{Ranking with the Mann-Whitney Test}

        At the moment of pruning the results based on attribute ranges, we are left with many combinations to be analyzed. In order to rank each combination, we performed a statistical Mann-Whitney test at 95\% confidence in order to rank a treatment. A rank is determined by how many times a combination wins compared to another. The method that won the most number of times is then given the highest rank. The table in Figure~\ref{fig:ranktable} shows the top ten ranking combinations based on a PD performance measure. Note that identical ranks are given to those treatments whose win value is equal in magnitude. 

\begin{figure}
  \begin{center}
    \rowcolors{1}{gray!10}{}
    \begin{tabular}{| c | c | c | c |}
      \hline
      Rank & Number of Attributes & FSS & Classifier \\
      \hline
      61&         30&oneR&bnet \\ \hline
      61&50&cfs&adtree \\ \hline
      57&       50&oneR&adtree \\ \hline
      56&       30&oneR&adtree \\ \hline
      55&        30&cfs&adtree \\ \hline 
      52&         50&oneR&bnet \\ \hline
      51&   30&infogain&adtree \\ \hline
      51&          30&cfs&bnet \\ \hline
      48&   50&infogain&adtree \\ \hline
    \end{tabular}
  \end{center}
  \caption{The top ten ranking treatments for third year retention. Ranks represent how many times a particular treatment wins over all other treatments in the experiment.}
  \label{fig:ranktable}
\end{figure}

\subsection{Selected FSS and Classifier}

    Figure~\ref{fig:ranktable} shows the top-most ranking combination of FSS and classifier is obtained by either using 30 attributes, or 50. Since, the two numbers of attributes (along with their own FSS and classifier) result in the same Mann-Whitney rank, we can make the statement that the two are statistically similar, and thus by focusing on only 30 attributes selected, we can concentrate on approximately 1/3 of the original data.  Thus, our analysis of the results show that 30 attributes selected using One-R as the feature subset selection method are the most critical to third year retention.  
   
\bibliographystyle{plain}
\newpage
\bibliography{refs.bib}

\end{document}