\section{Building the Experiment}   

In Section~\ref{sect:assess}, we assert that
a good data miner should do better than the  simplistic ZeroR learner.
Table \ref{tabDepedentVarsDistr} tell us that that lower bound is:
\begin{itemize}
\item For first year retention: 71.3\%.
\item For second year retention: 60.4\%. 
\item For third year retention: 54.8\%.
\end{itemize}
As discussed below, we will be able to do much better than some, but not all, of these targets. This was achieved by
\begin{itemize}
\item Removing spurious attributes using {\em feature subset  selection};
\item Exploring {\em a large range of classifiers};
\item Assessing the learned theories by their {\em variance}, as well as their {\em median} performance.
\item Assessing the learned theories by their {\em variance}, as well as their {\em median} performance.
\item Study the {\em delta} of  student factors {\em between} those who stay and those who are retained.
\end{itemize}

%The following represents brief explanations of each
%method used. Results obtained from a combination of which are then
%analyzed.

\subsection{Feature Subset Selection}

Table \ref{tabAttrswHypothesis} shows a sample of the 103 attributes used in this
study. Our
pre-experimental suspicion was that some of the attributes were
``noisy''; i.e. contain signals not related to the target of
prediction retention.  Therefore, before we learn a theory, we first
explored {\em attribute selection}.

Note that the number of attributes to select is
crucial in the analysis of the data, because it allows us to comment on 
the hypotheses shown in the last section. If removal of attributes from an hypothesis does not change the performance of the prediction, then that hypothesis is spurious.

In this experiment, we ranked the 103 attributes from most informative to least informative.
We then built theories using the top $n\in\{5,10,..,100,103\}$ ranked attributes. Attributes
were then discarded if adding them in did not improve the performance of our retention predictors.

The attributes were ranked using one of four methods: CFS, Information Gain, chi-squared, and One-R.
{\em Correlation-based feature
selection}  constructs a matrix
of feature to feature, and feature-to-class correlations~\citep{Hall00correlation-basedfeature}. CFS
uses a best first search by expanding the best subsets until no
improvement is made, in which case the search falls to the unexpanded
subset having the next best evaluation until a subset expansion
limit is met.

{\em Information Gain} uses  an information theory concept called $entropy$. Entropy measures the
amount of uncertainty, or randomness, that is associated with a
random variable. Thus, high entropy can be seen as a lack of purity
in the data. Information gain, as described in~\cite{Mitchell97}
is an expected reduction of the entropy measure that occurs when
splitting examples in the data using a particular attribute. Therefore
an attribute that has a high purity (high information gain) is
better at describing the data than the one that has a low purity. The
resulting attributes are then ranked by sorted their information
gain scores in a descending order.

The {\em chi-squared} statistic is used in
statistical tests to determine how distributions of variables are
different from one another~\citep{Moore06}. Note that these variables must be
categorical in nature. 
Thus, the chi-squared statistic can evaluate
an attribute's worth by calculating the value of this statistic
with respect to a class. Attributes can then be ranked based on
this statistic.

The {\em One-R} classifier, described below, can be used to deliver
top-ranking attributes. One-R constructs and scores rules using one attribute.
Feature selectors using One-R
sort the attributes 
based on these scores.

\subsection{Classifiers} 

In data mining,  classifiers are  used to learn connections between independent features and the dependent feature (called the {\em class}).  
Once these patterns
are learned, we can predict outcomes in
new data by reflecting on data that has already been examined. 

This study tried six different classifiers: One-R, C4.5, ADTrees, Naive Bayes, Bayes networks,
and radial bias networks. These are some of the well-known and standard classifiers in the machine learning field, except for ADTrees.   {\em One-R}, described in~\cite{holte93}, builds rules from the data by iteratively examining each value of an attribute and counting the frequency of each class for that attribute-value pair. An attribute-value is then assigned as the most frequently occurring class. Error rates of each of the rules are then  calculated, and the best rules are ranked based on the lowest error rates.

A
{\em radial basis function network} (RBFN) is an
artificial neural network (ANN) that
utilizes a radial basis function as an
activation function~\citep{rbfnIntroBors}. An ANN's activation function is used in order
to offer non-linearity to the network. This is important for
multi-layer networks containing many hidden layers, because their
advantages lie in their ability to learn on non-linearly separable
examples.

 {\em C4.5}~\citep{quinlanC4.5} is an 
 extension to the ID3~\citep{quinlanID3} algorithm.
 A decision tree (shown in
 Figure~\ref{fig:decisionTree}) is constructed by first determining
 the best attribute to make as the root node of the tree~\citep{Mitchell97}. ID3 decides
 this root attribute by using one that best classifies training
 examples based upon the attribute's information gain (described
 above)~\citep{quinlanID3}. Then, for each value of the attribute representing any
 node in the tree, the algorithm recursively builds child nodes
 based on how well another attribute from the data describes that
 specific branch of its parent node. The learning stops when the tree perfectly classifies all training examples,  or when all attributes used. C4.5 extends ID3 by making
 several improvements, such as the ability to operate on both
 continuous as well as discrete attributes, training data that
 contains missing values for a given attribute(s), and employ pruning
 techniques on the resulting tree.

\begin{figure}
\begin{center}
\includegraphics[width=6.5in]{images/Investment_decision_tree.png}
\end{center}
\caption{A decision tree consists of a root node and descending children nodes who denote decisions to make in the tree's structure. This tree, for example, was constructed in an attempt to optimize investment portfolios by minimizing budgets and maximizing pay-offs. The top-most branch represents the best selection in this example.}
\label{fig:decisionTree}
\end{figure}

{\em ADTrees} are decision trees that contain
both decision nodes, as well as prediction nodes~\citep{Freund99thealternating}. Decision nodes
specify a condition, while prediction nodes contain only a number.
Thus, as an example in the data follows paths in the ADTree, it
only traverses branches whose decision nodes are true. The example
is then classified by summing all prediction nodes that are encountered
in this traversal. ADTrees, however, differ from binary classification
trees, such as C4.5, where those trees  only traverses
a single path down the tree.

\begin{figure}
\begin{center}
\includegraphics[width=4.5in]{images/bayesnet_diagram.png}
\end{center}
\caption{In this simple Bayesian network, the variable {\em Sprinkler} is dependent upon whether or not it's raining; the sprinkler is generally not turned on when it's raining. However, either event is able to cause the grass to become wet - if it's raining, or if the sprinkler is caused to turn on. Thus, Bayesian networks excel at investigating information relating to relationships between variables.}
\label{fig:bayesnetwork}
\end{figure}


A {\em naive Bayes} classifier 
uses Bayes' theorem to classify
training data. Bayes' theorem, as shown in Equation~\ref{eq:bayesTheorem},
determines the probability $P$ of an event $H$ occurring given an
amount of evidence $E$. This classifier assumes feature
independence; the algorithm examines features independently to
contribute to probabilities, as opposed to the assumption that
features depend on other features. Surprisingly, even though feature
independence is an integral part of the classifier, it often
outperforms many other learners ~\citep{NB-performance,domingos97optimality}.

\begin{equation}
  Pr(H|E) = \frac{Pr(E|H) * Pr(H)}{Pr(E)}
\label{eq:bayesTheorem}
\end{equation}


{\em Bayesian networks}, illustrated in Figure ~\ref{fig:bayesnetwork}, are graphical models that use a directed acyclic graph (DAG) to represent probabilistic relationships between variables. 
As stated in~\cite{Heckerman96atutorial}, Bayesian networks have four important elements to offer:
\begin{enumerate}
\item{Incomplete data sets can be handled well by Bayesian networks. Because the networks encode a correlation between input variables, if an input is not observed, in will not necessarily produce inaccurate predictions, as would other methods.} 
\item{Causal relationships can be learned about via Bayesian networks. For instance, we can find whether a certain action taken would produce a specific result and to what degree.}
\item{Bayesian networks promote the amalgamation of data and domain knowledge by allowing for a straightforward encoding of causal prior knowledge, as well as the ability to encode causal relationship strength.}
\item{Bayesian networks avoid over fitting of data, as ``smoothing'' can be used in a way such that all data that is available can be used for training.}
\end{enumerate}


\subsection{Cross-Validation}

The value of different attributes can be assessed using equations one to four.
If we use multiple
{\em hold out} test sets, we can also discover the variance in these performance figures.
In this experiment, we performed a 5 $\times$ 5
cross-validation i.e. we partitioned the data five times into a
testing set consisting of $\frac{1}{5}$-th of the data and a training
set of $\frac{4}{5}$-ths of the data. After the five rounds, we
recorded the median values of recall and false alarm rates.

\subsection{Contrast Set Learning}

After determining the subset of the attributes that best predict for student retention, we conducted a 
{\em contrast set study}. Contrast set learners like TAR3~\citep{me07} seek attribute ranges that are 
most {\em different} in various outcomes. One way to read these contrast sets are as {\em treatments}
that promise if action X was applied to a domain, then this would favor outcome X over outcome Y.
In our case, we  used TAR3 in two ways:
\begin{itemize}
\item Firstly, we will use TAR3 to find which treatments most select for retention;
\item Secondly, we will run TAR3 in the opposite direction to find the treatments that most select for students
      leaving university.
\end{itemize}.
In the first case, TAR3 is being used to find actions that most encourage retention. In the second case,
TAR3 is being used to find the worst possible actions that most increase the probability of a student leaving.

\section{Analysis of Experimental Results}

\subsection{Evaluation Metrics}
The evaluation metrics used in this experiment are standard  data
mining performance measures of a method. They are:
\begin{itemize}
\item
Probability of
detection (PD);
\item
Probability of false alarm (PF);
\item 
And  variance PD and PF seen over the our cross-validation study.
\end{itemize}

Variance in these
values provides insight into how much reliability a classifier
supports on the data. For example, if a method's PD values ranges
from very low to very high, we can conclude that the particular
method is  inconsistent in its probabilities of detection. 
For our studies, we rejected anything with a variance greater than $\pm 25\%$.

The above statistics were collected over 1500 experiments, which were repeated 20 times (to check for conclusion
stability). In all, we conducted 
\[
5*5*4*6*3*20=36,000\]
experiments; i.e. 5 $\times$ 5 cross-validation using four feature subset selectors
and 6 different learners, for the 3 data sets of Section~3
(recall from \S3 that those three data sets contained data about first, second,  and
third year retention). This was repeated 20 times using the top $n\in {5,10,15,..,100,103}$
attributes as found by the feature selector.

\subsection{First Results}

After rejecting all results with (1)~a PD lower than the ZeroR limit; (2)~a PD  variance greater than $\pm 25\%$;
and (3)~a PF higher than 25\%,
we found that we had no predictors for Year1 or Year2 retention. This is the first major finding for this
research: {\em it is very difficult to predict  lower year retention}. Note that this result is consistent with
prior results discussed above in our literature review.

For the rest of this study, we will focus only on third year retention. The case for focusing on third year
retention is quite clear:
\begin{itemize}
\item
 If the goal is to provide a complete university education for a student, then predicting survival
till second year is less interesting than lasting till third year.
\item
Third year retention implies second and first year retention.
\end{itemize}
%\begin{figure}
%\centering
%\includegraphics{ret1_pd.pdf}
%\includegraphics{ret1_pf.pdf}
%\caption{Probability of Detection (PD) and Probability of False Alarm (PF) with variances for first year retention.}
%\label{fig:ret1graphs}
%\end{figure}
%
%\begin{figure}
%\centering
%\includegraphics{ret2_pd.pdf}
%\includegraphics{ret2_pf.pdf}
%\caption{Probability of Detection (PD) and Probability of False Alarm (PF) with variances for second year retention.}
%\label{fig:ret2graphs}
%\end{figure}
%
%\begin{figure}
%\centering
%\includegraphics{ret3_pd.pdf}
%\includegraphics{ret3_pf.pdf}
%\caption{Probability of Detection (PD) and Probability of False Alarm (PF) with variances for third year retention.}
%\label{fig:ret3graphs}
%\end{figure}
%
%\subsection{Visualizing the Results}  
%     Figures~\ref{fig:ret1graphs},~\ref{fig:ret2graphs}, and~\ref{fig:ret3graphs} show the PD and PF median results for first, second and third year retention against the variance of these values. Each point represents a specific combination of the number of attributes selected, the feature subset selector used to select them, and the classifier used to train on the resulting data. For example, one point on a graph could be seen as 50/Information Gain/Naive Bayes, where 50 denotes the number of attributes used. The color of each point shows the number of attributes used for that particular combination representing that point. 
%
%    The horizontal line segmenting the PD graphs is a baseline reference to the existing retention rates in the data. Thus, to predict for retention in a given year, it is desirable to yield results higher than the baseline. As can be seen in the figures, the median probability of detection of retention values for the first year do not meet the baseline, and therefore we can assume using our methods, we cannot accurately  predict first-year retention. Prediction of second year retention had better results than first year retention, but these results did not improve the baseline significantly. For example, most of the points lie at or below the baseline. For this reason, we did not consider second-year retention in further analysis. However, third year PD values successfully exceeded the baseline, and required more thorough examination.  




%/\subsection{Narrowing the Search}
%
%    Using the visualizations described above, we  narrowed our space
%    of possible combinations to examine for third year retention.
%    The graphs for PD and PF medians show that the range of number
%    of attributes that maximizes PD and minimizes PF values while
%    maintaining minimal variance is approximately 20 to 60. We
%    performed further analysis on the reduced datasets with 20 to
%    60 attributes.
%
\subsection{Ranking with the Mann-Whitney Test}

After pruning results with low PD, high PF, or high PD variance, we ranked
the remaining results via a Mann-Whitney test (95\% confidence).
	We determined the ranks by counting how many
	times a combination won compared to another combinations.
	The method that won the most number of times was then given
	the highest rank. The table in Figure~\ref{fig:ranktable}
	shows the top ten ranking combinations based on a PD
	performance measure. Note:
we gave  identical ranks  to those
	treatments whose win value was equal in magnitude.

\begin{table}
  \begin{center}
    %\rowcolors{1}{gray!10}{}
    \begin{tabular}{| c | c | c | c |}
      \hline
      Rank & Number of Attributes & FSS & Classifier \\
      \hline
      61&         30&oneR&bnet \\ 
      61&50&cfs&adtree \\ 
      57&       50&oneR&adtree \\ 
      56&       30&oneR&adtree \\ 
      55&        30&cfs&adtree \\  
      52&         50&oneR&bnet \\ 
      51&   30&infogain&adtree \\ 
      51&          30&cfs&bnet \\ 
      48&   50&infogain&adtree \\ 
    \end{tabular}
  \end{center}
  \caption{The top ten ranking treatments for third year retention. Ranks represent how many times a particular treatment wins over all other treatments in the experiment.}
  \label{fig:ranktable}
\end{table}

Since similar results were achieved using 30 or 50 attributes, we applied Occam's Razor and focused on the
30
attributes found to be best for oneR/bnet. For these 30 attributes, we studied all their ranges.
Figure~\ref{fig:ranktable} shows the ranges which, in isolation, select for retention at a probability
greater than the ZeroR limit for (for third year, that ZeroR limit is 55\%).
In terms of assessing different hypothesis, the third column of Figure~\ref{fig:ranktable} is most informative:
\begin{itemize}
\item The ranges shown at the top of the table are most predictive for third year retention. Note the 
dominance of ``Financial Aid'' attributes from Figure~\ref{fig:rankingsr}.
\item Attributes related to student ``Performance'' are rarer.
\item
None of the attribute ranges include the ``Faculty Type and Experience'' attributes of 
Figure~\ref{fig:rankingsr}. 
\end{itemize}
From this analysis, we made two tentative conclusions:
\begin{itemize}
\item
Using experienced faculty-level instructors
is {\em not} predictive for third year retention. 
\item
Issues relating to financial aid dominate over student performance.
\end{itemize}
\subsection{Ranking with Contrast Set Learning}
The counter-case to this conclusion might be that Figure~\ref{fig:ranktable} only discusses the effect
of attribute ranges in {\em isolation}. It is possible that   combination of factors might lead to different
conclusions.
The TAR3 treatment learner was used to test this possibility. We let TAR3 build rules of up to size 10
(i.e. ten combinations of attribute ranges) from the 30 attributes selected by the best
learning combination of Figure~\ref{fig:ranktable}.
It turned out that this max size of 10 ranges was much larger than necessary:
TAR3 never found combinations larger
than three ranges.


%\begin{table}
%\begin{center}
%\begin{tabular}{crp{1.5in}p{1.5in}rr}
%\textbf{Class Value} &  \textbf{Baseline} & \textbf{Probability of Detection (PD)} & \textbf{Probability of False Alarm (PF)} &  \textbf{Precision} &   \textbf{Accuracy} \\ \hline
%         Y &    54.78\% &  70.0\% &       34.9\% &       70.8\% &       67.8\% \\
%         N &       45.22\% & 65.1\% &         30.0\% &       64.2\% &       67.8\% \\
%\end{tabular}  
%\caption{Performance Measures obtained for RET3 using OneR as the FSS and Bayes Net as the classifier.}
%\label{tabRET3PerfMes}
%\end{center}
%\end{table}
%
\input{bestrangessorted}
%\subsection{Selected FSS and Classifier}
%
%    Figure~\ref{fig:ranktable} shows the top-most ranking combination of FSS and classifier is obtained by either using 30  or 50 attributes. Since, the two numbers of attributes (along with their own FSS and classifier) resulted in the same Mann-Whitney rank, we concluded that the results obtained using One-R/Bayes Netork and CFS/ADTree are not statistically different. As we selected top 30 attributes critical to third-year persistence, we concentrated on approximately 1/3 of the original data.     Table~\ref{tabRET3PerfMes} lists the performance measures obtained for RET3 using OneR and Bayes Net.
%
%