\section{Literature Review}

It is no news that higher education institutions are facing the
problem of student retention, which affects graduation rates as
well. Colleges with higher freshmen retention rate tend to have
higher graduation rates within four years.  The average national
retention rate is close to 55\% and in some colleges fewer than
20\% of incoming student cohort graduate \citep{Dru94}, and
approximately 50\% of students entering in an engineering program
leave before graduation \citep{Sca2000}.   \citet{Tin82} reported
national dropout rates and BA degree completions rates for the past
100 years to be constant at 45 and 52 percent respectively, except for the World War II period (see Figure
\ref{figTintoCompletionRates} for the completion rates from 1880
to 1980).   \citet{Till} at Valdosta State University (VSU) projected
lost revenues per 10 students, who do not persist their first
semester, to be \$326,811. Although gap between private institutions
and public institutions in terms of first-year students returning
to second year is closing, the retention rates have been constant
for both types of institutions \citep[see Figure
\ref{figACTRetRate}]{ACT2007}. National Center for Public Policy
and Higher Education (NCPPHE) reported the U.S. average retention
rate for the year 2002 to be 73.6\% \citep{NCP2007}.  This problem
is not only limited to the U.S. institutions, but also for the
institutions in other countries such as U.K and Belgium.  The U.K.
national average freshmen retention for the year 1996 was 75\%
\citep{Lau2003}, and \citet{Van07} found that 60\% of the first
generation first-year students in Belgium fail or dropout.

\begin{figure}[htbp]
\begin{center}
\includegraphics[scale=1.0]{TintoCompletionRates}
\caption[BA Degree Completion Rates, 1880-1980]{BA Degree Completion Rates for the period 1880 to 1980, where Percent Completion is the Number of BAs Divided by the Number of First-time Degree Enrollment Four Years Earlier \citep{Tin82}}
\label{figTintoCompletionRates}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[scale=0.7]{FirstYearRetentionRates18Years}
\end{center}
\caption[Percentage of First-Year Students Who Return for Second Year]{Percentage of First-Year Students at Four-Year Colleges Who Return for Second Year \citep{ACT2007}}
\label{figACTRetRate}
\end{figure}

Various researchers have studied this problem extensively, using
theoretical models \citep{Tin75,Tin88,Spa70,Spa71,Bea80},  traditional
models \citep{Ter80,Pas79,Pas80}, and data mining techniques
\citep{Dru94,San95,Mas99,Ste2001,Vei2004,Bar2004,Sal2004,Sup2006,Suj2006,Her2006,Atw2006,YuD2007,Del2007}.
As shown below, we can improve those prior results by augmenting
standard data mining with
{\em discretization}, {\em attribute selection}, and {\em cross-validation}
over various algorithms. 


As documented in~\citet{Ada2005}, the literature on retention in higher education is extensive, and although various researchers have tested theoretical models and noted attributes critical to student retention, these theories need to be tested from time-to-time. New generation of data miners make the testing easier, and possibly can find new theories or reject old theories using state-of-the art learning algorithms.
In this section, we focus on the literature
relating to data mining and the student retention problem.
The lesson of this section is that learning patterns of student retention is very difficult and, despite decades
of effort, there is much room for improvement in the current state of the art.

\subsection{Data Mining for Student Retention}

\citet{Dru94} were among the first researchers to apply knowledge
discovery algorithm to study the student retention problem. The
authors applied TETRAD II, a casual discovery program developed at
Carnegie Mellon University, to the U.S. news college ranking data
to find the factors that influenced student retention, and they
found that the main factor of retention was the average test score.
Using linear regression, the authors found that test scores alone
explained 50.5\% of the variance in freshmen retention rate. In
addition, they concluded  that other factors such as student-faculty
ratio, faculty salary, and university's educational expense per
student were not casually (directly) related to student retention;
and suggested that to increase student retention universities should
increase the student selectivity.

 \citet{San95} used 49er, a pattern discovery process developed by
 \citet{Zyt1993}, to find patterns in the form of regularities from
 student databases related to retention and graduation. The authors
 found that academic performance in high school was the best predictor
 of persistence and better performance in college, and that the
 high school GPA was a better predictor than the ACT composite
 score. In addition, they found that no amount of financial aid
 influenced students to enroll for more terms.

 \citet{Mas99} applied Markov chains modelling technique to create
 predictive models for the  student dropout problem. By tracking
 the students for 15 years, the authors created state variables for
 the number of exams appeared, average marks obtained, and the
 continuation decision. Using data mining, \citet{Ste2001} studied
 the effects of student characteristics to persistence and success
 in an academic program at a community college. They found that the
 student's GPA, cumulative hours attempted, and cumulative hours
 completed were the significant predictors of persistence, and that
 young males were a high risk group.

 \citet{Vei2004} used decision trees (CHAID) to study the high
 school dropouts.  Using 25-fold cross-validation, the overall
 misclassification rates of drop-outs who were classified as non-dropouts were 15.79\% and  10.36\%.  In this study, GPA
 was the most significant predictor of persistence. \citet{Sal2004}
 used clustering algorithms and C4.5 to study graduate student
 retention at Industrial University of Santander, Colombia. The
 authors found that the high marks in the national pre-university
 test predicted a good academic performance, and that the younger
 students had higher probabilities of a good academic performance.

 \citet{Bar2004} used neural networks and Support Vector Machines (SVM) to study graduation rates; the first-year advising center (University College at University of Oklahoma) collected data via a survey given to all incoming freshman. It is worthwhile to note that \citet{Bar2004} excluded all the missing data from the study, which constituted for  approximately 31\% of the total data.  Overall misclassification rate was approximately 33\% for various dataset combinations. The authors used principal component analysis to reduce the number of variables from 56 to 14, however, reported that the results using the reduced datasets were ``much worse'' than the complete datasets. 


 \citet{Sup2006} applied discriminant analysis, neural networks, random forests, and decisions trees to survey data at the University of Belgium to classify new students in low-risk, medium-risk, and high-risk categories. The authors found that the scholastic history and socio-family background were the most significant predictors of risk. The overall classification rates for decision trees, random forests, neural networks, and linear discriminant analysis were 40.63\%, 51.78\%, 51.88\%, and 57.35\% respectively. 

Using the National Student Clearinghouse (NSC) data, \citet{Suj2006}  differentiated between stopout, retained, and transfer students. The overall classification rates for the validation sets using logistic regression, neural networks, C5.0 were 80.7\%, 84.4\%, and 82.1\% respectively. \citet{Her2006} used American College Test's (ACT) student profile section data, NSC data, and the institutional student information system data for comparing the results from the decision trees, the neural networks and logistic regression to predict retention and degree-completion time. The author substituted mean average ACT scores for missing scores. Decision trees created using C5.0 performed the best with 85\% correct classification rate for freshmen retention, 83\% correct classification rate for degree completion time (three years or less), 93\% correct classification rate for degree completion time (six years or more ) for the validation datasets.



 \citet{Atw2006} used University of Central Florida's student demographic and survey data to study the retention problem with the help of data mining. In this study, university retained approximately 82\% of the freshmen from the study, and it used 285 variables to create data mining models. The authors used nearest neighbor algorithm to impute more than 60\% observations with missing values. Using decision trees with the entropy split criterion, the authors obtained precision of 88\% for the not-retained outcome using the test data, and the actual retention rate for this test data set was 82.61\%.


 \citet{YuD2007} studied the data from Arizona State University using decision trees, and included variables, such as demographic, pre-college academic performance indicators, current curriculum, and academic achievement. Some of the important predictor variables were accumulated earned hours, in-state residence, and on campus living. 

To study the retention problem using data mining for the admissions data, \citet{Del2007} applied various attribute evaluation methods, such as Chi-square gain, gain ratio, and information gain, to rank the attributes. In addition, the authors tested various classifiers, such as na\"ive Bayes, AdaBoost M1, BayesNet, decision trees, and rules, and noted that AdaBoost M1 with Decision Stump classifier performed the best in terms of precision and recall, hence, used this classifier for further experimentation. The authors balanced the class variable (retained and not retained) and obtained over 60\% classification rates for both retained and not retained outcome. The authors concluded that the number of programs that the student applied to that specific institution and the student's order of program admit preference were the most significant predictors of retention.


\citet{Pittman2008} compared various data mining techniques (artificial neural networks, logistic regression, Bayesian Classifiers, and decision trees) applied to the student retention problem, and also used attribute evaluators to generate rankings of important attributes. The author concluded that logistic regression performed the best in terms of ROC-curve area.



\begin{sidewaystable} 
\begin{center} 
\footnotesize
\begin{tabular}[t]{p{1.2in}p{1.2in}p{.4in}p{.5in}p{.5in}p{1.2in}p{.5in}p{1.6in}}
{\bf Author (Year)} & {\bf Notes} & {\bf Cohort Size} & {\bf Retained
(\#)} & {\bf Retained (\%)} & {\bf Measure of Accuracy} & {\bf
Coeffes Used?} & {\bf Techniques Used} \\ \hline

\citet{Spa71} &            &        683 &        615 &    90.04\%
& $R^2$ of .3132 for men and .3879 for women &        Yes & Multiple
regression \\ \hline

\citet{Bea80} &            &        906 &        769 &    84.88\%
& $R^2$ of .22 for women and 0.09 for men &        Yes & Multiple
regression \\\hline

\multirow{4}{*}{Terenzini (\citeyear{Ter80})} &     study 1 &
379 &     60 &    15.80\% & $R^2$ of .246 &        Yes &  discriminate
analyses \\

	   &     study 3 &        518 &        428 &    82.63\% &
	   $R^2$ of .256 &        Yes & Multiple regression \\

	   &     study 5 &        763 &        673 &    88.20\% &
	   $R^2$ of .309 &        Yes & discriminate  analyses \\

	   &     study 6 &        763 &        673 &    88.20\% &
	   $R^2$ of .476 for men and .553 for women &        Yes &
	   discriminate  analyses \\\hline

\citet{Sta89} &            &        323 &     294       &    91.00\%
&            &        Yes & Logistic regression \\\hline


\cite{Dey93} &            &        947 &      152      &    16.00\% &
Multiple R 0.354, 0.351, and 0.323 &        Yes & logit, probit,
and regression \\\hline


\citet{Mur99}   &            &       8667 &      5200      &       60\%
& estimated ret prob 59.3\% &        Yes & Survival Analysis/ Hazard
regression \\\hline

\citet{Bre2002} &            &       3535 &       3121     &    88.30\%
& $R^2$ of 0.022 &        Yes & Logistic regression \\\hline

\citet{glynn2003} & any dropout; not only first-year; accuracies
based on the training data &       3244 &       1592 &    49.08\%
& overall accuracy of 83\% &        Yes & Logistic regression \\\hline

\multirow{3}{*}{\citet{Her2005}} &            &       5261 & 4014
&    76.30\% & 77.4\% accuracy &        Yes & Logistic regression
\\

	   &            &       4298 &         3314   &    77.10\% &
	   &        Yes &            \\

	   &            &       4671 &       4040     &    83.50\% &
	   85.4\% accuracy &        Yes &            \\\hline

\multirow{3}{*}{\citet{Suj2006}} &            &      2,444 & 1943
&    79.50\% & 81.6\% accuracy on training; 80.7\% on validation &
Yes & Logistic regression \\

	   &            &      2,445 &       1994     &    79.50\% &
	   83.9\% accuracy on training; 82.1\% on validation &
	   & Neural Network \\

	   &            &      2,445 &       1994     &    79.50\% &
	   85.5\% on training; 84.4\% on validation &            &
	   C4.5 \\\hline

\citet{Her2006} &            &      8,018 &      6037      &    75.29\%
& accuracy close to 75\% &            & Neural Networks; CHAID,
C4.5, CR\&T; Logistic regression \\\hline

\multirow{2}{*}{\citet{Atw2006}} &  training  &      3,829 &
3149 &    82.24\% & precision for drop-outs  91, 84, 84, 78 &
& \multirow{2}{1.2in}{decision trees (entropy, chi-sq, gini) and
logistic regression} \\

	   &       test &      5,990 &      4,881 &    81.49\% &
	   precision for drop-outs  88, 82,82,73 &            &  \\\hline

\citet{Del2007} &            &            &            &       50\%
& precision varied from 57\% to 60\% &            & AdaBoost M1
with Decision Stumps \\\hline

\citet{Pittman2008} &            &      21136 &      17139 &
81.10\% & overall accuracy of 78-81\%; not-retained precision from
44-63\% &            & Logistic regression, neural network, bayes,
J48 \\\hline
\normalsize
\end{tabular} \caption{Techniques and  Accuracies Reported in
Literature} \label{tabTechniquesAcc} 
\end{center} 
\end{sidewaystable}




\subsection{Assessing the State of the Art}\label{sect:assess}
Table \ref{tabTechniquesAcc} lists techniques used in the studied
literature, where the cohort sizes were available, along with the
reported performance measures.  Clearly, there is much room for improvement in the current state of the art: 
\begin{itemize}
\item
It
is a recommended data mining practice to divide the data into a train and test set, learn on the train set,
then assess the learned theory on the test set~\citep{Wit2005}. 
Otherwise, if  a theory is tested via the data used to build 
that theory, this test can  over-estimate theory performance.
For example, the Glynn et al. result of Table~\ref{tabTechniquesAcc} seems impressive
(a 83\% accuracy on a data set with 49.08\% a retention rate); however, that result
should be repeated using some
{\em hold-out} test set.
\item 
All the regression studies from 1971 to 1999 report $R^2$ values under 0.6.
This $R^2$ value is a measure of how well future outcomes are likely to be predicted by the model.
The maximum value of $R^2$ is one  and 
$R^2$ values under 0.6 indicate very weak predictive abilities.
\item
The accuracy reports are very close to the {\em ZeroR} theoretically lower-bound  on performance.
ZeroR is a baseline classifier that simply returns the majority class. For example, 
Herzog
studied a data set with a 83.5\% retention rate (see Table 
\ref{tabTechniquesAcc}). ZeroR, applied to this data set, would be correct
in 83.5\% of cases. Therefore, the 85.4\% accuracy of Herzog's data miners is very close to the
ZeroR lower-bound; i.e. the sophisticated analysis of that paper could be very nearly
replicated using  the dumbest
of learners (ZeroR).
\end{itemize}

The last three results of  Table
\ref{tabTechniquesAcc} 
do not report their accuracies. However, these can be calculated in the following way.
Let A, B, C, D be the true negatives, false negatives, false
positives, and true positives respectively of a predictor that some student will attend some year of
university.
From \citet{zhang07} and \citet{me07e}, 
we say that (A,B,C,D) can be used in the
following performance measures:

{\footnotesize \begin{eqnarray}
 	pd = recall &=& D  (B+D)\label{eqpd}\\
	pf = false\;alarm &=& C / (A+C) \label{eqpf}\\
	prec = precision &=& D / (C+D)  \label{eqprec}\\
	acc = accuracy &=& (A+D) / (A+B+C+D) \label{eqaccu}\\
	neg / pos &=&(A+C) / (B+D) \label{eqnegpos}
 \end{eqnarray}} 
Note that all these performance measures assess subtly different aspects of the performance of data miner:
\begin{itemize}
\item ``Recall'' measures how much of the target was found. 
\item The ``false alarm'' rate measures  what fraction of non-targets triggered the learned theory. 
\item ``Precision'' comments on how many targets are found in the data selected by the theory.
\item ``Accuracy'' comments on how many of the targets and non-targets were accurately labeled by the learned theory.
\end{itemize}
In an ideal result, we can obtain high recall, low false alarms, high precision, and high accuracies. 
However, as discussed by 
\citet{zhang07} and \citet{me07e}, 
these values
are inter-related. Hence, the ideal result is not possible. These inter-relationships are shown below:
\begin{equation}\footnotesize
\left(prec =  \frac{D}{D+C} = \frac{1}{1+\frac{C}{D}}=\frac{1}{1+\frac{neg}{pos}\cdot\frac{pf}{recall}}\right) \Rightarrow
\left(pf =  \frac{pos}{neg}\cdot\frac{(1-prec)}{prec}\cdot recall\right) \label{eqpfprec}
\end{equation}
If a publication misses a particular performance measure, it is possible to use
these equations to infer the missing value. For example:
\begin{eqnarray}\footnotesize
D & = & recall * pos\\
C & = & pf * neg \\
A & = & C * 1/(pf - 1) \\
acc &=& (A+D) / (neg + pos)
\end{eqnarray}
Using these equations, we can comment that the last three results of Table
\ref{tabTechniquesAcc} 
can be significantly improved:
\begin{itemize}
\item
In \citet{Atw2006}, the
the precision varied from 73\% to 88\%.  Using our equations,
we can estimate false alarm values ($pf$)
ranging from 2\% to 8\% (assuming recall values of  65\%
to 90\%). In our experience, it is  very rare to achieve such very low false alarm rates, especially from noisy data
relating to student retention. Hence, the Atwell et al., results are somewhat surprising.
\item
In \citet{Del2007}, the precision varied
from 57\% to 60\%.  From our equations, we can estimate their false alarm rates  in the range of 49\%
to 63\% (assuming recall values of 65\% to 90\%).  Such high false alarm rates are deprecated. 
\item
In \citet{Pittman2008}, the reported precision varied from 44\% to 63\%. Our equations comment that
these values are numerically unobtainable. For $0.78 \le acc \le 0.81$, $neg=17139$ and $pos=21136-neg$,  the equations
only solve for $prec \le 50$. That is, half the precision values reported by Pittman need to be reviewed.
\end{itemize}
In summary, learning predictors for student retention is very difficult. Despite decades of work,
there is considerable room for improvement in the methods used to find patterns in student retention.
As we show below, such improvements are possible if we augment standard data miners with some extra pre-processors.

