\section{Data}
Data used in this study were from a mid-size public university, and
were extracted from the student information system on official
census dates. These data consisted all first-year freshmen's
demographic, academic, and financial aid information (more than 100
attributes), as of the census reporting dates (after two weeks of
semester starting date). As the higher education administrators may
design effective policies when the students begin their studies,
it is important to note that our emphasis was on detecting patterns
based only on the first-term data, and that too only  beginning of
the term data. We created three dependent variables: RET1, if the
student returned after one year;  RET2, if the student returned
after two years; and RET3, if the student returned after three
years. The overall distribution of these dependent variables is
given in Table \ref{tabDepedentVarsDistr}. For the studied time
period, the overall first-year retention rate was 71.3\%, the
second-year persistence rate was 60.4\%, and the third-year
persistence rate was 54.8\%.

\begin{table}[ht]
\begin{center}
\begin{tabular}{c|rcrcrc}
 & \multicolumn{2}{c}{\bf{RET1}} & \multicolumn{2}{c}{\bf{RET2}} & \multicolumn{2}{c}{\bf{RET3}} \\ \cline{2-7}
& Count & Percentage & Count & Percentage & Count & Percentage \\ \hline
retained=\bf{Y} & 24,039 & 71.3\% &  18,055 & 60.4\% &  14,362 & 54.8\%\\
retained=\bf{N} & 9,673 & 28.7\% & 11,857 & 39.6\% & 11,854 & 45.2\%\\ \hline
Total & 33,712	& 100\% & 29,912	& 100\% & 26,216	& 100\%
\end{tabular}
\caption{Distribution of Dependent Variables}
\label{tabDepedentVarsDistr}
\end{center}
\end{table}

In the Integrated Postsecondary Education Data System (IPEDS), for  the U.S. only, degree-granting, Doctoral degree offering, 4-year and above institutes (excluding University of Phoenix-Online Campus), and cohort size greater than 3,000, we found that the full-time freshmen retention rate had a range from 59\% to 96\%, and the cohort size had a range from 3,117 to 8,025. In this list of institutions, Kent state university ranked 38 in the full-time retention percentage and 26 in the cohort size \citep{ipeds}. Thus, Kent state data are representative of other similar size universities, and the data mining approach could be generalized to other universities.

\subsection{Attribute Groups}
The data mining methods discussed below used {\em attribute selection} to
prune measurements that are poor predictors for the target class.
Therefore, our data miners can be used to assess various hypotheses relating to student retention:
\begin{itemize}
\item If an hypothesize claims that attributes $X,Y,Z$ are important...
\item ... and if our learners prune those attributes ...
\item ... then that is evidence against that hypothesis.
\end{itemize}
Accordingly, before applying our data miners, we take care to divide our attributes into the active hypothesis that they supprot:
\begin{itemize}
\item[H1:] The {\em financial aid} hypothesis. \citet{San95}  found that no amount of financial aid influenced students to enroll for more terms; whereas \citet{Her2005} found that upper-income students had reduced dropout odds compared to those from middle and lower incomes. According to \citet{Joh2000}, ``the research literature remains ambiguous''  regarding the influence financial aid on recruitment and retention. 

\item[H2:] The {\em academic performance} hypothesis. Although there is no doubt that high school GPA and high school preparedness has a significant impact on persistence, researchers have often questioned the effects of standardized college entrance examinations (ACT/SAT). \citet{Wau94} found that SAT  and ACT scores had no relationship with retention, whereas \citet{Mur99} found that SAT scores had some predictive value, although inferior compared to high school GPA. \citet{Des2002} noted that high GPA lowered the risk of dropout,
but the effect diminished over time, and that the financial aid was an insignificant factor for increasing graduation, however, it indeed reduced the
student stopout. In their comprehensive literature review, \citet{Lot2004} found that high school GPA had the strongest relationship with college retention in the academic factors, but ACT assessment scores had a moderate impact.

\item[H3:] The {\em faculty tenure and experience} hypothesis. \citet{ehrenberg2005tenured} found that for every 10 percentage point increase in the percentage of part-time faculty and not on tenure-track full-time faculty, there was a 3-5 percentage point reduction in the institution's graduation rate. \citet{jacoby2006effects} found similar results at community colleges that increase in the ratio of part-time faculty  had a negative impact on the graduation rates.

\end{itemize}


In the sequel, we will return to these hypotheses to comment on which were most useful for predicting student retention. Table \ref{tabAttrswHypothesis} lists attributes that we grouped together under each hypothesis.
\input{attrsbyHypothesis}
