\subsection{Feature Subset Selection}

A repeated result in the data mining community is that simpler
models with  equivalent or higher performance can be built via
{\em feature subset selection}, algorithms that intelligently
prune useless features \cite{hall03}. Features may be pruned for
several reasons: 
\begin{itemize}
\item They may be noisy, i.e.\ contain spurious signals unrelated
  to the target class; 
\item They may be uninformative, e.g.\ contain mostly one value,
  or no repeating values;
\item They may be correlated to other variables, in which case
  they can be pruned since their signal is also present in other
  variables.
\end{itemize}

The reduced feature set has many advantages:
\begin{itemize}
\item Miller has shown that models generally containing fewer
  variables have less variance in their outputs \cite{miller02}.
\item The smaller the model, the fewer are the demands on
  interfaces  (sensors and actuators) to the external environment.
  Hence, systems designed around  small models are easier to use
  (less to do) and cheaper to build.
\item In terms of this article, the most important aspect of
  learning from a reduced features set is that it produces
  smaller models. Such smaller models are easier to explain (or audit).
\end{itemize}

The literature lists many feature subset selectors.  
In the Wrapper method, a {\em target learner} is augmented with a
pre-processor that used a heuristic search to grow subsets of the
available features. At each step in the growth, the target learner
is called  to find the accuracy of the model learned from the
current subset. Subset growth is stopped when the addition of new
features did not improve the accuracy. 
Kohavi and John \cite{kohavi97} report experiments with Wrapper
where 83\% (on average) of the measures in a domain could be
ignored with only a minimal loss of accuracy.

The advantage of the Wrapper approach is that, if some target
learner is already implemented, then the Wrapper is simple to
implement. The disadvantage of the wrapper method is that each
step in the heuristic search requires another call to the target
learner. Since there are many steps in such a search ($N$
features have $2^N$ subsets), Wrappers may be too slow.

Another feature subset selector is Relief.  
This is an instance based learning scheme \cite{Kir92,Kon94}
that works by randomly sampling one instance within the data. It
then locates the nearest neighbors for that instance from not
only the same class but the opposite class as well. The values
of the nearest neighbor features are  then compared to that of
the sampled instance and the feature scores are maintained and
updated based on this. This process is specified for some
user-specified $M$ number of instances. Relief can handle noisy
data and other data anomalies by averaging the values for $K$
nearest neighbors of the same and opposite class for each
instance \cite{Kon94}. For data sets with multiple classes, the
nearest neighbors for each class that is different from that of the
current sampled instance are selected and the contributions are 
determined by using the class probabilities of the class in the dataset.

The experiments of Hall and Holmes \cite{hall03} reject numerous
feature subset selection methods.  Wrapper is their preferred
option, but only for small data sets.  For larger data sets, the
stochastic nature of Relief makes it a natural choice.

Another reason to prefer Relief is that it can take advantage
of Nighthawk's stochastic search.  Currently, Relief is a
batch process that is executed after data generation is
completed.  However, it may not be necessary to wait till the end
of data generation to gain insights into which features are most
relevant.  Given the stochastic nature of the algorithm, we can
see feature work where Relief and a genetic algorithm work in
tandem.  Consider the random selection process at the core of
Relief: a genetic algorithm exploring mutations of the current
set of parents is an excellent stochastic source of data.  In
the future, we plan to apply Relief incrementally and in
parallel to our GAs.

\subsection{Data Collection and Analysis}

We instrumented Nighthawk so that every time a chromosome was
evaluated, it printed the current value of every gene and the
final fitness function score.  (For the two BitSet gene types,
we printed only the cardinality of the set.)
For each of the 16 Collection and Map classes from {\tt java.util},
we ran Nighthawk for 40 generations.  Each class therefore
yielded 800 observations, each consisting of a gene value vector
and the chromosome score.

%Relief assumes continuous data and Nighthawk's performance
Relief assumes discrete data and Nighthawk's performance
scores are continuous.  To enable feature subset selection, we
first discretized Nighthawk's output. A repeated pattern
across all our experiments is that the Nighthawk scores fall
into three regions:
\begin{itemize}
\item The 65\% majority of the scores are within 30\% of the top
  score for any experiment. We call this the {\em high plateau}.
\item A 10\%  minority of scores are less than 10\% of the maximum
  score.  We call this region {\em the hole}.
\item After the high plateau there is a  {\em slope} of
  increasing gradient that falls into {\em the hole}. This {\em
  slope} accounds for 25\% of the results.
\end{itemize}
Accordingly, to select our features, we sorted the results and
divided them into three classes: bottom 10\%, next 25\%, remaining
65\%. Relief then found the subset of features that distinguished
these three regions.  

Using the above discretization policy, we ran Relief 10 times in a 10-way
cross-validation study.
The data set was divided
into $10$ buckets. Each bucket was (temporarily) removed and Relief was
run on the remaining data. This produced a list of ``merit'' figures for
each feature (this ``merit'' value is an internal heuristic measure
generated from Relief, and reflects the difference ratio of neighboring
instances that have different regions). We took the maximum merit ($Max$)
and generated a {\it Selected} list. A feature was {\it Selected} if its merit
$m$ was $m> \alpha Max$ (e.g. at $\alpha=0.5$ then we selected all
features that score at least half the maximum merit).

\begin{figure}

\begin{center}
\begin{tabular}{|c|c|}
$\alpha$ & $| {\it Selected} | $\\\hline
0.9 & 39 \\
\hline
0.8 & 62 \\
\hline
0.7 & 112 \\
\hline
0.6 & 217 \\
\hline
0.5 & 439\\
\hline
\end{tabular} \vspace{2mm} \\
\end{center}

\caption{
  Numbers of selected features for values of $\alpha$.
}
\label{selected-features-fig}
\end{figure}

Figure \ref{selected-features-fig} shows the number of features
that were {\it Selected} in
our 19 examples, using different values for $\alpha$. Note that as
$\alpha$ increases, we selected fewer and fewer features.
That is, this method allowed us to discover the genes that were most
important in selecting
for the {\em high plateau} and not the {\em slope} or {\em hole}.

\begin{figure}

\begin{center}
\begin{tabular}{|l|l|l|l|l|}
Gene type & $\alpha=0.9$ & $\alpha=0.5$ & Best avg ranks & Best merit \\
\hline
{\tt candidateBitSet}         & 18 & 106 & 1, 1, 1          & 0.373 \\
\hline
{\tt chanceOfNull}            &  0 &   3 & 24.1, 24.3, 25.4 & 0.166 \\
\hline
{\tt chanceOfTrue}            &  2 &   7 & 1, 3.9, 5.7      & 0.150 \\
\hline
{\tt lowerBound}              &  1 &   9 & 4.1, 5.6, 7.2    & 0.129 \\
\hline
{\tt methodWeight}            &  0 &  11 & 7.5, 7.8, 10.6   & 0.144 \\
\hline
{\tt numberOfCalls}           &  0 &   1 & 7.2, 8.4, 16.7   & 0.195 \\
\hline
{\tt numberOfValuePools}      &  0 &   6 & 14.1, 17.9, 20   & 0.186 \\
\hline
{\tt numberOfValues}          &  0 &  28 & 5.2, 5.2, 7.4    & 0.160 \\
\hline
{\tt upperBound}              &  8 &  16 & 1, 1, 1          & 0.267 \\
\hline
{\tt valuePoolActivityBitSet} & 10 & 252 & 1, 1.4, 2        & 0.267 \\
\hline
\end{tabular}
\end{center}

\caption{
  Merit analysis of Nighthawk gene types.
}
\label{genes-merit-fig}
\end{figure}

Since our goal was to identify gene types that were not useful and
that could therefore possibly be eliminated from chromosomes, we
tabulated the properties of each of the different types of genes.
Figure \ref{genes-merit-fig} shows the result.  The three most
useful types of genes for the software under test were clearly
{\tt candidateBitSet},
{\tt valuePoolActivityBitSet}, and
{\tt upperBound},
since genes of these types sometimes had merit 90\% or more of
the maximum merit, often were ranked first or second in merit,
and were the only gene types such that some genes of that type
had merit greater than $0.2$ on some runs.  This result confirms
our intuition that changing the value reuse policy encoded in
the genes of type 
{\tt candidateBitSet} and
{\tt valuePoolActivityBitSet}
is an important way of improving the performance of a Nighthawk
chromosome.
The good performance of {\tt upperBound} is attributable to the
fact that for all the hash table-related units, adjusting this
parameter caused a ``load factor'' parameter in some of the
constructors to be able to take on values that led to the table
being reallocated frequently.

Just as clearly, the two least useful gene types for the
software under test were
{\tt chanceOfNull} and
{\tt numberOfValuePools}, since neither were ranked better than
than 14th in merit for any run.
This does not indicate that null values and multiple value pools
for each type were not important, only that changing the default
values of these genes (3\% chance of choosing null, and two
value pools) did not usually result in improved performance.

\subsection{Optimization and Performance Comparison}

In order to see whether our analysis was useful, we optimized
the Nighthawk code.
We decided to retain the four gene types with
the highest maximum merit
({\tt candidateBitSet},
{\tt valuePoolActivityBitSet},
{\tt upperBound}, and {\tt numberOfCalls}),
and also two other gene types:
{\tt lowerBound}, because it was paired with 
{\tt upperBound}; and
{\tt numberOfValues}, because it had the highest maximum
merit of the remaining well-ranked gene types.

Accordingly, we adjusted Nighthawk's code to delete all
references to the other four gene types
({\tt chanceOfNull},
{\tt numberOfValuePools},
{\tt chanceOfTrue}, and
{\tt methodWeight}).  We changed the code that depended on the
values stored for such genes in the current chromosome, so that
it instead used the default initial values for those gene types.
 
On each of the 16 Collection and Map classes,
we performed one run of the original Nighthawk,
and one run of the new, optimized version.
For each version of Nighthawk, we then
performed the same analysis that is reflected in Figures
\ref{collection-map-cov} and
\ref{collection-times};
that is, we asked RunChromosome to run 10 test cases with the
winning chromosome and measured the coverage, and we calculated
the clock time taken by Nighthawk to achieve the highest
coverage it achieved.  We then compared the original and
optimized Nighthawk.

The chromosomes resulting from the optimized Nighthawk achieved
slightly higher coverage than those from the original, though a
$t$ test concluded that the difference was not statistically
significant ($p=0.058$).  However, the optimized Nighthawk was
faster to a statistically significant degree ($p=0.010$).  The
time ratio between the original and optimized systems
was 1.46, i.e.\ the
optimized system was 46\% faster.  We can therefore say that
the process of analyzing the usefulness of the genes resulted
in a system that ran substantially more quickly but achieved the
same high coverage of the SUT.

\subsection{Discussion}

The feature subset selection and optimization procedure worked
well for us when we collected data from a set of classes and
then optimized Nighthawk for those classes.  This does not mean
that the optimized Nighthawk would necessarily work
well for other classes.  For instance, for some classes there
might be a great deal of code that is accessible only by giving
null pointers as parameters; we could expect our newly optimized
version of Nighthawk to perform poorly on such classes, since
there would be a fixed 3\% chance of choosing null.

However, the results of the optimization exercise indicate that
FSS-based analysis is a valuable adjunct to the GA in this
application.  It is for this reason that we envision in the
future integrating FSS into a genetic-random testing algorithm,
so that the algorithm can improve its performance dynamically,
while GA mutation and recombination is going on.