\section{Optimizing GAs}

The design of genetic algorithms is a
``black art'' \cite{ga-blackart}. Very little is known
about why GAs work when they do work and why they do not work
when they do not.  For example, consider the ten gene types that we mutate
in
Figure \ref{gene-types-fig}. This list was created after a consideration of our
domain but how can we tell if it is the {\em best} list? Does most of the power of Nighthawk come
from just a small subset of those gene types? If we do not ask a GA to mutate the
chromozone associated with all those types, what is the effect on the runtimes
of Nighthawk?

To answer these questions, we turn to another data technique:
feature subset
selection (FSS).  As shown below,
using FSS, Nighthawk can run an order of magnitude
faster
while maintaining nearly the same level of coverage.

Given the success of FSS, we would strongly recommend that
this kind of optimization be applied to any other search-based 
SE tools.  
It would be relatively easy to do so:
the FSS optimization applied to Nighthawk
used no special knowledge of Nighthawk.
Rather, it treated Nighthawk as a black-box and tied to find subsets of Nighthawk's decisions which,
if removed, did not alter the output score. 
\subsection{Motivation}

The search space of a GA is the product of all
possible values for all parts of a chromosome.
The run time cost to find the best possible chromosome
is therefore proportional to
this value times the evaluation cost of each chromosome:
\begin{equation}\label{eq:cost}
cost = R^L * eval
\end{equation}
That is, twenty binary choices
takes $2^{10}>1,000$ times longer to run than a chromosome of
length 10 to achieve the same quality of result.
Nighthawk's chromosomes store information related to the gene
types of Figure \ref{gene-types-fig}.
If we could discard some of those gene types, then \eq{cost}
suggests that this
would lead to an exponential improvement in Nighthawk's runtimes.

Feature subset selection (FSS) removes needless information.
A repeated result in the data mining community is that simpler
models with  equivalent or higher performance can be built via FSS~\cite{hall03}. Features may be pruned for
several reasons: 
\bi
\item They may be {\em noisy} (contain spurious signals unrelated
  to the target) ; 
\item They may be {\em uninformative} (contain mostly one value,
  or no repeating values);
\item They may be {\em correlated to other variables}; i.e.
  they can be pruned since their signal is also present in other
  variables.
\ei

Apart from reduced runtimes,
using fewer features has other advantages.
Miller has shown that models generally containing fewer
  variables have less variance in their outputs \cite{miller02}.
Also, the smaller the model, the fewer are the demands on
  interfaces  (sensors and actuators) to the external environment.
  Hence, systems designed around  small models are easier to use
  (less to do) and cheaper to build.

\subsection{Selecting an FSS Method}

The literature lists many feature subset selectors.  
In the WRAPPER method, a {\em target learner} is augmented with a
pre-processor that uses a heuristic search to grow subsets of the
available features. At each step in the growth, the target learner
is called  to find the accuracy of the model learned from the
current subset. Subset growth is stopped when the addition of new
features does not improve the accuracy. 
Kohavi and John \cite{kohavi97} report experiments with WRAPPER
where 83\% (on average) of the measures in a domain could be
ignored with only a minimal loss of accuracy.

The advantage of the WRAPPER approach is that, if some target
learner is already implemented, then the WRAPPER is simple to
implement. The disadvantage of the wrapper method is that each
step in the heuristic search requires another call to the target
learner. Since there are many steps in such a search ($N$
features have $2^N$ subsets), WRAPPERs may be too slow.

Another feature subset selector is RELIEF.  
This is an instance based learning scheme \cite{Kir92,Kon94}
that works by randomly sampling one instance within the data. It
then locates the nearest neighbors for that instance from not
only the same class but the opposite class as well. The values
of the nearest neighbor features are  then compared to that of
the sampled instance and the feature scores are maintained and
updated based on this. This process is specified for some
user-specified $M$ number of instances. RELIEF can handle noisy
data and other data anomalies by averaging the values for $K$
nearest neighbors of the same and opposite class for each
instance \cite{Kon94}. For data sets with multiple classes, the
nearest neighbors for each class that is different from that of the
current sampled instance are selected and the contributions are 
determined by using the class probabilities of the class in the dataset.

The experiments of Hall and Holmes \cite{hall03} reject numerous
feature subset selection methods.  WRAPPER is their preferred
option, but only for small data sets.  For larger data sets, the
stochastic nature of RELIEF makes it a natural choice.

%Another reason to prefer RELIEF is that it can take advantage
%of Nighthawk's stochastic search.  Currently, RELIEF is a
%batch process that is executed after data generation is
%completed.  However, it may not be necessary to wait till the end
%of data generation to gain insights into which features are most
%relevant.  Given the stochastic nature of the algorithm, we can
%see feature work where RELIEF and a genetic algorithm work in
%tandem.  Consider the random selection process at the core of
%RELIEF: a genetic algorithm exploring mutations of the current
%set of parents is an excellent stochastic source of data.  In
%the future, we plan to apply RELIEF incrementally and in
%parallel to our GAs.
%

\subsection{Initial FSS Analysis of Nighthawk}

We changed our code
so that each chromosome 
evaluation
printed the current value of every gene and the
final fitness function score.  (For the two BitSet gene types,
we printed only the cardinality of the set.)
For each of the 16 Collection and Map classes from {\tt java.util},
we ran Nighthawk for 40 generations.  Each class therefore
yielded 800 observations, consisting of a gene value vector
and the chromosome score.

%RELIEF assumes continuous data and Nighthawk's performance
RELIEF assumes discrete data but Nighthawk's performance
scores are continuous.  We therefore
discretized Nighthawk's output:
\bi
\item The 65\% majority of the scores are within 30\% of the top
  score for any experiment. We call this the {\em high plateau}.
\item A 10\%  minority of scores are less than 10\% of the maximum
  score.  We call this region {\em the hole}.
\item  The remaining data
{\em slopes} from the plateau into the hole.
\ei
To select features, we divided the
results 
into three classes: bottom 10\%, next 25\%, remaining
65\%. RELIEF then sought features that
distinguished
the three regions.  

Using the above discretization policy, we ran RELIEF 10 times in a 10-way
cross-validation study.
The data set was divided
into $10$ buckets. Each bucket was (temporarily) removed and RELIEF was
run on the remaining data. This produced a list of ``merit'' figures for
each feature (this ``merit'' value is an internal heuristic measure
generated from RELIEF, and reflects the difference ratio of neighboring
instances that have different regions).  
%We therefore got
%a merit score for each of the ten runs for every gene corresponding
%to every subject unit.  Each run therefore yielded a ranked list
%$R$ of all genes, where gene 1 had the highest merit for this
%run, gene 2 had the second highest merit, and so on.
We then calculated summary statistics in order to come
up with rankings of the various gene types (recall that each
gene type $t$ corresponds to zero or more genes, depending on the
unit under test).  Best results were obtained using {\em average merit}; i.e.
\begin{itemize}
\item $avgMerit(g,u)$ is the average RELIEF merit score, across all
  10 runs of the cross-validation study, of gene $g$ derived
  from unit $u$.
%\item $avgRank(g,u)$ is the average rank in $R$, across all
%  10 runs of the cross-validation study, of gene $g$ derived
%  from unit $u$.
%\item $bestMerit(t)$ is the maximum, over all genes $g$ of
%  type $t$ and all subject units $u$, of $avgMerit(g,u)$.
%\item $bestRank(t)$ is the maximum, over all genes $g$ of
%  type $t$ and all subject units $u$, of $avgRank(g,u)$.
\end{itemize}
For example, \fig{genes-merit-fig} shows the ten gene types from
Figure \ref{gene-types-fig}, ranked in terms of their $avgMerit$
merit as
defined above.  This ranking places {\tt numberOfCalls} at
the top, meaning that it considers genes of that type to be the
most valuable; it also places {\tt candidateBitSet} at the bottom,
meaning that it considers genes of that type to be the most
expendable.

% We took the maximum merit ($Max$)
%and generated a {\it Selected} list. A feature was {\it Selected} if its merit
%$m$ was $m> \alpha Max$ (e.g. at $\alpha=0.5$ then we selected all
%features that score at least half the maximum merit).
%
%We 
%
%The features in a Nighthawk chromosome come from the different
%gene types of  \fig{gene-types-fig}.
%We ranked and sorted each gene type using the maximum RELIEF
%merit score seen for any of its features.
%This resulted in the sort order of  \fig{genes-merit-fig}.

\begin{figure}[tp]
{
\footnotesize
\begin{center}
\begin{tabular}{r|rl}
Rank &Gene type $t$  & $avgMerit$ \\
\hline
     1&	 {\tt numberOfCalls}&  85\\
     2&	{\tt valuePoolActivityBitSet} & 83\\
     3&	{\tt upperBound} & 64\\
     4&	{\tt chanceOfTrue}&  50\\
     5&	{\tt methodWeight}&  50\\
     6&	{\tt numberOfValuePools}&  49\\
     7&	{\tt lowerBound }& 44\\
     8&	{\tt chanceOfNull}&  40\\
     9&	{\tt numberOfValues }& 40\\
    10&	{\tt candidateBitSet}&  34\\
    \end{tabular}
\end{center}
}
\caption{
Nighthawk gene types, sorted by the maximum RELIEF merit of any of its features.
}
\label{fig:genes-merit-fig}
\end{figure}

\subsection{An FSS Experiment}

After ranking the gene types by the various ranking metrics, we
proceeded to study whether eliminating gene types on the basis
of the RELIEF results sped up Nighthawk without sacrificing
the quality of the results.

First, we changed the source code so we could
replace the parts of Nighthawk's code controlled by
each gene type by code that assumed a constant value for each
gene of that type (i.e. we could disable   all genes of
a given type).

Next, we ran Nighthawk using the top ranked
$1 \le i \le 10$ gene types.
For example,
if we say ``number of gene types = 2'' for the
$avgMerit$ ranking, then we are
using only {\tt numberOfCalls} and {\tt valuePoolActivityBitSet}, and
all other gene types are disabled.

%
%\begin{figure}[!t]\begin{center}\includegraphics[width=2.5in]{Properties_rankedByMerit.pdf}
%\end{center}\caption{Properties}\label{fig:properties}
%\end{figure}

\begin{figure}[tp]
\begin{center}\includegraphics[width=2.5in]{Hashtable_rankedByMerit.pdf}
\end{center}\caption{
  Nighthawk on Hashtable unit, eliminating gene types according
  to $avgMerit$ ranking.
}
\label{fig:hashtable}
\end{figure}
%
%\begin{figure}[tp]
%\begin{center}\includegraphics[width=2.5in]{Hashtable_rankedByAvgRank.pdf}
%\end{center}\caption{
%  Nighthawk on Hashtable unit, eliminating gene types according
%  to $bestRank$ ranking.
%}
%\label{fig:hashtable-bestRank}
%\end{figure}
%
%\begin{figure}\begin{center}\includegraphics[width=2.5in]{EnumMap_rankedByMerit.pdf}
%\end{center}\caption{EnumMap}\label{fig:enummap}
%\end{figure}
%

\subsection{Coverage Results}
In the following, each run's result is compared to
the runtime and coverage seen using all ten gene types and
running for $g=50$ generations.  \fig{hashtable} shows how the
coverage changed over one of the case studies of Figure \ref{collection-map-cov},
using the $avgMerit$ ranking; the results for this subject unit
are typical.  The y-axis of that figure is defined such that
the point (50,1) represents the coverage reached by using all 10
gene types after 50 generations.  The thick black curve on those
figures shows the performance of Nighthawk using all ten gene
types. Other curves show results from using $1 \le i \le 9$
gene types.   

A measure of interest in \fig{hashtable} is the area
under the curve, which is maximal when Nighthawk converges
to maximum coverage in a few generations.  Due to
the random nature of the GA and the randomized test data
generation, some of the curves are sometimes higher than the
$i=10$ line.


\begin{figure}[tp]
\scriptsize
\begin{center}
\begin{tabular}{r@{~}r@{~}r@{~}r@{~}r@{~}r@{~}r@{~}r@{~}r@{~}r|l}
\multicolumn{9}{c}{Number of used gene types}\\\cline{1-10}
1    &  2   &   3 &   4   &  5   & 6    & 7    & 8    &  9  & 10    & class being tested\\\hline
1.00&1.00&1.01&0.99&1.00&1.00&1.00&1.00&1.00&1.00&ArrayList\\
1.13&1.00&1.03&0.96&0.65&0.68&0.67&0.88&0.71&1.00&EnumMap\\
0.98&0.99&0.98&0.96&1.01&1.00&1.00&1.00&0.98&1.00&HashMap\\
0.99&1.00&0.99&1.00&1.00&1.00&0.99&1.00&1.00&1.00&HashSet\\
0.99&1.00&1.00&1.00&0.99&1.00&0.98&1.01&1.01&1.00&Hashtable\\
0.97&0.97&0.98&0.97&0.98&0.97&1.00&0.98&0.98&1.00&IdentityHashMap\\
0.98&1.00&1.00&1.00&0.99&1.00&1.00&0.99&0.99&1.00&LinkedHashMap\\
0.99&1.01&1.01&1.00&1.00&1.00&1.00&1.01&1.01&1.00&LinkedHashSet\\
1.01&1.01&1.01&1.02&1.00&1.01&1.01&1.02&1.00&1.00&LinkedList\\
1.01&0.99&0.97&1.01&1.03&0.99&0.95&0.98&1.00&1.00&PriorityQueue\\
1.00&1.00&1.00&1.00&1.00&1.00&1.00&1.00&0.99&1.00&Properties\\
1.00&1.00&1.00&1.00&1.00&1.00&1.00&1.00&1.00&1.00&Stack\\
0.90&0.97&0.95&0.93&0.99&0.97&0.97&1.02&1.01&1.00&TreeMap\\
0.90&0.95&0.98&0.93&0.97&0.98&0.98&1.00&0.98&1.00&TreeSet\\
0.95&0.97&0.99&0.96&0.99&0.92&0.99&0.97&1.01&1.00&Vector\\
0.96&0.97&0.97&0.98&0.98&0.97&0.98&0.99&0.97&1.00&WeakHashMap\\\hline
0.98&0.99&0.99&0.98&0.97&0.97&0.97&0.99&0.98&1.00&mean
\end{tabular}
\end{center}
\caption{Coverage found using the top $i$ ranked gene types for $1 \le i \le 10$
Coverages expressed as a ratio of the coverages found using all gene types
}\label{fig:coverage}
\end{figure}

If we calculate the ratio of the area under
a curve (AUC) with the area under the thick
black curve, then we can summarize all the curves of
\fig{hashtable} as the first row of \fig{coverage}. In that
figure, each column shows how many gene types were used in  a
particular run (and when $used=10$, we are ignoring the FSS
results and using all gene types). Also, the number 1.00 informs
us that we achieved 100\% of the coverage reached by using all
ten gene types.

\fig{coverage} 
summarizes \fig{hashtable} as well as results
from all other cases studied in Figure \ref{collection-map-cov}a.  
The last row of 
\fig{coverage} shows that the mean AUC is very similar using
the first $i$-th ranked gene types.  
From this observation,
we conclude that:
\bi
\item Nighthawk's GA gains most of its efficacy just for mutations of
the top-ranked gene type {\tt numberOfCalls}.
\item
In terms of coverage, there is little benefit in forcing Nighthawk's GA
to mutate more than this first ranked gene type.
\ei

\subsection{ Runtime Results}
\begin{figure}[tp]
\begin{center}
\includegraphics[width=3in]{mar21timereport.pdf}
\end{center}
\caption{Time results, 1 vs 10.}\label{fig:timereport100}
\end{figure}
In order to determine whether eliminating gene types from
Nighthawk's GA is cost-effective, we must consider not only the
coverage achievable, but also the time taken to achieve that
coverage.  
We therefore made two runs of Nighthawk on all the
subject units, run
(a) using all the gene types and run (b) using just the top gene
type ranked by $avgMerit$ (i.e. {\tt numberOfCalls}).
We then divided the runtime and coverage results
from (b) by the (a) values seen after 50
generations, and plotted the results.

\fig{timereport100} shows the results, with time percentage on
the X axis and coverage percentage on the Y axis.
Note the point indicated by the arrow in \fig{timereport100}.
This point shows that it is usually
(in $\frac{15}{16}$ cases) possible to achieve 90\% of the
coverage in under 10\% of the time required to run all gene
types for 50 generations.

The two anomalous subject units in \fig{timereport100} are
{\tt EnumMap} and {\tt TreeMap}:
\bi
\item  {\tt EnumMap} looks like a
success story, since we were able to achieve 140\% of the
original coverage.  In fact, this is simply noise, since the results presented
earlier in this paper showed that we 
could achieve only 3\% coverage of {\tt EnumMap} using
Nighthawk and generic test wrappers \cite{andrews07}. The improvements
reported in \fig{timereport100}
shows an improvement to less than 5\%.  
\item
\fig{timereport100} showed that the run
with TreeMap using one gene type took 1.48 longer than using TreeMap with all ten gene types.
To understand this result,
note that \eq{cost} has two
components: number of options and the evaluation cost of each
option. In the case of {\tt TreeMap}, even though we reduced the
chromosome search space size, we made choices that greatly
increased the runtime cost.
\ei
\subsection{Discussion}
\subsubsection{Implications for Nighthawk}
We are currently exploring early stopping criteria within Nighthawk to take
advantage of the early plateaus in coverage seen in \fig{timereport100}
The above results show that
the original Nighthawk was doing
much work that did not pay off by achieving much better results.
Consider the nine gene types ranked the lowest in the $avgMerit$
ranking.  Maintaining all the genes associated with these gene
types meant that the original Nighthawk was spending time
mutating these genes and extracting information from the current
values of the genes ( it also meant that the original Nighthawk
was using up memory storing the representations of these genes,
for each chromosome in each generation, with a concomitant
time expense).  When these less useful genes were eliminated,
no large loss of coverage was observed, but a great increase
in efficiency was observed.
Hence, we are also exploring methods to reduce Nighthawk's memory requirements
and other performance criteria.

\subsubsection{Implications for Other Systems}


At the {\it metaheuristic level}, this work suggests that it may
be useful to integrate FSS directly into metaheuristic
algorithms.  Such an integration would enable the automatic
reporting of the merits of individual features, and the
automatic or semi-automatic selection of features.  If the
results of this paper extend to other domains, this would lead
to metaheuristic algorithms that improve themselves
automatically each time they are run.  We also speculate that
this is a promising direction to look at for a further order
of magnitude improvement in performance.

Also, on the {\it theory-formation level}, this work opens
up the possibility of rapid turnover of the theoretical
foundations underlying present tools, as aspects of heuristic
and meta-heuristic approaches are shown to be consistently
valuable or extendible.  In the framework of a self-monitoring
metaheuristic algorithm, new strategies, heuristics, and
gene and mutator types can be introduced as theories evolve, and
can be evaluated efficiently by the algorithms themselves.

As an example of this latter point, the
{\tt chanceOfNull} gene type in Nighthawk determines how likely
(in percent) Nighthawk's random test data generation is to
choose a {\tt null} as a parameter in a given parameter
position.  That gene type was ranked as expendable by both
rankings explored here.  This suggests that it is not worthwhile
to search for different values for this percentage, and that the
default value (3\%) is sufficient.  Given sufficient
corroboration from other case studies and systems, this in turn
suggests that the value of generating test data with nulls is a
relatively settled problem, and that we can turn toward other
aspects of test data generation for future improvements.