\section{Optimizing GAs}
\label{optimizing-section}

Despite the success of Nighthawk at achieving high coverage,
the most pressing issue we faced was whether or not
Nighthawk was spending time
usefully in its quest for good chromosomes.
The design of genetic algorithms is a
``black art'' \cite{ga-blackart}. Very little is known
about why GAs work when they do work and why they do not work
when they do not.  Consider the ten gene types that we mutate,
as listed in
Figure \ref{gene-types-fig},
each of which controls some aspect of the randomized testing.
We chose to control these aspects based on our past experience
of applying randomized testing to different units,
but how can we tell if this is the {\em best} list? Does most of the power
of Nighthawk come
from being able to mutate just a small subset of those gene types? If we
do not ask a GA to mutate the
genes associated with all the types, what is the effect on the runtime
and coverage of Nighthawk?

To answer these questions, we turn to another data technique:
feature subset
selection (FSS).  As shown below,
using FSS, Nighthawk can run an order of magnitude
faster
while maintaining nearly the same level of coverage.
%
%Given the success of FSS, we would strongly recommend that
%this kind of optimization be applied to any other search-based 
%SE tools.  
%It would be relatively easy to do so:
%the FSS optimization applied to Nighthawk
%used no special knowledge of Nighthawk.
%Rather, it treated Nighthawk as a black-box and tied to find subsets of Nighthawk's decisions which,
%if removed, did not alter the output score. 
%
% Organization of section:
We first discuss the motivation of this work in more detail, and then
describe the FSS method we selected.  We then describe the initial FSS
analysis we did of the data collected from Nighthawk runs on the
{\tt java.util} units.  Finally, we report on how we used the information
from the analysis, to progressively eliminate gene types from
consideration and measure the effect.

\subsection{Motivation}

The search space of a GA is the product of the sets of possible values for
all genes in a chromosome.  In the simplest case, where all genes have $R$
possible values and there are $L$ genes, the size of this search space is
$R^L$.
The run time cost to find the best possible chromosome
is therefore proportional to
this size times the evaluation cost of each chromosome:
\begin{equation}\label{eq:cost}
cost = R^L * eval
\end{equation}
%For example, a chromosome of 20 binary-valued genes
%takes $2^{10}>1,000$ times longer to run than a chromosome
%10 such genes to achieve the same quality of result.
Nighthawk's chromosomes for
the {\tt java.util} classes range in size from 128
genes to 1273 genes (recall that the number of genes is
dependent on such things as the number of target methods and
the numbers of parameters of those methods), and each gene can
have a large number of values.
Nighthawk's chromosomes store information related to the gene
types of Figure \ref{gene-types-fig}.
For example, for the public methods of {\tt java.util.Vector},
Nighthawk uses 933 genes, 392 of which are
{\tt valuePoolActivityBitSet} genes, and 254 of which are
{\tt candidateBitSet} genes.
If we could discard some of those gene types, then \eq{cost}
suggests that this
would lead to a large improvement in Nighthawk's runtimes.

We can get information about which genes are valuable by recording, for
each chromosome, the values of the genes and the resulting fitness score
of the chromosome.  This leads to a large volume of data, however: since
the population is of size 20, there are 20 vectors of data for each
generation for each unit being tested.  We can interpret our problem as a
data mining problem.  Essentially, what we need to know from this data is
what information within it is not needed to accurately predict the fitness
score of a chromosome.

Feature subset selection (FSS) is a data mining technique
that removes needless information.
A repeated result is that simpler
models with  equivalent or higher performance can be built via FSS~\cite{hall03}. Features may be pruned for
several reasons: 
\bi
\item They may be {\em noisy} (contain spurious signals unrelated
  to the target) ; 
\item They may be {\em uninformative} (contain mostly one value,
  or no repeating values);
\item They may be {\em correlated to other variables}; i.e.
  they can be pruned since their signal is also present in other
  variables.
\ei

Apart from reduced runtimes,
using fewer features has other advantages.
Miller has shown that models generally containing fewer
  variables have less variance in their outputs \cite{miller02}.
Also, the smaller the model, the fewer are the demands on
  interfaces to the external environment.
  Hence, systems designed around  small models are easier to use
  (less to do) and cheaper to build.

\subsection{Selecting an FSS Method}

%The literature lists many feature subset selectors.  
%In the WRAPPER method, a {\em target learner} is augmented with a
%pre-processor that uses a heuristic search to grow subsets of the
%available features. At each step in the growth, the target learner
%is called  to find the accuracy of the model learned from the
%current subset. Subset growth is stopped when the addition of new
%features does not improve the accuracy. 
%Kohavi and John \cite{kohavi97} report experiments with WRAPPER
%where 83\% (on average) of the measures in a domain could be
%ignored with only a minimal loss of accuracy.

%The advantage of the WRAPPER approach is that, if some target
%learner is already implemented, then the WRAPPER is simple to
%implement. The disadvantage of the wrapper method is that each
%step in the heuristic search requires another call to the target
%learner. Since there are many steps in such a search ($F$
%features have $2^F$ subsets), WRAPPERs may be too slow.

One well-studied and popular feature subset selector is RELIEF
\cite{Kir92,Kon94}.
RELIEF assumes
that the data is divided into $groups$\footnote{Technically, RELIEF assumes
that instances have been classified using some ``class'' attribute. However,
to avoid confusion with the concept of ``class'' discussed above, we will
describe RELIEF in terms of ``groups''.} and tries to find the features that
serve to distinguish instances in one group from instances in other groups.  

\newcommand{\FOR}{{\sffamily \underline{for}}~}
\newcommand{\TO}{{\sffamily \underline{to}}~}
\newcommand{\DO}{{\sffamily \underline{do}}~}
\newcommand{\DONE}{{\sffamily \underline{done}}~}
\newcommand{\setuptabs}{
\hspace*{.2in}\=\hspace*{.2in}\=\hspace*{1in}\=\hspace*{.2in}\=\hspace*{.1in}
\hspace*{.2in}\=\hspace*{.2in}\=\hspace*{.2in}\=\hspace*{.2in}\=\hspace*{.2in}
\hspace*{.2in}\=\hspace*{.2in}\=\hspace*{.2in}\=\hspace*{.2in}\=\hspace*{.2in}\kill
}
\begin{figure}
{\footnotesize
 \begin{tabbing}\setuptabs
\FOR  $f \leftarrow 1$ \TO $|features|$ \DO \\
\>$M_f =  0$ \>\> {\em // set all merits to 0}\\
\DONE\\
\FOR i $\leftarrow$ 1 \TO $N$ \DO\\
\>   randomly select instance $R$ from group $G$\\
\>   find nearest hit $H$ \>\>{\em // closest thing in the same group}\\
\>   find nearest miss $M$ \>\>{\em // closest thing in a  different group}\\
\>   \FOR $f \leftarrow 1$ \TO $|features|$ \DO\\
\>\>       $M_f \leftarrow M_f - \frac{\Delta(f,R,H)}{N} + \frac{\Delta(f,R,M)}{N}$\\
\>	\DONE\\
\DONE
\end{tabbing}}
\caption{Binary RELIEF (two group system) for $N$ instances
for merit of different features. }\label{fig:relief2}
\end{figure}

RELIEF is a stochastic
instance-based scheme 
that works by randomly selecting $N$ reference instances
$R_1 .. R_N$; by default, $N=250$.
For data sets with two groups, RELIEF can be implemented
using the simple algorithm of \fig{relief2}. 
For each instance, the algorithm finds two other instances: the ``hit'' is the nearest
instance 
to $R$ in the same group while
the ``miss'' is the nearest instance to $R$ in another group. RELIEF's core intuition is that
features that change value between groups are more meritorious than features that change value
within the same group.
Accordingly,
the merit of a feature (denoted $M_f$) is 
{\em increased} for all features with a different value in the ``miss'' and 
{\em decreased} for all features with different values in the ``hit''.
The $\Delta$ function of figure \fig{relief2} detects differences between feature values. 
If a feature is discrete then the distance is one (if the
symbols are different) or zero (if the symbols are the same). If a feature is numeric,
%the the feature is normalizes 0..1 for min..max then subtracted.
then the distance is the difference in value (normalized to 0...1).
 If a feature has a
missing value, then a Bayesian statistic is used to generate an estimate for the expected
value (see ~\cite{Kir92} for details).
%
For a complex data set with $k>2$ groups, RELIEF samples
the $k$ nearest misses and hits from the same or different groups.
%
%(respectively).
%The update function for 
%$M_f$ is modified accordingly:
%\[\footnotesize
%\begin{array}{r@{~}l}
%M_f \leftarrow M_f - &  \sum_i^k\frac{\Delta(f,R,H_i)}{N*k} \\
%				   + &  \sum_{g \not= group(R)} \sum_i^k\left(\underbrace{\frac{P(g) }{  1 - P(group(R))}}_{normalization}  * \frac{\Delta(f,R,M_i(g))}{N*k}\right) \\ 
%\end{array} 
%\]
%$P(X)$ denotes the ratio of group $X$ in the entire data.
%When reasoning about rare 
%groups (i.e. when $P(X)$ is small), 
%there is less support for the inference that feature $f$ distinguishes
%one group from another. 
%Accordingly,
%when the  group of the reference instance $R$ or the group of the miss $M$ is rare,
%then the {\em normalization} term (shown above) demotes the influence of their difference.

The experiments of Hall and Holmes \cite{hall03} reject numerous
feature subset selection methods.  A method referred to as WRAPPER is
their preferred
option, but only for small data sets.  For larger data sets, they
recommend RELIEF.

%Another reason to prefer RELIEF is that it can take advantage
%of Nighthawk's stochastic search.  Currently, RELIEF is a
%batch process that is executed after data generation is
%completed.  However, it may not be necessary to wait till the end
%of data generation to gain insights into which features are most
%relevant.  Given the stochastic nature of the algorithm, we can
%see feature work where RELIEF and a genetic algorithm work in
%tandem.  Consider the random selection process at the core of
%RELIEF: a genetic algorithm exploring mutations of the current
%set of parents is an excellent stochastic source of data.  In
%the future, we plan to apply RELIEF incrementally and in
%parallel to our GAs.
%

\subsection{Initial FSS Analysis of Nighthawk}

We changed our code
so that each chromosome 
evaluation
printed the current value of every gene and the
final fitness function score.  (For the two BitSet gene types,
we printed only the cardinality of the set.)
For each of the 16 Collection and Map classes from {\tt java.util},
we ran Nighthawk for 40 generations; by 40 generations, the
fitness score had usually stabilized, and we did not want to
bias our dataset by including many instances with high
score.  Each unit therefore
yielded 800 instances, each consisting of the gene values
and the chromosome score.

%RELIEF assumes continuous data and Nighthawk's performance
RELIEF assumes discrete data, but Nighthawk's fitness
scores are continuous.  We therefore
discretized Nighthawk's output:
\bi
\item The 65\% majority of the scores are within 30\% of the top
  score for any experiment. We call this the {\em high plateau}.
\item A 10\%  minority of scores are less than 10\% of the maximum
  score.  We call this region {\em the hole}.
\item  The remaining data
{\em slopes} from the plateau into the hole.
\ei
%To select features, we divided the
%results 
%into three groups: bottom 10\%, next 25\%, remaining
%65\%. RELIEF then sought features that
%distinguished
%the three groups.  
We therefore assigned each instance to one of three groups
(plateau, slope and hole), and gave the data to RELIEF to
seek features (in this context, genes) that distinguished between the
groups.

%Using the above discretization policy, we ran RELIEF 10 times in a 10-way
%cross-validation study.
In order to compensate for the effect of possible outliers in the
data, the data set was divided
into $10$ buckets. Each bucket was temporarily removed and RELIEF was
run on the remaining data. This produced a list of ``merit'' figures
$M_f$ for each feature.
% (this ``merit'' value is an internal heuristic measure
%generated from RELIEF, and reflects the difference ratio of neighboring
%instances that have different regions).  
%We therefore got
%a merit score for each of the ten runs for every gene corresponding
%to every subject unit.  Each run therefore yielded a ranked list
%$R$ of all genes, where gene 1 had the highest merit for this
%run, gene 2 had the second highest merit, and so on.
We then calculated summary statistics in order to come
up with rankings of the various gene types (recall that each
gene type $t$ corresponds to zero or more genes, depending on the
unit under test).  Our best results were obtained by ranking the
gene types using {\em average merit}, i.e.\ the average RELIEF
merit score, across all runs and all subject units, of any gene
of that type.
% i.e.
%\begin{itemize}
%\item $avgMerit(g,u)$ is the average RELIEF merit score, across all
%  10 runs, of gene $g$ derived
%  from unit $u$.
%\item $avgRank(g,u)$ is the average rank in $R$, across all
%  10 runs of the cross-validation study, of gene $g$ derived
%  from unit $u$.
%\item $bestMerit(t)$ is the maximum, over all genes $g$ of
%  type $t$ and all subject units $u$, of $avgMerit(g,u)$.
%\item $bestRank(t)$ is the maximum, over all genes $g$ of
%  type $t$ and all subject units $u$, of $avgRank(g,u)$.
%\end{itemize}

\fig{genes-merit-fig} shows the ten gene types from
Figure \ref{gene-types-fig}, ranked in terms of their $avgMerit$ as
defined above.  This ranking places {\tt numberOfCalls} at
the top, meaning that it considers genes of that type to be the
most valuable; it also places {\tt candidateBitSet} at the bottom,
meaning that it considers genes of that type to be the most
expendable.

% We took the maximum merit ($Max$)
%and generated a {\it Selected} list. A feature was {\it Selected} if its merit
%$m$ was $m> \alpha Max$ (e.g. at $\alpha=0.5$ then we selected all
%features that score at least half the maximum merit).
%
%We 
%
%The features in a Nighthawk chromosome come from the different
%gene types of  \fig{gene-types-fig}.
%We ranked and sorted each gene type using the maximum RELIEF
%merit score seen for any of its features.
%This resulted in the sort order of  \fig{genes-merit-fig}.

\begin{figure}[tp]
{
\footnotesize
\begin{center}
\begin{tabular}{r|rl}
Rank &Gene type $t$  & $avgMerit$ \\
\hline
     1&	 {\tt numberOfCalls}&  85\\
     2&	{\tt valuePoolActivityBitSet} & 83\\
     3&	{\tt upperBound} & 64\\
     4&	{\tt chanceOfTrue}&  50\\
     5&	{\tt methodWeight}&  50\\
     6&	{\tt numberOfValuePools}&  49\\
     7&	{\tt lowerBound }& 44\\
     8&	{\tt chanceOfNull}&  40\\
     9&	{\tt numberOfValues }& 40\\
    10&	{\tt candidateBitSet}&  34\\
    \end{tabular}
\end{center}
}
\caption{
Nighthawk gene types, sorted by the maximum RELIEF merit of any of its features.
}
\label{fig:genes-merit-fig}
\end{figure}

\subsection{An FSS Experiment}

After ranking the gene types by the various ranking metrics, we
proceeded to study whether eliminating gene types on the basis
of the RELIEF results sped up Nighthawk without sacrificing
the quality of the results.
%
First, we changed the source code so we could
replace the parts of Nighthawk's code controlled by
each gene type by code that assumed a constant value for each
gene of that type (i.e. we could disable   all genes of
a given type).
%
Next, we ran Nighthawk using the top ranked
$1 \le i \le 10$ gene types.
For example,
if we say ``number of gene types = 2'' for the
$avgMerit$ ranking, then we are
using only {\tt numberOfCalls} and {\tt valuePoolActivityBitSet}, and
all other gene types are disabled.

%
%\begin{figure}[!t]\begin{center}\includegraphics[width=2.5in]{Properties_rankedByMerit.pdf}
%\end{center}\caption{Properties}\label{fig:properties}
%\end{figure}

\begin{wrapfigure}{r}{2.5in}
\begin{center}\includegraphics[width=2.5in]{Hashtable_rankedByAvgMerit.pdf}
\end{center}\caption{
  Nighthawk on Hashtable unit, eliminating gene types according
  to $avgMerit$ ranking.
}
\label{fig:hashtable}
\end{wrapfigure}
%
%\begin{figure}[tp]
%\begin{center}\includegraphics[width=2.5in]{Hashtable_rankedByAvgRank.pdf}
%\end{center}\caption{
%  Nighthawk on Hashtable unit, eliminating gene types according
%  to $bestRank$ ranking.
%}
%\label{fig:hashtable-bestRank}
%\end{figure}
%
%\begin{figure}\begin{center}\includegraphics[width=2.5in]{EnumMap_rankedByMerit.pdf}
%\end{center}\caption{EnumMap}\label{fig:enummap}
%\end{figure}
%

\subsubsection{Coverage Results}
In the following, each run's result is compared to
the runtime and coverage seen using all ten gene types and
running for $g=50$ generations.  \fig{hashtable} shows how the
coverage changed for one of the case studies of Figure \ref{collection-map-cov},
using the $avgMerit$ ranking; the results for this subject unit
are typical.  The y-axis of that figure is defined such that
the point (50,1) represents the coverage reached by using all 10
gene types after 50 generations.  The thick black curve on those
figures shows the performance of Nighthawk using all ten gene
types. Other curves show results from using $1 \le i \le 9$
gene types.   

A measure of interest in \fig{hashtable} is the area
under the curve, which is maximal when Nighthawk converges
to maximum coverage in a few generations.  Due to
the random nature of the GA and the randomized test data
generation, some of the curves are sometimes higher than the
$i=10$ line.


\begin{figure}[tp]
\scriptsize
\begin{center}
\begin{tabular}{r@{~}r@{~}r@{~}r@{~}r@{~}r@{~}r@{~}r@{~}r@{~}r|l}
\multicolumn{9}{c}{Number of used gene types}\\\cline{1-10}
1    &  2   &   3 &   4   &  5   & 6    & 7    & 8    &  9  & 10    & class being tested\\\hline
1.00&1.00&1.01&0.99&1.00&1.00&1.00&1.00&1.00&1.00&ArrayList\\
1.13&1.00&1.03&0.96&0.65&0.68&0.67&0.88&0.71&1.00&EnumMap\\
0.98&0.99&0.98&0.96&1.01&1.00&1.00&1.00&0.98&1.00&HashMap\\
0.99&1.00&0.99&1.00&1.00&1.00&0.99&1.00&1.00&1.00&HashSet\\
0.99&1.00&1.00&1.00&0.99&1.00&0.98&1.01&1.01&1.00&Hashtable\\
0.97&0.97&0.98&0.97&0.98&0.97&1.00&0.98&0.98&1.00&IdentityHashMap\\
0.98&1.00&1.00&1.00&0.99&1.00&1.00&0.99&0.99&1.00&LinkedHashMap\\
0.99&1.01&1.01&1.00&1.00&1.00&1.00&1.01&1.01&1.00&LinkedHashSet\\
1.01&1.01&1.01&1.02&1.00&1.01&1.01&1.02&1.00&1.00&LinkedList\\
1.01&0.99&0.97&1.01&1.03&0.99&0.95&0.98&1.00&1.00&PriorityQueue\\
1.00&1.00&1.00&1.00&1.00&1.00&1.00&1.00&0.99&1.00&Properties\\
1.00&1.00&1.00&1.00&1.00&1.00&1.00&1.00&1.00&1.00&Stack\\
0.90&0.97&0.95&0.93&0.99&0.97&0.97&1.02&1.01&1.00&TreeMap\\
0.90&0.95&0.98&0.93&0.97&0.98&0.98&1.00&0.98&1.00&TreeSet\\
0.95&0.97&0.99&0.96&0.99&0.92&0.99&0.97&1.01&1.00&Vector\\
0.96&0.97&0.97&0.98&0.98&0.97&0.98&0.99&0.97&1.00&WeakHashMap\\\hline
0.98&0.99&0.99&0.98&0.97&0.97&0.97&0.99&0.98&1.00&mean
\end{tabular}
\end{center}
\caption{Coverage found using the top $i$ ranked gene types for
$1 \le i \le 10$. 
Coverages are expressed as a ratio of the coverages found using all gene types.
}\label{fig:coverage}
\end{figure}

If we calculate the ratio of the area under
a curve (AUC) with the area under the thick
black curve, then we can summarize all the curves of
\fig{hashtable} as the first row of \fig{coverage}. In that
figure, each column shows how many gene types were used in  a
particular run (and when $used=10$, we are ignoring the FSS
results and using all gene types). Also, the number 1.00 informs
us that we achieved 100\% of the coverage reached by using all
ten gene types.

\fig{coverage} 
summarizes \fig{hashtable} as well as results
from all other cases studied in Figure \ref{collection-map-cov}a.  
The last row of 
\fig{coverage} shows that the mean AUC is very similar using
the top-ranked $i$ gene types.  
From this observation,
we conclude that
Nighthawk's GA gains most of its efficacy just for mutations of
the top-ranked gene type {\tt numberOfCalls}.
However, a statistical comparison of the coverage measures with
only the top gene type and those with all ten gene types does
show a statistically significant difference (a paired Wilcoxon test with
$\alpha=0.5$ results in $p=.013$).
%We also conclude that
%in terms of coverage, there is little benefit in forcing Nighthawk's GA
%to mutate more than this first ranked gene type.

\subsubsection{ Runtime Results}

In order to determine whether eliminating gene types from
Nighthawk's GA is cost-effective, we must consider not only the
coverage achievable, but also the time taken to achieve that
coverage.  
We therefore made two runs of Nighthawk on all the
subject units, run
(a) using all the gene types and run (b) using just the top gene
type ranked by $avgMerit$ (i.e. {\tt numberOfCalls}).
We then divided the runtime and coverage results
from (b) by the (a) values seen after 50
generations, and plotted the results.

\fig{timereport100} shows the results, with time percentage on
the X axis and coverage percentage on the Y axis.
Note the point indicated by the arrow in \fig{timereport100}.
This point shows that it is usually
(in $\frac{13}{16}$ cases) possible to achieve 90\% of the
coverage in under 10\% of the time required to run all gene
types for 50 generations.
The data for the two outlier subject units in \fig{timereport100},
{\tt EnumMap} and {\tt TreeMap}, can be
attributed to the low coverage achieved by the original Nighthawk
on {\tt EnumMap} and the stochastic nature of both the GA and
random testing level.

%The two anomalous subject units in \fig{timereport100} are
%{\tt EnumMap} and {\tt TreeMap}:
%\bi
%\item  {\tt EnumMap} looks like a
%success story, since we were able to achieve 140\% of the
%original coverage.  In fact, this is simply noise, since the results presented
%earlier in this paper showed that we 
%could achieve only 3\% coverage of {\tt EnumMap} using
%Nighthawk and generic test wrappers \cite{andrews07}. The improvements
%reported in \fig{timereport100}
%shows an improvement to less than 5\%.  
%\item
%%\fig{timereport100} showed that the run
%with TreeMap using one gene type took 1.48 longer than using TreeMap with all ten gene types.
%%To understand this result,
%note that \eq{cost} has two
%components: number of options and the evaluation cost of each
%option. In the case of {\tt TreeMap}, even though we reduced the
%chromosome search space size, we made choices that greatly
%increased the runtime cost.
%\ei

\subsection{Discussion}

There are two main areas of implications for this work:  implications
for Nighthawk itself, and implications for other systems.

\subsubsection{Implications for Nighthawk}

On the surface, the results reported above suggest
that most gene types can be eliminated.  However, there are
at least three mitigating factors.  First, the additional coverage achieved
by using all gene types is statistically significant.  Second, the
additional
\begin{wrapfigure}{r}{3in}
\includegraphics[width=3in]{mar21timereport.pdf}
\caption{Time results, 1 vs 10.}\label{fig:timereport100}
\end{wrapfigure}
coverage may be of code that is difficult to cover, and thus
this additional coverage might be more valuable to the user than the
raw coverage numbers suggest.  Third, the observations about gene types
might not carry over to other, different subject units.

Nevertheless, the results show that it is very likely that Nighthawk can
be modified to give users a better range of cost-benefit tradeoffs,
for instance by eliminating gene types or using early stopping criteria
that take advantage of the early plateaus in coverage seen in
\fig{timereport100}.

%We are currently exploring early stopping criteria within Nighthawk to take
%advantage of the early plateaus in coverage seen in \fig{timereport100}
%The above results show that
%the original Nighthawk was doing
%much work that did not pay off by achieving much better results.
%Consider the nine gene types ranked the lowest in the $avgMerit$
%ranking.  Maintaining all the genes associated with these gene
%types meant that the original Nighthawk was spending time
%mutating these genes and extracting information from the current
%values of the genes ( it also meant that the original Nighthawk
%was using up memory storing the representations of these genes,
%for each chromosome in each generation, with a concomitant
%time expense).  When these less useful genes were eliminated,
%no large loss of coverage was observed, but a great increase
%in efficiency was observed.
%Hence, we are also exploring methods to reduce Nighthawk's memory requirements
%and other performance criteria.

\subsubsection{Implications for Other Systems}

At the {\it meta-heuristic level}, this work suggests that it may
be useful to integrate FSS directly into meta-heuristic
algorithms.  Such an integration would enable the automatic
reporting of the merits of individual features, and the
automatic or semi-automatic selection of features.  If the
results of this paper extend to other domains, this would lead
to meta-heuristic algorithms that improve themselves
automatically each time they are run.
%  We also speculate that
%this is a promising direction to look at for a further order
%of magnitude improvement in performance.

Also, on the {\it theory-formation level}, this work opens
up the possibility of rapid turnover of the theoretical
foundations underlying present tools, as aspects of heuristic
and meta-heuristic approaches are shown to be consistently
valuable or expendable.
%  In the framework of a self-monitoring
%meta-heuristic algorithm, new strategies, heuristics, and
%gene and mutator types can be introduced as theories evolve, and
%can be evaluated efficiently by the algorithms themselves.

As an example of this latter point, the
{\tt chanceOfNull} gene type in Nighthawk determines how likely
(in percent) Nighthawk's random test data generation is to
choose a {\tt null} as a parameter in a given parameter
position.  That gene type was ranked as expendable by both
rankings explored here.  This suggests that the default value for this
gene (3\%) is sufficient in most cases.  Given sufficient
corroboration from other case studies and systems, this in turn
suggests that the value of generating test data with nulls is a
relatively settled problem, and that we can turn toward other
aspects of test data generation for future improvements.

