% THIS IS SIGPROC-SP.TEX - VERSION 2.9
% WORKS WITH V3.0SP OF ACM_PROC_ARTICLE-SP.CLS
% MARCH 2007
%
% It is an example file showing how to use the 'acm_proc_article-sp.cls' V3.0SP
% LaTeX2e document class file for Conference Proceedings submissions.
% ----------------------------------------------------------------------------------------------------------------
% This .tex file (and associated .cls V3.0SP) *DOES NOT* produce:
%       1) The Permission Statement
%       2) The Conference (location) Info information
%       3) The Copyright Line with ACM data
%       4) Page numbering
% ---------------------------------------------------------------------------------------------------------------
% It is an example which *does* use the .bib file (from which the .bbl file
% is produced).
% REMEMBER HOWEVER: After having produced the .bbl file,
% and prior to final submission,
% you need to 'insert'  your .bbl file into your source .tex file so as to provide
% ONE 'self-contained' source file.
%
% Questions regarding SIGS should be sent to
% Adrienne Griscti ---> griscti@acm.org
%
% Questions/suggestions regarding the guidelines, .tex and .cls files, etc. to
% Gerald Murray ---> murray@acm.org
%
% For tracking purposes - this is V2.9SP - MARCH 2007

\documentclass{acm_proc_article-sp}

\sloppy  % JHA finds that this actually makes things neater

%-- Begin patch area for accents in 'Author Block' area - may be needed by some authors / but not all
\DeclareFixedFont{\auacc}{OT1}{phv}{m}{n}{12}   % Needed for "Author Block" accents - Patch by Gerry 3/21/07
\DeclareFixedFont{\afacc}{OT1}{phv}{m}{n}{10}   % Needed for "Author Block" accents in the affiliation/address line - Patch by Gerry 3/21/07
%--
\begin{document}

\title{Nighthawk: A Two-Level Genetic-Random \\ Unit Test Data Generator}

\numberofauthors{2}
\author{
% You can go ahead and credit any number of authors here,
% e.g. one 'row of three' or two rows (consisting of one row of three
% and a second row of one, two or three).
%
% The command \alignauthor (no curly braces needed) should
% precede each author name, affiliation/snail-mail address and
% e-mail address. Additionally, tag each line of
% affiliation/address with \affaddr, and tag the
% e-mail address with \email.
%
% 1st. author
\alignauthor
James H.\ Andrews and Felix C. H. Li\\
       \affaddr{Department of Computer Science, University of Western Ontario}\\
       \affaddr{London, Ontario, CANADA N6A 5B7}\\
       \email{\{andrews,cli9\}@csd.uwo.ca}
\alignauthor
Tim Menzies\\
       \affaddr{Lane Department of Computer Science, West Virginia University}\\
       \affaddr{PO Box 6109, Morgantown, WV, USA 26506}\\
       \email{tim@menzies.us}
}
% \date{30 July 1999}

\maketitle
\begin{abstract}
Randomized testing has been shown to be an effective method for
testing software units.  However, the thoroughness of randomized
unit testing varies widely according to the settings of certain
parameters, such as the relative frequencies with which methods
are called.  In this paper, we describe a system which uses a
genetic algorithm to find parameters for randomized unit testing
that optimize test coverage.  We compare our coverage results to
previous work, and report on case studies and experiments on
system options.
\end{abstract}

\category{D.2.5}{Software Engineering}{Testing and Debugging}[Testing tools]
\category{I.2.8}{Artificial Intelligence}{Problem Solving, Control Methods, and Search}

\terms{Algorithms, Experimentation, Measurement}

\keywords{Randomized testing, genetic algorithms, test coverage}

% Use just normal \section, \subsection, etc.

\section{Introduction}

% XXX NOT MORE THAN 10 pages

Software testing involves running a piece of software (the
software under test, or SUT) on selected input data, and
checking the outputs for correctness.  The goals of software
testing are to force failures of the SUT, and to be thorough.
The more thoroughly we have tested an SUT without forcing
failures, the more sure we are of the reliability of the SUT.

Randomized testing is the practice of using randomization for
some aspects of test input data selection.  Several independent
studies
\cite{miller-fuzz-cacm,andrews-etal-rute-rt,%
pacheco-etal-icse2007,groce-etal-icse2007}
have found that randomized
testing of software units is effective at forcing failures in
even well-tested units.  However, there remains a question of
whether randomized testing can be thorough enough.  Using
various code coverage measures to measure thoroughness,
researchers have come to varying conclusions about the ability
of randomized testing to be thorough
\cite{michael-etal-ga-tcg,visser-etal-issta06,andrews-etal-rute-rt}.

The thoroughness of randomized unit testing is highly dependent
on parameters that control when and how randomization is
applied.  These parameters include the number of method calls
to make, the relative frequency with which different methods are
called, the ranges from which integer arguments are chosen, and
the manner in which previously-used arguments or
previously-returned values are used in new method calls.  It is
often difficult to work out the optimal values of these parameters by hand.

In this paper, we describe Nighthawk, a system for generating
unit test data.  The system can be viewed as consisting of two
levels.  The lower level is a randomized unit testing engine
which tests a set of methods according to parameter values specified
in a chromosome.  The upper level is a genetic algorithm (GA)
which uses fitness evaluation, selection, mutation and
recombination to find good values for the randomized unit
testing parameters, including parameters that encode a value
reuse policy.  Goodness is evaluated on the basis of test
coverage and number of method calls performed.
Users can use Nighthawk to find good parameters, and then
perform randomized unit testing based on those parameters.

\subsection{Randomized Testing}

``Random'' or ``randomized'' testing has a long history, being
mentioned as far back as Hetzel's 1973 book
\cite{hetzel-book-1973}.  For randomized testing, an automated
oracle is needed.  However, studies
\cite{miller-fuzz-cacm,jcrasher-spe,pacheco-etal-icse2007}
have found that even with simple, high-pass oracles, randomized
testing is effective at forcing failures.
%  For instance,
%Miller et al.\ \cite{miller-fuzz-cacm} used an oracle that
%judged a program run as failing only if it crashed or hung, and
%Pacheco et al.\ \cite{pacheco-etal-icse2007}
%used oracles that checked generally-desirable properties, such
%as that a method does not throw a null pointer exception unless
%one of its arguments is null.

Randomized testing has not, however, always been found to be
sufficiently thorough.  For instance,
Michael et al.\ \cite{michael-etal-ga-tcg} performed randomized testing
on the well-known Triangle program; this program accepts three
integers as arguments, interprets them as sides of a triangle,
and reports whether the triangle is equilateral, isosceles,
scalene, or not a triangle at all.  They concluded that
randomized testing could not achieve 50\% condition/decision
coverage of the code, even after 1000 runs.
Visser et al.\ \cite{visser-etal-issta06}
compared randomized testing with various model-checking
approaches and found that while randomized testing was good at
achieving block coverage, it failed to achieve optimal coverage
for stronger coverage measures.

In contrast, Doong and Frankl \cite{doong-frankl-tosem94} tested several
units using randomized sequences of method calls, and found that
by varying some parameters of the randomized testing, they could
greatly increase or decrease the likelihood of finding injected
faults.  The parameters included number of operations
performed, ranges of integer arguments,
and the relative frequencies of some of the methods in
the call sequence.

\subsection{Randomized Unit Testing}

Unit testing is variously defined as the
testing of a single method, a group of methods, a module or a
class.  We will use it in this paper to mean the testing of a
group $M$ of methods, called the {\it target methods}.  A unit
test is a sequence of calls to the target methods, with each
call possibly preceded by code that sets up the arguments, and
with each call possibly followed by code that checks results.

{\it Randomized unit testing} is unit testing where there is
some randomization in the selection of the target
method call sequence and/or arguments to the
method calls.  Many researchers
\cite{doong-frankl-tosem94,antoy-hamlet-tse-jan2000,%
claessen-hughes-quickcheck,sen-etal-cute,visser-etal-issta06}
have performed randomized unit testing,
sometimes combined with other tools such as model
checkers.

A key concept in randomized unit testing is that of {\it value
reuse}.  We use this term to refer to how the
testing engine reuses the receiver, arguments or return values
of past method calls when making new method calls.  In previous
research, value reuse has mostly taken the form of making a
sequence of method calls all on the same receiver object.

In our previous research, we developed a GUI-based randomized
unit testing engine called RUTE-J \cite{andrews-etal-rute-rt}.
To use RUTE-J, users write their own customized test wrapper
classes, hand-coding such parameters as relative frequencies of
method calls.  Users also hand-code a value reuse policy by
drawing receiver and argument values from value pools, and
placing return values back in value pools.  Finding good
parameters quickly, however, requires experience with the tool.

The system Nighthawk described in this paper significantly
builds on this work by automatically determining good parameters.
The lower, randomized-testing,
level of Nighthawk initializes and maintains one or more value
pools for all relevant types, and draws and replaces values in
the pools according to a policy specified in a chromosome.
The chromosome also specifies relative frequencies of methods,
method parameter ranges, and other testing parameters.
The upper, genetic-algorithm, level performs a search for the
parameter setting that causes the lower level to achieve a high
value of a coverage-related measure.  Nighthawk uses only the
Java reflection facility to gather information about the SUT,
making its general approach robust and adaptable to other languages.

\subsection{Contributions and Paper Organization}

The main contributions of this paper are as follows.
\begin{enumerate}
\item We describe the implementation of a novel two-level
  genetic-random testing system, Nighthawk.  In particular,
  we describe how we encode a value reuse policy in a manner
  amenable to meta-heuristic search.
\item We compare Nighthawk to other systems described in
  previous research, showing that it can achieve the same
  coverage levels.
\item We describe the results of a case study carried out on
  real-world units (the Java 1.5.0 Collection and Map classes)
  to determine the effects of different option settings on the
  basic algorithm.
\end{enumerate}

We discuss related work in section 2.
In section 3, we describe the results of an exploratory study
that suggested that a genetic-random approach was feasible and
could find useful parameter settings.  In section 4, we describe
the design and use of Nighthawk.  Section 5 contains our
comparison to previous work, and section 6 our case study.

\section{Exploratory Study}
\label{exploratory-section}

To find out whether there was any merit in the idea of a
genetic-random system, we conducted an exploratory experiment.
In this section, we describe the prototype software we developed
for the experiment, the design of the experiment and the results
of the study.

\subsection{Software Developed}

Using code from RUTE-J (see above) and the open-source genetic
algorithm package JDEAL \cite{jdeal}, we constructed a prototype
two-level genetic-random unit testing system that took Java
classes as its testing units.  For each unit under test (UUT) with
$n$ methods to call, the GA level constructed a chromosome with
$n+2$ integer genes:  the first contained the number of method
calls to make in each test case, the second contained the number
of test cases to generate, and the other $n$ contained the
relative weights (calling frequencies) of the methods.  All other
randomized testing parameters, such as the ranges of integer
parameters and the value reuse policy, were hard-coded in the
test wrappers.

The evaluation of the fitness of each chromosome $c$ proceeded as
follows.  We got the random testing level to generate the number
of test cases of the length specified in $c$, using
the method weights specified in $c$, and then
measured the number of coverage points covered; we used the
open-source coverage tool Cobertura \cite{cobertura-website}
for this, which measures line coverage.  If we had based the
fitness function only on coverage, however, then any chromosome
would have tended to benefit from having a larger number of
method calls and/or test cases, since every method call has the
potential of covering more code.
We therefore built in a brake to prevent these values from
getting unfeasibly high.  We calculated the fitness function for
the chromosome as:
\begin{center}
(number of coverage points covered) $*$ (coverage factor) \\
 $-$ (number of method calls performed)
\end{center}
We set the coverage factor to 1000, meaning that we were willing
to make 1000 more method calls (but not more) if that meant
covering one more coverage point.

\subsection{Experiment Design}

We chose as our subject programs three units taken from the Java
1.4.2 edition of {\tt java.util}: {\tt BitSet}, {\tt HashMap}
and {\tt TreeMap}.  Each was in wide use as part of a standard
package, and {\tt TreeMap} had been used as the basis of earlier
experiments \cite{visser-etal-issta04}.  For each UUT, we wrote a
test wrapper class containing methods that called selected
target methods of the UUT (16 methods for BitSet, 8 for HashMap
and 9 for TreeMap).  We instrumented each UUT using Cobertura.

We ran the two-level algorithm 30 times on each of the three
subject units, and recorded the amount of time taken and the
parameters recorded in the final chromosome.
In order to test whether the weights in the chromosomes were
useful given the length and number of method calls, for each
final chromosome $c$ we created a variant chromosome $c'$
with the same length and number of method calls but with all
weights equal.  We then compared the coverage achieved by
$c$ and $c'$ on 30 paired trials.


\subsection{Results}

The full results from the experiment are available in
\cite{cli-msc}.  We found that the amount of time needed by the
prototype was reasonable given the generic nature of the tools
we were using (about 25 minutes to compute 5000 generations of a
population of 100 chromosomes).

We performed two statistical tests to evaluate whether the
system was converging on a reasonable solution.  First, we
ordered the average weights discovered for each method in each
class in ascending order, and performed an $t$ test
with Bonferroni correction between each pair of
adjacent columns.
We found that for the {\tt HashMap} and {\tt TreeMap} units, the
{\tt clear} method (which removes all data from the map) had a
statistically significantly lower weight than the other methods,
indicating that the algorithm was consistently converging on a
solution in which it had a lower weight.
This is explainable by the fact that much of the code in these
units can be executed only when there is a large amount of data
in the container objects.  Since the {\tt clear} method clears
out all the data, executing it only infrequently ensured that
the objects would get large enough.

We also found that for the {\tt TreeMap} unit, the {\tt remove}
and {\tt put} methods had a statistically significantly higher
weight than the other methods.  This is explainable by the large
amount of complex code in these methods and the private methods
that they call; it takes more calls with diverse parameters to
cover this code than it does for the relatively simpler code of
the other methods.

The second statistical test we performed was designed to test
whether the weights found by the GA were actually efficient.
For this, we used the 30 trials comparing the discovered
chromosome $c$ and the equal-weight variant $c'$.  We found that
for all three units, the equal-weight chromosome covered less
code than the
original, to a statistically significant level (as measured by a
$t$ test with $\alpha=0.05$).  This can be interpreted as
meaning that the calculated weights were better than equal
weights at covering code when the specified number of test cases
of the specified length were generated.

In the course of the experiment, we found a bug in the Java
1.4.2 version of {\tt BitSet}:  when a call to {\tt set()} is
performed on a range of bits of length 0, the unit could wrongly
reallocate more memory.  We found that a bug report for this bug
had already been submitted to Sun's bug database.  It has been
corrected in the Java 1.5.0 version of the library.

In summary, the experiment indicated that the two-level
algorithm was potentially useful, and was consistently
converging on similar solutions that were more optimal than
calling all methods equally.

\section{Nighthawk:  System Description}

The results of our exploratory study encouraged us to expand
the scope of the GA to include method parameter ranges, the
value reuse policy and other randomized testing parameters.  The
result was the Nighthawk system.

In this section, we first outline the lower, randomized-testing,
level of Nighthawk, and then describe the chromosome that
controls the operation of that lower level.  We then outline the
genetic-algorithm level, and finally describe the ways in which
the end user can use the system.  Finally, we describe the use
of automatically-generated test wrappers for precondition
checking, result evaluation and coverage enhancement.

\subsection{Randomized Testing Level}

In this section, we present a simplified description of the
algorithm that the lower, randomized-testing, level of Nighthawk
uses to construct and run a test case.  The algorithm takes two
parameters:  a set $M$ of Java methods, and a genetic algorithm
chromosome $c$ appropriate to $M$.  The chromosome controls
aspects of the algorithm's behaviour, such as the number of
method calls to be made.  The chromosome will be described in
more detail in the next subsection.

In the following description, the word ``choose'' is always used
to mean specifically a random choice which may partly depend on
the chromosome $c$.

We refer to $M$ as the set of ``target methods''.  We define the
set $I_M$ of {\it types of interest} corresponding to $M$ as the
union of the following sets of types\footnote{
  In this paper, the word ``type'' refers to any primitive type,
  interface, or abstract or concrete class.
}:
\begin{itemize}
\item All types of receivers of methods in $M$.
\item All types of parameters of methods in $M$.
\item All types of return values of methods in $M$.
\item All primitive types that are the types of parameters
  to constructors of other types of interest.
\end{itemize}
When $M$ is clear from context, we drop the subscript, referring
to $I_M$ simply as $I$.  In the algorithm, each type $t \in I$
is associated with an array of {\it value pools}, and each
value pool for $t$ contains an array of values of type $t$. 
Each value pool for a range primitive type (a primitive type
other than {\tt boolean} and {\tt void}) also has bounds on
the values that can appear in it.  The number of value pools,
number of values in each value pool, and the range primitive
type bounds are specified by the chromosome $c$.

We define a constructor method to be an {\it initializer} if it
has no parameters, or if all its parameters are of primitive
types; these are the methods that build objects depending only
on primitive-type values.  We define a constructor to be a
{\it reinitializer} if it has no parameters, or if all its
parameters are of types in $I_M$; these are the methods that
build objects depending only on values that can be found in
value pools.  We define the set $C_M$ of {\it callable methods}
to be the methods in $M$ plus the reinitializers of the types of
$I_M$.  The callable methods are the ones that Nighthawk calls directly.

A {\it call description} is an object representing one method
call that has been constructed and run.  It consists of the
method name, an indication of whether the method call succeeded,
failed or threw an exception, and one {\it object description}
for the receiver (if any), the parameters (if any) and the
result (if any).
The object description of an object of primitive type or a
string consists of the object itself.  The object description of
other objects consists of the class of interest, the value pool
number and the value number associated with the object; this is
the source from which the object came (for receivers and
parameters) or the place where the value was put (for results).
A {\it test case} is a sequence of call descriptions, together
with an indication of whether the test case succeeded or failed.

\begin{figure}
Input:  a set $M$ of target methods; a chromosome $c$. \\
Output:  a test case. \\
Steps:
\begin{enumerate}
\item For each element of each value pool of each primitive
  type in $I_M$, choose an initial value that is within the bounds
  for that value pool.
\item For each element of each value pool of each other type $t$
  in $I_M$:
  \begin{enumerate}
  \item If $t$ has no initializers, then set the element to {\tt null}.
  \item Otherwise, choose an initializer method $i$ of $t$, call
    {\sf tryRunMethod}$(i, c)$, and place the result in the
    destination element.
  \end{enumerate}
\item Initialize test case $k$ to the empty test case.
\item Repeat $n$ times, where $n$ is the number of method calls to
  perform:
  \begin{enumerate}
  \item Choose a target method $m \in M$.
  \item Run algorithm {\sf tryRunMethod}$(m, c)$, and
        add the call description returned to $k$.
  \item If {\sf tryRunMethod} returns a method call failure
    indication, return $k$ with a failure indication.
  \end{enumerate}
\item Return $k$ with a success indication.
\end{enumerate}

\caption{Algorithm {\sf constructRunTestCase}.}
\label{constructRunTestCase-fig}
\end{figure}

\begin{figure}
Input:  a method $m$; a chromosome $c$. \\
Output:  a call description. \\
Steps:
\begin{enumerate}
\item If $m$ is non-static and not a constructor:
  \begin{enumerate}
  \item Choose a type $t \in I$ which is a subtype of the receiver of $m$.  
  \item Choose a value pool $p$ for $t$.
  \item Choose one value $recv$ from $p$ to
    act as a receiver for the method call.
  \end{enumerate}
\item For each argument position to $m$:
  \begin{enumerate}
  \item Choose a type $t \in I$ which is a subtype of the argument type.
  \item Choose a value pool $p$ for $t$.
  \item Choose one value $v$ from $p$ to act as the argument.
  \end{enumerate}
\item If the method is a constructor or is static, call it with
  the chosen arguments.  Otherwise, call it on $recv$ with the
  chosen arguments.
\item If the method call threw an {\tt AssertionError},
  return a call description with a failure indication.
\item Otherwise, if the method call threw some other exception,
  return a call description with an exception indication.
\item Otherwise, if the method return type is not {\tt void},
  and the return value $ret$ is non-null:
  \begin{enumerate}
  \item Choose a type $t \in I$ which is a supertype of the
    type of the return value.
  \item Choose a value pool $p$ for $t$.
  \item If $t$ is not a primitive type, or if $t$ is a primitive
    type and $ret$ does not violate the bounds on
    $p$, then choose an element of $p$ and
    replace it by $ret$.
  \item Return a call description with a success indication.
  \end{enumerate}
\end{enumerate}

\caption{Algorithm {\sf tryRunMethod}.}
\label{tryRunMethod-fig}
\end{figure}

Nighthawk's randomized testing algorithm is referred to as
{\sf constructRunTestCase}, and is described in Figure
\ref{constructRunTestCase-fig}.  It takes the set $M$ of target
methods and the chromosome $c$ as inputs, and begins by seeding
first primitive-type value pools, and then non-primitive-type
value pools using initializers.  It then constructs and runs a
test case, and returns the test case.  It uses an auxiliary
method called {\sf tryRunMethod}, described in Figure
\ref{tryRunMethod-fig}, which takes a method as input and
returns a call description.

{\sf tryRunMethod} considers a method call to fail if and only
if it throws an {\tt AssertionError}.  It does not consider
other exceptions to be failures, since they might be correct
responses to bad input parameters.  A separate mechanism
is used for detecting precondition violations and checking
correctness of return values and exceptions; see
Section \ref{test-wrappers-section}.

For conciseness,, the algorithm descriptions omit some details
which we now fill in.  These concern the treatment of nulls, the
treatment of {\tt String}, and the treatment of {\tt Object}.

The receiver of a method call cannot be null, and no parameter
can be null unless {\sf tryRunMethod} chooses it to be.  If
{\sf tryRunMethod} fails to find a non-null value when it is
looking for one, it reports failure of the attempt to call the
method (which is different from failure of the method call
itself).  {\sf constructRunTestCase} tolerates a certain number
of these attempt failures before prematurely terminating the
test case generation process.

{\tt java.lang.String} is treated as if it is a primitive type,
the values in the value pools being initialized with ``seed
strings''.  Some default seed strings are supplied by the
system, and the user can supply more.

\label{object-section}
Formal parameters of type {\tt java.lang.Object} stand
for some arbitrary object, but it is usually sufficient
to use a small number of specific
types as actual parameters;
Nighthawk uses only {\tt int} and {\tt String} by default.
A notable exception to this rule is the parameter to the {\tt equals()}
method, which can be treated specially by test wrapper objects
(see Section \ref{test-wrappers-section}).

\subsection{Chromosomes}

Aspects of the test case execution algorithms are controlled
by the genetic algorithm chromosome given as an argument.  A
{\it chromosome} is composed of a finite number of {\it genes}.
Each gene is a pair consisting of a name (a unique descriptive
string) and an integer, floating-point or boolean value.

Every Nighthawk chromosome contains a gene specifying the number
$n$ of method calls that {\sf constructRunTestCase} is to
run.  In addition, a chromosome appropriate to a set $M$
of target methods contains the following genes:
\begin{itemize}
%\item The number $n$ of method calls to perform.
\item For each method in $C_M$, the relative weight of the
  method, i.e.\ the likelihood that it will be chosen at step
  4(a) of {\sf constructRunTestCase}.
\item For each type of interest in $I_M$, the number of value
  pools for that type.
\item For each value pool, the number of values in the pool.
\item For each value pool of a range primitive type, the upper
  and lower bounds on values in the pool.  Initial values are
  drawn from this range with a uniform distribution.
\item For each method in $C_M$, the types of interest that will
  be chosen from to find a receiver, and, for each of those
  types, the value pools that will be chosen from.
\item For each method in $C_M$ and every parameter position, the
  chance that {\tt null} will be chosen as an actual parameter.
\item For each method in $C_M$ and every parameter position, the
  types of interest that will be chosen from to find an actual
  parameter, and, for each of those types, the value pools
  that will be chosen from.
\item For each method in $C_M$, the types of interest that will
  be chosen from to find a destination for the return value,
  and, for each of those types, the value pools that will be
  chosen from.
\end{itemize}
The last two genes are expressed as bit vectors, where each bit
stands for one of the types of interest that is a subtype of the
declared type.

It is clear that different gene values in the chromosome may
cause dramatically different behaviour of the algorithm on the
methods.  We illustrate this point with two concrete examples.

Consider the ``triangle'' unit mentioned earlier as one of the
subject units of Michael et al.  If the chromosome specifies
that all three parameter values are to be taken from a value
pool of 65536 values in the range -32768 to 32767, then the
chance that the randomized testing level will ever choose two
or three identical values for the parameters (needed for the
``isosceles'' and ``equilateral'' cases) is very low.  If,
on the other hand, the value pool contains only 30 integers
each chosen from the range 2 to 5, then the chance rises to
10 in 16 that two values will be identical, and 1 in 16
that all three values will be identical.  The amount of
additional coverage this would give would vary depending on the
UUT, but is probably nonzero.

Consider further a container class with {\tt put} and
{\tt remove} methods, each taking an integer key as its only
parameter.  If the parameters to the two methods are taken from
two different value pools of 30 values in the range 0 to 1000,
there is little chance that a key that has been put into the
container will be successfully removed.  If, however, the
parameters are taken from a single value pool of 30 values in
the range 0 to 1000, then the chance is very good that added
values are removed.  The amount of additional coverage
this would give would vary by UUT, but a {\tt remove}
method for a typical data structure executes different
code for a successful removal than it does for a failing one.

\subsection{Genetic Algorithm Level}

We take the space of possible chromosomes as a solution space to
search, and apply the genetic algorithm approach to search it
for a good solution.
GAs have been found to be superior to purely random search in
finding solutions to complex problems.  Goldberg
\cite{goldberg-ga-book} argues that their power stems from being
able to engage in ``discovery and recombination of building
blocks'' for solutions in a solution space.  For the randomized
unit testing problem, the ``building blocks'' are individual
choices for the gene values; discovering and recombining good
building blocks leads to better overall solutions.

The parameter to Nighthawk's GA is the set $M$ of target methods.
The GA performs the usual steps of chromosome
evaluation, fitness selection, mutation and recombination.
It first derives an
initial template chromosome appropriate to $M$, constructs
an initial population of size $p$ as clones of this chromosome, and
mutates the initial population.  It then performs a loop, for
the desired number $g$ of generations, of evaluating each
chromosome's fitness, retaining the $r\%$ fittest chromosomes,
discarding the rest, cloning the fit chromosomes and mutating
the population with point mutations and crossover (exchange of
genes between chromosomes).
The {\it fitness function} for a chromosome is calculated in
a manner identical to the exploratory study (Section
\ref{exploratory-section}).
%; that is, as
%\begin{center}
%(number of coverage points covered) $*$ (coverage factor) \\
% $-$ (number of method calls)
%\end{center}

It is recognized that the design of genetic algorithms is a
``black art'' \cite{ga-blackart}, such that very little is known
about why GAs work when they do work and why they do not work
when they do not.  Nighthawk uses default settings of
$p=20, g=50, r=20$.  These settings are different from those
taken as standard in GA literature \cite{dejong-spears-genetic},
and are motivated by a need to do as few chromosome evaluations
as possible (the primary cost driver of the system).  The
settings of other variables, such as the mutation and crossover
rate, are consistent with the literature.

To enhance availability of the software,
Nighthawk uses the popular open-source coverage tool Cobertura
\cite{cobertura-website} to measure coverage.  Cobertura can measure
only line coverage (each coverage point corresponds to a source
code line, and is covered if any code on the line is executed) \footnote{
  Cobertura also reports what it calls ``decision coverage'',
  but this is coverage of lines containing decisions.
}.
However, Nighthawk's algorithm is not specific to this measure;
indeed, our empirical studies (see below) show that Nighthawk
performs well when using other coverage measures.

\subsection{Top-Level Application}

\label{deep-section}
The Nighthawk application takes several switches and a set
of class names as command-line parameters.  The default
behaviour is to consider the command-line class names as a set
of ``target classes''.  If, however, the ``{\tt --deep}'' switch
is given to Nighthawk, the public declared methods of the
command-line classes are explored, and all non-primitive types
of parameters and return values of those methods are added to
the set of target classes.  The set $M$ of target {\it methods}
is computed as all public declared methods of the target
{\it classes}.  Intuitively, therefore, the {\tt --deep} switch
performs a ``deep target analysis'' by getting Nighthawk to call
methods in the layer of classes surrounding the command-line
classes.

Nighthawk runs the genetic algorithm, monitoring the
chromosomes and retaining the most fit
chromosome ever encountered.  A text description of this most
fit chromosome is the final output of the program.

After finding the most fit chromosome, a test engineer can
perform the randomized testing as specified by the chromosome.
To do this, they run a separate program, RunChromosome, which
takes the chromosome description as input and runs test cases
based on it for a user-specified number of times.  Randomized
unit testing generates new test cases with new data every time
it is run, so if Nighthawk finds a parameter setting that
achieves high coverage, a test engineer can automatically
generate a large number of useful test cases with the
Nighthawk / RunChromosome combination.
\label{run-chromosome-sec}

\subsection{Test Wrappers}
\label{test-wrappers-section}

We provide a utility program that, given a class name, generates
the Java source file of a ``test wrapper'' class for the class.
Running Nighthawk on an unmodified test wrapper is the same
as running it on the target class; however, test wrappers can be
customized for precondition checking, result checking or
coverage enhancement.

A test wrapper for class X is a class that contains one private
field of class X (the ``wrapped object''), and one public method
with an identical declaration for each public declared method of
class X.  Each wrapper method simply passes the call on to the
private data member.

To customize a test wrapper for precondition checking, the user
can insert a check in the wrapper method before the target
method call.  If the precondition is violated, the wrapper
method can simply return.

To customize a test wrapper for test result checking, the user
can insert any result-checking code after the target method
call; examples include normal Java assertions and JML
\cite{jml-overview} contracts.
For instance, a user can choose to customize a wrapper method so
that it throws an {\tt AssertionError} (signalling test case failure)
whenever the target method throws an {\tt ArrayOutOfBoundsException}.
We provide switches to the test wrapper generation program that
check commonly-desired properties, such as that a method throws
no {\tt NullPointerException} unless one of its arguments is
null.  The switch \verb/--pleb/ generates a wrapper that checks
all the Java Exception and Object contracts from
Pacheco et al.\ \cite{pacheco-etal-icse2007}.

Finally, to customize a test wrapper for coverage enhancement,
the user can insert extra methods that cause code to be executed
that would not otherwise be executed.  We provide two switches
for commonly-desired enhancements.  The switch \verb/--checkTypedEquals/
adds a method to the test wrapper for class X that takes one
argument of type X and passes it to the {\tt equals} method
of the wrapped object.  This is distinct from the normal wrapper
method that calls {\tt equals}, which has an argument of type
{\tt Object} and would therefore by default receive arguments
only of type {\tt int} or {\tt String} (see Section \ref{object-section}).
For classes X that implement their own {\tt equals} method, the
typed-equals method is likely to execute more code.

Customized serialization is accomplished in Java via
specially-named private methods that are inaccessible to
Nighthawk.  The test wrapper generation program switch
\verb/--checkSerialization/, for serializable objects, adds a
method to the test wrapper that writes the object to a byte
array output stream and reads it again from the byte array.
This causes Nighthawk to be able to execute the code in the
private serialization methods.

\section{Comparison with Previous Results}

We compared Nighthawk with two well-documented systems in the
literature by running it on the same software and measuring
the results.

\subsection{Pure GA Approach}

To compare the results of our genetic-random approach with those
of the purely genetic approach of
Michael et al.\ \cite{michael-etal-ga-tcg}, we adapted their
published C code for the Triangle program to Java, transforming
each decision so that each condition and decision direction
corresponded to an executable line of code measurable by
Cobertura.  We then ran Nighthawk 10 separate times on the
resulting class.

%6782 gen 8
%6914 gen 10
%6527 gen 10
%5008 gen 7
%3346 gen 5
%1790 gen 2
%2486 gen 5
%7945 gen 13
%18363 gen 21
%2509 gen 4

We found that Nighthawk reached 100\% of feasible
condition/decision coverage on average after 8.5 generations, in
an average of 6.2 seconds of clock time \footnote{
All empirical studies in this paper were performed on a Sun
UltraSPARC-IIIi running SunOS 5.10 and Java 1.5.0\_11.
}.
%There were 22 decisions having a total of 28 condition
%and decision directions.
%All condition and decision directions were coverable
%except the false direction of the last ``if'' statement; Michael
%et al.\ found that their best method also achieved 95\% coverage.

Michael et al.\ had found that a purely random approach could
not achieve even 50\% condition/decision coverage.  The
discrepancy between our results and theirs may be due to
Nighthawk being able to find a setting of the randomized testing
parameters that is more optimal than the one Michael et al.\ were
using.

\subsection{Model-Checking and Feedback-Directed Randomization}

To compare our results with those of the model-checking approach
of Visser et al.\ \cite{visser-etal-issta06} and the
feedback-directed random testing of
Pacheco et al.\ \cite{pacheco-etal-icse2007},
we downloaded the four data structure units used in those
studies.  The units had been hand-instrumented to record
coverage of the deepest basic blocks in the code.

We first wrote restricted test wrapper classes that called only
the methods called by the previous researchers.  We ran
Nighthawk giving these test wrapper classes as command-line
classes, and observed the number of instrumented basic blocks
covered, and the number of lines covered as measured by
Cobertura.

\begin{figure}

\begin{tabular}{|c|c|c|c|c|c|c|}
\hline
		& \multicolumn{3}{c|}{Instr Blk Cov}
				  & Time
					  & \multicolumn{2}{c|}{Line Cov} \\
UUT		& JPF & RP  & NH  & (sec) & Restr & Full \\
\hline
BinTree		& .78 & .78 & .78 &  .58  & .84   &   1  \\
\hline
BHeap		& .95 & .95 & .95 &  4.1  & .88   &  .92 \\
\hline
FibHeap		&  1  &  1  &  1  &  5.1  & .74   &  .92 \\
\hline
TreeMap		& .72 & .72 & .72 &  5.4  & .76   &  .90 \\
\hline
\end{tabular}

\caption{Comparison of results on the JPF subject units.}
\label{jpf-results-fig}
\end{figure}

Figure \ref{jpf-results-fig} shows the results of the comparison.
Columns 2 and 3 show the maximum block coverage ratio achieved
by the best Java Pathfinder (JPF)-based technique from Visser et
al.\ and by Pacheco et al.'s tool Randoop, as reported in
\cite{pacheco-etal-icse2007}.  Column 4 shows the block coverage ratio
achieved by Nighthawk using the restricted test wrappers. 
Nighthawk was able to achieve the same coverage as the previous
tools.  Column 5 shows the clock time in seconds needed by
Nighthawk to achieve its greatest coverage, and column 6 shows
the line coverage reported by Cobertura at greatest coverage.
For BHeap and FibHeap, Nighthawk runs faster than JPF, but for
the other two units it runs slower than both JPF and Randoop.
This is not surprising, since for coverage information Nighthawk
relies on general-purpose Cobertura instrumentation (which slows
down programs) rather than the efficient, but
application-specific, hand instrumentation that the other
methods used.

We then ran Nighthawk giving the target classes themselves as
command-line classes (bypassing the test wrappers), and observed
the number of lines covered.  Column 7 shows the line coverage
reported by Cobertura when we did this.  When running directly
on the target classes, Nighthawk was able to cover significantly
more lines of code.  All the blocks covered by the previous
studies were also covered by Nighthawk running on the full
target classes.

\begin{figure}

\begin{tabular}{|c|c|c|c|}
\hline
		& \multicolumn{3}{c|}{Number of cond value combinations} \\
UUT		& Total & Unreached & Rch/Uncov \\
\hline
BinTree		&  34   &     6     &    0 \\
\hline
BHeap		&  75   &     0     &    5 \\
\hline
FibHeap		&  57   &    10     &    3 \\
\hline
TreeMap		&  157  &    31     &    19 \\
\hline
\end{tabular}

\caption{Multiple condition coverage of the subject units.}
\label{jpf-mcc-results-fig}
\end{figure}

Visser et al.\ and Pacheco et al.\ also studied ``a form of
predicate coverage \cite{ball-pred-coverage} that measures the
coverage of all combinations of a set of predicates manually
derived from conditions in the source code''.  In the subject
units, the measurement of the predicate coverage is tightly
linked to the underlying Java Pathfinder code.  We therefore
studied instead multiple condition coverage (MCC), a standard
and closely-related coverage metric which is, like predicate
coverage, intermediate in strength between decision/condition
and path coverage.

We instrumented the source code so that every combination of
values of conditions in every decision caused a separate line of
code to be executed.  We then ran Nighthawk on the resulting
units, thus effectively causing it to optimize MCC rather than
just line coverage.  The results are in Figure
\ref{jpf-mcc-results-fig}.  The second column lists the number
of logically possible condition value combinations in each unit.
The third column indicates the number of those condition
value combinations that corresponded to decisions that were
not reached even in the block coverage study.  The last column
indicates the number of condition value combinations that were
uncovered despite being in reached decisions.  The results
suggest that Nighthawk was achieving between 68\% and 93\% of
MCC, or between 84\% and 100\% when only reachable condition
combinations were considered.  These results are very good for
most commercial software, where ``code coverage of 70-80\% is a
reasonable goal for system test of most projects with most
coverage metrics'' \cite{cornett-minimum-coverage}.

In summary, the comparison with previous results suggests that
Nighthawk was achieving good coverage with respect to the
results achieved by previous researchers, even when strong
coverage measures such as decision/condition and MCC were taken
into consideration.

\section{Case Study}

In order to study the effects of different test wrapper
generation and command-line switches to Nighthawk, we studied
the Java 1.5.0 Collection and Map classes; these are the 16
concrete classes with public constructors in the {\tt java.util}
package that inherit from the {\tt Collection} or {\tt Map}
interface.  The source files total 12137 LOC, and Cobertura
reports that 3512 of those LOC contain executable code.  These
units are ideal subjects because they are heavily used and
contain complex code, including templates and inner classes.

For each unit,
we generated test wrappers of two kinds:  plain test wrappers
(P), and enriched wrappers (E) generated with the
{\tt --checkTypedEquals} and {\tt --checkSerializable} switches
(see Section \ref{test-wrappers-section}).
We studied two different option sets for Nighthawk: with no
command-line switches (N), and with the {\tt --deep} switch
(see Section \ref{deep-section}) turned on (D).
For each $\langle$UUT, test wrapper, option set$\rangle$ triple, we
ran Nighthawk and saved the best chromosome it found.
For each triple, we then executed RunChromosome (see Section
\ref{run-chromosome-sec}) specifying that it generate 10 test
cases with the given chromosome, and we measured the coverage
achieved.

\begin{figure}
\begin{center}
\begin{tabular}{|l|c|c|c|c|c|}
Source file	& SLOC	& PN	& EN	& PD	& ED \\
\hline
ArrayList	& 150	& 111	& 140	& 109	& 140 \\
\hline
EnumMap		& 239	& 7	& 9	& 10	& 7 \\
\hline
HashMap		& 360	& 238	& 265	& 305	& 347 \\
\hline
HashSet		& 46	& 24	& 40	& 26	& 44 \\
\hline
Hashtable	& 355	& 205	& 253	& 252	& 325 \\
\hline
IdentityHashMap	& 392	& 156	& 196	& 283	& 333 \\
\hline
LinkedHashMap	& 103	& 27	& 37	& 28	& 96 \\
\hline
LinkedHashSet	& 9	& 6	& 6	& 7	& 9 \\
\hline
LinkedList	& 227	& 156	& 173	& 196	& 225 \\
\hline
PriorityQueue	& 203	& 98	& 123	& 140	& 155 \\
\hline
Properties	& 249	& 101	& 102	& 102	& 102 \\
\hline
Stack		& 17	& 17	& 17	& 17	& 17 \\
\hline
TreeMap		& 562	& 392	& 415	& 510	& 526 \\
\hline
TreeSet		& 62	& 44	& 59	& 41	& 59 \\
\hline
Vector		& 200	& 183	& 184	& 187	& 195 \\
\hline
WeakHashMap	& 338	& 149	& 175	& 274	& 300 \\
\hline
\hline
Total		& 3512	& 1914	& 2194	& 2487	& 2880 \\
\hline
Ratio		& 	& .545	& .625	& .708	& .820 \\
\hline
\end{tabular}
\end{center}

\caption{
  Coverage achieved by configurations of Nighthawk
  on the {\tt java.util} Collection and Map classes.
}
\label{collection-map-cov}
\end{figure}

\begin{figure}
\begin{center}
\begin{tabular}{|l|c|c|c|c|c|}
Source file	& PN	& EN	& PD	& ED \\
\hline
ArrayList	& 75	& 91	& 29	& 48 \\
\hline
EnumMap		& 3	& 9	& 6	& 5 \\
\hline
HashMap		& 63	& 37	& 136	& 176 \\
\hline
HashSet		& 25	& 29	& 27	& 39 \\
\hline
Hashtable	& 8	& 110	& 110	& 157 \\
\hline
IdentityHashMap	& 31	& 41	& 59	& 134 \\
\hline
LinkedHashMap	& 1	& 5	& 4	& 129 \\
\hline
LinkedHashSet	& 1	& 4	& 6	& 24 \\
\hline
LinkedList	& 32	& 61	& 41	& 53 \\
\hline
PriorityQueue	& 23	& 40	& 242	& 103 \\
\hline
Properties	& 104	& 19	& 49	& 47 \\
\hline
Stack		& 5	& 10	& 5	& 26 \\
\hline
TreeMap		& 80	& 131	& 231	& 106 \\
\hline
TreeSet		& 110	& 93	& 98	& 186 \\
\hline
Vector		& 106	& 83	& 156	& 176 \\
\hline
WeakHashMap	& 37	& 35	& 92	& 110 \\
\hline
\hline
Total		& 704	& 798	& 1291	& 1519 \\
\hline
\end{tabular}
\end{center}

\caption{
  Time in seconds taken by configurations of Nighthawk
  to achieve highest coverage
  on the {\tt java.util} Collection and Map classes.
}
\label{collection-times}
\end{figure}

Figure \ref{collection-map-cov} shows the results of this study.
The column labelled SLOC shows the total number of source lines
of code reported by Cobertura (including inner classes) in the
source file associated with the class.  Column PN shows the
SLOC covered by Nighthawk with the plain test wrappers and
no Nighthawk switches; column EN shows the effect of using the
enriched wrappers, and column PD the effect of using deep target
class analysis with the plain wrappers; column ED shows the effect of
using both the enriched wrappers and deep target class analysis.
The second last line shows the totals for each column, and
the last line shows the coverage ratio attained.

With enriched test wrappers and deep target class analysis,
Nighthawk performs well, achieving over 90\% coverage on 11 out
of the 16 classes, including code in inner classes, and
achieving 82\% coverage overall.
Paired $t$ tests with Bonferroni
correction (corrected $\alpha = .00833$) on each pair of columns
in the table
indicate that there are statistically significant differences
between every pair except the (EN, PD) pair.

Nighthawk performed particularly poorly on the {\tt EnumMap}
class.  Inspection revealed that this was due to
the fact that the main constructor to {\tt EnumMap} expects
an enumerated type as one of the parameters.
Nighthawk had no facility for supplying such a type, and so
only a few lines of error code in constructors were executed.
When we customized the test wrapper class so that it used a
fixed enumerated type, Nighthawk covered a more respectable
204 lines of code (ratio .854), raising the total coverage
ratio for all classes to .876.

%It also performed poorly on the {\tt Properties} class, due to
%the fact that it could not automatically generate a useful
%{\tt InputStream} argument for several of the methods.
%We expect that these deficiencies could be at least partially
%addressed by hand-customizing the test wrappers for the classes.

Table \ref{collection-times} shows the amount of time taken by
Nighthawk on the various configurations.  We report the number
of seconds of clock time taken for Nighthawk to first achieve
its best coverage.  A series of paired $t$ tests of the same
type as described above showed that the only pairs of
columns that were different to a statistically significant level
were (PN, ED) and (EN, ED).  This suggests that generating the
enriched wrappers was worthwhile, since it allowed Nighthawk to
cover significantly more code without running significantly
longer.  The results for the deep target class analysis are more
equivocal, since although it also caused Nighthawk to cover
significantly more code, it took significantly longer.

\section{Threats to Validity}

Here we discuss the threats to validity of the empirical results
in this paper (the exploratory study, the comparisons and the
case study).

The representativeness of the units under test is the major
threat to external validity.  We studied Java collection
classes because these are complex, heavily-used units that have
high quality assurance requirements.  However, other units might
have characteristics that cause Nighthawk to perform poorly.
Randomized unit testing schemes in general require many test
cases to be executed, so they perform poorly on methods that do
a significant amount of disk I/O or generate threads.  They also
perform poorly on methods expecting string inputs that are
sentences in a grammar; we expect Nighthawk to do the same
unless it is given valid sentences as seed strings.

Nighthawk uses Cobertura, which measures line coverage, a weak
coverage measure.  The results that we obtained may not extend
to stronger coverage measures.  However, the Nighthawk algorithm
treats the coverage measurement as a black-box integer, and does
not perform special checks particular to line coverage.  The
comparison studies suggest that it still performs well when
decision/condition coverage and MCC are simulated.  There is
also the question of whether code coverage measures are a good
indication of the thoroughness of testing, or whether other,
black-box measures are necessary.  This is an area of active
debate in the software testing community.

Time measurement is a threat to construct validity.  We measured
time using Java's {\tt systemTimeInMillis}, which gives total
wall clock time rather than CPU time.  This may give run time
numbers that do not reflect the actual cost to the user of the
testing.

\section{Related Work}

\subsection{Genetic Algorithms for Test Data Generation}

There is a long history of meta-heuristic search methods such as
GAs being applied to the problem of test suite generation.
In Rela's review of 122 applications of meta-heuristic search
in software engineering \cite{rela04}, 44\% of the
applications related to testing.  Notable examples are Michael
et al.'s evolutionary approach for generating code-level
test data \cite{michael-etal-ga-tcg} and Guo et al.'s GA approach
for generating UIO sequences for protocol testing
\cite{guo-etal-genetic}.

\begin{figure*}

\begin{center}
\begin{tabular}{c|ccccccc|ccccc}
$x|y$ & 1 & 2 & 3 & 4 & 5 & ~~~ &
$lo|hi$ & 1 & 2 & 3 & 4 & 5 \\
\cline{1-6} \cline{8-13}
1 & 1.0 & .00 & .00 & .00 & .00 & &
1 & 1.0 & .50 & .33 & .25 & .20 \\
2 & .00 & 1.0 & .00 & .00 & .00 & &
2 & .00 & 1.0 & .50 & .33 & .25 \\
3 & .00 & .00 & 1.0 & .00 & .00 & &
3 & .00 & .00 & 1.0 & .50 & .33 \\
4 & .00 & .00 & .00 & 1.0 & .00 & &
4 & .00 & .00 & .00 & 1.0 & .50 \\
5 & .00 & .00 & .00 & .00 & 1.0 & &
5 & .00 & .00 & .00 & .00 & 1.0 \\
\end{tabular}
\end{center}

\caption{Chance of selecting two identical integers $x$ and $y$.
(left) search space as space of $(x,y)$ pairs.
(right) search space as space of lower and upper bounds
for random generation of $x$ and $y$.}
\label{search-space-fig}
\end{figure*}

GAs use a modified hill-climbing strategy, in the sense that
they take already-good solutions and look for better solutions
close by in the search space.  Such strategies do not perform
well when the search space is mostly flat, with steep jumps in
score.  Consider the problem of generating two input values
$x$ and $y$ that will cover the true direction of the decision
``{\tt x==y}''.  If we cast the problem as a search for the
two values themselves, and the score as whether we have found
two equal values, the search space is shaped as in the
left-hand side of Figure \ref{search-space-fig}: a flat plain
with a narrow ridge along the diagonal.  Most approaches
to GA white-box test data generation address this problem by
proposing other measures that detect how close the target
decision is to being true.
% Look up Baresel paper BSS02!!

Our approach is essentially to instead recast the problem as a
search for the best values of two variables $lo$ and $hi$ that
will be used as the lower and upper bound for random generation
of $x$ and $y$, and the score as the probability of generating
two equal values.  Seen in this way, the search space landscape
still contains a steep ``cliff'', as seen in the right-hand
side of Figure \ref{search-space-fig}, but the cliff is
approached on one side by a gentle slope.  Our approach also
deals with non-numeric data.

\subsection{Other Test Data Generation Approaches}

There is also a long history of approaches to test data
generation via symbolic execution, starting with the work of
King \cite{king-symbolic-1976} and Clarke
\cite{clarke-76-testdata}.
Other source code analysis-based approaches have used such
methods as iterative relaxation of a set of constraints on
input data \cite{gupta-etal-gen-test-data}, and generation
of unit test method call sequences using goal-directed
reasoning \cite{leow-etal-icse2004}.  Some recent approaches use
model checkers \cite{visser-etal-issta04}, sometimes augmented
with randomized search
\cite{visser-etal-issta06,godefroid-etal-dart,sen-etal-cute,owen-menzies-lurch}.
Groce et al.\ \cite{groce-etal-icse2007} have concluded that
randomized testing is a good first step, before model checking,
in achieving high quality software, especially where the
existence of a reference implementation allows differential
randomized testing \cite{mckeeman-differential}.

Our approach does not require source code or bytecode analysis,
instead depending only on the Java reflection mechanism, which
is well-supported and robust.  For instance, our code was
initially written with Java 1.4 in mind, but worked seamlessly
on the Java 1.5 versions of the {\tt java.util} classes, despite
the fact that the source code of many of the units had been
heavily modified to introduce templates.  However,
model-checking approaches have other strengths, such as the
ability to analyze multithreaded code
\cite{havelund-pathfinder}, which could compensate for the
weaknesses of randomized unit testing if they are used in tandem.  

\subsection{Randomized Testing}

As noted in the Introduction, randomized testing has a long
history, although it faces two main problems:  the oracle
problem and the question of whether randomized testing can be
thorough enough.

There are two main approaches to the oracle problem.  The first
is to use general-purpose, ``high-pass'' oracles that pass many
executions but check properties that should be true of most
software.  For instance, Miller et al.\ \cite{miller-fuzz-cacm}
fail only executions that crash or hang; Csallner and
Smaragdakis \cite{jcrasher-spe} judge a randomly-generated unit
test case as failing if it throws an exception; and Pacheco
et al.\ \cite{pacheco-etal-icse2007} check general-purpose
contracts for units.

The second approach is to write unit-specific oracles
\cite{andrews-etal-rute-rt} in order to check unit-specific
properties.  These oracles, like all formal unit specifications,
are non-trivial to write; tools such as Daikon for automatically
deriving likely invariants \cite{ernst-daikon} can help here.

Previous studies also show that randomized testing is effective
in forcing failures of the SUT.  Here, our research focus is not
on measuring effectiveness, but rather on optimizing
commonly-used objective measures (i.e.\ code coverage) that can
measure thoroughness even in the absence of failures.  If a
test engineer has achieved high coverage, then they have more
assurance that their software is correct, whether or not the
testing has forced failures.

%Daniel Saab on genetic / random testing of circuits
%\cite{saab-genetic-random}.

\section{Conclusions and Future Work}

Randomized unit testing is a promising technology that has been
shown to be effective, but whose thoroughness depends on the
settings of test algorithm parameters.  In this paper, we have
described Nighthawk, a system in which an upper-level genetic
algorithm automatically derives good values for parameters for a
lower-level randomized unit test algorithm.  We have shown that
Nighthawk is able to achieve high coverage of complex Java units
with high reliability requirements.

Future work includes the integration into Nighthawk of useful
facilities from past systems, such as failure-preserving or
coverage-preserving test case minimization, and further
experiments on the effect of program options on coverage and
efficiency.

\section{Acknowledgments}

Thanks to Willem Visser for making the source code of the Java
Pathfinder subject units available, and to the JDEAL development
team for their tool package.  Thanks also for interesting
comments and discussions to Rob Hierons and Andy Podgurski. 
This research is supported by the first author's grant from the
Natural Sciences and Research Council of Canada (NSERC).

\bibliographystyle{abbrv}
\bibliography{my}  % sigproc.bib is the name of the Bibliography in this case

\appendix
foo

\balancecolumns
\end{document}