\section{Collars and Clumps }\label{sec:general}

\PARstart{T}his research assumes
that many models
can be controlled by a small number of key variables
which we call {\em collars}. Collars restrict the behavior
of a model such that their state space $clumps$;
i.e.
only small number of states are used at runtime.
If so, 
the output of  a data miner could be simplified
to constrain just the collar variables that switch
the system between a few clumps.

To visualize  collars, imagine an execution trace  spreading
out across a program. Initially, the trace is very small and includes only
the inputs.
Later, the trace spreads wider as {\em upstream} variables
effect more of the {\em downstream} variables (and the inputs are the most
$upstream$ variables of all).
At some time after the appearance of the inputs,
the variables hold a set of values. Some of these values were derived from
the inputs while others may be default settings that, as yet, have not been
affected by the inputs. The union of those values at time $t$ is called
the $state$ $s_{t}$ of the program at  that time. 

Multiple execution traces are
generated when the program is run multiple times with different inputs.
These traces reach different branches of the program. Those
branches
are selected by tests on conditionals at the root of each branch.
The {\em controllers} of a program are the variables that have different
settings at the roots of different branches in different traces.
Programs have $collars$
when a handful of the controllers
in
an early state $s_t$ control the settings
of the majority  of the variables seen in later states .

A related effect to {\em collars} is {\em clumping}.
If a program
has $v$ variables
with range 
$r$, then the maximum number of states is $r^v$.  Programs $clump$ when
most of those states are never $used$ at runtime; i.e.
$|used|/\left(r^v\right)\approx 0$.  
Clumps can cause collars:
\bi
\item
The
size of  $used$ is the cardinality cross product of the ranges seen in the
controllers.
\item
If that cardinality
is large, many states will be generated 
and programs won't clump.  
\item
But if that cross product
is small, then the deltas between the states will be small -- in which case
controlling a small number of collar variables would suffice for
selecting what states are reached at runtime.
\ei

There is much theoretical and empirical evidence
for expecting that many
models often contain  \emph{collars} and \emph{clumps}.

\subsection{Theoretical Evidence}

With Singh~\cite{me03l}, we have shown
that collars are an expected property of Monte Carlo simulations where
the output has been discretized into a small number of output classes.
After such a discretization,
many of the inputs would reach the same
goal state, albeit by different routes.
The following diagram shows two possible distinct
execution paths within  a Monte Carlo simulation
both leading to the same goal; i.e.
$a \rightarrow goal$ or
$b \rightarrow goal$.

\[ \left.
\begin{array}{r}
\stackrel{a_1}{\longrightarrow} M_1\\
\stackrel{a_2}{\longrightarrow} M_2\\
 \ldots\\
\stackrel{a_m}{\longrightarrow} M_m
\end{array}
\right\} \stackrel{c}{\longrightarrow} goal_i
\stackrel{d}{\longleftarrow} \left\{
\begin{array}{r}
N_1\stackrel{b_1}{\longleftarrow} \\
N_2 \stackrel{b_2}{\longleftarrow}\\
N_3 \stackrel{b_2}{\longleftarrow}\\
N_4 \stackrel{b_2}{\longleftarrow}\\
 \ldots\\
N_n\stackrel{b_n}{\longleftarrow}
\end{array}
\right.
\]
Each of the terms in lower case in the above diagram represent a probability
of some event; i.e. $0 \le \{a_i,b_i,c,d,goal_i\} \le 1$. 
For the two pathways to reach the $goal$, they must
satisfy  the collar $M$ or the larger collar $N$ 
(each collar is a conjunction). 
As the size of $N$ grows, the product
$\prod_{j=1}^N b_j$ decreases and it
becomes less likely that a random
Monte Carlo simulation will take steps of the larger collar $N$.

The magnitude of this effect is quite remarkable.
Under a variety of conditions, the  narrower collar is thousands to millions of times
more likely.  For example, when $|M|=2$ and $N>M$, the condition for selecting
the larger collar is 
$\frac{d}{c}\ge 64$; i.e. the larger collar $N$ will be used
only when the $d$ pathway is dozens
of times more likely than $c$. The effect is more pronounced
as $|M|$ grows; at $|M|=3$ and $N>M$, the condition is $\frac{d}{c}\ge 1728$;
i.e. to select the larger collar $N$, the $d$ pathway must be thousands of
times more likely than $c$
(for more details, see~\cite{me03l}).
That is, when the output space is discretized
into a small number of outputs, and there
are multiple ways to get to
the same output,  then a randomized simulation (e.g. a Monte Carlo
simulation) will naturally select for small collars. 

While the mathematics may be arcane, the intuition is simple.  
Suppose all the power goes out in a street of hotels. If all those
hotels were designed by different architects then their internal search spaces
would be different (number of rooms per floor, distance from each room
to a flight of stairs, number of stairwells, etc).
After half an hour, some of the guests bumble around in the dark and get
outside to the street.
Amongst the
guests that have reached the street, there would be more guests from
hotels with simpler internal search spaces (fewer rooms per floor, less distance
from each room to the stairs, more stairwells).


%%%% Jim -- this seems counter intuitive.  I would have thought that the smaller collar (M) was
%%%% more likely when it was smaller.  Isn't that what you just previously argued?

As to {\em clumping},
Druzdel~\cite{druzdel94} observed this effect in
a medical monitoring system.
The system had
 525,312 possible internal states. However, at runtime, very few were ever reached.
In fact, the system remained in one
state 52\% of the time,  and a mere
49 states were used, 91\% percent of the time. 
Druzdel showed mathematically that there is nothing unusual about his application.
If a model has $n$ variables, each with its own
assignment probability distribution of $p_i$, then the probability
that  the model will fall into a particular state is
$p=p_1p_2p_3...p_n=\prod^n_{i=1} p_i$.
By taking logs of both sides, this equation becomes
\begin{equation}\label{eq:dr}
\mbox{ln}~p = \mbox{ln}\prod^n_{i=1}p_i=\sum^n_{i=1}\mbox{ln}~p_i
\end{equation}
The asymptotic behavior of such a sum of random variables
is addressed by the central limit theorem.
In the case
where we know very little about a model, $p_i$ is uniform and many
states are possible. However, the {\em more} we know about  the model,
the {\em less} likely it is that the distributions are uniform.
%%%% I don't understand this argument 
Given
enough variance
in the individual priors and conditional probabilities or
$p_i$, the expected case is that the frequency with which
we reach states will exhibit a
log-normal distribution; i.e. a small fraction of states can be
expected to cover a large portion of the total probability space; and
the remaining states have practically negligible probability.   

The assertion that many types of models display this clumping behavior
is quite important for the style of data mining (treatment learning) that we 
advocate.  In application to a clumping model with collars,
Monte Carlo simulation, followed by TAR2,
suffices to summarize that model in an effective way:
\bi
\item TAR2's rules never need to be bigger than the collars. Hence, if the collars
are small, TAR2's rules can also be small.
\item
If a model clumps, then, very quickly, a Monte Carlo simulation would sample
most of the reachable states. TAR2's summarization of that simulation would
then include most of the important details of a model.
\ei
 
\subsection{Empirical Evidence}

Empirical evidence for clumps 
first appeared in the 1950s.
Writing in 1959, Samuel
studied machine learning for the game of checkers~\cite{samuel59}.
At the heart of his program was a 32-term polynomial that scored
different
configurations. For example,     \emph{king center control} means that
a king occupies one of the center positions.  The program learned
weights for these variable coefficients.  After 42 games, the program
had learned that 12 variables were important, although only 5 of
these were of any real significance.

Decades later, we can assert that 
deleting irrelevant variable has proven to be a useful strategy in
many domains. For example, Kohavi and John report experiments on 8 real world
datasets where, on average, 81\% of the non-collar variables can
be ignored without degrading the performance of a model automatically
learned from the data~\cite{kohavi97}.

If models contain collars, or if the internal state space clumps,
then much of the reachable parts of a
program can be reached very quickly. This {\em early coverage} effect
has been observed many times.
In a
telecommunications application, Avritzer,  Ros, \& Weyuker found that
a sample of 6\% of all inputs to this system covered 99\% of all
inputs seen in about one year of operation (and a sample of just
over 12\% covered 99.9\%)~\cite{avri96}.  
Further evidence for early
coverage can be found in the mutation testing literature.  In
mutation testing, some part of a program is replaced with a
syntactically valid, but randomly selected, variant (e.g. switching
``less than'' signs to ``greater than'').  This method of testing
is useful for getting an estimate of what percentage of errors have
been discovered by testing.  
Wong compared results using X\% of a library of mutators, randomly selected \mbox{(X $\in$\{10,15,\ldots40,100\})}. Most of what could be learned from the
program could be learned using only X=10\% of the mutators; i.e.
after a very small number of mutators, new mutators acted in the
same manner as previously used mutators~\cite{wong95}. The same
observation has been made elsewhere by Budd~\cite{budd80}
and Acree~\cite{acree80}. 

If the space of possible
execution pathways within a program are limited, 
then program execution
would be observed to clump
since it could only ever
repeat a few simple patterns.
Empirically such limitations have been
observed in procedural and declarative systems.
Bieman and Schultz~\cite{bieman92}
report that 10 or fewer paths through programs explored 83\% of the
du-pathways  (a du-path is a set of statements in a computer program
from a definition to a use of a variable.  This is one common form
of structural coverage testing.) 
Harrold~\cite{harrold98}
studied the control graphs of  4000 Fortran routines and 3147 C functions.
Potentially, the size of a control graph may
grow as the square of the number of statements (in the case where
every statement was linked to every other statement.)  This research
found that, in these case studies, the size of the control graph
is a linear function of the number of statements.
In an analogous result,
Pel\'{a}nek reviewed the structures of dozens of
formal models and concluded that
the internal structure of those models
was remarkably simple:
``state spaces are
usually sparse, without hubs, with one large SCC [strongly connected
component], with small diameter \footnote{The diameter of a graph
(of a state space here) is the number of edges on the largest
shortest path between any two vertices.} 
and small SCC quotient''\footnote{SCC quotient is a measure of the complexity of a 
graph.}~\cite{pelanek04}.  This
sparseness of state spaces was observed previously by Holtzmann
where he estimate the average degree of a vertex
in a state space to be 2~\cite{holzmann90}.

Pel\'{a}nek hypotheses that
these ``observed properties of state spaces are not the result of
the way state spaces are generated nor of some features of specification
languages but rather of the way humans design/model systems"~\cite{pelanek04}.
Pel\'{a}nek does not expand on this, but we assert that generally SE models are simple enough to be controlled by treatment learning
since they were written by humans with
limited short-term memories~\cite{miller56} who have difficulty
designing overly-complex models.


%
%
%
%who
%typically
%Formal models often comprised one large
%strongly connected component (where if state $u$ connects to state
%$v$, the $v $ also connects to $u$) and small ``diameters'' (i.e.  the
%largest shortest path between two states was quite short).  
%A program
%executing around such a space would repeatedly arrive back at a small number
%of  states;
%i.e. it would clump.
%In a analogous result, 

%There is much empirical and theoretical evidence for exalt that.
%
%model checking. not the UN reached states. just summarizing the reached states.
%
%
% often, a small number of conjunctions of attribute ranges.
%
%rules with  very small number of conjunctions
%fails if the model being studied required complex control rules.
%
%weighted sum of scores seen
%
%There is a particular type of structure that we have observed in
%many SE models -- that we have labeled a \emph{saturation effect}.
%This name derives from the idea that "random search quickly yields
%all that can be found, even after a much longer search."  \cite{menzies02}
%That is, the search space has been explored to a sufficient extent
%in that longer searches will produce little or no more information.
%(We do not assert that all SE models have this property, but that,
%when a model does exhibit saturation, we can avoid the problem of
%incompleteness of a random search.)  In this subsection, will
%illustrate this effect in a number of areas of software engineering.
%
%In one study of testing of formal models \cite{menzies02}, repeated
%experiments have demonstrated the plateau that is typical of a
%saturation effect.  See figure \ref{formalTesting}.  In this
%experiment, finite state machines were translated into a particular
%variant of AND-OR graphs. Figure \ref{formalTesting} shows results
%from a random search for errors of a very large model.  (There are
%250 finite state machines in this model, each with an average of 6
%states.  This gives a total of 1,455 local states which is equivalent
%to a composite finite state machine of $2.65 x 10^{178}$ states.)
%It is clear that the marginal gain after a search of the first few
%million nodes is negligible.
%
%In some prior work, Menzies, et al, have surveyed empirical reports
%in the field of testing where this saturation effect has been
%observed, i.e., where computer programs were observed to have zones
%that were easily reachable in testing, and others that were not
%reachable at all.  Here is a survey of some of such reports.
%
%\begin{itemize} \item In \cite{bieman92}, Bieman and Schultz report
%that 10 or fewer paths through programs explored 83\% of the
%du-pathways.  (A du-path is a set of statements in a computer program
%from a definition to a use of a variable.  This is one common form
%of structural coverage testing.)
%
%\item In another study, Harrold and colleagues \cite{harrold98}
%studied over 4000 Fortran routines and 3147 C functions to look at
%their control graphs.  Potentially, the size of a control graph may
%grow as the square of the number of statements (in the case where
%every statement was linked to every other statement.)  This research
%found that, in these case studies, the size of the control graph
%is a linear function of the number of statements.
%
%\item In \cite{colomb99}, Colomb reports about the use of a medical
%expert system.  Colomb expected that, for a system with V variables
%and S state, this system should have $S^V$ states.  In this case,
%$S^V \approx 10^{14}$  However, after operation for one year, he
%found that only 4000 states had been exercise.
%
%\item In another examination of an expert system, this one in the
%telecommunications domain, Avritzer,  Ros, and Weyuker found that
%a sample of 6\% of all inputs to this expert system covered 99\%
%of all inputs seen in about one year of operation, and a sample of
%just over 12\% covered 99.9\%.
%
%\item In a study of the testing of two extensively used systems,
%the Unix report-generation language AWK and the text formatting
%language TEX.  In both cases, despite elaborate testing, various
%measures of coverage showed that large portions of these programs
%were not covered.  For example, in \emph{block coverage}, TEX testing
%had achieved 85\% coverage, and AWK had achieved only 70\%.  Since
%these programs are viewed as quite reliable, this is evidence that
%most paths get exercised early in testing with little further
%improvement with further testing.
%
%\item Another form of testing that has interested software engineers
%is mutation testing.  This method of testing is useful for getting
%an estimate of what percentage of errors have been discovered by
%testing.  Such tester have often reported that a small sample of
%mutations are as useful as a much larger number of mutations.  See
%\cite{acree80, budd80, michael97, wong95}
%
%\item A similar effect has been noted in the certification of expert
%systems.  \cite{menzies01} In this work, Menzies and Singh present
%evidence from nine empirical studies that the number of tests to
%certify an expert system is often quite low, in the range of 4 to
%200 tests.  See \cite{betta95, bobrow86, buchanan84, caraca00,
%davies94, harmon86, menzies97, ramsey89, yu79}.  Thus, the number
%of tests need to discover the errors in an expert system is often
%quite low.
%
%\end{itemize}
%
%
%\begin{figure}[htb] \begin{center}
%\includegraphics[width=3in]{figures/saturationMenzies02.pdf}
%\caption{Random search of a very large model \cite{menzies02}}
%\label{formalTesting} \end{center} \end{figure}
%
%In a comparison of a standard depth first search backtracking
%algorithm with a randomized search theorem prover, Crawford and
%Baker \cite{crawford94} found that the randomized search was able
%to reach \emph{more} scheduling solutions in \emph{less} time.  In
%speculating about the reasons for this, these researchers indicated
%that their systems had a small number of ``control" variables that
%determined the value of the much larger number of ``dependent"
%variables.
%