\section{Collars and Clumps }\label{sec:general}

\PARstart{T}his research assumes that many models can be
controlled by a small number of key variables which we call
{\em collars}. Collars restrict the behavior of a model such
that their state space $clumps$; i.e.  only small number of
states are used at runtime.  If so, the output of a data miner
could be simplified to constrain just the collar variables that
switch the system between a few clumps.

To visualize  collars, imagine an execution trace  spreading out
across a program. Initially, the trace is very small and
includes only the inputs.  Later, the trace spreads wider as
{\em upstream} variables effect more of the {\em downstream}
variables (and the inputs are the most $upstream$ variables of
all).  At some time after the appearance of the inputs, the
variables hold a set of values. Some of these values were
derived from the inputs while others may be default settings
that, as yet, have not been affected by the inputs. The union of
those values at time $t$ is called the $state$ $s_{t}$ of the
program at  that time. 

Multiple execution traces are generated when the program is run
multiple times with different inputs.  These traces reach
different branches of the program. Those branches are selected
by tests on conditionals at the root of each branch.  The
{\em controllers} of a program are the variables that have
different settings at the roots of different branches in
different traces.  Programs have $collars$ when a handful of the
controllers in an early state $s_t$ control the settings of the
majority  of the variables seen in later states.

A related effect to {\em collars} is {\em clumping}.  If a
program has $v$ variables with range $r$, then the maximum
number of states is $r^v$.  Programs $clump$ when most of those
states are never $used$ at runtime; i.e. 
$|used|/\left(r^v\right)\approx 0$.
Clumps can cause collars:
\bi
\item The size of  $used$ is the cardinality cross product of
  the ranges seen in the controllers.
\item If that cardinality is large, many states will be
  generated and programs won't clump.  
\item But if that cross product is small, then the deltas
  between the states will be small -- in which case controlling
  a small number of collar variables would suffice for selecting
  what states are reached at runtime.
\ei

There is much theoretical and empirical evidence for expecting
that many models often contain  \emph{collars} and \emph{clumps}.

\subsection{Theoretical Evidence}

With Singh~\cite{me03l}, we have shown that collars are an
expected property of Monte Carlo simulations where the output
has been discretized into a small number of output classes. 
After such a discretization, many of the inputs would reach the
same goal state, albeit by different routes.  The following
diagram shows two possible distinct execution paths within  a
Monte Carlo simulation both leading to the same goal; i.e.
$a \rightarrow goal$ or $b \rightarrow goal$.

\[ \left.
\begin{array}{r}
\stackrel{a_1}{\longrightarrow} M_1\\
\stackrel{a_2}{\longrightarrow} M_2\\
 \ldots\\
\stackrel{a_m}{\longrightarrow} M_m
\end{array}
\right\} \stackrel{c}{\longrightarrow} goal_i
\stackrel{d}{\longleftarrow} \left\{
\begin{array}{r}
N_1\stackrel{b_1}{\longleftarrow} \\
N_2 \stackrel{b_2}{\longleftarrow}\\
N_3 \stackrel{b_2}{\longleftarrow}\\
N_4 \stackrel{b_2}{\longleftarrow}\\
 \ldots\\
N_n\stackrel{b_n}{\longleftarrow}
\end{array}
\right.
\]
Each of the terms in lower case in the above diagram represent a
probability of some event; i.e. $0 \le \{a_i,b_i,c,d,goal_i\} \le 1$.
For the two pathways to reach the $goal$, they must satisfy  the
collar $M$ or the larger collar $N$ (each collar is a
conjunction). As the size of $N$ grows, the product
$\prod_{j=1}^N b_j$ decreases and it becomes less likely that a
random Monte Carlo simulation will take steps of the larger collar $N$.

The magnitude of this effect is quite remarkable.  Under a
variety of conditions, the  narrower collar is thousands to
millions of times more likely.  For example, when $|M|=2$ and
$N>M$, the condition for selecting the larger collar is
$\frac{d}{c}\ge 64$; i.e. the larger collar $N$ will be used
only when the $d$ pathway is dozens of times more likely than
$c$. The effect is more pronounced as $|M|$ grows; at $|M|=3$
and $N>M$, the condition is $\frac{d}{c}\ge 1728$; i.e. to
select the larger collar $N$, the $d$ pathway must be thousands
of times more likely than $c$ (for more details,
see~\cite{me03l}).  That is, when the output space is
discretized into a small number of outputs, and there are
multiple ways to get to the same output,  then a randomized
simulation (e.g. a Monte Carlo simulation) will naturally select
for small collars. 

While the mathematics may be arcane, the intuition is simple. 
Suppose all the power goes out in a street of hotels. If all
those hotels were designed by different architects then their
internal search spaces would be different (number of rooms per
floor, distance from each room to a flight of stairs, number of
stairwells, etc).  After half an hour, some of the guests bumble
around in the dark and get outside to the street.  Amongst the
guests that have reached the street, there would be more guests
from hotels with simpler internal search spaces (fewer rooms per
floor, less distance from each room to the stairs, more stairwells).


%%%% Jim -- this seems counter intuitive.  I would have thought that the smaller collar (M) was
%%%% more likely when it was smaller.  Isn't that what you just previously argued?

As to {\em clumping}, Druzdel~\cite{druzdel94} observed this
effect in a medical monitoring system.  The system had 525,312
possible internal states. However, at runtime, very few were
ever reached.  In fact, the system remained in one state 52\% of
the time,  and a mere 49 states were used, 91\% percent of the
time. Druzdel showed mathematically that there is nothing
unusual about his application.  If a model has $n$ variables,
each with its own assignment probability distribution of $p_i$,
then the probability that  the model will fall into a particular
state is $p=p_1p_2p_3...p_n=\prod^n_{i=1} p_i$.  By taking logs
of both sides, this equation becomes
\begin{equation}\label{eq:dr} \mbox{ln}~p = \mbox{ln}\prod^n_{i=1}p_i=\sum^n_{i=1}\mbox{ln}~p_i \end{equation}
The asymptotic behavior of such a sum of random variables is
addressed by the central limit theorem.  In the case where we
know very little about a model, $p_i$ is uniform and many states
are possible. However, the {\em more} we know about  the model,
the {\em less} likely it is that the distributions are uniform.
%%%% I don't understand this argument 
Given enough variance in the individual priors and conditional
probabilities or $p_i$, the expected case is that the frequency
with which we reach states will exhibit a log-normal
distribution; i.e. a small fraction of states can be expected to
cover a large portion of the total probability space; and the
remaining states have practically negligible probability.   

The assertion that many types of models display this clumping
behavior is quite important for the style of data mining
(treatment learning) that we advocate.  In application to a
clumping model with collars, Monte Carlo simulation, followed by
TAR2, suffices to summarize that model in an effective way:
\bi
\item TAR2's rules never need to be bigger than the collars.
  Hence, if the collars are small, TAR2's rules can also be small.
\item If a model clumps, then, very quickly, a Monte Carlo
  simulation would sample most of the reachable states. TAR2's
  summarization of that simulation would then include most of
  the important details of a model.
\ei
 
\subsection{Empirical Evidence}

Empirical evidence for clumps first appeared in the 1950s. 
Writing in 1959, Samuel studied machine learning for the game of
checkers~\cite{samuel59}.  At the heart of his program was a
32-term polynomial that scored different configurations.  For
example,     \emph{king center control} means that a king
occupies one of the center positions.  The program learned
weights for these variable coefficients.  After 42 games, the
program had learned that 12 variables were important, although
only 5 of these were of any real significance.

Decades later, we can assert that deleting irrelevant variable
has proven to be a useful strategy in many domains. For example,
Kohavi and John report experiments on 8 real world datasets
where, on average, 81\% of the non-collar variables can be
ignored without degrading the performance of a model
automatically learned from the data~\cite{kohavi97}.

If models contain collars, or if the internal state space
clumps, then much of the reachable parts of a program can be
reached very quickly. This {\em early coverage} effect has been
observed many times.  In a telecommunications application,
Avritzer,  Ros, \& Weyuker found that a sample of 6\% of all
inputs to this system covered 99\% of all inputs seen in about
one year of operation (and a sample of just over 12\% covered
99.9\%)~\cite{avri96}.  Further evidence for early coverage can
be found in the mutation testing literature.  In mutation
testing, some part of a program is replaced with a syntactically
valid, but randomly selected, variant (e.g. switching ``less
than'' signs to ``greater than'').  This method of testing is
useful for getting an estimate of what percentage of errors have
been discovered by testing.  Wong compared results using X\% of
a library of mutators, randomly selected
\mbox{(X $\in$\{10,15,\ldots40,100\})}. Most of what could be
learned from the program could be learned using only X=10\% of
the mutators; i.e.  after a very small number of mutators, new
mutators acted in the same manner as previously used
mutators~\cite{wong95}. The same observation has been made
elsewhere by Budd~\cite{budd80} and Acree~\cite{acree80}. 

If the space of possible execution pathways within a program are
limited, then program execution would be observed to clump since
it could only ever repeat a few simple patterns.  Empirically
such limitations have been observed in procedural and
declarative systems.  Bieman and Schultz~\cite{bieman92} report
that 10 or fewer paths through programs explored 83\% of the
du-pathways  (a du-path is a set of statements in a computer
program from a definition to a use of a variable.  This is one
common form of structural coverage testing.)
Harrold~\cite{harrold98} studied the control graphs of  4000
Fortran routines and 3147 C functions.  Potentially, the size of
a control graph may grow as the square of the number of
statements (in the case where every statement was linked to
every other statement.)  This research found that, in these case
studies, the size of the control graph is a linear function of
the number of statements.  In an analogous result, Pel\'{a}nek
reviewed the structures of dozens of formal models and concluded
that the internal structure of those models was remarkably
simple: ``state spaces are usually sparse, without hubs, with
one large SCC [strongly connected component], with small
diameter \footnote{The diameter of a graph (of a state space
here) is the number of edges on the largest shortest path
between any two vertices.} and small SCC quotient''\footnote{SCC
quotient is a measure of the complexity of a
graph.}~\cite{pelanek04}.  This sparseness of state spaces was
observed previously by Holtzmann where he estimate the average
degree of a vertex in a state space to be 2~\cite{holzmann90}.

Pel\'{a}nek hypotheses that these ``observed properties of state
spaces are not the result of the way state spaces are generated
nor of some features of specification languages but rather of
the way humans design/model systems"~\cite{pelanek04}. 
Pel\'{a}nek does not expand on this, but we assert that
generally SE models are simple enough to be controlled by
treatment learning since they were written by humans with
limited short-term memories~\cite{miller56} who have difficulty
designing overly-complex models.


%In some prior work, Menzies, et al, have surveyed empirical reports
%in the field of testing where this saturation effect has been
%observed, i.e., where computer programs were observed to have zones
%that were easily reachable in testing, and others that were not
%reachable at all.  Here is a survey of some of such reports.
%
%\begin{itemize} \item In \cite{bieman92}, Bieman and Schultz report
%that 10 or fewer paths through programs explored 83\% of the
%du-pathways.  (A du-path is a set of statements in a computer program
%from a definition to a use of a variable.  This is one common form
%of structural coverage testing.)
%
%\item In another study, Harrold and colleagues \cite{harrold98}
%studied over 4000 Fortran routines and 3147 C functions to look at
%their control graphs.  Potentially, the size of a control graph may
%grow as the square of the number of statements (in the case where
%every statement was linked to every other statement.)  This research
%found that, in these case studies, the size of the control graph
%is a linear function of the number of statements.
%
%\item In \cite{colomb99}, Colomb reports about the use of a medical
%expert system.  Colomb expected that, for a system with V variables
%and S state, this system should have $S^V$ states.  In this case,
%$S^V \approx 10^{14}$  However, after operation for one year, he
%found that only 4000 states had been exercise.
%
%\item In another examination of an expert system, this one in the
%telecommunications domain, Avritzer,  Ros, and Weyuker found that
%a sample of 6\% of all inputs to this expert system covered 99\%
%of all inputs seen in about one year of operation, and a sample of
%just over 12\% covered 99.9\%.
%
%\item In a study of the testing of two extensively used systems,
%the Unix report-generation language AWK and the text formatting
%language TEX.  In both cases, despite elaborate testing, various
%measures of coverage showed that large portions of these programs
%were not covered.  For example, in \emph{block coverage}, TEX testing
%had achieved 85\% coverage, and AWK had achieved only 70\%.  Since
%these programs are viewed as quite reliable, this is evidence that
%most paths get exercised early in testing with little further
%improvement with further testing.
%
%\item Another form of testing that has interested software engineers
%is mutation testing.  This method of testing is useful for getting
%an estimate of what percentage of errors have been discovered by
%testing.  Such tester have often reported that a small sample of
%mutations are as useful as a much larger number of mutations.  See
%\cite{acree80, budd80, michael97, wong95}
%
%\item A similar effect has been noted in the certification of expert
%systems.  \cite{menzies01} In this work, Menzies and Singh present
%evidence from nine empirical studies that the number of tests to
%certify an expert system is often quite low, in the range of 4 to
%200 tests.  See \cite{betta95, bobrow86, buchanan84, caraca00,
%davies94, harmon86, menzies97, ramsey89, yu79}.  Thus, the number
%of tests need to discover the errors in an expert system is often
%quite low.