\documentclass{llncs}
	\usepackage{rotating,wrapfig,cite,url,times,graphicx}
	\newcommand{\aive}{{\"aive}}

	\newenvironment{smallitem}
	 {\setlength{\topsep}{0pt}
	  \setlength{\partopsep}{0pt}
	  \setlength{\parskip}{0pt}
	  \begin{itemize}
	   \setlength{\leftmargin}{.2in}
	  \setlength{\parsep}{0pt}
	  \setlength{\parskip}{0pt}
	  \setlength{\itemsep}{0pt}}
	 {\end{itemize}}

	\newenvironment{smallenum}
	 {\setlength{\topsep}{0pt}
	  \setlength{\partopsep}{0pt}
	  \setlength{\parskip}{0pt}
	  \begin{enumerate}
	  \setlength{\leftmargin}{.2in}
	  \setlength{\parsep}{0pt}
	  \setlength{\parskip}{0pt}
	  \setlength{\itemsep}{0pt}}
	 {\end{enumerate}}


	 \newcommand{\bd}{\begin{description}}
	\newcommand{\ed}{\end{description}}
	 \newcommand{\bi}{\begin{smallitem}}
	\newcommand{\ei}{\end{smallitem}}
	 \newcommand{\be}{\begin{smallenum}}
	\newcommand{\ee}{\end{smallenum}}
	\newcommand{\fig}[1]{Figure~\ref{fig:#1}}
	\newcommand{\eq}[1]{Equation~\ref{eq:#1}}
	\newcommand{\tion}[1]{\S\ref{sec:#1}}

	\begin{document}


	\title{
	Finding Robust Solutions in Requirements  Models 
	}

	\author{ Gregory Gay\inst{1} \and 
	Tim Menzies\inst{1} \and
	Omid Jalali\inst{1} \and
	Gregory Mundy\inst{2} \and \\
	Beau Gilkerson \inst{1} \and
	Martin Feather\inst{3} \and  
	James Kiper\inst{4}
	\authorrunning{Gay et al.}
	\institute{West Virginia University, Morgantown, WV, USA\\
\and Alderson-Broaddus College, Philippi, WV\\
	\and Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, USA\\
	\and Dept. of Computer Science and Systems Analysis, Miami University, Oxford, OH, USA\vspace{0.1cm}
	{\footnotesize \email{greg@greggay.com, tim@menzies.us, jalali.omid@gmail.com, beau.gilkerson@gmail.com, mundyge@ab.edu, martin.s.feather@jpl.nasa.gov, kiperjd@muohio.edu }}
}}
	\maketitle
\vspace{-0.8cm}
	\begin{abstract}
Solutions to non-linear requirements engineering problems 
may be ``brittle''; i.e. small changes 
may dramatically alter solution effectiveness.
Hence, it is not enough to just generate solutions to requirements problems- we must also
assess solution robustness.
The KEYS2 algorithm can generate {\em decision ordering diagrams}.
Once generated, these diagrams can assess
solution robustness in linear time. In experiments with real-world requirements engineering models, we show that KEYS2 can generate
decision ordering  diagrams in
$O(N^2)$. 
When assessed in terms of
terms of (a)~reducing inference times, (b)~increasing solution quality, and (c)~decreasing the variance of the generated solution,
KEYS2 out-performs other search algorithms (simulated annealing, ASTAR, MaxWalkSat).
	\end{abstract}

	\pagestyle{plain}
	\section{Introduction}


Consider
a ``requirements model'' where stakeholders write:
\bi
\item
Their various goals;
\item
 Their different methods to reach those goals;
\item
Their view of the possible risks that compromise
those goals;
\item
 What mitigations they believe might reduce those risks.
\ei
A ``solution'' to such  models is a
{\em least}
cost set of mitigations that reduce the {\em most} risks, thereby enabling the {\em most}
requirements. 
In theory, software can find the solution
that best exploits and most satisfies  the various goals of our different stakeholders.
Such tools might find good solutions  that  were  missed
by stakeholders.
Finding solutions to these requirements models is a non-linear optimization problem 
({\em minimize} the sum of the mitigation costs while {\em maximizing} the number of achieved requirements).

There are many heuristic methods that can generate  solutions to non-linear problems (see {\em Related Work}, below).
However, such heuristic methods can be  {\em brittle}; i.e. small changes may dramatically alter
the effectiveness of the solution. Therefore, it important to understand  the {\em neighborhood around the solution}.
A naive approach to understanding the neighborhood  might be
to run a system $N$
times then report:
\bi
\item
the solutions appearing in more than (say) $\frac{N}{2}$
cases;
\item
results with a $\pm95\%$ confidence interval.
\ei
Note that both these approaches requires multiple runs of an analysis method.
Multiple runs are undesirable since, in our experience~\cite{feather08}, stakeholders often
ask questions across a
range of ``scenarios''; i.e. hard-wired constraints that cannot be
changed in that scenario. For example, three 
scenarios might be ``what can be achieved assuming a maximum budget
of one, two, or five million dollars?''. Scenario analysis can be time consuming.
Reflecting over (say) $d=10$ possible decisions a statistically significant number of times (e.g. $N=20$)
requires up to $20*2^{10}>20,000$ repeats of the analysis. 

Therefore, this paper proposes a rapid method for exploring decision neighborhoods. {\em Decision ordering diagrams}
are a visual representation of the effects of changing a solution. We show below that:
\bi
\item Using these diagrams, the region around a solution can be explored in linear time. 
\item A greedy Bayesian-based method called KEYS2 can generate the decision ordering diagrams in $O(N^2)$ time.
\item KEYS2 yields solutions of higher quality that several other methods
(simulated annealing, MaxWalkSat, ASTAR).
\item Also, the variance of the solutions found by KEYS2 is less (and hence, better) than those found by the other methods.
\ei
This paper is structured as follows:
\bi
\item After some background notes on the solution robustness,
we describe 
the   {\em DDP} requirements
engineering tool used at NASA's Jet Propulsion Laboratory
(the case studies for this paper come from real-world early lifecycle DDP models);
\item
DDP inputs and outputs are then reviewed;
\item
Next, decision ordering diagrams are introduced;
\item
We then define and
compare five different algorithms for generating solutions from DDP models:
KEYS, KEYS2, Simulated Annealing, MaxWalkSat, ASTAR. 
Experimental results will show that KEYS2 out-performs the other methods (measured in terms of quickly generating high quality solutions that allow us to reflect over solution robustness).
\item
Finally, we offer some notes on related work and conclusions.
\ei
This paper extends a prior publication~\cite{jalali08} in several ways:
\bi
\item That paper did not address concerns of solution robustness. 
\item That paper did not explore a range of alternate algorithms.
\item This paper introduces KEYS2, which is an  improved version of KEYS,.
\item This paper offers extensive notes on related work.
\ei


\section{Background}
According to Harman~\cite{harman07}, understanding the neighborhood of solutions is an open and pressing issue in search-based
software engineering (SBSE).
He argues that  many software engineering problems are over-constrained and no precise solution over all variables is 
	achievable; therefore partial solutions based on heuristic search methods are preferred. 
{\em Solution robustness} is a major problem for such partial heuristic searches.
The results of such partial heuristic search may be ``brittle''; i.e. small changes to the search results
may dramatically alter the effectiveness of the solution~\cite{harman01}.

When offering partial solutions, it is very important to 
	also offer insight into the space of options around the proposed solution. Such neighborhood information is very useful for managers with only 
	partial control over their projects since it can give them confidence that even if only some of their recommendations are effected, then at least 
	the range of outcomes is well understood. Harman~\cite{harman07} comments that understanding the neighborhood of our solutions is an open and 
	pressing issue in search-based software engineering (SBSE):
	\begin{quote}
		``In some software engineering applications, solution robustness may be as important as solution functionality. For example, it may be 
		better to locate an area of the search space that is rich in fit solutions, rather than identifying an even better solution that is 
		surrounded by a set of far less fit solutions.''
	\end{quote}

	\begin{quote}
		``Hitherto, research on SBSE has tended to focus on the production of the fittest possible results. However, many application areas require 
		solutions in a search space that may be subject to change. This makes robustness a natural second order property to which the research 
		community could and should turn its attention~\cite{harman07}.''
	\end{quote}

This paper reports a set of experiments on AI search for robust solutions.
Our experiments address
 two important concerns.
Firstly,
{\em is
demonstrating solution
robustness a time consuming task}? 
Secondly, is it necessary, as Harman suggests that 
{\em solution quality must be traded off against solution robustness}?
 That is, in our search for the
conclusions that were stable within their local neighborhood, would we
have to reject better solutions because they  are  less stable across
their local neighborhood?

At least for the NASA models described in the next section, both these concerns are unfounded.
KEYS2 terminates in hundreths of a second (where as our prior implementations took minutes to terminate~\cite{fea02a}).
Also, the solutions found by KEYS2 where not only of highest quality, they were also exhibited the lowest variance.
Further, KEYS2 generates the decision ordering diagrams that can 
assess solution robustness in  linear time.


\section{Requirements Modeling  Using ``DDP''}

This section
introduces the
DDP requirements modeling tool~\cite{corn01,feather08}. 
used to interactively document the 
``Team-X'' early lifecycle
 meetings at 
NASA's Jet Propulsion Laboratory (JPL). 
These meetings are the source of the real-world requirements models used in this paper.

At ``Team X'' meetings, 
a large and diverse group of up to 30 experts from various fields (propulsion, engineering,
communication, navigation, science, etc) meet for very short periods of time
(hours to days) to produce
a
``mission concept'' document. This document may commit the project to, say,
solar power rather nuclear power or a particular style of guidance software. 
All of the subsequent work is based on the initial decisions documented in the mission concept. 


\begin{figure}[!t]
{\footnotesize
\begin{tabular}{p{\linewidth}}\hline
\be
\item Requirement goals:
\bi
\item
Spacecraft ground-based testing \& flight problem monitoring
\item
 Spacecraft experiments with on-board Intelligent Systems Health Management (ISHM)
\ei
\item Risks:
	\bi
	\item Obstacles to spacecraft ground-based testing \& flight problem monitoring
			\bi
			\item
			Customer has no, or insufficient, money available for our use
			\item
			Difficulty of building the models / design tools
			\ei
	\item ISHM Experiment is a failure (without necessarily causing flight failure)
	\item Usability, User/Recipient-system interfaces undefined
	\item V\&V (certification path) untried and scope unknown
	\item Obstacles to Spacecraft experiments with on-board ISHM
	\bi
	\item Bug tracking / fixes / configuration management issues, Managing revisions and upgrades (multi-center tech. development issue)
	\item Concern about our technology interfering with in-flight mission
	\ei
	\ei
\item Mitigations:
	\bi
	\item  Mission-specific actions
		\bi
		\item Spacecraft ground-based testing \& flight problem monitoring
		\item  Become a team member on the operations team
		\item Use Bugzilla and CVS
		\ei
	\item  Spacecraft experiments with on-board ISHM
		\bi
		\item Become a team member on the operations team
		\item Utilize xyz's experience and guidance with certification of his technology
		\ei
	\ei
\ee\\\hline
\end{tabular}}
\caption{Sample DDP requirements, risks, mitigations.}\label{fig:ddpegeg}
\end{figure}


DDP allows for the representation of the goals, risks, and
risk-removing mitigations that belong to a specific project.  
During  a Team X meeting,
users of DDP explore combinations of mitigations that cost the least
and support the most number of requirements.  
DDP propagate  influences over matrices.
For example, here is a trivial DDP model
where  {\tt mitigation1} costs \$10,000 to apply and each requirement is of equal
value (100). Note that the mitigation can remove 90\% of the risk. Also, unless mitigated,
the risk will disable 10\% to 99\% of requirements one and two (respectively):
\begin{equation}\label{eq:mod1}
\overbrace{mitigation1}^{\$10,000} \underbrace{\rightarrow}_{0.9} risk1 \rightarrow
\left< \begin{array}{c} \overbrace{\rightarrow}^{0.1} (requirement1=100) \\ 
\underbrace{\rightarrow}_{0.99} (requirement2=100) \end{array} \right.\end{equation}
The other numbers show the impact of mitigations on risks, and the impact of risks
on requirements. The core of DDP is two matrices: one for {\em mitigations*risks} and another for
{\em risks*requirements}.


DDP is used as follows.
A dozen experts, or more,
gather together for short, intensive knowledge
acquisition sessions (typically, 3 to 4 half-day sessions). These
sessions \textit{must} be short since it is hard to gather together
these experts for more than a very short period of time. 
The DDP tool supports a graphical interface for the rapid entry of the
assertions. Such rapid entry is essential, time is crucial and no tool should slow the debate. 
Therefore, 
 DDP uses a lightweight 
representations for its model. Such representations are essential for early lifecycle decision making since
only high-level assertions can be collected in short knowledge acquisition
sessions 
(if the assertions get more elaborate, then experts
may be unable to understand technical arguments from outside their
own field of expertise). 
Therefore,
DDP uses the following lightweight ontology:
\bi
\item \textit{Requirements} (free text) describing the objectives
and constraints of the mission and its development process; 
\item \textit{Weights} (numbers) of each requirements, reflecting
their relative importance; 
\item \textit{Risks} (free text) are
events that damage requirements; 
\item \textit{Mitigations}: (free
text) are actions that reduce risks; 
\item \textit{Costs}: (numbers) effort associated with mitigations, and repair costs for
correcting Risks detected by Mitigations; 
\item \textit{Mappings}: directed edges between
requirements, mitigations, and risks that capture quantitative relationships among them. 
\item \textit{Part-of relations}
structure the collections of requirements, risks and mitigations; 
\ei
Note that DDP models
are the same as the ``requirements models'' we defined in the introduction.
For examples of risks, requirements, and mitigations, see \fig{ddpegeg}. 
For an example of the network of connections between risks and requirements and mitigations,
see
\fig{DDPexample}.


\begin{figure}
\begin{center}
\includegraphics[angle=90, height=6in,width=5in]{fig2new.png}
\end{center}
\caption{An example of a model formed by the DDP tool. Red lines connect 
risks (middle) to requirements (left). Green lines connect mitigations (right)
to the risks.} 
\label{fig:DDPexample}
\end{figure}
Sometimes, we are asked if the analysis of DDP requirements models is a  real
problem.  The usual question is something like: ``With these
ultra-lightweight languages, aren't all open issues just obvious?''.
Such a question is usually informed by the small model fragments
that appear in the ultra-lightweight modeling literature. Those
sample model fragments are typically selected according to their
ability to fit on a page or to succinctly illustrate some point of
the authors.  Real world ultra-lightweight models can be much more
complex, paradoxically perhaps due to their simplicity: if a model
is easy to write then it is easy to write a lot of it.  \fig{DDPexample},
for example, was generated in under a week by four people discussing
one project.  It is complex and densely-connected (a close inspection
of the left and right hand sides of \fig{DDPexample} reveals the
requirements and fault trees that inter-connect concepts in this
model) and it is, by no means, the biggest or most complex DDP model
that has ever been built. 


We base our experimentation around DDP for three reasons.
Firstly,
one potential drawback with ultra-lightweight models is that they are excessively lightweight and contain no useful
information. DDP's models are demonstrably useful (that is, we are optimizing a real-world problem of some value).
Clear project improvements have been seen from DDP sessions at JPL. Cost savings in at least two sessions have exceeded \$1 million, while savings of over \$100,000 have resulted in others~\cite{feather08}. Cost savings are not the only benefits of these DDP sessions. 
Numerous design improvements such as savings of power or mass have come out of DDP sessions.
Likewise, a shifting of risks has been seen from uncertain architectural ones to more predictable and manageable ones. 
At some of these meetings, non-obvious significant risks have been identified and subsequently mitigated. 

Our second reason to use DDP is that we
can access numerous real-world requirements models
written in this format, both now and in the future.
The DDP tool can be used to document not just final decisions 
but also to review the rationale that led to those decisions.
Hence, DDP remains in use at JPL:
not only for its original purpose (group decision support),
but also as a design rationale tool to document decisions. 
Recent DDP sessions included:
\bi
\item An identification of the challenges of intelligent systems health management (ISHM) technology maturation
(to  determine the most cost-effective approach to achieving maturation)~\cite{feather08b};
\item A study on the selection and planning of deployment of prototype software~\cite{feather08c}.
\ei

Our third, and most important reason to use DDP in our research is that the
tool is representative of  other requirements modeling tools in widespread use.
At its core, DDP is a set of influences expressed in a hierarchy, augmented with the occasional equation. Edges
in the hierarchy have weights that strengthen or weaken influences that flow along those edges. At this level of abstraction, 
DDP
is just another form of QOC~\cite{shum94}  or a quantitative variant of Mylopoulos'
s qualitative soft goal graphs~\cite{mylo99}.

\section{Model Inputs and Outputs}
\begin{figure}
~\hrule~
{\scriptsize
\begin{verbatim}
#include "model.h"

#define M_COUNT 2
#define O_COUNT 3
#define R_COUNT 2

struct ddpStruct
{
    float oWeight[O_COUNT+1];
    float oAttainment[O_COUNT+1];
    float oAtRiskProp[O_COUNT+1];
        float rAPL[R_COUNT+1];
    float rLikelihood[R_COUNT+1];
    float mCost[M_COUNT+1];
    float roImpact[R_COUNT+1][O_COUNT+1];
    float mrEffect[M_COUNT+1][R_COUNT+1];
};

ddpStruct *ddpData;

void setupModel(void)
{
    ddpData = (ddpStruct *) malloc(sizeof(ddpStruct));
    ddpData->mCost[1]=11;
    ddpData->mCost[2]=22;
    ddpData->rAPL[1]=1;
    ddpData->rAPL[2]=1;
    ddpData->oWeight[1]=1;
    ddpData->oWeight[2]=2;
    ddpData->oWeight[3]=3;
    ddpData->roImpact[1][1] = 0.1;
    ddpData->roImpact[1][2] = 0.3;
    ddpData->roImpact[2][1] = 0.2;
    ddpData->mrEffect[1][1] = 0.9;
    ddpData->mrEffect[1][2] = 0.3;
    ddpData->mrEffect[2][1] = 0.4;
}

void model(float *cost, float *att, float m[])
{
    float costTotal, attTotal;
    ddpData->rLikelihood[1] = ddpData->rAPL[1] * (1 - m[1] * ddpData->mrEffect[1][1]) 
        			* (1 - m[2] * ddpData->mrEffect[2][1]);
    ddpData->rLikelihood[2] = ddpData->rAPL[2] * (1 - m[1] * ddpData->mrEffect[1][2]);
    ddpData->oAtRiskProp[1] = (ddpData->rLikelihood[1] * ddpData->roImpact[1][1]) 
        			+ (ddpData->rLikelihood[2] * ddpData->roImpact[2][1]);
    ddpData->oAtRiskProp[2] = (ddpData->rLikelihood[1] * ddpData->roImpact[1][2]);
    ddpData->oAtRiskProp[3] = 0;
    ddpData->oAttainment[1] = ddpData->oWeight[1] * (1 - minValue(1, ddpData->oAtRiskProp[1]));
    ddpData->oAttainment[2] = ddpData->oWeight[2] * (1 - minValue(1, ddpData->oAtRiskProp[2]));
    ddpData->oAttainment[3] = ddpData->oWeight[3] * (1 - minValue(1, ddpData->oAtRiskProp[3]));
    attTotal = ddpData->oAttainment[1] + ddpData->oAttainment[2] + ddpData->oAttainment[3];
    costTotal = m[1] * ddpData->mCost[1] + m[2] * ddpData->mCost[2];

    *cost = costTotal;
    *att = attTotal; 
}
\end{verbatim}}
~\hrule~
\caption{A trivial DDP model after knowledge compilation}
\label{fig:ddpExample}
\end{figure}

Before describing experimental comparisons of different methods for generating decision ordering diagrams,  we will first offer more details on the
DDP
models.
\subsection{Pre-Processing}
To enable fast runtimes,
a simple compiler  exports the DDP models into a form accessible by our algorithms.
This compiler stores a flattened form of the DDP requirements tree. 
In our compiled form, all computations are  performed once 
and added as a constant to each reference of the requirement.
For example, the compiler converts the trivial model of \eq{mod1}
into {\tt setupModel} and {\tt model} functions 
similar to those in \fig{ddpExample}.
The {\tt setupModel} function is called only once and sets several constant values. 
The {\tt model} function is called whenever new cost and attainment values are needed.
The topology of the mitigation network is represented as terms in equations within these functions. 
As our models grow more complex, so do these equations. For example,
our biggest model, which contains 99 mitigations, generates 1427 lines of code.
\fig{ddpModels} compares the largest model to four other real-world DDP models.

\begin{figure}[!t]
\small
\begin{center}
\begin{tabular}{|c|r|r|r|c|c|} \hline
Model&LOC&Objectives&Risks&Mitigations\\ \hline
model1.c&  55  &3         &2    &2          \\ \hline
model2.c& 272   &1         &30   &31         \\ \hline
model3.c& 72   &3         &2    &3         \\ \hline
model4.c& 1241   &50        &31   &58         \\ \hline
model5.c&1427    &32        &70   &99         \\ \hline
\end{tabular}
\end{center}
\caption{Details of Five DDP Models. }
\label{fig:ddpModels}
\end{figure}

Currently it takes
about two seconds to compile a model with 50 requirements, 31 risks, and 58 mitigations. 
This compilation only has to happen once, after which we will run our $2^{|d|}$ what-if scenarios.
While this is not a significant bottleneck, the current compiler (written in
unoptimized Visual Basic code) can certainly be sped up.
Experts usually change a small portion of the model then run $2^{|d|}$ what-if scenarios to
understand the impact of that change. Therefore, an incremental compiler
(that only updates changed portions) would run much faster than a full 
compilation of the entire DDP model.


\subsection{Objective Function}\label{sec:of}

When the {\tt model} function is called, a pairing of the total cost of the selected mitigations and the number of reachable requirements (attainment) is returned. 
All of our algorithms then use that information to obtain a ``score'' for the current set of mitigations. 
The two numbers are normalized to a single score that represents the distance to a {\em sweet spot} of maximum requirement attainment and minimum cost: 
\begin{equation}\label{eq:score}
score = \sqrt{\overline{cost}^2 + (\overline{attainment}-1)^2}
\end{equation}

Here, $\overline{x}$ is a normalized value 
$0 \le \frac{x - min(x)}{max(x) - min(x)} \le 1$. 
Hence, our scores ranges $0 \le score \le 1$ and
{\em higher} scores are {\em better}.

\subsection{Decision Ordering Diagrams}\label{sec:dod}
The objective function described above summarizes {\em one} call to a DDP model. 
This section describes
{\em decision ordering diagrams}, which are a tool for summarizing the 
results of thousands of calls to DDP models.


Consider some recommendation for changes to a project that requires decisions $d$ of size $|d|$.
In the general case, $d$ is a subset of the space of all solutions $D$  ($d \subseteq D$).
When checking for solution robustness, or reflecting over  modifications to 
$d$, a stakeholder may
need to consider up to $d'\subseteq N^{|d|}$  possibilities (and $N=2$ for binary decisions
of the form ``should I or should I not do this'').
This can be a slow process, especially if evaluating each decision 
requires invoking a
complex and slow simulator.

Decision ordering diagrams are a linear time method for studying the
robustness and neighborhood of a set of decisions.
The diagrams assume that some method could offer a {\em linear ordering} of the decisions $x \in d$
ranked from 
{\em most-important} to {\em least-important}.
They also assume that some
method
offers
information on the effects of applying the top-ranked $1\le x \le |d|$
decisions
(e.g. the median and variance seen in the model's
objective function after applying solution \mbox{$\{d_1 .. d_x\}$}).
For example, the {\em decision ordering diagram} of \fig{actiono} shows such a linear ordering
(this figure presents
$benefit$  and $cost$ results).
In that figure:
\bi
\item
 The x-axis denotes the number of decisions made.
\item
The y-axis shows performance statistics of an objective function seen after imposing the conjunction of decisions
$1 \le i \le x$.  
\ei
For performance, we  run some objective function and 
report the median (50th percentile) and {\em spread} (the range given by 
the 75th percentile - the 50th percentile). 
We use median and spread to avoid any parametric assumptions.


\begin{figure}[!t]
\begin{center}
\includegraphics[width=3in]{dod.png}
\end{center}
\caption{A Decision Ordering Diagram. The median and spread plots show 50\%-the
percentile and the (75-50)\%-th percentile range (respectively) values generated
from some objective function.}\label{fig:actiono}
\end{figure}
These diagrams can comment on the robustness and neighborhood of
solution $\{d_1.. d_x\}$ as follows:
\bi
\item By considering the variance of the performance statistics after applying $\{d_1 .. d_x\}$.
\item By comparing the results of using the first $x$ decisions to that of using the first $x-1$ or $x+1$
actions.
\ei
The neighborhood of a solution that uses decisions $\{d_1..d_x\}$ are solutions
that use the decisions $\{d_1..d_{x\pm j}\}$.  Since $j$ is bounded $0 \le |d|-1$, this means that 
{\em reflecting
over solution neighborhoods takes time linear on the number of decisions $d$}.

Decision ordering diagrams are a natural representation for
``trade studies,''
the activity of a multidisciplinary team to identify the most balanced 
technical solution among a set of proposed viable solutions~\cite{faa06}.
For example, minimum costs and maximum benefits 
are achieved at point $x_2$ of \fig{actiono}. However, after applying only half the decisions (see $x_1$)
most of the benefits could be achieved, albeit at a somewhat higher cost.

Decision ordering diagrams are {\em useful} under at least three conditions:
\bi
\item  
The scores output by the objective functions are {\em well-behaved}; i.e. move smoothly to
a plateau.
\item
The decisions {\em tame} the variance; i.e.
the spread falls to value much lower than then median (otherwise, it is hard to show that decisions
have any effect on the system performance).
\item The are generated in a {\em timely} manner.
Fast runtimes are required in order to keep up with fast moving discussion.
\ei
According to these definitions, \fig{actiono} is a {\em useful} decision ordering diagram if it can be generated in a {\em timely} manner.

It is an open issue if real worlds requirements models  generate useful
decision ordering diagrams.  
The following
experiments test if, in practice, decision ordering
diagrams generated from  
real world requirements models are {\em timely} to generate while being  {\em well-behaved} and {\em tame}.


\section{Searching for Solutions}

Our experiments compare the results of numerous algorithms.
We selected these comparison algorithms with much care. 
Numerous researchers have stressed the difficulties associated
with comparing radically different algorithms.
For example, Uribe and Stickel~\cite{uribe94}
tried to make some general statement about the value
of Davis-Putnam proof (DP) procedures or binary-decision diagrams
(BDD) for constraint satisfaction problems. In the end, they
doubted that it was fair to compare
algorithms that perform constraint satisfaction and no search (like BDDs) and
methods that perform search and no constraint satisfaction (like DP). 
For this reason, model checking researchers like Holzmann (pers. communication)
eschew comparisons of tools like SPIN~\cite{holzmann97}, which are search-based,  with tools like NuSMV~\cite{nusmv}, which are BDD-based.
Hence we take care to only select algorithms which are similar to KEYS. 

In terms of the Gu et al. survey~\cite{gu97}, our selected
algorithms 
(simulated annealing, ASTAR and MaxWalkSat) share four properties with KEYS and KEYS2.
They are each discrete, sequential, {\em unconstrained} algorithms
(constrained algorithms
work towards a  pre-determined number of possible solutions 
while unconstrained methods are allowed to adjust to the goal space).


For full details on simulated annealing, ASTAR, and MaxFunWalk, see below.
We observe that these algorithms share the property that at each step
of their processing, they comment on all model inputs. KEYS2, on the other hand,
explores the consequences of setting only a subset of the possible inputs.

\subsection{SA: Simulated Annealing}

Simulated Annealing is a classic stochastic search algorithm. It was first described in 1953~\cite{metropolis53} and refined in 1983 ~\cite{kirkpatrick83}. 
The algorithm's namesake, annealing, is a technique from metallurgy, where a material is heated,
then cooled. The heat causes the atoms in the material to 
wander randomly through different energy states and the cooling process increases the chances of finding a state with a lower energy than the starting position.

\begin{figure}
~\hrule~
{\scriptsize
\begin{verbatim}
1. Procedure SA
2. MITIGATIONS:= set of mitigations
3. SCORE:= score of MITIGATIONS
4. while TIME < MAX_TIME && SCORE < MIN_SCORE //minScore is a constant score (threshold)
5.    find a NEIGHBOR close to MITIGATIONS
6.    NEIGHBOR_SCORE:= score of NEIGHBOR
7.    if NEIGHBOR_SCORE > SCORE
8.        MITIGATIONS:= NEIGHBOR
9.        SCORE:= NEIGHBOR_SCORE
10.   else if prob(SCORE, NEIGHBOR_SCORE, TIME, temp(TIME, MAX_TIME)) > RANDOM)
11.       MITIGATIONS:= NEIGHBOR
12.       SCORE:= NEIGHBOR_SCORE
13.   TIME++
14. end while
15. return MITIGATIONS
\end{verbatim}
}
~\hrule~
\caption{Pseudocode for SA}
\label{fig:simplesa}
\end{figure}

For each round, SA ``picks'' a neighboring set of mitigations. To calculate this neighbor, a function traverses the mitigation settings of the current state and randomly flips those mitigations (at a 5\% chance). 
If the neighbor has a better score, SA will move to it and set it as the current state. If it isn't better, the algorithm will decide whether to move to it based on the mathematical function:
\begin{equation}\label{eq:ps1}\small
prob(w,x,y,temp(y,z)) = e^{((w-x) * \frac{y}{temp(y,z)})}
\end{equation}
\begin{equation}\label{eq:ps}\small
temp(y,z) = \frac{(z-y)}{z}
\end{equation}
If the value of the {\tt prob} function is greater than a randomly generated number, SA will move to that state anyways. 
This randomness is the very cornerstone of the Simulated Annealing algorithm. Initially, the ``atoms'' (current solutions) 
will take large random jumps, sometimes to even sub-optimal new solutions. These random jumps allow simulated
annealing to sample a large part of the search space, while avoiding being trapped in local minima.
Eventually,
the ``atoms'' will cool and stabilize and the search will converge to a simple hill climber.

As shown in line 4 of \fig{simplesa}, the algorithm will continue to operate until the number of tries is exhausted or a score meets the threshold requirement. 

\subsection{MaxFunWalk}
The design of simulated annealing dates back to the 1950s. In
order to benchmark our own search engine (KEYS2) against a more state-of-the-art
algorithm, we implemented the variant of 
MaxWalkSat described below.

WalkSat is a local search method designed to address the problem
of boolean satisfiability~\cite{kautz96}. MaxWalkSat is
a variant of  that algorithm that applies weights to each clause in a
conjunctive normal form equation~\cite{selman96}. 
While WalkSat tries to satisfy the entire set of clauses, MaxWalkSat
tries to maximize the sum of the weights of the satisfied
clauses. 

In one respect, both algorithms can be viewed
as a variant of simulated annealing. Whereas simulated annealing always selects
the next solution randomly, the WalkSat algorithms will {\em sometimes}
perform random selection while, other times, conduct a local search to find the
next best setting to one variable.
 
\begin{figure}
~\hrule~
{\scriptsize
\begin{verbatim}
1. Procedure MaxFunWalk
2. for TRIES:=1 to MAX-TRIES
3.    SELECTION:=A randomly generate assignment of mitigations
4.    for CHANGED:=1 to MAX-CHANGES
5.        if SCORE satisfies THRESHOLD return
6.        CHOSEN:= a random selection of mitigations from SELECTION
7.        with probability P
8.            flip a random setting in CHOSEN
9.        with probability (P-1)
10.           flip a setting in CHOSEN that maximizes SCORE
11.    end for
12. end for
13. return BESTSCORE
\end{verbatim}
}
~\hrule~
\caption{Pseudocode for MaxFunWalk}
\label{fig:MWF}
\end{figure}

MaxFunWalk is a generalization of MaxWalkSat:
\bi
\item
MaxWalkSat is defined over CNF formulae.  The success of a collection of variable settings is determined
by how many clauses are ``satisfiable'' (defined using standard boolean truth tables).
\item 
MaxWalkFun, on the other hand, assumes that there exist an arbitrary function 
that can assess a collection of variable settings.
Here, we use the DDP model as a
assessment function.
\ei
Note that
$MaxWalkFun=MaxWalkSat$ if the assessment  is conducted via a logical 
truth table. 

The MaxFunWalk procedure is shown in \fig{MWF}. When run, the user supplies an ideal cost and attainment. This setting is normalized, scored, and set as a goal threshold. If the current setting of mitigations satisfies that threshold, the algorithm terminates. 

MaxFunWalk begins by randomly setting every mitigation. From there, it will attempt to make a {\em single} change until the threshold is met or the allowed number of changes runs out (100 by default). 
A random subset of mitigations is chosen and a random number P between 0 and 1 is generated. The value of P  will decide the form that the change takes:
\bi
\item P$\le\alpha$: A stochastic decision is made. A setting is changed completely at random within the set CHOSEN.
\item P$>\alpha$: Local search is utilized. Each mitigation in CHOSEN is tested until one is found that improves the current score.
\ei

The best setting of $\alpha$ is domain-specific. For this study,
we used $\alpha=0.3$.

If the threshold is not met by the time that the allowed number of changes is exhausted, the set of mitigations is completely reset and the algorithm starts over. This measure allows the algorithm to avoid becoming trapped in local maxima. 
For the DDP models, we found that the number of retries has little effect on solution quality.

If the threshold is never met, MaxFunWalk will reset and continue to make changes until the maximum number of allowed resets is exhausted. At that point, it will return the best settings found. 

As an additional measure to improve the results found by MaxFunWalk, a heuristic was implemented to limit the number of mitigations that could be set at one time. If too many are set, the algorithm will turn off a few in an effort to bring the cost factor down while minimizing the effect on the attainment. 


\subsection{A* (ASTAR)}

 A* is a best-first path finding algorithm that uses distance from origin ($G$)
and estimated cost to goal ($H$)
to find the best path~\cite{hart68}. The algorithm  is widely used~\cite{pearl84,norvig03,stout97,hui04}. 

A* is a natural choice for DDP optimization since the objective
function described above
is actually a Euclidean distance measure
to the desired goal 
of maximum attainment and minimum costs.  
Hence, for the second potion of the ASTAR heuristic,  we can make direct use of \eq{score}.


The ASTAR algorithm keeps a {\em closed} list in order to prevent 
backtracking. We begin by adding the starting state to the closed list.
 In each ``round,'' a list of neighbors is populated from the series of possible states reached by making a change to a single mitigation. 
If that neighbor is not on the closed list, two calculations are made: 
\bi
\item G = Distance from the start to the current state plus the additional distance between the current state and that neighbor.
\item H = Distance from that neighbor to the goal (an ideal spot, usually 0 cost and a high attainment). 
For DDP models, we use \eq{score} to compute H.
\ei
The {\em best} neighbor is the one with the lowest F=G+H. The algorithm ``travels'' to that neighbor and adds it to the closed list. 
Part of the optimality of the A* algorithm is that the distance to the goal is underestimated. 
Thus, the final goal is never actually reached by ASTAR. 
Our implementation terminates once it stops
finding better solutions
for a total of ten rounds. This number was chosen to give ample time for ASTAR to become ``unstuck'' if it hits a corner early on.  


\begin{figure}
~\hrule~
{\scriptsize
\begin{verbatim}
1. Procedure ASTAR
2. CURRENT_POSITION:= Starting assignment of mitigations
3. CLOSED[0]:= Add starting position to closed list
4.
5. while END:= false
6.    NEIGHBOR_LIST:=list of neighbors
7.    for each NEIGHBOR in NEIGHBOR_LIST
8.        if NEIGHBOR is not in CLOSED
9.            G:=distance from start
10.           H:=distance to goal
11.           F:=G+H
12.           if F<BEST_F
13.               BEST_NEIGHBOR:=NEIGHBOR
14.   end for
15.   CURRENT_POSITION:= BEST_NEIGHBOR
16.   CLOSED[++]:=Add new state to closed list
17.   if STUCK
18.       END:= true
19. end while	
20. return CURRENT_POSITION
\end{verbatim}
}
~\hrule~
\caption{Pseudocode for ASTAR}
\label{fig:ASTAR}
\end{figure}


\subsection{KEYS and KEYS2}
The core premise of KEYS and KEYS2 is that the above algorithms perform over-elaborate searches.
Suppose 
that the behavior of a large system is
determined by a small number of {\em key} variables. 
If so, then a very rapid search for solutions can be found by
(a)~finding these {\em keys} then (b)
explore the ranges of the key variables.

%The following small example illustrates this idea.
%Within a model, there are chains of {\em reasons} linking inputs to desired goals.
%Some of the links clash with others. 
%Some of those clashes are most upstream; they are not dependent on other clashes. 
%In the following chains of reasoning 
%the clashes are $\{e, \neg e\}$, 
%$\{g, \neg g\}$ 
%\& $\{j,\neg j\}$;
%the most upstream clashes 
%are
%$\{e \neg e\}$, 
%\& $\{g \neg g\}$, 
%\[\begin{array}{r@{}l@{}l}
%&a \longrightarrow b \longrightarrow c \longrightarrow d \longrightarrow e\\
%input_1 \longrightarrow  &f \longrightarrow g \longrightarrow h \longrightarrow i \longrightarrow j & \longrightarrow goal\\
%input_2 \longrightarrow  &k \rightarrow \neg g \longrightarrow l \longrightarrow m \rightarrow \neg j & \longrightarrow goal\\
%&n \longrightarrow o \longrightarrow p \longrightarrow q \longrightarrow
%\neg e
%\end{array}
%\]
%In order to optimize decision making about this model, we must first decide about these {\em most upstream clashing reasons} (the ``keys'').
%Returning to the above reasoning chains,
%any of $\{a, b, ..q\}$ are subject to discussion.
%However, most of this model is completely irrelevant to the task of
%$input_i\vdash~goal$. For example, 
%the $\{e, \neg e\}$ clash is
%unimportant to the decision making process
%as no reason uses $e$ or $\neg e$. 
%In the context of reaching $goal$ from 
%$input_i$, the only important discussions are
%the clashes $\{g,\neg g,j,\neg j\}$.
%Further, since $\{j, \neg
%j\}$ are dependent on $\{g,\neg g\}$, then the core decision
%must be about variable $g$ with two disputed values: true and
%false. 
%
%Setting the keys reduces the number of reachable states within the model. 
%Formally, the reachable states reduce to
%the cross-product of all of the ranges of the collars.
%We call this the $clumping$ effect. Only a small fraction of the possible states are actually reachable.
%The effects of clumping can be quite dramatic.
%Without knowledge of these keys,  the above model has $2^{20} > 1,000,000$ possible consistent states.
%However, in the context of 
%$input_i\vdash~goal$,
%those 1,000,000 states clumps to just the following two states: 
%$\{input_1,f,g,h,i,j,goal\}$ or
%$\{input_2,k,\neg g,l,m,\neg j,goal\}$.
%
 
As documented in our {\em Related Work} section,
this notion of $keys$ has been
discovered and rediscovered
many times by many researchers.
Historically, finding the keys has seen to be a very  hard task. For example, finding the keys is analogous to finding the {\em minimal environments}
of DeKleer' ATMS algorithm~\cite{deKleer86}. Formally, this {\em logical abduction}, which is an NP-hard task~\cite{by91}.

Our method for finding the keys uses a Bayesian sampling method. If a model contains keys then, by definition, those variables must appear in all solutions
to that model. If model outputs are scored by some oracle, then the key variables are those with ranges that occur with very different frequencies
in high/low scored model outputs.
Therefore, we need not search for the keys- rather, we just need to keep frequency counts on how often ranges appear in $best$ or $rest$ outputs.

KEYS contains an implementation of this Bayesian sampling method. 
It has
two main components - a greedy search and the BORE ranking heuristic.
The greedy
search explores a space of $M$ 
mitigations over the course of 
$M$ ``eras.'' 
Initially, the entire set of mitigations is set randomly. 
During each era, one more
mitigation is set to $M_i=X_j$, $X_j \in \{true,false\}$. 
In the original version of KEYS~\cite{promise08}, the greedy search fixes one variable per era.
A newer variant, KEYS2, fixes an increasing number of variables as the search progresses
(see below for details).

In KEYS (and KEYS2), each era $e$ generates
a set $<input,score>$ as follows:
\be
\item[1:]
$MaxTries$ times repeat:
\bi
\item
$Selected[1{\ldots}(e-1)]$ are settings from previous eras.
\item
	$Guessed$ are randomly selected values for unfixed mitigations.  
\item
	$Input = selected \cup guessed$.
\item
Call {\tt model} to compute $score=ddp(input)$;
\ei
\item[2:] The $MaxTries$ $score$s are divided into $\beta$\% ``best'' 
and remainder become ``rest''. 
\item[3:]
The $input$ mitigation values are then scored using BORE
(described below).
\item[4:]
The top ranked mitigations (the default is one, but the user may fix multiple mitigations at once) are fixed and stored in $selected[e]$.\ee
\begin{figure}
~\hrule~
{\scriptsize
\begin{verbatim}
1. Procedure KEYS
2. while FIXED_MITIGATIONS != TOTAL_MITIGATIONS
3.    for I:=1 to 100
4.       SELECTED[1...(I-1)] = best decisions up to this step
5.       GUESSED = random settings to the remaining mitigations
6.       INPUT = SELECTED + GUESSED
7.       SCORES= SCORE(INPUT)
8.    end for   
9.    for J:=1 to NUM_MITIGATIONS_TO_SET 
10.      TOP_MITIGATION = BORE(SCORES)       
11.      SELECTED[FIXED_MITIGATIONS++] = TOP_MITIGATION 
12.   end for  
13. end while          
14. return SELECTED \end{verbatim} }
~\hrule~
\caption{Pseudocode for KEYS}
\label{fig:KEYS}
\end{figure}
The search moves to era $e+1$ and repeats steps 1,2,3,4.
This process stops 
when every mitigation has a setting.  
The exact settings for $MaxTries$ and $\beta$ must be set via engineering judgment.
After some experimentation, we used $MaxTries=100$ and $\beta=10$.
For full details, see \fig{KEYS}. 

KEYS ranks mitigations using a support-based
Bayesian ranking measure called BORE.
BORE ~\cite{clark05} (short for ``best or rest'')
divides numeric scores seen over $K$ runs and stores the top 10\%  in $best$ and the remaining 90\% scores in the set $rest$
(the $best$ set is computed by studying the delta
of each score 
to the best score seen in any era).
It then computes the probability that a value is found in $best$
using
Bayes theorem.
The theorem uses
evidence $E$ and a prior probability $P(H)$ for
hypothesis $H\in\{best, rest\}$,
to calculate a  posteriori  probability
$P(H|E)=P(E|H) P(H)\; /\; P(E)$.
When applying the theorem, {\em likelihoods}
are computed from observed
frequencies. These likelihoods (called "like" below) are then normalized to create
probabilities. This normalization cancels out $P(E)$ in Bayes theorem.
For example, after $K=10,000$ runs are divided into  1,000 {\em best} solutions and 9,000
{\em rest}, the  value \mbox{$mitigation31=false$} might appear 10 times in the $best$ solutions, but only
5 times in the $rest$. Hence:\newline
\begin{minipage}{1.0\linewidth}
{\scriptsize
\begin{eqnarray}\nonumber
E&=& (mitigation31=false)\\\nonumber
P(best)&=&1000/10000 = 0.1\\\nonumber
P(rest)&=&9000/10000 = 0.9\\\nonumber
freq(E|best)&=&10/1000 = 0.01\\\nonumber
freq(E|rest)&=&5/9000 = 0.00056\\\nonumber
like(best|E)&=&freq(E|best) \cdot P(best)=0.001\\\nonumber
like(rest|E)&=&freq(E|rest) \cdot P(rest)= 0.000504\\\label{eq:ten}
P(best|E)&=& \frac{like(best|E)}{like(best|E) + like(rest|E)}=0.66
\end{eqnarray}}
\end{minipage}\newline
Previously~\cite{clark05},
we have found that  Bayes theorem
is a poor ranking heuristic
since it is easily distracted by low frequency evidence.
For example, note how
the probability of $E$ belonging to the best class
is moderately high even though its support is very low; i.e. $P(best|E)=0.66$ but
$freq(E|best) = 0.01$. 

To avoid the problem of unreliable low frequency evidence, we augment \eq{ten}
with a support term.
Support should $increase$ as the frequency of a value $increases$,  i.e. 
$like(best|E)$ is a valid support measure.
 Hence, step 3 of our greedy search ranks values  via
\begin{equation}\label{eq:ps3}\small
P(best|E) * support(best|E) = \frac{like(best|E)^2}{like(best|E) + like(rest|E)}
\end{equation}

For each era, KEYS samples the DDP models and fixes the top $N=1$ 
settings. KEYS2 assigns progressively larger values. 
In era 1, KEYS2 behaves exactly the same as KEYS
while
in (say) era 3, KEYS2 will
fix the top 3 ranked ranges. 
Since it sets more variables at each era,
KEYS2 terminates earlier than KEYS.

Note that
decision ordering diagrams could be 
directly generated during execution,
just by collection statistics from the {\tt SCORES} array used in
line 7  of \fig{KEYS}.
\section{Results}

Each of the above algorithms was tested on the five  models of \fig{ddpModels}.
Note that:
\bi
\item
Models one and three are trivially small. They were used them to debug our 
code, but not in the core experiments.
We report our results using
models two, four and five since they are large enough to stress test real-time optimization.
\item
Model 4 was discussed in~\cite{me03n} in detail. 
The largest, model 5 was optimized previously in ~\cite{fea02a}.
At that time (2002), it took
300 seconds to generate solutions  using our
old, very slow, rule learning method.
\ei

We also studied how well KEYS and KEYS2 scale to larger models.
Further, we instrumented KEYS and KEYS2 to generate decision ordering diagrams.
The results from all of these experiments are  shown below.

\subsection{Attainment and Costs}

We ran all of our algorithms 1000 times on each model. This number
was chosen because it yielded enough data points to give a clear
picture of the span of results. At the same time, it is a low enough
number that we can generate a set of results in a fairly short time
span.  

\begin{figure}[!b]\centering
\begin{center}\scriptsize
\begin{tabular}{c@{}c@{}c@{~}c}
&Model 2&Model 4 &Model 5 \\
& (31 mitigations)& (58 mitigations)&(99 mitigations)\\ \hline
\begin{sideways}~~~~~~~~SA\end{sideways}& \includegraphics[width=4.25cm]{WithLabel/2/SA.pdf} & \includegraphics[width=4cm]{WithLabel/4/SA.pdf} & \includegraphics[width=4cm]{WithLabel/5/SA.pdf} \\\hline 
\begin{sideways}~~MaxFunWalk\end{sideways}& \includegraphics[width=4.25cm]{WithLabel/2/MaxFunWalk.pdf} & \includegraphics[width=4cm]{WithLabel/4/MaxFunWalk.pdf} & \includegraphics[width=4cm]{WithLabel/5/MaxFunWalk.pdf} \\ \hline

\begin{sideways}~~~~~~ASTAR\end{sideways}&  \includegraphics[width=4.25cm]{WithLabel/2/AStar.pdf} & \includegraphics[width=4cm]{WithLabel/4/AStar.pdf} &  \includegraphics[width=4cm]{WithLabel/5/AStar.pdf} \\ \hline

 \begin{sideways}~~~~~~~~KEYS\end{sideways} &  \includegraphics[width=4.25cm]{WithLabel/2/KEYS.pdf} & \includegraphics[width=4cm]{WithLabel/4/KEYS.pdf} & \includegraphics[width=4cm]{WithLabel/5/KEYS.pdf} \\ \hline

\begin{sideways}~~~~~~~KEYS2\end{sideways}& \includegraphics[width=4.25cm]{WithLabel/2/KEYS-L.pdf} & \includegraphics[width=4cm]{WithLabel/4/KEYS-L.pdf} & \includegraphics[width=4cm]{WithLabel/5/KEYS-L.pdf} \\


\end{tabular}
\caption{1000 results of running five algorithms on three models (15,000 runs in all).
The y-axis shows
cost and  the x-axis shows attainment. 
The size of each model is measured in number of mitigations. 
Note that better solutions fall
towards the bottom right of each plot; i.e. lower costs and higher attainment.
Also better solutions exhibit less variance, i.e. are clumped tighter together.
\label{fig:finalModelResults}
}
\end{center}
\end{figure}

The results are pictured in ~\fig{finalModelResults}. Attainment
is along the x-axis and cost (in thousands) is along the y-axis.
Note that better solutions fall
towards the bottom right of each plot; i.e. lower costs and higher attainment.
Also, better solutions exhibit less variance; that is, the results
are clumped closely together.

These graphs give a clear picture of the results obtained by our
various algorithms. Two methods are clearly inferior:
\bi
\item Simulated annealing exhibits
the worst variance, lowest attainments, and highest costs.
\item MaxFunWalk is better than SA (less variance, lower costs, higher attainment)
but its variance is still far too high to use in any critical situation.
\ei
As to the others:
\bi
\item
On larger models such as model4 and model5, KEYS and KEYS2 exhibit 
lower variance, lower costs, and higher attainments than ASTAR.
\item On smaller models such as model2,
ASTAR usually produces higher attainments and lower variance than the KEYS algorithms
(this advantage disappears on the larger models). 
However, observe the results near the $(0,0)$ point of model2's ASTAR results: sometimes
ASTAR's heuristic search failed completely for that model
\ei

\begin{figure}[!t]
{\scriptsize
\begin{center}
\begin{tabular}{c|c|c|c} 
&Model 2~(31 mitigations)~& Model 4~(58 mitigations)~&Model 5~(99 mitigations)~\\ \hline
SA& 0.577   &1.258        &0.854  \\ \hline
MaxFunWalk& 0.122   &0.429   &0.398         \\ \hline
ASTAR&  0.003  &0.017         &0.048    \\ \hline
KEYS  & 0.011   &0.053         &0.115   \\ \hline
KEYS2 	&0.006  &0.018 		&0.038 \\ 
\end{tabular}
\end{center}
}\caption{
 Runtimes in seconds, averaged over 100 runs, measured using the
``real time'' 
value from the Unix {\tt times} command. 
The size of each model is measured in number of mitigations (and for more details
on model size, see \fig{ddpModels}). 
}
\label{fig:runtimes}
\end{figure}
\subsection{Runtime Analysis}\label{sec:runtimea}
Measured in terms of  attainment and cost, there is little difference between KEYS and KEYS2.
However, as shown by \fig{runtimes}, KEYS2 runs twice to three 
times as fast as its predecessor.
Interestingly, \fig{runtimes} ranks two of the algorithms in a similar order to \fig{finalModelResults}: 
\bi
\item
Simulated annealing is clearly the slowest;
\item MaxFunWalk is somewhat better but not as fast as the other algorithms.
\ei
As to ASTAR versus KEYS or KEYS2:
\bi
\item
ASTAR is faster than KEYS;
\item
and  KEYS2 runs in time comparable to  ASTAR.
\ei
Measured purely in terms of runtimes, there is little to recommend KEYS2 over ASTAR. 
However, ASTAR's heuristic guesses were sometimes observed to be sub-optimal (recall the above discussion
on the $(0,0)$ results in model2's ASTAR results). Such sub-optimality was never observed for KEYS2.

\subsection{Scale-Up Studies}\label{sec:scaleup}

\fig{scaleupruntimes} and
\fig{scaleupmodelcalls} 
shows the effect of changing the size of the model on the number of times that 
the model is asked to generate a score for both KEYS and KEYS2.
To generate these graph, an {\em instance generator} was created that:
\bi
\item
Examined the real-world DDP models of \fig{ddpModels}; 
\item
Extracted
statistics related to the different types  of nodes (mitigations or risks or requirements) and the number of
edges between different types of nodes; 
\item 
Used those statistics to build  random models that were 1,2,4 and 8 times
larger than the original models.
\ei
\begin{figure}[!t]
\begin{minipage}{.6\linewidth}
{\scriptsize
\begin{tabular}{c|c|r|r|r|r|}
\multicolumn{3}{c}{~}& \multicolumn{2}{|c|}{Runtime (secs)}&$\frac{\mbox{KEYS}}{\mbox{KEYS2}}$\\\cline{4-5}
model & expansion & model size & KEYS & KEYS2 & \\\hline
2&1&62&0.01&0.01&1.07\\
2&2&124&0.03&0.02&1.23\\
4&1&139&0.04&0.02&2.29\\
5&1&201&0.13&0.04&3.18\\
2&4&248&0.10&0.05&2.09\\
4&2&278&0.17&0.05&3.48\\
5&2&402&0.50&0.12&4.26\\
2&8&496&0.44&0.14&3.21\\
4&4&556&0.73&0.16&4.66\\
5&4&804&1.98&0.38&5.21\\
4&8&1112&2.97&0.52&5.71\\
5&8&1608&8.06&1.35&5.96
\end{tabular}}
\end{minipage}\begin{minipage}{.4\linewidth}
\includegraphics[width=1.75in]{scaleupruntimes.pdf}
\end{minipage}
\caption{Runtimes KEYS vs KEYS2 (medians over 1000 repeats) as models increase in size.
The ``model''
 number in column one corresponds to \fig{ddpModels}. The 
``expansion factor''
of column two shows how much the instance generator expanded the model.
The ``model sizes'' of column three are  the sum of mitigations, requirements, and risks
seen in the expanded model.
}\label{fig:scaleupruntimes}
\end{figure}
\begin{figure}[!t]
\begin{minipage}{.6\linewidth}
{\scriptsize
\begin{tabular}{c|c|r|r|r|r|}
\multicolumn{3}{c}{~}& \multicolumn{2}{|c|}{Calls to {\tt model}}&$\frac{\mbox{KEYS}}{\mbox{KEYS2}}$\\\cline{4-5}
model & expansion & model size & KEYS & KEYS2 & \\\hline
2&1&62&3100&800&3.9\\
2&2&124&6200&1100&5.6\\
4&1&139&5800&1100&5.3\\
5&1&201&9900&1400&7.1\\
2&4&248&12400&1600&7.8\\
4&2&278&11600&1500&7.7\\
5&2&402&19800&2000&9.9\\
2&8&496&24800&2200&11.3\\
4&4&556&23200&2200&10.5\\
5&4&804&39600&2800&14.1\\
4&8&1112&46400&3000&15.5\\
5&8&1608&79200&4000&19.8
\end{tabular}
}
\end{minipage}\begin{minipage}{.4\linewidth}
\includegraphics[width=1.75in]{scaleupcalls.pdf}
\end{minipage}
\caption{Number of model calls made by KEYS vs KEYS2 (medians over 1000 results) as models
increase in size.
This figure uses the same column structure as \fig{scaleupruntimes}.
}\label{fig:scaleupmodelcalls}
\end{figure}
\begin{figure}[!t]
{\footnotesize
\begin{center}
\begin{tabular}{r|r|r|r|r|}
\multicolumn{1}{c}{~}				&\multicolumn{2}{c}{KEYS}	&\multicolumn{2}{c}{KEYS2}\\\cline{2-5}
				&runtimes	&model calls	&runtimes	&model calls\\\hline
exponential 		&0.82	    &0.83		&	0.88&     0.93\\\hline
polynomial (of degree 2)&0.99&	0.99	&		0.99&     0.98\\
\end{tabular}
\end{center}
}
\caption{Coefficients of determination $R^2$ of KEYS/KEYS2 performance figures,
fitted to two different functions: exponential or   polynomial of degree two.
Higher values indicate a better curve fit.
In all cases, the best fit is not exponential.
}\label{fig:fits}
\end{figure}
\fig{fits} shows the results of curve fitting to the plots of
\fig{scaleupruntimes} and
\fig{scaleupmodelcalls}.
The KEYS and KEYS2 performance curves fit a low-order polynomial (of degree two)  with very high coefficients of determination ($R^2\ge 0.98$).


\fig{fits} suggests that we could scale {\em either} KEYS or KEYS2 to larger models. However we still
recommend KEYS2.
The column marked $\frac{KEYS}{KEYS2}$ in \fig{scaleupmodelcalls}
shows the
ratio of the number of calls made by KEYS vs KEYS2.
As models get larger, the number of calls to the model are an 
order of magnitude greater in KEYS than in
KEYS2. If applied to models with slower runtimes than DDP, then this 
order of magnitude is highly undesirable.


\subsection{Decision Ordering Algorithms}

\begin{figure}
\begin{center}
\noindent
\includegraphics[width=5in]{model2.pdf}

\noindent
{\bf \fig{in5}a:} Internal Decisions on Model 2.

\noindent
~\\
\noindent
\includegraphics[width=5in]{model4.pdf}

\noindent
{\bf \fig{in5}b:} Internal Decisions on Model 4.

\noindent
~\\
\noindent
\includegraphics[width=5in]{model5.pdf}

{\bf \fig{in5}c:} Internal Decisions on Model 5.
\end{center}
\caption{Median and spread of partial solutions learned by KEYS and KEYS2. X-axis shows the number
of decisions made.  ``Median'' shows the 50-th percentile of the measured
value seen in 100 runs at each era. ``Spread'' shows the difference between the 75th  and 50th percentile. }
\label{fig:in5}
\end{figure}

The decision ordering diagrams of \fig{in5} show the effects of the 
decisions made by KEYS and KEYS2.
For both algorithms, at $x=0$, all of
the mitigations in the model are set at random. During each subsequent
era, more mitigations are fixed (KEYS sets one at a time, KEYS2 
an incrementally increasing number).
The lines in each of these
plots show the median and spread seen in the 100 calls to the {\tt
model} function during each round. 

Note that the these diagrams are {\em tame} and {\em well-behaved}:
\bi
\item
{\em Tame:} The "spread" values quickly shrink to a small fraction of the median. 
\item
{\em Well-behaved:} The median values move smoothly to a plateau of best performance (high attainment, low costs).
\ei
On termination (at maximum value of $x$), KEYS and KEYS2 
arrive at nearly identical median results (caveat:
for model2, KEYS2 attains slightly more requirements at a slightly higher
cost than KEYS). 
The spread plots  for both algorithms are almost indistinguishable (exception: in model2, the KEYS2 spread
is less than KEYS). 
That is, KEYS2 achieves the same results as KEYS, but
(as shown in \fig{runtimes} and \fig{scaleupruntimes}) it does so in less time.


A core assumption of this work is the ``keys'' concept; i.e. a small number of variables set the remaining model variables. 
\fig{in5} offers significant support for this assumption: observe how most of the improvement in costs and attainments were achieved after KEYS and KEYS2 
made only a handful of decisions (often ten or fewer).

On another matter,
it is insightful to reflect on the effectiveness of different algorithms for generating decision ordering diagrams.
KEYS2 is the most direct and fastest method. As mentioned above, 
all of the required information can be collected during one execution.
On the other hand,
simulated annealing, ASTAR, and MaxWalkSat would require a post-processor to generate the diagrams:
\bi
\item
Given $D$ possible decisions,
At each era, KEYS and KEYS2 collects statistics on  partial solutions where 
$1,2,3,..|d|$ variables are fixed (where $d$ is the set of decisions)
while the remaining $D-d$ decisions are made at random. 
\item
ASTAR, Simulated Annealing, and MaxFunWalk 
work with full solutions since at each step they offer settings to all 
$d_i\in D$ variables. In the current form, they cannot comment on partial solutions.
Modified forms of these algorithms could add in extra instrumentation and extra post-processing
to comment 
on partial solutions
using methods like
feature subset selection~\cite{hall03} or a sensitivity
analysis~\cite{saltelli00}. 
\ei

	\section{Related Work} 
\subsection{Early vs Later Life-cycle Requirements Engineering}\label{sec:emod}
The case studies presented in this paper come from the NASA Jet Propulsion 
Lab's Team X meetings.
Team X conducts early life-cycle requirement discussions.

Once a system is running, released, and being maintained or extended, 
another problem
is {\em release planning}; i.e.
what features to add to the next $N$
releases. To solve this problem,
an inference engine must
reason about how functionality extensions
to current software can best satisfy outstanding stakeholder requirements. 
The challenge of release planning is that the
 benefits of added functionality must be weighed against the cost of implementing those
extensions.

Several approaches have been applied to this problem including:
\bi
\item
The
OPTIMIZE tool of Ngo-The and Ruhe~\cite{ruhe09}, which combines linear
programming with genetic programming to optimize release plans for software projects
\item
The weighted Pareto optimal genetic algorithm approach of Zhang et al.~\cite{zhang07}
\ei
(See also the earlier comparison of exact vs greedy algorithms 
by Bagnall et al.~\cite{bagnall01}).

Without further experimentation, we cannot assert that KEYS2 will work
as well on later life-cycle models 
(such as those used in release planning) as it did above (on the earlier life-cycle Team X models).  
However, at this time, we can see no reason why KEYS2 would not work as a non-linear optimizer of these later life-cycle models.
 This could be a productive area for future work.

	\subsection{Other Optimizers}\label{sec:oo}
	As documented by the search-based SE literature~\cite{clarke03,harman01,harman04,rela04}
and Gu et al~\cite{gu97},
there are many possible optimization methods.
	 	For example:
\bi
\item 
Gradient descent methods assume that an objective function $F(X)$ is 
differentiable at any single point $N$. A Taylor-series approximation of
$F(X)$ can be shown  to 
decrease fastest if the negative gradient ($-\Delta F(N)$) is followed 
from point $N$. 
\item
	Sequential methods run on one CPU while parallel
	methods spread the work over a distributed CPU farm. 
\item
Discrete methods
	assume model variables have a finite range (that may be quite small)
	while continuous methods assume numeric values with a very large
	(possibly infinite) range.  
\item The search-based SE literature prefers meta-heuristic methods 
like simulated annealing, genetic algorithms and tabu search.
\item
Some methods map discrete values true/false into a continuous range 1/0
	and then use integer programming methods like CPLEX~\cite{mittle07}
	to achieve results. 
\item
	Other methods find overlaps in goal expressions and generate a binary decision diagram (BDD)
	where parent nodes store the overlap of children nodes.
\ei
	This list is hardly exhaustive: Gu et al. list hundreds of other methods and no
single paper can experiment with them all.
	All the algorithms studied here  are discrete and sequential.
	We are currently exploring parallel versions of our optimizers but, so far, the communication overhead
	outweighs the benefits of parallelism.

As to the general class of gradient descent methods, we  do not use them since they assume
the objective function being optimizing is essentially continuous. Any model with an ``if'' statement
in it is not continuous since, at the ``if'' point, the program's behavior may become  discontinuous.
The requirements models studied here are discontinuous about every subset of every possible mitigation.


As to the more specific class of  integer programming  methods, 
	we do not explore them here for two reasons.
	Coarfa et
	al.~\cite{coarfa00} found that  integer programming-based approaches
	ran an order of magnitude slower than discrete methods like
	the 
	MaxWalkSat and KEYS2 algorithms that we use.  Similar results 
have been reported by Gu et.al  where discrete
	methods ran one hundred times faster than integer programming~\cite{gu97}.

Harman offers another reason to avoid integer programming methods.
In his search-based SE
manifest, Harman~\cite{harman01} argues 
that many SE problems are
over-constrained and so there may exist no precise solution that
covers all constraints. 
A complete solution over all variables is hence impossible and  partial solution based on 
heuristic search methods are  preferred.
Such methods may  not be complete; however, as Clarke et al remark, ``...software engineers
face problems which consist, not in finding {\em the} solution,
but rather, in engineering an {\em acceptable} or 
{\em near-optimal solution} from a large number of alternatives.'' ~\cite{clarke03}

	\subsection{Models of Requirements Engineering}
DDP is a ultra-lightweight modeling tool.
The value of ultra-lightweight ontologies
in early life cycle modeling
is widely recognized.
For example
Mylopoulos' {\em soft-goal} graphs~\cite{mylo92,mylo99}
represent
knowledge about non-functional requirements.
Primitives in soft goal modeling include statements of partial
influence such as {\em helps} and {\em hurts}.
Another  commonly used  framework in the design rationale
community is a  ``questions-options-criteria'' (QOC)
graph~\cite{shum94}. In  QOC
graphs:
\bi
\item {\em Questions} suggest {\em options}.
 Deciding on  one option can
raise other  questions;   
\item Options   shown  in  a  box denote  {\em selected
options};  
\item
Options are assessed  by  {\em criteria}; 
\item Criteria are gradual
knowledge;  i.e. they {\em  tend/reject to support}  
options.  
\ei
QOCs   can succinctly  summarize   lengthy debates; e.g. the 480
sentences  uttered in  a  debate on  interface
options  can be    displayed    in  a   QOC   graph  on    a    single
page~\cite{maclean96}.
Saaty's Analytic Hierarchy Process (AHP)~\cite{saaty80} is a variant of QOC.

While DDP shares many of the design aspects of softgoals \& QOC \& AHP, it differs in its representations
and inference method.  
As explained above around \eq{mod1},
where as AHP and QOC  and softgoals
propagate influences over hierarchies,  DDP propagate  influences over matrices.

\subsection{Formal Models of Requirements Engineering}\label{sec:ffm}

Zave \& Jackson~\cite{zave97}  define requirements engineering as finding the specification $S$ for the
domain assumptions $K$ that satisfies the given requirements $R$; i.e.
\begin{equation}\label{eq:zj}
\mbox{find $S$ such that~} S \vdash R
\end{equation}
Jureta, Mylopoulous \& Faulkner~\cite{jureta08} (hereafter JMF) take issue with \eq{zj}, saying 
that it
implicitly assume that $K,S,R$ are precise and complete enough for the satisfaction relation to hold. 
More specifically, JMF complain that \eq{zj}
does not permit
partial fulfillment of (some) non-functional requirements. Also, the Zave\&Jackson definition
does not allow any preference
ordering of 
$specification_1$ over $specification_2$. JMF offer a replacement ontology where classical inference
is replaced with  operators that supports the generation and ranking of
subsets of domain assumptions that lead to maximal (w.r.t. size) subsets of 
the possible goals, and softgoal quality criteria\footnote{According to JMF:
``a salient characteristic of softgoals is that they cannot be
satisﬁed to the ideal extent, not only because of subjectivity, but also because the ideal level of satisfaction is beyond
the resources available to (and including) the system. It is
therefore said that a softgoal is not satisﬁed, but satisﬁced.''}.

DDP reinterprets ``$\vdash$''  in \eq{zj} as an inference across numeric quantities, rather than the inference
over discrete logical variables suggested by Zave\&Jackson. Hence, it can achieve the same goals as
JMF (ranking of partial solutions with weighted goals) without requiring the JMF ontology.
 \subsection{Requirements Analysis Tools}

	There exist many powerful requirements analysis tools including
	continuous simulation (also called system dynamics)~\cite{hamid91,sterman00},
	state-based simulation (including petri net and data flow approaches)
	~\cite{akhavi93,harel90,martin00}, hybrid-simulation (combining discrete
	event simulation and systems dynamics)~\cite{martin01,donzelli01,setamanit07},
	logic-based and qualitative-based methods~\cite[chapter
	20]{bratko01}~\cite{iwasaki89}, and rule-based
	simulations~\cite{mi90}. 
	One can find these models being used in the requirements phase (i.e. 
	the DDP tool described below), design refactoring using patterns ~\cite{france03}, software integration~\cite{denno03}, 
	model-based security~\cite{jurjens06}, and performance assessment~\cite{bals04}.
	Many researchers have proposed support environments to
	help explore the increasingly complex models that engineers are
	developing. Gray et al ~\cite{gray06} have developed the Constraint-Specification
	Aspect Weaver (C-Saw), which uses aspect-oriented approaches ~\cite{filman04}
	to help engineers in the process of model transformation.
	Cai and Sullivan ~\cite{cai05} describe a formal method and tool called
	{\em Simon} that ``supports interactive construction of formal models,
	derives and displays design structure matrices... and supports
	simple design impact analysis.'' 
	Other tools of note are lightweight formal methods such as ALLOY ~\cite{jackson02}
	and SCR~\cite{heitmeyer02} as well as UML tools that allow
	for the execution of life cycle specifications 
	(e.g. CADENA~\cite{child06}).

 
	Many of the above tools were built to maximize the expressive power of the representation language or the constraint language
	used to express invariants. What distinguishes our work is that we are 
	{\em willing to trade off
	representational or constraint expressiveness 
	for faster runtimes}.
	There exists a class of ultra-lightweight model languages which, as we show above, can be processed very quickly.
	Any of the tools listed in the last paragraph are also candidate
	solutions to the problem explored in this paper, if it can be shown that their processing 
can generate tame and well-behaved decision ordering diagrams in a timely manner.
\subsection{Other Work on ``Keys''}\label{sec:keys}
Elsewhere ~\cite{me03l}, we have documented
dozens of papers that have reported the keys effect (that a small number of variables set the rest)
under different names including {\em
narrows, master-variables, back doors,} and {\em feature subset
selection}:
\bi
\item
Amarel ~\cite{Amarel86} observed that search problems contain narrow sets of variables or collars that must be used in any solution.
In such a search space, what matters is not so much how you get to these collars, but what decision you make when you get there.
Amarel defined macros encoding paths between {\em narrows}, effectively permitting a search engine to jump between them.
\item
In a similar theoretical analysis,
Menzies \& Singh~\cite{me03l} 
computed the odds of a system selecting solutions
to goals using complex, or simpler, sets of preconditions.
In their
simulations, they found that a system will naturally select for tiny
sets of preconditions (a.k.a. the keys) at a very high probability.
\item
Numerous researchers have examined {\em feature subset selection};
i.e. what happens when a data miner deliberately ignores some of
the variables in the training data. For example, Kohavi and
John~\cite{kohavi97} showed in numerous datasets that as few as
20\% of the variables are $key$ - the remaining 80\% of variables
can be ignored without degrading a learner's classification accuracy.
\item
Williams et.al.~\cite{williams03} discuss how
to use keys  (which they call ``back doors'') to optimize search.
Constraining these back doors also constrains the rest of the program. 
So, to quickly search a program, they suggest imposing some set values on the key variables.
They showed that setting the keys
can reduce
the solution time of certain hard problems from exponential to polytime, provided that the
keys can be cheaply located, an issue on which Williams et.al. are curiously silent.
\item
Crawford and Baker ~\cite{crawford94} compared the performance of
a complete TABLEAU prover to a very simple randomized search engine
called ISAMP.  Both algorithms assign a value to one variable, then
infer some consequence of that assignment with forward checking.
If contradictions are detected, TABLEAU backtracks while ISAMP
simply starts over and re-assigns other variables randomly. Incredibly,
ISAMP took {\em less} time than TABLEAU to find {\em more} solutions
using just a small number of tries. Crawford and Baker hypothesized
that a small set of {\em master variables} set the rest and that
solutions are not uniformly distributed throughout the search space.
TABLEAU's depth-first search sometimes drove the algorithm into
regions containing no solutions.  On the other hand, ISAMP's
randomized sampling effectively searches in a smaller space.
\ei
In summary, the core assumption of our algorithms are supported in many domains.

\section{Conclusion}
Requirements tools such as  the DDP  tool (used at NASA for early lifecycle discussions), 
contain a
shared
group memory that stores all of the requirements, risks, and mitigations of each member of the group.
Software tools can explore this shared memory to find consequences and interactions that may have been
overlooked.

Studying that group memory is a non-linear optimization task:
possible benefits must be traded off against the increased cost of applying various mitigations.
Harman~\cite{harman01} cautions that solutions to non-linear problems 
may be ``brittle'' - small changes to the search results
may dramatically alter the effectiveness of the solution. Hence, when 
reporting an analysis of this
shared group memory, it is vitally important to comment on the robustness of the solution.

Decision ordering diagrams are a solution robustness assessment method.
The diagrams
rank all of the possible decisions from most-to-least influential. Each point $x$ on
the diagrams shows the effects on imposing the conjunction of decisions $1 \le j \le x$.
These diagrams can comment on the robustness and neighborhood of
solution $\{d_1.. d_x\}$ using two operators:
\be
\item By considering the variance of the performance statistics after applying $\{d_1 .. d_x\}$.
\item By comparing the results of using the first $x$ decisions to that of using the first $x-1$ or $x+1$ actions.
\ee
Since the diagrams are sorted, this analysis of robustness and neighborhood takes, at most, time linear of the number
of decisions.
That is, theoretically, it takes linear time to {\em use} a decision ordering diagram (see \tion{dod}).

Empirically, it take low-order polynomial time to {\em generate} a decision ordering diagram using KEYS2.
This algorithm makes the ``key'' assumption (that a small group of variables set everything else) and uses Bayesian
ranking mechanism to quickly find those keys. As discussed above in the {\em Related Work} section, this assumption holds over a wide range of models used in a wide range of domains. This ``keys'' assumption can be remarkably
effective: our empirical results show that KEY2 can generate decision ordering diagrams 
faster than the other algorithms studied here. Better yet, curve fits to our empirical results
show that KEYS runs in low-order polynomial time (of degree two) and so should scale to very large models.

Prior to this work, our two pre-experimental concerns were that:
\bi
\item
We would need to trade solution robustness against solution quality.
More robust solutions may not have the highest quality.
\item
Demonstrating solution robustness requires multiple calls to an analysis procedure.
\ei
At least for the models studied here, neither concern  was realized. KEYS2 generated the highest
quality solutions (lowest cost, highest attainments) and did so 
more quickly than the other methods.

In \tion{dod} it was argued that 
decision ordering diagrams are useful when they are 
{\em timely} to generate while being  {\em well-behaved} and {\em tame}.
KEYS2's results are the most {\em timely} (fastest to generate) of all 
of the methods
studied here. As to the other criteria,
\fig{in5} shows that KEYS2's decision ordering diagrams:
\bi
\item Move smoothly to a plateau with only a small amount of ``jitter'';
\item Have very low spreads, compared to the median results.
\ei
That is, at least for the models explored here, KEYS2 generated
decision ordering diagrams they are both {\em well-behaved} and {\em tame}.

In summary, we recommend KEYS2 for generating decision ordering diagrams since, apart from the (slightly 
slower) KEYS algorithm, we are unaware of other search-based software engineering 
methods that enable such a rapid reflection of solution robustness. 

\section{Acknowledgments}
This research was conducted at West Virginia University, the Jet Propulsion
Laboratory under a contract with the National Aeronautics and Space administration, 
Alderson-Broaddus College, and Miami University. 
Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does 
not constitute or imply its endorsement by the United States Government
\bibliographystyle{abbrv}
\bibliography{refs,refsgreg} 
\appendix
\section*{Obtaining our System}

We have placed on-line all the  materials required for other
researchers to conduct further investigation into this problem. 
All the code, Makefiles, scripts, and so on used in this paper are available 
at 	 \url{http://unbox.org/wisp/tags/ddpExperiment/install}.  For security reasons, all the available
JPL requirements models have been ``sanitized''; i.e. all words replaced with anonymous  variables.


\clearpage
\appendix
\sffamily
\section*{Replies to reviewers}

Thank you for your careful reviews on the first draft of this paper. On experimentation, we found that the premises of draft 
\#1 was wrong. It turns out that
KEYS2 scales very well to larger models ($O(N^2)$). Hence, the basis of the paper change from issues of runtime
to other advantages of our method. 

The issue of {\em exploring solution robustness}  was not adequately discussed in draft \#1. Here, we focus draft \#2 on that issue.
Nearly every page of this paper has been extensively rewritten and the whole paper has been shortened.

\subsection*{Comments from Reviewer 1}
{\em I felt that the preparatory sections built up lots of background, but the meat in the results section was skimmed over by comparison.}
\begin{quote}
This draft has been extensively rewritten so that we get to the ``meat'' as soon as possible.
New results have been added regarding how well these methods scale to larger models
(see \tion{scaleup}). 
In all, the conclusions of this paper are now based on three real-world requirements models from NASA
and a dozen artificially generated models.

As to excessive amount of background material, much has been pruned from this draft but much more could also be deleted.
For example:
\bi
\item
Sections 5.1, 5.2, 5.3 
describe algorithms that have been extensively documented elsewhere in the literature. Hence, if you agree,
we could cut those
sections back to one paragraph each.
\item
 In the {\em Related Work} section, we could delete
sections 7.3, 74, 7.5.
\ei
\end{quote}
{\em Fig 10 is too small to read properly.}
\begin{quote}
That  figure has been made larger.
\end{quote}
{\em I missed a clear and unequivocal explanation of what the fitness function and representation were for the Simulated Annealing}
\begin{quote}
That has now been placed in its own section, especially marked ``objective function'' (see \tion{of}).
\end{quote}
{\em
The work is written in a compelling and direct style (which occasionally is too direct and a little grating)....

.... I liked the novel style opening "The room is crowded and everyone is talking at once.". However this may put off some readers. 
At present the paper is nice as a compelling read, but starting off like this really could distract the more traditional SE reader who have views about how scientific work should be presented...

The colorful writing style comes up again on p17; Uribe and Stickel "struggled valiantly". I don't think that this style is suitable for an academic journal. Maybe I am old fashioned. I think it should be toned down and the style more dispassionate. Did they really struggle "valiantly" anyway? 
It may well put readers off and that would be a pity.
}
\begin{quote}
That language has been removed, both in the introduction and around the discussion of Uribe and Stickel.
\end{quote}
{\em The authors make available their code which is helpful. It is not
clear whether real world requirements data form JPL is also included
in the on line provision. I would like some clarification on this
issue in the revised version. Such provision would be a great help
to follow on research, though I realize that for reasons of commercial
confidentiality this may not be possible.}
\begin{quote}
Appendix A now details how to obtain this material.  For security
reasons, all the available JPL requirements models have been
``sanitized''; i.e. all words replaced with anonymous  variables.
\end{quote}
{\em The authors statement:- " This is an exciting result since it means
 that current results from simulated annealing(e.g.[5,8,10,62,73])
 could be greatly improved, just by switching to an alternate search
 engine.  " really needs to be re-written.
}
\begin{quote}
This new draft does not include those over-inflated claims about the merits of KEYS/KEYS2 vs
simulated annealing. 
\end{quote}
{\em Other work on SBSE for requirements that is clearly missing 
and needs to be included is:
\bi
\item
The paper by Gunther Ruhe at FSE 08
\item The Zhang et al (GECCO 2007) and
\item the work by Bagnall (I\&ST 2001).
\ei}
\begin{quote}
We cannot find the Ruhe paper at the FSE'08 web site\footnote{\url{https://cgi4.cc.gatech.edu/phps/conferences/fse16/program.php}}
or on his home pages (perhaps it was a last minute replacement for
a keynote?).  
However,  his 2009 TSE paper is clearly relevant so we used that instead.
A curious omission in that paper is the runtimes of the GAs used in that paper.
While this does not
invalidate that paper, it does mean that we cannot make an extensive comment on that work to this paper.
In any case, thanks to this review, we added a reference to his work in our section "Other Optimizers" section (see \tion{oo}).

As to the other work on release planning, we added in notes on 
Ruhe, Zhang, and Bagnall's work in Related Work (see \tion{emod}).
\end{quote}
{\em 
Some of the citations are messed up too. For example citation [5] "S M B and S. Mancoridis". S M B is Brian Mitchell I believe.
Clearly some more scholarly care is required here, both to chaise up relevant work and to ensure that the citations are present correctly.
}
\begin{quote}
Thanks for catching that- the culprit was two commas together in a .bib file. That has been fixed and the other references checked.

Oh- and that particular paper did not
not make it into this draft.
\end{quote}
{\em Does Fig 4 add sufficient value to justify a whole page?}
\begin{quote}
(FYI: that figure has now moved to page 6).
That is, of course, for you to decide. However, we often get questions along the lines of ``aren't these models so simple you can just {\em see} the problems/solutions?''.
This figure shows just how complex these models can get.
We can remove that figure if you feel that we have made enough of a case in the text that the JPL are hard enough to be interesting.
\end{quote}
{\em In related work, some recent work by Finkelstein on multi objective
 requirements models for fairness is missing (this was published in
 RE 08)}
\begin{quote}
In RE08, we found:
 \bi
\item	
Revisiting the Core Ontology and Problem in Requirements Engineering
Jureta, I.J.; Mylopoulos, J.; Faulkner, S.;
International Requirements Engineering, 2008. RE '08. 16th IEEE
\ei
Comments on this have been added to the "Related Work" section, see \tion{ffm}.
\end{quote}
\subsection*{Comments from Reviewer 2}
{\em The novelty of the work remains unclear. The paper presents little other (of real substance) than a small performance measurement experiment, results of which are interpreted as supporting the appropriateness of the algorithm for use in automated requirements engineering.
}
\begin{quote}
Certainly, the previous draft was poorly written and obscured the novelty of the paper. 
This draft has been significantly
reorganized and the relevant material moved to the front (and supporting material moved to
a related work section at the back of the article). Consequently the novel features of this paper
are not obscured:
\bi
\item
A completely naive
search through the space of options is exponential on the number of options. We can reduce that search to linear
time using {\em decision ordering diagrams}, described in \tion{dod}. 
\item
Better yet, we can generate those diagrams in low-order polynomial time using
a Bayesian sampling method called KEYS2.
\ei
\end{quote}
{\em The title and abstract and indeed the overall narrative of this paper are misleading. This is a paper on a small-scale runtime performance study *masquerading* as a paper on requirements analysis.}
\begin{quote}
We thank this reviewer for this comment- it inspired us to conduct the scale up study of \tion{scaleup}.

We agree that the previous draft lacked insufficient analysis to merit journal publication.
To address this concern, we have changed the title of the paper; changed the focus of the paper; and clarified and extended our analysis.

We would also comment that this {\em is} a real requirements analysis paper.
In fact,
this new draft's  empirical authority  is {\em greater than} the standard requirements engineering paper.
Here, we explore three real world requirements models (from NASA) as well as a dozen artificially generated models.
By contrast, a standard paper in this field uses a single small (possibly ``toy'') case study.

\end{quote}
{\em The writing in this paper is somewhere between terrible and just plain poor: at the level of sentences, paragraphs, and overall narrative structure..... Opening scenario isn't credible. A decision as substantial as a redesign of the file system would not be made without documenting it in a way that would make everyone with an interest in that aspect of the aware of the decision....}
\begin{quote}
This reviewer is quite correct in complaining about the quality of the previous draft. This draft has been extensively reorganized. The bogus opening scenario has been
deleted. The  core of the paper has been made clearer.
The paper has been extensively proof read prior to resubmission to ASE.
\end{quote} 
{\em Critical but unexplained and unsupported assumption: Our proposed solution to real-time requirements optimization assumes that the behavior of a large system is determined by a very small number of key variables.}
\begin{quote}
This assumption has much supporting evidence from other domains (see \tion{keys}). 
Also, our empirical evidence supports it (see the comment at bottom of p20).
\end{quote}
\subsection*{Comments from Reviewer 3}
{\em 
The whole discussion of the "real time requirements
 optimization" problem, while a nice idea and interesting,
 falls flat by being bogus.}
\begin{quote}
We agree. When we experimented with this scale up runtime costs using our models (and not using some bogus equation), we found the scale up factors
to be far {\em less} than anything suggested by our formulae. So this draft has dropped that whole discussion. Instead, we focus on an issue that was
poorly discussed in the previous draft: fast methods for assessing solution robustness.
\end{quote}
{\em
 First paragraph of Sec 7: Why? why is it "unfair" to compare
  BDD-based against search-based? Aren't they solving the same problem?
}
\begin{quote}
Algorithm comparison is far more complicated than it first appears.
We have much experience at
empirical software engineering conferences where authors of method1 have cried "foul!" when authors have compared model1 against
model2 using a particular problem that, by definition, is not suitable for method1. The classic example is NuSMV vs SPIN: the 
BDDs of NuSMV make it a natural choice for models with extensive symmetry and repeated structures. Given that you know that,
should you compare NuSMV and SPIN using such models? Or are they the {\em last} thing you should use for a comparison?
We have no good answer. Nor do other researchers:
\bi
\item Holzmann does not compare SPIN against NuSMV. 
\item
Uribe and Stickel
take great care when selecting comparative cases- and when the algorithms are too distant, they don't even do that. 
\ei
\end{quote}
{\em  Sec 7.2: what is the variable "C"?}
\begin{quote}
(FYI: in the new draft,  this section is appendix B.2.) "C" is "CHOSEN" (fixed in this draft)
\end{quote}
{\em I'm not sure it is valid to make claims about KEYS2 performance
  versus MaxWalkSAT when you are actually comparing to MaxFunWalk.
  You have to do a better job showing how they are related. (E.g.
  at what level are they the same algorithm?)
} 
\begin{quote}
Thanks for this comment- it has allowed us to clarify a confusing part of our text.
MaxWalkSat is defined over CNF formulae. MaxWalkFun assumes that some function can assess a collection of variable settings.
MaxWalkFun becomes MaxWalkSat when that assessment function is a logical truth table. Here, we use our DDP models as the
assessment function (these notes have been added to the bottom of page 12).
\end{quote}
{\em Last para of Sec 9. "...some requirements models". This comes out of
 the blue.  For which reqts models does KEYS approach fail?}
\begin{quote}
This section has been removed. As to the ultimate limits of KEYS2, we don't know yet. However, given the simplicity of
the algorithm's implementation and its fast runtimes, we offer the following advise: try it and see.
FYI, we are using it {\em all} our current projects. We have found it is a good
stochastic rule generator for standard UCI data and that it out-performs standard non-linear optimizers for optimizing complex FORTRAN simulators
of NASA flight systems. We are also successfully applying it to effort estimation and defect prediction. All those papers are Masters projects, currently in progress
(i.e. no tech reports, or publications, yet).  But in summary,   KEYS2 is remarkably general and remarkably useful. 
\end{quote}
\end{document}