%\documentclass[onecolumn,12pt,draftcls,journal,compsoc]{IEEEtran}
\documentclass[cite,journal,10pt,compsoc, twocolumn]{IEEEtran}
\usepackage{pifont}
\usepackage{times}
\newcommand{\ed}{\end{description}}
\newcommand{\fig}[1]{Figure~\ref{fig:#1}}
\newcommand{\eq}[1]{Equation~\ref{eq:#1}}
\newcommand{\hyp}[1]{Hypothesis~\ref{hyp:#1}}
\newcommand{\tion}[1]{\S\ref{sec:#1}}


\usepackage{alltt}
\usepackage{graphicx}
\usepackage{url}
\newcommand{\bi}{\begin{itemize}}
\newcommand{\ei}{\end{itemize}}
\newcommand{\be}{\begin{enumerate}}
\newcommand{\ee}{\end{enumerate}}
\newcommand{\bdd}{\begin{description}}
\newcommand{\edd}{\end{description}}
% IEEE Computer Society needs nocompress option
% requires cite.sty v4.0 or later (November 2003)
\usepackage[nocompress]{cite}
% normal IEEE
 % \usepackage{cite}

% correct bad hyphenation here
\hyphenation{op-tical net-works semi-conduc-tor}

\begin{document}

\title{Achieving Conclusion Stability when Ranking
Software Effort Estimation Methods}

%\title{Rejecting Effort Estimation Methods}

\author{Tim~Menzies,~\IEEEmembership{Member,~IEEE,}
        Omid Jalali,
		Jairus Hihn,
	    Dan Baker,
		and Karen Lum
\thanks{ 
Tim Menzies, Omid Jalali, and Dan Baker
are with the Lane Department of
Computer Science and Electrical Engineering, West
Virginia University, USA: 
\protect\url{tim@menzies.us},
\protect\url{ojalali@mix.wvu.edu}, \protect\url{danielryanbaker@gmail.com}.}
\thanks{Jairus Hihn and Karen Lum at with NASA's Jet Propulsion Laboratory:
\protect\url{jhihn@mail3.jpl.nasa.gov},
\protect\url{karen.t.lum@jpl.nasa.gov}.}
\thanks{
The research described in this paper was carried out at the Jet 
Propulsion Laboratory, California Institute of Technology, under a contract with the US National Aeronautics and 
Space Administration. Reference herein to any specific 
commercial product, process, or service by trade name, 
trademark, manufacturer, or otherwise does not constitute 
or imply its endorsement by the US Government.}
\thanks{Download: \protect\url{http://menzies.us/pdf/07better.pdf}.}
\thanks{
Manuscript received July 31, 2007; revised XXX, XXXX.}}

\markboth{Journal of ???,~Vol.~6, No.~1, January~2007}%
{Menzies \MakeLowercase{\textit{et al.}}: Better Effort Estimation}

\IEEEaftertitletext{\vspace{-1\baselineskip}
\noindent\begin{abstract}
This paper revisits the effort estimation {\em conclusion instability}
 problem identified by Foss et.al. and Myrtveit et.al. 
They report
that conclusions regarding which software effort estimation
method is  ``best'' is highly contingent
on
(1)~the evaluation criteria and (2)~the
subset of the data used in the evaluation.
One cause of this instability is
the presence of extreme outlying results.
The effects of such outliers can be mitigated using
 non-parametric evaluation such as the U-test proposed by
Mann-Whitney.
We report a study were, when using the U-test,
the same four methods were ranked higher than
154 others {\em regardless of which evaluation criteria or data subset
was applied}.
Hence, we
recommend non-parametric evaluation to evaluate {\em and} prune
effort estimation methods.
More specifically, when learning effort estimators,
we now (i)~advocate augmenting Boehm's local calibration
method with simple linear-time row and column pruning pre-processors
and (ii)~advise against model trees, linear regression,
exponential time column subset selection,
and (unless the data is sparse) standard nearest neighbor methods.
\end{abstract}\noindent
\begin{keywords}COCOMO, effort estimation, data mining, 
evaluation, Mann-Whitney U-test, non-parametric tests.
\end{keywords}\vspace{1\baselineskip}}\maketitle

%------------------------------------------------------------------------- 

\section{Introduction}

Software effort estimates are often wrong. Initial estimates may be incorrect by a factor 
of four~\cite{boehm81} or even more~\cite{kemerer87}. 
As a result,
the allocated funds may be inadequate to develop the required project.
In the worst case, over-running 
projects are canceled, wasting the entire development effort. 
For example, in 2003, NASA canceled
the CLCS system
after spending 
hundreds of millions of dollars on software development.
The project was canceled after the initial estimate
of
\$206 million was increased to between
\$488 million \$533 million~\cite{clcs03}.
On cancellation, approximately 400 developers
lost their jobs\cite{clcs03}.

While the need for better estimates is clear, there exists
a very large number of effort estimation methods~\cite{jorg04,jorgensen05}
and no good criteria for selecting between them. Few studies empirically
compare all these techniques. What is more usual are narrowly focused studies (e.g.
\cite{kemerer87,briand00,lum02,ferens98}) that test, say, linear regression models in different environments.

Previously, Foss et.al.~\cite{foss05} and Myrtveit et.al.~\cite{myrtveit05}
have doubted the practicality of comparatively assessing $L$ different learners processing $D$ 
data sets. The results of such a comparison, they argue, vary according to the sub-sample
of the data being processed and the applied evaluation criteria. 

Methodologically, Foss/Myrtveit's {\em conclusion instability}
is highly problematic. 
Unless we can {\em rank} methods and {\em prune}
inferior methods,
we will soon be overwhelmed by a growing number
of (possibly useless) effort estimation methods.
New open source data mining toolkits are
appearing with increasing frequency such as the 
the R project\footnote{\url{http://www.r-project.org/}},
Orange\footnote{\url{http://www.ailab.si/orange/}} and the
WEKA~\cite{witten05}.
Such tools tempt researchers to over-elaborate their effort estimation tools.
For example, our own COSEEKMO tool~\cite{me06d} takes a full
day to run its 158 methods.
Much of that execution is wasted since, as shown below,
154 of those methods are superfluous.

The rest of this paper shows how we ranked and pruned 154 COSEEKMO methods.
Initially, our results will exhibit  Foss/Myrtveit's conclusion instability: different
subsets of the data will rank methods differently.
The root cause of conclusion instability will then be
identified: a very small number of COSEEKMO's estimates have a  very large 
error. If these {\em outliers} 
fall into some of the subsets, then those subsets will exhibit a dramatically
different performance.

Non-parametric
methods such as the U-test proposed by Mann-Whitney~\cite{mann47} 
mitigate the outlier problem 
Non-parametric U-test uses {\em ranks}, not precise numeric values.
For example, if treatment $A$ generates $N_1=5$ values \{5,7,2,0,4\} and treatment $B$ generates $N_2=6$ values \{4,8,2,3,6,7\}, 
then these sort as follows:
\begin{center}
{\small
\begin{tabular}{c|c|c|c|c|c|c|c|c|c|c|c}
Samples& A & A & B & B & A & B & A & B & A & B & B\\
Values & 0 & 2 & 2 & 3 & 4 & 4 & 5 & 6 & 7 & 7 & 8\\
\end{tabular}}\end{center}
On ranking, averages are used when values are the same:
\begin{center}
{\small
\begin{tabular}{c|c@{~}|c@{~}|c@{~}|c@{~}|c@{~}|c@{~}|c@{~}|c@{~}|c@{~}|c@{~}|c}
Samples& A&  A  &B&  B&  A & B  &A&  B&  A&  B & B\\
Values&  0  &2  &2&  3&  4 & 4  &5&  6&  7&  7 & 8\\
Ranks&   1 &2.5&2.5 &4& 5.5&5.5 &7&  8& 9.5&9.5& 11\\
\end{tabular}}\end{center}
Note that, when ranked in this manner,
the largest value gets the same rank
even if it 
ten to a hundred times larger.
That is, such rank tests
are less susceptible to large outliers and,  theoretically,
may not exhibit conclusion instability.

The experiment of this paper
checks if this theoretical prediction holds in practice.
As shown below,
when U-tests are applied to COSEEKMO's methods then
the same four methods are demonstrably superior. This conclusion will be shown to be stable 
across different randomly selected subsets and different evaluation criteria.
To the best of our knowledge, this paper is the first report of
stable conclusions in effort estimation.

The contributions of this paper are to:
\bi
\item 
Offer an optimistic alternative to the
pessimism of Foss/Myrtveit. Using non-parametric analysis,
 it is possible to find stable conclusions regarding the relative
merits of different effort estimation methods. Existing tools
like COSEEKMO can 
be pruned  and optimized. Better yet, new methods can now be developed
and, possibly, shown to be better than current practice.
\item
Identify a small set of useful
  effort estimation methods.
Interestingly, the four winning COSEEKMO methods are also some of the simplest (e.g.
linear time row and column pruners, Boehm's 1981 local calibration method).
\item
Identify
methods
that require further research before they can be recommended to industrial practitioners.
For data expressed in the COCOMO format,
this study found no added value in general linear regression, model trees, 
exponential time column subset selection,
and (unless the data is sparse) standard nearest neighbor methods.
\ei

\section{The Conclusion Instability Problem}

% fig- larege samples 30 repeats

\subsection{Symptoms of Instability}

Kitchenham et.al.~\cite{kitc01}, Foss et.al.~\cite{foss05}, 
and Myrtveit et.al.~\cite{myrtveit05} caution that,
historically, ranking estimation methods has been done quite poorly. 
Based on an analysis of two (non-COCOMO)
data sets as well as simulations over artificially generated data set,
Foss et.al. and Myrtveit et.al.
concluded that numerous commonly used methods such as the mean
MRE\footnote{MRE = $abs(actual-predicted)/actual)$.} are
unreliable measures of estimation effectiveness.
Also, the conclusions reached from these standard measures can vary wildly
depending on which subset of the data is being used for testing~\cite{myrtveit05}.


\begin{figure}
\begin{center}
\includegraphics[width=3.5in]{Parametric.pdf}
\end{center}
\caption{Results of 3 different runs of COSEEKMO comparing
two methods using mean MRE values. Points on the lines
Y-axis shows the difference in mean relative error (MRE) between method1 and method2.
Lower values endorse method2 since, when such values occur,
method2 has a lower error than method1.}
\label{fig:parametric}
\end{figure}

\fig{parametric} demonstrates  conclusion instability.
That figure shows two experimental runs. In each run,
30 times, effort estimate models were built for our 19 subsets using
two methods. Each time, an effort model was build from a randomly selected 90\% of the data.
Results are expressed in terms of the difference in mean MRE between the
two subsets; e.g. in Run~\#1, method2 had a much larger mean error than method1.

After Run~\#1, the results endorse method2 since that method either (a)~did better (lower errors)
as method1 or (b)~had similar performance to method1.
However, that conclusion is not stable. Observer, in  Run~\#2 that:
\bi
\item
 In subsets 1,2,3,7,and 11
the improvement of method2 over method1 disappeared. 
\item
Worse, in subsets 1,2,11, method1 performed
dramatically better than method2.
\ei
The deviations seen in 30 repeats of the above procedure
was quite large:  within each data set, the standard deviation
on the MREs were $\{median,max\} = \{150\%,649\%\}$~\cite{me06d}.
 Port (personnel communication) has proposed a bootstrapping method to determine the true performance distributions
of COSEEKMO's methods. That method would require $10^2$ to $10^3$ re-samples and, given
 COSEEKMO's current runtimes, would take $10^2$ to $10^3$ days
to terminate.


One troubling result from \fig{parametric} study is that the number of training
examples
was {\em not} connected to the size of standard deviation. A pre-experimental intuition
was that the smaller the training set, the worse the prediction instability.
On the contrary, we found small and large instability (i.e. MRE standard deviation)
for both small
and large training sets~\cite{me06d}. That is,  instability
cannot be tamed by further data collection. Rather, the data must be processed
and analyzed in some better fashion (e.g. U-test described below).

These large instabilities explain the contradictory results in the effort estimation
literature.
Jorgensen reviews fifteen studies that compare model-based to expert-based estimation. 
Five of those studies found in
favor of expert-based methods; five found no difference; 
and five found in favor of model-based estimation~\cite{jorg04}.
Such diverse conclusions are to be expected if models exhibit large instabilities in their performance.

\begin{figure}[!t]
\begin{center}
\includegraphics[width=2.5in]{graphs/re_nasa93_fg_g.pdf}
\includegraphics[width=2.5in]{graphs/re_nasa93_year_1975.pdf}
\includegraphics[width=2.5in]{graphs/re_nasa93_mode_embedded.pdf}
\includegraphics[width=2.5in]{graphs/re_coc81_langftn.pdf}
\end{center}
\caption{Relative errors seen in COSEEKMO's
experiments on four data sets. Thin lines show the actual
values. Thick lines show a Gaussian distribution that uses
mean and standard deviation of the actual values.
From top to bottom, the plots are of:
{\em (top)} NASA ground systems;
NASA software written around 1975;
NASA embedded software (i.e. software developed within tight hardware, software, and operational constraints);  
{\em (bottom)} some FORTRAN-based software systems.
}\label{fig:re}
\end{figure}

\subsection{Diagnosing the Cause}

The thin line of \fig{re}
 sort the relative error\footnote{RE = $(predicted - actual)/actual$ OMID: it can't be (ac-pred)/act or we'd only
get negative spikes 
 } (RE)
seen in four of the subsets studied in \fig{parametric}.
Observe that while most of the actual RE values are nearly zero, an {\em infrequent}
number
(on the right hand side)
are {\em extremely large} (up to 8000 in the second plot). 
Such  large {\em spikes} in RE  result when
the predicted values are 
much larger than actual values and result from
(1)~noise in the data or (2)~a training set that learns an overly steep
exponential function for the effort model. 

Large and infrequent outliers
explain conclusion instability:
\bi
\item
Large outliers can make mean calculations
highly misleading. A single large outlier can make the mean
value far removed from the median\footnote{Median: 
the value below which 50\% of the values fall.}. 
\item
For data sets with only a small number of outliers (e.g. \fig{re}), the conclusions reached from
different subsets can be very different,
depending on the absence or presence of the infrequent outliers.
\ei

\fig{re} also illustrates how poorly standard methods assess the performance of effort estimation data.
Demsar~\cite{demsar06} offers a definition of {\em standard methods} in data mining.
In his study of four years of proceedings from the {\em International Conference
on Machine Learning}, Demsar
found that the standard
method of comparative assessment 
were t-tests over some form of repeated sub-sampling
such as cross-validation, separate subsets, or randomized re-sampling. 
Such t-tests assume that the distributions being studied are Gaussian and, as shown by the
thick line of \fig{re}, effort estimation results can be highly
non-Gaussian.
These thick lines show a Gaussian cumulative 
distribution function computed from the means and standard deviations of 
the actual RE values (the thin lines).
Observe how poorly Gaussians model our RE results:
\bi
\item
There exists orders of magnitude differences 
between the {\em actual} plots (the thin lines) and the {\em Gaussian} approximations
(the thick lines). 
\item
The Gaussian goes negative while none of our effort estimation methods assume that
it takes less than no time to build software.
\ei

\subsection{Fixing Instability}

The problem of comparatively assessing $L$ learners run on multiple sub-samples of $D$ data sets
has been extensively studied in the data mining community. 
T-tests that assume Gaussian distributions are strongly deprecated.
For example, Demsar~\cite{demsar06}
 argues that non-Gaussian populations are common enough to require a methodological
change in data mining. 

After reviewing a wide range of
comparisons methods\footnote{paired t-tests with and without the use of geometric means of the relative ratios;
binomial tests with the Bonferroni correction;
paired t-tests; ANOVA; Wilcoxon; Friedman},
Demsar advocates the use of the 1945 Wilcoxon~\cite{wilcoxon45} 
signed-rank test that compares the ranks for the positive and negative differences (ignoring the signs). 
Writing five years early,
Kitchenham et.al.~\cite{kitc01} comment that the Wilcoxon test has its limitations.
Demsar's report  offers the same conclusion, 
noting that the Wilcoxon test requires that the sample sizes are the same.
To fix this problem, Demsar augments Wilcoxon with the Friedman test.

One test not studied by Demsar is Mann and Whitney's 1947
modification~\cite{mann47} to Wilcoxon rank-sum test (proposed along with his signed-rank test).
We prefer this test since:
\bi
\item
The Mann-Whitney U-test does not require that the sample sizes are the same.
So, in a single U-test, learner $L_1$ can be compared to all its rivals. 
\item
The U-test does not require any post-processing
(such as the Friedman test) to conclude if the median rank of one population (say, the $L_1$ results)
is greater than, equal to, or less than the median rank of another (say, the $L_2,L_3,..,L_x$ results).
\ei
\begin{figure}[!t]
\begin{tabular}{|p{.95\linewidth}|}\hline
\footnotesize
The sum and median of A's ranks is
\[\begin{array}{ccl}
sum_A &=& 1 + 2.5 + 5.5 + 7 + 9.5 = 25.5\\
median_A& =&5.5
\end{array}\]
and the sum and median of B's ranks is
\[\begin{array}{ccl}
sum_B&=& 2.5 + 4 + 5.5 + 8 + 9.5 + 11 = 40.5\\
median_B&=& 6.75
\end{array}\]
The $U$ statistic is then calculated 
where \mbox{$U_x=sum_x-(N_1(N_2+1))/2$}:
\[\begin{array}{c}
U_A = 25.5 - 5*6/2 = 10.5\\
U_B = 40.5 - 6*7/2 = 19.5
\end{array}\]
These can be converted to a Z-curve using:
\[
\begin{array}{rclcl}
\mu&= &(N_1N_2)/2&=&516.4\\
\sigma & = & \sqrt{\frac{N_1N_2(N_1 + N_2 + 1)}{12}}&=&5.477\\
Z_A & = & (U_A - \mu)/\sigma&=&-0.82\\
Z_B & = & (U_B - \mu)/\sigma&=&0.82\\
\end{array}
\]
(Note that $Z_A$ and $Z_B$ have the same absolute value.
In all case, these will be the same, with opposite signs.)

If $abs(Z) < 1.96$ then the samples $A$ and $B$ have the same median rankings (at
the 95\% significance level). In this case, we add one to both $ties_A$ \& $ties_B$.
Otherwise, their median values can be
compared, using some domain knowledge. In this work, {\em lower} values
are better since we are comparing relative errors. Hence:
\bi
\item
If $median_A < median_B$ add 1 to both $wins_A$ \& $losses_B$.
\item
Else if $median_A > median_B$ add 1 to both $losses_A$ \& $wins_B$.
\item
Else, add 1 to both $ties_A$ \& $ties_B$. 
\ei
\\\hline
\end{tabular}
\caption{An example of the Mann-Whitney U test.}\label{fig:mw}
\end{figure}

\fig{mw} shows the U-test for the two treatments
$A$ and $B$ discussed in the introduction.
The test concludes that these treatments have the same
median ranked values (at the 95\% significance level). 
As defined in \fig{mw}, this test  counts
the  $wins$, $ties$, and $losses$ for $A$ and $B$
(where $A$ and $B$ are single or groups of methods).
Since we seek methods that can be rejected, the value of interest to us is
how often methods {\em lose}.

\section{Experiments}

This section describes an experiment where the U-test was applied to COSEEKMO's
158 methods.
We will find that:
\bi
\item 
Four of those methods  lost the least. We will advocate these four preferred
methods.
\item
Most
of the methods (154)
lost out to the four preferred methods. We will deprecate these methods.
\ei
This was both a disappointing and exciting result.
On one hand, it meant that much of COSEEKMO's design
is superfluous. On the other hand, it meant that:
COSEEKMO could be significantly optimized.


\subsection{Data}
%XXX different subsetsets, two sources
% previously we hare report that these data sets have very different properties

This paper is based on 19 subsets from two sources.
$COC81$\footnote{See "coc81" at \url{http://promisedata.org/repository}.}
comes from Boehm's 1981 text on effort estimation. 
$NASA93$\footnote{See "nasa93" at \url{http://promisedata.org/repository}.}
comes from
a study funded by the Space
Station Freedom Program.
$NASA93$ contains data from six  different NASA centers including the Jet Propulsion
Laboratory.
For details on this data, see the appendix. In terms of conclusion instability across 
data subsets, the important feature to note is that our data
comes from two sources with  demonstrably  different 
properties.
 In 20 repeats of 90\% sampling
of the data, the coefficients learned by linear regression for $NASA93$ were found to have a much large variation than
$COC81$~\cite{me05a}.

Our data is in the COCOMO-I format~\cite{boehm81}.
Shepperd~\cite{shepperd07} notes that COCOMO
requires the domain to be expressed in
COCOMO terms. Numerous researchers (e.g. \cite{shepperd97,li07})  advocate {\em case-based reasoning}
methods that can accept domain artifacts expressed in any terminology.
We use COCOMO-I since unlike other effort estimators such as as PRICE-S~\cite{park88}, 
SLIM~\cite{putnam92}, or SEER-SEM~\cite{jensen83}, COCOMO is a public domain 
model with published data and baseline results~\cite{chulani99}. 

In 2000, 
Boehm et.al. updated COCOMO-I model~\cite{boehm00b}. After the update, numerous features 
remained the same:
\bi
\item
Effort is assumed to be exponential on model size.
\item
Boehm et.al. recommends a procedure called
{\em local calibration} (described below)
for tuning generic COCOMO to a local situation.
\item
Boehm et.al. advises effort estimates can be improved via {\em manual stratification} (described later);
i.e. use domain knowledge to select relevant past data.
\ei

At the 2005 COCOMO forum, there was some 
discussions about relaxing the security restrictions around the COCOMO-II data
set. To date, those discussions have not progressed.
Since other researchers do not have 
access to COCOMO-II, this paper will only report
results from COCOMO-I.


\newcommand{\nope}{\ding{56}}
\newcommand{\yupe}{\ding{52}}
\begin{figure*}
\begin{center}
\scriptsize
\begin{tabular}{r|r@{~=~}l|l|l|p{1.5in}}
established\\
practice & method& name                  & row pruning & column pruning         & learner \\\hline
\yupe&a    & LC           & \nope          & \nope                      & LC = Boehms's local calibration \\\hline
&b    & COCOMIN +  LC     &  \yupe automatic $O(P^2)$        & \nope                      & local calibration \\\hline
&c    & COCOMIN + LOCOMO + LC   &  \yupe automatic  $O(P^2)$       &  \yupe automatic $O(F{\cdot}log(F) + F)$ &  local calibration \\\hline
&d    &  LOCOMO + LC      &  \nope         & \yupe automatic $O(F{\cdot}log(F) + F)$ &  local calibration\\\hline
\yupe&e    &  Manual Stratification + LC &  \yupe manual  & \nope             &  local calibration \\\hline
&f    & M5pW  + M5p       &  \nope        &  \yupe Kohavi's WRAPPER~\cite{kohavi97} calling M5p~\cite{quinlan92b}, $O(2^F)$ & model trees.\\\hline
&g    & LOCALW  + LC      &  \nope        &  \yupe Chen's WRAPPER~\cite{me06d}  calling LC, $O(2^F)$ & local calibration\\\hline
&h    & LsrW  + LSR       &  \nope        &  \yupe Kohavi's WRAPPER~\cite{kohavi97} calling LSR, $O(2^F)$ & linear regression\\\hline
&i    & NEAREST       &  \yupe automatic $O(P^2)$& \nope &      mean  effort of near neighbors\\\hline
\end{tabular}
\end{center}
\caption{Eight effort estimation methods explored
in this paper.  F=the number of features.}\label{fig:abcd}
\end{figure*}

\subsection{Experimental Procedure}

\noindent
Each of the 19 subsets of $COC81$ and $NASA93$ were expressed
as a table of data $P*F$.
The table stored {\em project} information in
 $P$ rows and
each row
included the {\em actual} development effort. 
In the 19 subsets of $COC81$ and $NASA93$ used in the study,
$20 \le P \le 93$. The upper bound of this range ($P=93$)
is the largest data set's size.
The lower bound of this range ($P=20$) was selected based on
experiments described elsewhere~\cite{me06d}.
For details on these data sets, see the appendix.

The table also has $F$ columns containing the project {\em features} $\{f_1,f_2,...\}$.
The features used in this study come from Boehm's COCOMO-I work (described in the appendix)
and include items such as
lines of code (KLOC), schedule pressure (sced), analyst capability (acap), etc. 

To build an effort model, 
the rows of each table were divided at random into a $Train$ and $Test$ set
(and $|Train|+|Test|=P$). COSEEKMO's different
methods are then applied to the training data to generate a model. This
model was then used on the test set.
In order to compare this study with our work~\cite{me06d},
we use the same test set size as the COSEEKMO
study; i.e. \mbox{$|Test|=10$}.

Effort models were assessed via four evaluation criteria:
\bi
\item $AR$: absolute residual = $abs(actual - predicted)$;
\item $MRE$ : magnitude of relative error;  $\frac{abs(predicted -actual)}{actual}$;
\item $MER$: magnitude of error relative to estimate;
		$\frac{abs(actual - predicted)}{predicted}$;
\ei
Note that, according to conclusion stability, there should be instability in how
methods are ranked by AR,MRE, and MER.

For the sake of statistical validity, the above procedure was repeated 20 times
for each of the 19 subsets of $COC81$ and $NASA93$.
Each time, a different seed has used to generate the $Train$ and $Test$ sets.
Recall that our data came from two different sources (Boehm's 1981 work and NASA in the 1990s). 
Hence, according to conclusion instability, there should be  instability in the conclusions
reached 
from the data from different sources or
across the different randomly selected $Train$ and $Test$  sets. 


\subsection{158 Methods}

COSEEKMO's  158 methods combine:
\bi
\item Some {\em learners} such as standard linear regression, local calibration, and model trees.
\item Various {\em pre-processors} that may prune rows or columns.
\item Various {\em nearest neighbor} algorithms that can be used either as learners or as pre-processors to other learners.
\ei

Note that only some of the learners use pre-processors.
In all, COSEEKMO's methods combine
15 learners without a pre-processor and 8 
learners with 8 pre-processors; i.e. 
\mbox{$15+8*8=79$} combinations in total.

COSEEKMO's methods input project features described using the 
symbolic range {\em very low} to 
{\em extra high}.
Some of the methods map the symbolic range to numerics 1..6\footnote{OMID: lsr?}.
Other methods map the symbolic range into a set of {\em effort multipliers} and
{\em scale factors} developed by Boehm  and shown in the appendix (\fig{effortmults}).
Previously, we have queried the utility of these 
effort multipliers and scale factors~\cite{me06d}. COSEEKMO hence executes its 79 methods
twice: once using Boehm's values, then once again using
perturbations of those values. Hence, in all, COSEEKMO contains $2*79=158$ methods.

\subsection{Brief Notes on 8 Methods}
There is insufficient space in this paper to describe the 158 methods
(for full details, see~\cite{jalali07}).
Such a complete
description would be pointless since, as shown below, most of them  are beaten by a very
small number of 
preferred methods. For example, our previous concerns regarding the effort multipliers and
scale factors proved unfounded (so at least half the runtime of COSEEKMO is wasted CPU).

Hence, this paper focuses on the
eight   methods $(a,b,c,d,ef,g,h,i)$ of \fig{abcd}. Four of these ($(a,b,c,d)$)
are our preferred methods while the other four 
comment on premises of some prior publications~\cite{me05c}.


One way to categorize \fig{abcd} is by their relationship to accepted practice
(as defined in the COCOMO texts~\cite{boehm81,boehm00b}).
Methods $(a,e)$ are endorsed as best practice in the COCOMO community.
The others are our attempts to do better than current established practice using
e.g. intricate learning schemes or intelligent  data pre-processors.

Method $f$ is an example of a more intricate learning schemes. 
Standard linear
regression assumes that the data can be fitted to a single model.
On the other hand, the model trees used in $f$~\cite{quinlan92b} permit the
generation of multiple models (as well as a 
a decision tree for selecting the appropriate model).

As to intelligent data pre-processor,
COSEEKMO's pre-processors prune {\em irrelevant} projects and features.
After pruning, the learner executes on a new table $P'*F'$
where $P' \subseteq P$ and $F' \subseteq F$.
Two pre-processors are defined: row pruners and column pruners.
There are two groups of row pruners in COSEEKMO:
\bi
\item Manual Stratification uses domain knowledge to select past, most-relevant projects.
\item NEAREST and LOCOMO~\cite{jalali07} are automatic and use nearest neighbor methods on 
the $Train$ set to find the $k$ most relevant projects to generate predictions for the projects in the $Test$ set. 
\ei
The core of both automatic algorithms is a distance measure that must compare all pairs of projects.
Hence, these automatics methods take time $O(P^2)$. 

Both NEAREST and LOCOMO learn an appropriate $k$ from the $TRAIN$ set and the $k$ with the lowest
error is used when processing  the $TEST$ set. 
NEAREST averages the effort associated with the $k$ nearest neighbors 
while LOCOMO passes the $k$ nearest neighbors to Boehm's local calibration (LC)
method.

Column pruners fall into two groups:
\bi
\item WRAPPER and LOCALW are very thorough search algorithms that explore
subsets of the features, in no set order. This search takes time $O(2^F)$. 
\item COCOMIN~\cite{baker07} is far less thorough.
COCOMIN is a near linear-time pre-processor that selects the features
on some heuristic criteria and does not explore all subsets of the features.
It runs in $O(F{\cdot}log(F))$ for the sort and $O(F)$ time for the exploration of selected features.
\ei
Each method may use a column or row pruner or, as with $(a,i)$, no pruning at all.
In methods $(f,h)$, the notation
M5pW and LsrW denotes a WRAPPER that uses M5p or LSR as its target
learner (respectively).

For more details on these eight methods, see the appendix.

\begin{figure}[!t]
\scriptsize
\begin{tabular}{|p{.95\linewidth}|}\hline
Results using random $seed_1$:

\includegraphics[width=3.4in]{LossesColumnsBoth-AR-Run2.pdf}\\\hline

Results using random $seed_2$:

\includegraphics[width=3.4in]{LossesColumnsBoth-AR-Run1.pdf}\\\hline


Results using random $seed_3$:

\includegraphics[width=3.4in]{LossesColumnsBoth-AR-Run3.pdf}\\\hline\end{tabular}
\caption{U-tests using  AR. repeated three times with different
random seeds.}
\label{fig:resultsBoth-AR-Run1}
\end{figure}


\begin{figure}[!t]
\includegraphics[width=3.4in]{LossesColumnsBoth-MRE-Run1.pdf}
\caption{U-tests using MRE.}
\label{fig:resultsBoth-MRE-Run1}
\end{figure}

\begin{figure}[!t]
\includegraphics[width=3.4in]{LossesColumnsBoth-MER-Run1.pdf}
\caption{U-tests using MER.}
\label{fig:resultsBoth-MER-Run1}
\end{figure}

\section{Results}

\noindent
Figures \ref{fig:resultsBoth-AR-Run1}
\ref{fig:resultsBoth-MRE-Run1} and
\ref{fig:resultsBoth-MER-Run1} shows results from
20 repeats of:
\bi
\item Dividing some subset into $Train$ and
$Test$ sets; 
\item
Learning an effort model from the $Train$ set using COSEEKMO's 158 methods; 
\item
Applying that model to the $Test$ set; 
\item
Collecting performance statistics
from the $Test$ set using AR or MER or MRE; 
\item
Ranking the performance
results from different
methods; 
\item
Applying the Mann-Whitney U-test to find which method(s) had the
largest median ranks.
\ei
\noindent
In these results:
\bi
\item Conclusion instability due to 
{\em changing evaluation criteria} can be detected by
comparing  results across 
\fig{resultsBoth-AR-Run1}
\fig{resultsBoth-MRE-Run1} and
\fig{resultsBoth-MER-Run1};
\item Conclusion instability due to {\em changing subsets} can be detected by comparing
results across 
different subsets generated by changing the random seed controlling $Train$ and
$Test$ set generation; e.g. the  three  
runs of \fig{resultsBoth-AR-Run1} that used different random seeds.
\ei

Each mark on these plots shows the 
number of times a method loses in seven $COC81$ subsets (left plots) and
twelve $NASA93$ subsets (right plots). 
The x-axis shows results from the methods
$(a,b,c,d,e,f,g,h,i)$ described in  \fig{abcd}. 

In these plots, methods that generate {\em lower} losses are {\em better}.
For example,
the top-left plot of 
\fig{resultsBoth-AR-Run1}
shows results for ranking methods applied to $COC81$ using AR. 
In that plot, all method $(a,d)$ results from the seven $COCO81$ subsets
can be seen at $y=losses=0$. That is,
in that plot,
these two methods {\em never}  losses against the other 158 methods.

In a result consistent with the Foss/Myrtveit findings, there
are some instabilities in our results. For example, the exemplary
performance of methods $(a,d)$
in the top-left plot of
\fig{resultsBoth-AR-Run1} does {\em not} repeat in other plots.
For example in the the $NASA83$ MRE and MER
results shown in 
\fig{resultsBoth-MRE-Run1} and
\fig{resultsBoth-MER-Run1}, method $b$ losses much less than methods $(a,d)$.

However, in terms of number of losses generated by methods $(a,b,c,d,e,f,g,h)$, the following
two results holds across all evaluation criteria and all subsets:
\be
\item
One member of  method $(a,b,c,d)$ always performs better (losses least)
than all members of methods
$(e,f,g,h)$. Also, all member of methods $(e,f,g,h)$ perform better than
$i$.
\item
Compared to 158 methods, one member of $(a,b,c,d)$
always losses at some rate very close to zero.
\ee


As observed by Foss/Myrtveit, there is no
single universal $best$ method.
Nevertheless, out of 158 methods, there are 154 clearly inferior methods.
Hence, we recommend ranking methods  $(a,b,c,d)$ on
all
the available historical data, then applying the best ranked method
to estimate new projects.


\section{Discussion}

The superiority of  $(a,b,c,d)$ is a ringing endorsement of Boehm's 1981 estimation
research. 
These four methods
are based around Boehm's preferred
method for calibrating generic COCOMO models to local data. 
Method $a$ is Boehm's {\em local calibration}
(or LC) procedure (defined in the appendix).
Methods $b$ and $c$ augment LC with pre-processors performing
simple row and column pruning
(and method $d$ combines both $b$ and $c$).
LC 
contains all Boehm's parametric
assumptions about effort estimation; i.e. that it
is exponential on lines of code, and that
the features influence the effort by a set of pre-defined constants that can be taken
from Boehm's textbooks~\cite{boehm81,boehm00b}. 

Our results endorse some of Boehm's estimation modeling work, but not all of it.
Method $e$ is manual stratification- a commonly recommended method
in the COCOMO literature. As shown above, method $e$ always losses
more than one of $(a,b,c,d)$. Hence, contrary to the COCOMO literature,
we recommend replacing manual stratification with automatic methods.

Our results argue that there is little added value in
methods $(f,g,h)$.
This is a useful result since these methods contain some of our slowest algorithms:
\bi
\item In one test,
model tree generation in method $f$ consumed 35\% of the CPU, just for that one method.
\item
The WRAPPER column selection method used in $(f,g,h)$  
is an elaborate heuristic search through, potentially, all combinations of the
columns. 
\ei
The failure of model trees in method $f$ is also interesting. 
If the model trees of method $f$ had out-performed $(a,b,c,d)$, that would
have suggested that effort is a multi-parametric phenomenon where, e.g.
over some critical size of software, different effects emerge.
This proved not to 
be the case, endorsing 
Boehm assumption
that effort can be modeled as a single parametric log-linear equation.


Of all the methods in \fig{abcd}, 
$(a,b,c,d)$ perform the best and $i$ performs the worst.
Once distinguishing feature of method $i$ are the {\em assumptions}
it makes about the domain. 
In terms of parametric assumptions, $(a,b,c,d)$ make the most 
and method $i$ makes the least.
The NEAREST neighbor method $i$  is {\em assumption-less}
since they do not even assume that the model is parametric. 

While
assumption-less, NEAREST is not {\em assumption-free}.
NEAREST (and the LOCOMO algorithm used in method $b$)
 use a simple n-dimensional Euclidean distance to find similar projects.  
Wilson \& Martinez report that this measure assumes that the training data is not
sparse~\cite{wilson97a}. Such spare data sets arise when many of the
values of project features are unavailable. All our data sets are non-sparse
(we have zero unknown values).
Shepperd \& Schofield argue that their case-based reasoning methods, like NEAREST
procedure used in method $i$,
are better suited to sparse data domains where precise numeric
values are {\em not} available on all factors~\cite{shepperd97}.
All our data is non-sparse. Hence, it is not surprising that method $i$ performs
poorly on our data.

%
%\fig{resultsBoth-AR-Run1} shows the number of $losses$ found by Mann-Whitney U test (at a 95\% confidence level).
%The evaluation criterion used for this test was AR, or the absolute residual\footnote{AR = $abs(actual-predicted)$.}, 
%values. All 19 subsets of $NASA93$ and $COC81$ were used in order to compare each of the eight methods in \fig{abcd} 
%to all methods in COSEEKMO. Mann-Whitney U test is particularly useful for such large scale comparisons (each method against
%all 157 methods) since it makes no assumptions that the populations being compared are of equal size.
%(The results in this section are based on Mann-Whitney U test over generated AR values. Other measures such as
%MRE, etc. also produce the same or very similar stable conclusions as presented in this section. 
%AR was selected as the evaluation criterion since it is the simplest\footnote{Attention Tim. This needs some more info}. For a discussion
%about stability of conclusions using other evaluation critria refer to {\em Biases in the evaluation criteria} in the External Validity section).
%
%In a result consistent with the cautions of Myrtveit et.al. (that different subsets lead to 
%different conclusions), the pattern of losses in $NASA93$ is different to $COC81$:
%\bi
%\item
%In $COC81$ plot (\fig{resultsBoth-AR-Run1}), the LOCOMO row pruning method followed by LC 
%learner, $(d)$, does better than all other methods.
%\item
%But in $NASA93$ plot (\fig{resultsBoth-AR-Run1}), the COCOMIN column pruning method followed 
%by LC learner, $(b)$, does better than all other methods. 
%\item
%Boehm's standard practices have different performances in $COC81$ vs. $NASA93$. 
%Method $(a)$ peforms better than $(e)$ in $COC81$ but performs slightly worse in $NASA93$.
%Although in both data sets the standard practices, $(a,e)$, defined by Boehm 26 years ago, 
%are not the best methods, they outperform supposedly more sophisticated methods $(f,g,h)$.
%\ei
%
%While the loss patterns are different in the two subsets, certain patterns repeat:
%\bi
%\item
%There is value-added in Boehm's effort estimation assumptions.
%With only one exception, the {\em strong assumption} methods, $(a,b,c,d,e,g)$ 
%do better than the others. The only exception is $(g)$ which does 
%worse than $(f,h)$ in $NASA93$.\footnote{Attention Tim. 
%This difference about NASA93 can be mentioned above when we mention the differences. method f is in general good in NASA93.}
%\item

\section{External Validity}
The case was made above that our conclusions are valid across different evaluation
criteria and samplings.
However, no empirical evaluation is bias free.
Some biases remain including a sampling bias,
a paradigm bias, a modeling bias, and a bias in our selection of methods.

{\em Sampling bias:}
Our model-based estimation methods use data
and so are only useful in organizations that maintain historical
data on their projects. Such data collection is rare in organizations
with low process maturity. However, it is common elsewhere; e.g. amongst government
contractors whose contract descriptions include process auditing requirements.
For example, it is common practice at NASA and the United States Department of Defense
to require a model-based estimate at each project milestone.
Such models are used to generate estimates or to double-check an expert-based
estimate.

Another source of sampling bias was already mentioned above;
our data sets are non-sparse and sparse data sets may
be more suitable for nearest neighbor tools.
Yet another source of bias is that some of the data used here comes from NASA and NASA
works in a particularly unique market niche. Nevertheless, we argue
that results from NASA are relevant to the general software engineering
industry. NASA makes extensive use of contractors.
These contractors service many other industries. 
These contractors are contractually
obliged (ISO-9001) to demonstrate their understanding and usage of
current industrial best practices. 
For these reasons, other noted researchers such
as Basili, Zelbowitz, et.al.~\cite{basili02} have argued that
conclusions from NASA data are relevant to the general software
engineering industry.


{\em Biases in the paradigm}: The rest of this paper explores more
model-based methods (e.g. COCOMIN, LOCOMO, LC) and not expert-based methods. 
Model-based methods use some algorithm to summarize old data and make predictions 
about new projects.
Expert-based methods use human expertise (possibly augmented with process guidelines
or checklists) to generate predictions.  
Jorgensen~\cite{jorg04} argues that most industrial effort estimation is expert-based
and lists 12 {\em best practices} for such effort-based estimation.
The comparative evaluation of model-based
vs expert-based methods must be left for future work. Before
we can compare any effort estimation
methods (be they model-based or expert-based) we must first demonstrate that any two
methods can be comparatively assessed.
For more on expert-based methods, see~\cite{jorg04,jorgensen04,shepperd97,chulani99}.

{\em Biases in the model:} This study uses COCOMO data sets since 
that was the only public domain data we could access.  
Nevertheless, the techniques described here can easily be generalized to other models. For example,
here we use COSEEKMO to select best parametric methods in the
COCOMO format~\cite{boehm81,boehm00b} but it could just as easily
be used to assess other model-based tools like
PRICE-S~\cite{park88}, SEER-SEM~\cite{jensen83}, 
or SLIM~\cite{putnam92}.
However, it should be noted that in the above study, 154 out of 158 methods were demonstrably
inferior.
If those percentages carry over to a study of SEER-SEM vs PRICE-S vs SLIM, then
we would predict that will yield similar performance.

{\em Biases in the selection of methods:} Another source of bias in
this study is the set of methods explored by this study.
We can make no claim that \fig{abcd} or COSEEKMO's other 150 methods
represents the space of possible
effort estimation methods. 
Indeed, when we review the space of known methods
(see Figure~1 in~\cite{myrtveit05}), it is clear
that COSEEKMO covers only a small part of that total space. 

Instead of claiming that $(a,b,c,d)$ are ``best'', 
we really should say that $(a,b,c,d)$ are the best we have seen so far after four years
of trying many alternatives. Nevertheless, the reader
may know of other effort estimation methods they believe we should try.
Alternatively, the reader may have a design or an implementation of a new kind
of effort estimator. In either case, before it can be shown that an existing or new method
is better than those shown in \fig{abcd},
we first need a demonstration that there
exists statistical tests that distinguish between methods.
This paper offers such a demonstration.


\section{Conclusion}
needs work
\bi
\item
A single linear model is adequate for the purposes of effort estimation.
All the methods that assume multiple linear models, such as model trees $(f)$,
or no parametric form at all, such as nearest neighbor $(i)$, perform
relatively poorly.
\item
Elaborate searches do not add value to effort estimation.
All the $O(2^F)$ column pruners $(f,g,h)$ do worse than near-linear-time
column pruning $(b,c)$. We attribute the success of the near-linear-time pruners
to the pre-sort criteria. COCOMIN uses LC to pre-sort the features.
Repeating the above remarks regarding strong assumptions,
Boehm's effort estimation assumptions embodied in LC are very effective modeling heuristics.
\item
The slower implementations such as model trees $(f)$ never do no better than methods $(a,b,c,d)$. 
In fact, methods $(a,b,c,d)$ always outperform methods $(f,g,h)$.
\ei
Combining the last two points, we observe that COSEEKMO can run much faster if it ignores $O(2^F)$ 
column pruning methods $(f,g,h)$ and model trees $(f)$.

As to the ``best'' method for effort estimation:
there is no clear ``best'' method. We advise that the following methods should be tried
and the one that does best on the available training data (assessed using Mann-Whitney U Test)
should be used to predict new projects:
\bi
\item Make strong assumptions and use LC-based methods.
\item Row and column pruning can be useful but elaborate column pruning (requiring
an $O(2^F)$ search), is not useful. 
Hence, try LC with zero or more of LOCOMO's row pruning or COCOMIN's column
pruning; i.e. $(a,b,c,d)$.
\item If the training data is sparse, then nearest neighbor method $(i)$ 
should at least be tried as well~\cite{shepperd97}.
\ei
The reader may be saying "hey! what about!". Note that before we can explore new methods,
or declare that some new methods is superior to the set explored here, the conclusion
instability problem must be solved.

another method is to handle the infrequent and large outliers with a different algorithm, experiments with bagging ad boosting very disappoint, bu that work contues.

Definitions of best practices are constantly changing (e.g. structured programming, 
object-oriented, value-based ,etc). There are few stable conclusions in software engineering. 
One such stable conclusion is Boehm's 1981 modeling work on COCOMO-I. Despite four
years of trying to do better with a large suite of modern data mining methods,
the best we can do is to augment his standard 1981 local calibration procedure
with some simple near-linear time column pruning and quadratic-time $(O(P^2)$ row pruning.
\footnote{Attention Tim. The new methods usually do better than LC by itself.}

The Mann-Whitney U test has shown that most of the methods we have explored
have similar performance that is inferior to a small number of simple techniques.
The conclusion of this study is that there may be little value in
exploring yet more small variants to current data miners. Rather,
we need to explore radically different methods for supporting managerial 
decisions about software projects.


we have found stbilty in efort estimation - an effect that prior TSE publications
would suggest is impossible.

In personal communication with Shepperd\footnote{Attention Tim. This needs citation}  PRED(30)\footnote{PRED(N) is
the percent of the mean MRE less than N\%.}, another common measure, was identified as
unreliable as well. PRED(30) only shows how good the results are and ignores how bad 
the method may perform (i.e. outliers). PRED(30) is also not explored in studies by Foss et.al. 
and Myrtveit et.al. 

we did not use pred(30). we are unaware of any serious advocate of pred(30). also, it is
blind to the outlier problem that has become very important in our thinking about effort estimation.

proved stability across some evaluation criteria, not all. there might well
exist other criteria, not mentioned above, that offered very different rankings to figures 2 3 4.
However, prior to this article, based on the work of Foss et.al. and Myrtveit et.al, we
were afraid that {\em every} evaluation criteria gives a different ranking.

our results don't refute Foss/Myrtveit. Rather, we identified classes of evaluation
criteria that will/won't exhibit conclusion instability.

 exhibit conjecture that our
results are so stable since (a)~Mann-Whitney U test is less
susceptible to the conflating effects of outliers; (b)~the Mann-Whitney U
test simplifies and clarifies comparative assessment;
and (c)~the numerous assumptions inherent in a COCOMO data set impose 
enough regularities to make clear conclusions.


XXX\footnote{Attention Tim}

\appendix
\subsection{Data used in this Study}

In this study, effort estimators were built using all or some {\em part} of data from two sources:
\bdd
\item[{\em $COC81$:}]~~63 records in the COCOMO-I format. Source: \cite[p496-497]{boehm81}. 
Download from \url{http://unbox.org/wisp/trunk/cocomo/data/coc81modeTypeLangType.csv}.
\item[{\em $NASA93$:}]~~~93 NASA records in the COCOMO-I format.  
Download from \url{http://unbox.org/wisp/trunk/cocomo/data/nasa93.csv}.
\edd

Taken together, these two sets are the largest
COCOMO-style data source in the public domain (for reasons
of corporate confidentiality, access to Boehm's COCOMO-II data set is
highly restricted).
$NASA93$ was originally collected to create a NASA-tuned version of
COCOMO, funded by the Space Station Freedom Program and 
contains data from six NASA centers including the Jet Propulsion
Laboratory For more details on this dataset, see~\cite{me06d}.

Different subsets and number of subsets used (in parenthesis) are:
\bdd
\item[{\em All(2):}]~selects all records from a particular source.
\item[{\em Category(2):}]~~~~~~~$NASA93$ designation selecting the type of project; e.g. avionics.
\item[{\em Center(2):}]~~~~~$NASA93$ designation selecting records relating to where the software was built.
\item[{\em Fg(1):}]~$NASA93$ designation selecting either ``$f$'' (flight) or ``$g$'' (ground) software.
\item[{\em Kind(2):}]~~$COC81$ designation selecting records relating to the development platform; e.g. max is mainframe.
\item[{\em Lang(2):}]~~~$COC81$ designation selecting records about different development languages; e.g ftn is FORTRAN.
\item[{\em Mode(4):}]~~~designation selecting records relating to the COCOMO-I development mode: one of semi-detached, embedded, organic.
\item[{\em Project(2):}]~~~~~$NASA93$ designation selecting records relating to the name of the project.
\item[{\em Type(0):}]~~selects different $COC81$ designations and include ``bus'' (for business application) or ``sys'' (for system software).
\item[{\em Year(2):}]~~is a $NASA93$ term that selects the development years, grouped into units of five; e.g. 1970, 1971, 1972, 1973, 1974 are labeled ``1970''.
\edd
There are more than 19 subsets overall. Some have fewer than 20 projects and 
hence were not used (i.e. {\em Type}). The reasons are explained in~\cite{me06d}.

\subsection{Learners}
\subsubsection{Learning with Linear Regression}

Linear regression assumes that the data can be
approximated by one linear model that includes lines of code
(KLOC) and other features $f$ seen in a software development project:
\[
effort = \beta_0 + \sum\beta_i\cdot f_i
\]
Linear regression adjusts $\beta_i$ to minimize the {\em prediction error} (the
difference between predicted and actual values for the project).

Boehm argues that effort is exponential on KLOC~\cite{boehm81}:
\[
effort = a \cdot KLOC^b  \cdot \prod_i\beta_i
\]
(where a,b are domain-specific constants). Such exponential
functions can be learned via linear regression after they are
converted to the following linear form:
\footnote{Attention Tim. We need to finish our discussion about logging. Also, we need to decide whether to use non-logged graphs.}
\[
log(effort)=log(a)+b{\cdot}log(KLOC)+\sum_ilog(\beta_i)
\]
All our methods transform the data in this way. After predictions are generated
for the $Test$ set, they are unlogged prior to computing performance statistics. 

\subsubsection{Learning with Model Trees}

Model trees are a generalization of linear regression.
Instead of fitting the data to {\em one linear model}, 
model trees learn {\em multiple linear models}, and a
decision tree that decides which linear model to use.
Model trees are useful when the projects form regions and
different models are appropriate for different regions.
COSEEKMO includes the M5p model tree learner defined by Quinlan~\cite{quinlan92b}.

\begin{figure}
\begin{center}
{\scriptsize
\begin{tabular}{l|r@{:~}l|}\cline{2-3}
upper:   &acap&analysts capability\\
increase &pcap&programmers capability\\
these to &aexp&application experience\\
decrease &modp&modern programming practices\\
effort   &tool&use of software tools\\
         &vexp&virtual machine experience\\
         &lexp&language experience\\\cline{2-3}
middle   &sced&schedule constraint\\\cline{2-3}
lower:   &data&data base size\\
decrease &turn&turnaround time\\
these to &virt&machine volatility\\
increase &stor&main memory constraint\\
effort   &time&time constraint for cpu\\
         &rely&required software reliability\\
         &cplx&process complexity\\\cline{2-3}
\end{tabular}}
\end{center}
\caption{{
Features used in this study. From~\cite{boehm81}. Most
range from 1 to 6 representing ``very low'' to ``extremely high''.
}}\label{fig:em}
\end{figure}

\begin{figure}
\begin{center}
{\scriptsize
\begin{tabular}{|l|c|r@{~}|r@{~}|r@{~}|r@{~}|r@{~}|r|}
    \hline
    &&1&2&3&4&5&6\\
    \hline
upper&ACAP   &1.46   &1.19   &1.00   &0.86   &0.71   &\\
(increase&PCAP   &1.42 &1.17   &1.00   &0.86   &0.70 &\\
these to&AEXP   &1.29 &1.13   &1.00   &0.91   &0.82   &\\
decrease&MODP   &1.2  &1.10 &1.00 &0.91 &0.82 &\\
effort)&TOOL   &1.24 &1.10 &1.00 &0.91 &0.83 &\\
&VEXP   &1.21 &1.10 &1.00 &0.90 &  &\\
&LEXP   &1.14 &1.07 &1.00 &0.95 &  &\\\hline
middle&SCED   &1.23 &1.08 &1.00 &1.04 &1.10 &  \\\hline
lower&DATA   &    & 0.94 &1.00 &1.08 &1.16&\\
(increase&TURN   &       &0.87   &1.00   &1.07   &1.15   &\\
these to&VIRT   &       &0.87   &1.00   &1.15   &1.30   &\\
increase&STOR   &       &       &1.00   &1.06   &1.21   &1.56\\
effort)&TIME   &  &    &1.00   &1.11   &1.30   &1.66\\
&RELY   &0.75& 0.88& 1.00 & 1.15 & 1.40&\\
&CPLX   &0.70 &0.85 &1.00 &1.15 &1.30 &1.65\\
    \hline
\end{tabular}}
\end{center}
\caption{The COCOMO-I $\beta_i$ table~\cite{boehm81}. 
For example, the bottom right cell is saying that if CPLX=6, then
the nominal effort is multiplied by 1.65.}\label{fig:effortmults}
\end{figure}

\subsubsection{Learning with Local Calibration}

Local calibration (LC) is a specialized form
of linear regression developed by Boehm~\cite[p526-529]{boehm81}.
LC assumes project effort is exponential on KLOC; i.e. 
\[
effort = a \cdot KLOC^b  \cdot \prod_i\beta_i
\]
\fig{effortmults} shows the $\beta_i$ values
recommended by Boehm (the names on the left hand side are defined in \fig{em}).
When $\beta_i$ is used in the above equation, they yield estimates in months 
where one month is 152 hours (and includes development and management hours).
To operate, LC linearizes the exponential equation to generate 
\[
log(effort)=log(a)+b{\cdot}log(KLOC)+\sum_ilog(\beta_i)
\]
Linear regression would try to adjust all the $\beta_i$ values.
This is not practical when training on a very small number of projects.
Hence, LC fixes the $\beta_i$ values while adjusting the $<a,b>$ values to minimize
the prediction error. We shall refer to LC as ``standard practice'' since, in the COCOMO community
at least, it is the preferred method for calibrating standard COCOMO data~\cite{boehm00b}.

\subsubsection{Learning with Nearest Neighbor}
Nearest neighbor makes predictions using past data that is
similar to a new situation. Some distance measure is used to find the $k$ nearest
older projects to each project in the test set. An effort estimate
can be generated from the mean effort of the $k$ nearest neighbors.

The benefit of nearest neighbor algorithms is that they
make the fewest domain assumptions. That is, they can process a 
broader range of the data available within projects. For example:
\bi
\item LC cannot be applied unless projects are described using the COCOMO ontology (\fig{em}). 
\item Linear regression and model trees are best applied to data where most of the values for the numeric factors are known.
\ei

The drawback of nearest neighbor algorithms is that, sometimes, the domain assumptions they ignore are important
to that domain. For example, if effort is really exponential on KLOC, a standard nearest neighbor algorithm
has no way to exploit that.

\subsection{Pre-Processors}

\subsubsection{Pre-processing with Row Pruning}
Project data collected in one context may not be relevant to another.
Kitchenham et.al.~\cite{kitch07} take great care to document this effect.
They conducted a systematic review of ten projects comparing estimates using 
historical data {\em within} the same company or {\em imported} from another. In no
case was it better to use data from other sites, and sometimes
importing such data yielded significantly worse estimates.

Hence, given an a table of $P$ projects, it is good practice to
prune away rows relating to irrelevant projects. An often repeated
result in effort estimation is that the careful selection of project
data can improve the effectiveness of estimation. Similar projects
have less variation and so can be easier to calibrate. 
Chulani et.al.~\cite{chulani99} report that row pruning
can improve their PRED values anywhere between 5\% to 12\%.
Shepperd and Schofield report even more dramatic improvements in
PRED from 4\% to 44\%~\cite{shepperd97}.

Row pruning can be {\em manual} or {\em automatic}.
In {\em manual row pruning} (also called ``stratification'' in the COCOMO
literature~\cite{boehm00b}), an analyst applies their domain knowledge
to select project data that is similar to the new project to be
estimated.

Unlike other methods, the manual stratification uses different subsets to create $Train$ sets.
\bi
\item In every case except the for the manual stratification, $Train$ and $Test$ sets are created from the 
same subsets of $NASA93$ or $COC81$.
\item In manual stratification, the $Test$ set is created in the same manner from the subsets. 
However, the $Train$ set is created from the projects drawn from the $NASA93$ or $COC81$ and not their subsets.
\ei

{\em Automatic row pruning} uses algorithmic techniques to select a subset
of the projects (rows). The LOCOMO tool~\cite{jalali07} in COSEEKMO
is a row pruner that combines a nearest neighbor method
with LC. LOCOMO prunes away all projects except those $k$ ``nearest'' to the $Test$ data.

To learn an appropriate value for $k$, LOCOMO uses the $Train$ set as follows:
\bi
\item For each project $p_0\in Train$, LOCOMO sorts the remaining 
$Train - p_0$ examples by their Euclidean distance from $p_0$. 
\item LOCOMO then passes the $k_0$ examples closest to $p_0$
to LC. The returned $<a,b>$ values are used to estimate effort for $p_0$.
\item After trying all possible $k_0$ values, $2 \le k_0 \le |Train|$, 
$k$ is then set to the $k_0$ value that yielded the smallest mean MRE.
\ei
This calculated value $k$ is used to estimate the effort for projects in the
$Test$ set. For all $p_1\in Test$, the $k$ nearest neighbors
from $Train$ are passed to LC. The returned $<a,b>$ values are then 
used to estimate the effort for $p_1$. 

\subsubsection{Pre-Processing with Column Pruning}
Kirsopp \& Schofeld~\cite{kirsopp02} 
and Chen \& Menzies \& Port \& Boehm~\cite{me05c}
report that column pruning improves effort estimation.
Miller's research~\cite{miller02} explains why.
Column pruning (a.k.a. feature subset selection~\cite{hall03} or variable
subset selection~\cite{miller02}) reduces the deviation of a
linear model learned by minimizing least squares error~\cite{miller02}.
To see this, consider a linear model with constants $\beta_i$ 
that inputs features $f_i$ to predict for $y$:
\[y = \beta_0 + \beta_1\cdot f_1 + \beta_2\cdot f_2 + \beta_3\cdot f_3 ...\]
The variance of $y$ is some function of the variances in $f_1, f_2$, etc.
If the set $F$ contains ``noise'' (spurious signals unconnected to the target variable $y$)
then random variations in $f_i$ can increase the uncertainty of $y$.
Column pruning methods decrease the number of features $f_i$, thus
increasing the stability of the $y$ predictions. 
That is, the fewer the features (columns), the more restrained are the model predictions.

Taken to an extreme, column pruning can reduce $y$'s variance 
to zero (e.g. by pruning the above equation back to $y=\beta_0$)
but increases model error (the equation $y=\beta_0$ will ignore all project data when generating estimates).
Hence, intelligent column pruners experiment with some proposed subsets $F' \subseteq F$ before
changing that set. COSEEKMO currently contains three intelligent column pruners: WRAPPER, LOCALW, and COCOMIN.

WRAPPER~\cite{kohavi97} is a standard best-first search through the space of possible features. 
At worst, the WRAPPER must search an space exponential on the number of features $F$; i.e. $2^F$.
However, a simple best-first heuristic makes WRAPPER practical for effort estimation.
At each step of the search, all the current subsets are scored by passing them to
a {\em target leaner}. If a set of features does not score better than a smaller subset,
then it gets one ``mark'' against it. If a set has more than $STALE=5$ number of marks, it is deleted.
Otherwise, a feature is added to each current set and the algorithm continues. 

In general, a WRAPPER can use any target learner. Chen's LOCALW is a WRAPPER specialized
for LC. Previously~\cite{me06d,me05c}, we have explored LOCALW for effort estimation.

Theoretically, WRAPPER (and LOCALW)'s exponential time search is more
thorough, hence more useful, than simpler methods that try fewer options. To 
test that theory, we will compare WRAPPER and LOCALW to a linear-time column pruner 
called COCOMIN~\cite{baker07}.

COCOMIN is defined by the following operators:
\[\{sorter, order, learner, scorer\}\]
The algorithm runs in linear time over a {\em sorted} set of features, $F$.
This search can be {\em order}ed in one of two ways:
\bi
\item A ``backward elimination'' process starts with all features $F$ and throws some away, one at a time.
\item A ``forward selection'' process starts with one feature and adds in the rest, one at a time.
\ei
Regardless of the search order, at some point the current set of features $F' \subseteq F$ 
is passed to a {\em learner} to generate a performance {\em score} by applying 
the model learned on the current features to the $Train$ set. 
COCOMIN returns the features associated with the highest score.

COCOMIN pre-sorts the features on some heuristic criteria. Some of
these criteria, such as standard deviation or entropy, are gathered
without evaluation of the target learner. Others are gathered by
evaluating the performance of the learner using only the feature in
question plus any required features, such as KLOC for COCOMO, to
calibrate the model. After the features are ordered, each feature is
considered for backward elimination, or forward selection if chosen,
in a single linear pass through the feature space, $F$. The decision to
keep or discard the feature is based on an evaluation measure
generated by calibrating and evaluating the model with the training data.

The version of COCOMIN used in this study:
\bi
\item sorted the features by the highest median MRE;
\item used a backward elimination search strategy;
\item learned using LC;
\item scored using mean MRE.
\ei

%XXX why these settings?

\bibliographystyle{IEEEtran}
\bibliography{refs}
\end{document}