%XXX why mann whitnet
%XXX why did others not report this earlier
%XXX change conclusion
%XXX walk the reviews looking for comments
%\documentclass[onecolumn,12pt,journal,compsoc]{IEEEtran}
\renewcommand{\baselinestretch}{1.41}
\documentclass[cite,journal,11pt,compsoc]{IEEEtran}
\usepackage{pifont}
\usepackage{times}
\newcommand{\ed}{\end{description}}
\newcommand{\fig}[1]{Figure~\ref{fig:#1}}
\newcommand{\eq}[1]{Equation~\ref{eq:#1}}
\newcommand{\hyp}[1]{Hypothesis~\ref{hyp:#1}}
\newcommand{\tion}[1]{\S\ref{sec:#1}}

\usepackage{rotating}
\usepackage{alltt}
\usepackage{graphicx}
\usepackage{url}
\newcommand{\bi}{\begin{itemize}}
\newcommand{\ei}{\end{itemize}}
\newcommand{\be}{\begin{enumerate}}
\newcommand{\ee}{\end{enumerate}}
\newcommand{\bdd}{\begin{description}}
\newcommand{\edd}{\end{description}}
% IEEE Computer Society needs nocompress option
% requires cite.sty v4.0 or later (November 2003)
\usepackage[nocompress]{cite}
% normal IEEE
 % \usepackage{cite}

% correct bad hyphenation here
\hyphenation{op-tical net-works semi-conduc-tor}

\begin{document}

\title{
Identifying the ``Best'' Software Prediction Models 
Requires a Combination of Methods}

\author{Tim~Menzies,~\IEEEmembership{Member,~IEEE,}
        Omid Jalali,
		Jairus Hihn,
	    Dan Baker,
		and Karen Lum
\thanks{ 
Tim Menzies, Omid Jalali, and Dan Baker
are with the Lane Department of
Computer Science and Electrical Engineering, West
Virginia University, USA: 
\protect\url{tim@menzies.us},
\protect\url{ojalali@mix.wvu.edu}, \protect\url{danielryanbaker@gmail.com}.}
\thanks{Jairus Hihn and Karen Lum at with NASA's Jet Propulsion Laboratory:
\protect\url{jhihn@mail3.jpl.nasa.gov},
\protect\url{karen.t.lum@jpl.nasa.gov}.}
\thanks{
The research described in this paper was carried out at West Virginia University
and the Jet 
Propulsion Laboratory, California Institute of Technology, under a contract with the US National Aeronautics and 
Space Administration. Reference herein to any specific 
commercial product, process, or service by trade name, 
trademark, manufacturer, or otherwise does not constitute 
or imply its endorsement by the US Government.}
\thanks{Download: \protect\url{http://menzies.us/pdf/07stability.pdf}.}
\thanks{
Manuscript received July 31, 2007; revised XXX, XXXX.}}

\markboth{Journal of ???,~Vol.~6, No.~1, January~2007}%
{Menzies \MakeLowercase{\textit{et al.}}: Estimation and Conclusion Stability}

\IEEEaftertitletext{\vspace{-1\baselineskip}
\noindent\begin{abstract}
There exists a large and growing number of proposed estimation methods
but little conclusive evidence ranking one method over another.  
Prior
effort estimation studies suffered from ``conclusion instability'' where the rankings offered to different methods
were not stable across (a)~different evaluation criteria; (b)~different data sources; or
(c)~different random selections of that data.
This paper reports a study of 158 effort estimation methods on data sets based on COCOMO features.
Four  ``best'' methods were detected that were consistency better than the ``rest'' of the other 154 methods.
These rankings of ``best'' and ``rest'' methods were stable across
(a)~three different evaluation criteria applied to (b)~multiple data sets from 
two different sources that were (c)~divided into hundreds of randomly
selected subsets using four different random seeds.
Hence, while there
exists no single universal ``best'' effort estimation method,
there appears to exist a small
number (four) of most useful methods. This result both complicates and
simplifies effort estimation research. The complication is that any
future effort estimation analysis should be preceded by a ``selection
study'' that finds the best local estimator. However, the
simplification is that such a study need not be labor intensive, at
least for COCOMO style data sets.
\end{abstract}\noindent
\begin{keywords}COCOMO, effort estimation, data mining, evaluation
\end{keywords}\vspace{1\baselineskip}}\maketitle
%-------------------------------------------------------------------------

%\begin{tabular}{|p{.9\linewidth}|}\hline
%{\em Note that this is a new submission of a paper 
%previously reviewed by TSE. Based on reviewer feedback,
%much has been changed in this draft. For details of those changes,
%please see the end of this paper.}\\\hline
%\end{tabular}

\section{Introduction}
Software effort estimates are often wrong by a factor 
of four~\cite{boehm81} or even more~\cite{kemerer87}. 
As a result,
the allocated funds may be inadequate to develop the required project.
In the worst case, over-running 
projects are canceled, wasting the entire development effort. 
For example, in 2003, NASA canceled
the CLCS system
after spending 
hundreds of millions of dollars on software development.
The project was canceled after the initial estimate of
\$206 million was increased to between
\$488 million and \$533 million~\cite{clcs03}.
On cancelation, approximately 400 developers
lost their jobs\cite{clcs03}.

While the need for better estimates is clear, there exists
a very large number of effort estimation methods~\cite{jorg04,jorgensen05}
and few studies empirically
compare all these techniques. What is more usual are narrowly focused studies (e.g.
\cite{kemerer87,briand00,lum02,ferens98}) that test, say, linear regression models in different environments.

Unless we can {\em rank} methods and {\em prune}
inferior methods,
we will soon be overwhelmed by a growing number
of (possibly useless) effort estimation methods.
New open source data mining toolkits are
appearing with increasing frequency such as 
the R project\footnote{\url{http://www.r-project.org/}},
Orange\footnote{\url{http://www.ailab.si/orange/}}, and the
WEKA~\cite{witten05}.
All the learners in all these toolkits can be {\em stacked}
by 
{\em meta-learning} schemes 
where the conclusions of one 
data miner influences the next. 
There exists
one stacking for every ordering of $N$ learners; so
five learners can be stacked $5!=120$ ways and ten learners
can be stacked in millions of different ways.

%The abundence of such  tools tempt researchers to over-elaborate their effort estimation tools.
%For example, our COSEEKMO tool~\cite{me06d} takes nearly a 
%day to run its 158 methods.
%In this paper, we show that
%much of that execution is wasted since,
%$\frac{154}{158}$ of those methods are superfluous.
%In the experiments reported below, we show that despite:\bi
%\item two changes of the source of the data;
%\item three changes to the evaluation criteria;
%\item hundreds of random  to the training/test data;
%\ei
%that four methods consistently out-performed the rest.
%To the best of our knowledge, this is the first report of
%stable conclusions in such a comparison of a very large number of
%effort estimation methods.
%\subsection{Contributions}
% Can the new generation of
%data miners offer better estimates than traditional methods?
%Previously, we argued that this is the case 
%(\cite{me04h,me05a,me05c,me05d,me06d,me06e,me06f}). However,
%the experiments reported in this paper  have forced a revision of that view.
%
%Potentially,
%data mining methods can handle modeling tasks not suitable for 
%standard linear regression.
%For example:
%\bi
%\item {\em Model trees}~\cite{quinlan92b} applies regression
%to parts of the data, then builds a decision tree to decide which
%part is most relevant.
%This
%is useful if 
%estimations are  different for (say)
%very large
%or very small software systems. 
%\item 
%{\em Column pruning} 
%reduces estimate variance~\cite{miller02}
%by culling  noisy data resulting from
%poor collection practices.
%\item {\em Row pruning}
%methods
%build estimates from only relevant cases taken from some historical log.
%Like column pruning, row pruning can improve estimation by discarding extraneous
%influences that conflate the estimation and, potentially, increase the
%variance on the estimate.
%\ei
%Various research results \cite{kirsopp02,me06d}
%advocate 
%combing these methods.
%Do those results invalidate
%other papers that do not use multiple methods
%that combine, say,  row and/or column pruning?
%For example,
%Lokan \& Mendes~\cite{lokan06} do not apply column pruning
%or row pruning, even though they report
%large deviations in effort estimates\footnote{
%In Tables 11 \& 12 in \cite{lokan06} the reported effort estimates include
%two results with
%  efforts of  $\{\mu=4,704,\sigma=6,717\}$  
%and $\{\mu=4,310,\sigma=10,730\}$. Note that in both cases,
%the deviation is {\em larger} than the mean. This is a clear indicator
%of ``noisy'' data that could be enhanced via column or row pruning.}.
%
%To be fair to Lokan \& Mendes,
%we hasten to say that 
%they were following accepted practice
%in the field of effort estimation.
%Many 
%publications
%in this field
%only use  one estimation method- often,
%some variant on linear regression (e.g. see
%\cite{kemerer87,briand00,lum02,ferens98} and many of the 
%papers from the annual USC COCOMO forum).
%Also, 
%subsequently, Mendes did use multiple methods for effort 
%estimation~\cite{mendes07}. 
%
%Also,
%it is impractical to 
%try {\em all} methods. 
%Data mining toolkits are
%appearing with increasing frequency;
%e.g. R project\footnote{\url{http://www.r-project.org/}},
%Orange\footnote{\url{http://www.ailab.si/orange/}}, and the
%WEKA~\cite{witten05}.
%The learners in these toolkits can be
%{\em stacked} via 
%{\em meta-learning} schemes
%where
%the conclusions of one 
%data miner influences the next. 
%$N$ learners can be stacked in $N!$ ways so
%five learners can be stacked $5!=120$ ways and ten learners
%can be stacked millions of ways.
%
%
%Nevertheless, the research community should agree on
%acceptable boundaries on experimentation. If one method is too few and
%all methods are 
%too many, how many methods are enough?

Prior attempts to rank and prune different methods 
have been inconclusive.
Shepperd and Kadoda~\cite{shepperd01} 
compared the effort models learned 
from a variant of regression, rule induction, case-based
reasoning (CBR), and neural networks. 
Their results exhibited much {\em conclusion instability}
where the performance results:
\bi
\item
Differed markedly across different data sets;
\item
Differed markedly when they repeated their runs using different random seeds.
\ei
Overall, while
no single best method was ``best''
they found weak evidence that two methods were generally
inferior (rule induction and neural nets)~\cite[p1020]{shepperd01}.

%XXX prune intro . start at can't try all methods.
%XXX bring back old text

%XXX p2, column1 half way- jumps to just CBR. why?

%XXX p3: two "available" too close together
%XXX p4: col [rune before row prune. add see appendix for more details
%XXX check all references to conclusion instability
%XXX p5: move % from RE
%XXX we use mann whitney cause we are com[acing pruned to non-pruned


The genesis of this paper were two observations suggesting that it might be worthwhile revisiting the
Shepperd \& Kadoda
results. Firstly, our  
data sets are expressed in terms of the COCOMO features~\cite{boehm81}.
These features were selected by
by Boehm (a widely-cited and experienced researcher with much industrial experience; 
e.g. see \cite{boehm86}) and subsequently tested
by a large  
research and industrial community (since 1985, the annual COCOMO forum has met
to debate and review the value of those features).
Perhaps, we speculated, conclusion
instability might be tamed by the use of better features.

Also, there is an interesting quirk in 
their experimental results.
Estimation methods can
{\em prune} tables of training data:
\bi
\item
CBR prunes away irrelevant {\em rows};
\item
Stepwise regression prunes away {\em columns} that
add little information to the regression equation;
\item
The Shepperd \& Kadoda's rule induction and neural net instantiations
have no row/column pruning.
\ei
Note that the two methods found to be
``worst'' by Shepperd \& Kadoda had no
row or column pruning methods.
Pruning data can be useful to remove outliers or ``noisy'' information
(spurious signals unconnected to the target variable). One symptom of outliers and noise is 
conclusion instability across different data sets and different random samplings.
Hence, we wondered if conclusion instability could be tamed via the application of more elaborate row {\em
and} column pruners.

The above observations lead to the experiment reported in this paper. 
Previously~\cite{me06d}, we have built the COSEEKMO effort estimation workbench that supports 158 
estimation methods\footnote{
To be precise, COSEEKMO currently supports 15 learners and 8 row/column pre-processors
which can be applied to two different sets of internal tuning parameters.
In one view, this combination of (15+8*8)*2=158 different
estimation generation methods are not really different ``methods'';
rather it might be more accurate to call them ``instances of methods''.
This paper does not adopt that terminology for the following reason.
To any user of our tools, our menagerie of estimation software
contains 158 {\em oracles} that may yield different effort estimates.
Hence, in the view adopted by this paper, they are 158
competing methods that must be assessed and (hopefully)
pruned to a more management size.}.
The methods are defined over COCOMO features, and make extensive use of
combinations of different row and column pruners.
When we ran this toolkit over our data sets, we found only minor conclusion instability.
In fact, a very clear pattern in the results was apparent:
\be
\item Many estimation methods offer very little added value;
\item A small number (four) of estimation methods consistently out-perform the rest;
\item Within the set of four demonstrably useful methods, there is no
consistently best estimator. 
\ee
The methods in COSEEKMO were selected to cover a range of standard and novel
theories of effort estimation. Based on these above results, we are now
concerned that there is an excess of theoretical elaboration and
not enough empirical ranking of supposedly useful methods.
For our own work,
we are planning to prune away many of the methods in COSEEKMO and we
advise other researchers to consider doing the same.
For example, 
in this paper,
very simple extensions to decades-old
techniques out-performed 97\% of all the methods studied here.
If those percentages carry over to other effort estimation paradigms and data sets then 
the implication is that:
\bi
\item Certain commercial estimation models such as PRICE-S~\cite{park88},
SEER-SEM~\cite{jensen83}, or SLIM~\cite{putnam92} have too many variables and model elaborations;
\item
Certain seemingly sophisticated estimation methods proposed in the academic literature 
do not add significant value for the task of effort estimation.
\ei

The rest of this paper describes our experiments and discusses their implications on prior and future work.

\section{Related Work}
There are many methods for effort estimation.
The rest of this paper offers brief notes on some
of them (with supporting details in the appendix).
For a more extensive survey of methods, see~\cite{shepperd07,jorgensen05}.

In order to introduce the reader to a range of estimation,
in this related work section we will just focus on two methods:
regression-based COCOMO and case-based reasoning.
This will be followed by a brief tutorial on row and column pruning.

\subsection{Regression-Based COCOMO}
\noindent
Two factors make us prefer COCOMO-based methods:
\bi
\item
{\em Public domain:}
Unlike other effort estimators such as PRICE-S~\cite{park88}, 
SEER-SEM~\cite{jensen83}, or SLIM~\cite{putnam92}, COCOMO is a public domain 
with published data and baseline results~\cite{chulani99}. 
\item
{\em Data availability:}
All the information we could access was in the \mbox{COCOMO-I} format~\cite{boehm81}.
\ei
COCOMO is based on linear regression which
assumes that the data can be
approximated by one linear model that includes lines of code
(KLOC) and other features $f$ seen in a software development project:
\[
effort = \beta_0 + \sum_i\beta_i\cdot f_i
\]
Linear regression adjusts $\beta_i$ to minimize the {\em prediction error} 
(predicted minus actual values for the project).

\begin{figure}[!b]
\begin{center}
{\scriptsize
\begin{tabular}{l|r@{:~}l|}\cline{2-3}
upper:   &acap&analysts capability\\
increase &pcap&programmers capability\\
these to &aexp&application experience\\
decrease &modp&modern programming practices\\
effort   &tool&use of software tools\\
         &vexp&virtual machine experience\\
         &lexp&language experience\\\cline{2-3}
middle   &sced&schedule constraint\\\cline{2-3}
lower:   &data&data base size\\
increase &turn&turnaround time\\
these to &virt&machine volatility\\
increase &stor&main memory constraint\\
effort   &time&time constraint for CPU\\
         &rely&required software reliability\\
         &cplx&process complexity\\\cline{2-3}
\end{tabular}}
\end{center}
\caption{{
The $f_i$ features used in this study. From~\cite{boehm81}. Most
range from 1 to 6 representing ``very low'' to ``extremely high''.
}}\label{fig:em}
\end{figure}

After much research, Boehm advocated
the COCOMO-I features shown in 
\fig{em}. He also
argued that effort is exponential on KLOC~\cite{boehm81}:
\begin{equation}\label{eq:coc1}
effort = a \cdot KLOC^b  \cdot \prod_i\beta_i
\end{equation}
where $a$ and $b$ are domain-specific constants
and $\beta_i$ comes from looking up $f_i$ values
in \fig{effortmults}. 
When $\beta_i$ is used in the above equation, they yield estimates in months 
where one month is 152 hours (and includes development and management hours).

Exponential
functions like \eq{coc1}
can be learned via linear regression after a conversion to
a linear form:
\begin{equation}\label{eq:linear}
log(effort)=log(a)+b{\cdot}log(KLOC)+\sum_ilog(\beta_i)
\end{equation}
Most our methods transform the data in this way
so when collecting performance statistics, the estimates must be unlogged.

\begin{figure}
\begin{center}
{\scriptsize
\begin{tabular}{|l|c|r@{~}|r@{~}|r@{~}|r@{~}|r@{~}|r|}
    \hline
    &&1&2&3&4&5&6\\
    \hline
upper&ACAP   &1.46   &1.19   &1.00   &0.86   &0.71   &\\
(increase&PCAP   &1.42 &1.17   &1.00   &0.86   &0.70 &\\
these to&AEXP   &1.29 &1.13   &1.00   &0.91   &0.82   &\\
decrease&MODP   &1.2  &1.10 &1.00 &0.91 &0.82 &\\
effort)&TOOL   &1.24 &1.10 &1.00 &0.91 &0.83 &\\
&VEXP   &1.21 &1.10 &1.00 &0.90 &  &\\
&LEXP   &1.14 &1.07 &1.00 &0.95 &  &\\\hline
middle&SCED   &1.23 &1.08 &1.00 &1.04 &1.10 &  \\\hline
lower&DATA   &    & 0.94 &1.00 &1.08 &1.16&\\
(increase&TURN   &       &0.87   &1.00   &1.07   &1.15   &\\
these to&VIRT   &       &0.87   &1.00   &1.15   &1.30   &\\
increase&STOR   &       &       &1.00   &1.06   &1.21   &1.56\\
effort)&TIME   &  &    &1.00   &1.11   &1.30   &1.66\\
&RELY   &0.75& 0.88& 1.00 & 1.15 & 1.40&\\
&CPLX   &0.70 &0.85 &1.00 &1.15 &1.30 &1.65\\
    \hline
\end{tabular}}
\end{center}
\caption{The COCOMO-I $\beta_i$ table~\cite{boehm81}. 
For example, the bottom right cell is saying that if CPLX=6, then
the nominal effort is multiplied by 1.65.}\label{fig:effortmults}
\end{figure}
Local calibration (LC) is a specialized form
of linear regression developed by Boehm~\cite[p526-529]{boehm81}
that assumes effort is modeled via the linear form \eq{linear}.
Linear regression would try to adjust all the $\beta_i$ values.
This is not practical when training on a very small number of projects.
Hence, LC fixes the $\beta_i$ values while adjusting the $<a,b>$ values to minimize
the prediction error. We shall refer to LC as ``standard practice'' since, in the COCOMO community
at least, it is the preferred method for calibrating standard COCOMO data~\cite{boehm00b}.

In 2000, Boehm et al. updated the COCOMO-I model~\cite{boehm00b}. After the update, numerous features 
remained the same:
\bi
\item
Effort is assumed to be exponential on model size.
\item
Boehm et al. still recommends local calibration for tuning generic COCOMO to a local situation.
\ei

At the 2005 COCOMO forum, there were  
discussions about relaxing the access restrictions on the
\mbox{COCOMO-II} data.
To date, those discussions have not progressed.
Since other researchers do not have 
access to COCOMO-II, this paper will only report
results from COCOMO-I.

\subsection{Case-Based-Reasoning}
COCOMO's features are both the strongest and weakest part
of that method. On the one hand, they have been selected and tested
by a large community of academic and industrial researchers led
by Boehm. This group meets annually at the COCOMO forums (these
are large meetings that have been held annually since 1985).
On the other hand, these features may not be available in
the databases of a local site. 
Hence, regardless of the potential value added of using a well-researched
feature set, those features may not be available.

An alternative to COCOMO is the case-based reasoning (CBR) approach used by
Shepperd~\cite{shepperd07} and others~\cite{li07}. CBR accepts data with any set of features.
Often, CBR uses a {\em nearest neighbor}
method to make predictions using past data that is
similar to a new situation. Some distance measure is used to find the $k$ nearest
older projects to each project in the $Test$ set. An effort estimate
can be generated from the mean effort of the $k$ nearest neighbors
(for details on finding $k$, see below).

The benefit of nearest neighbor algorithms is that they
make the fewest domain assumptions. That is, they can process a 
broader range of the data available within projects:
\bi
\item Local calibration cannot be applied unless projects are described using the COCOMO ontology (\fig{em}). 
\item Linear regression is best applied to data where most of the values for the numeric factors are known.
\ei

The drawback of nearest neighbor is that, sometimes, it can ignore important
domain assumptions. For
example, if effort is really exponential on KLOC, a standard nearest
neighbor algorithm has no way to exploit that.

\section{A Brief Tutorial on Row and Column Pruning}

Pruning can be useful since
project data collected in one context may not be relevant to another.
Kitchenham et al.~\cite{kitch07} take great care to document this effect.
In a systematic review comparing estimates generated using 
historical data {\em within} the same company or {\em imported} from another,
Kitchenham et al. found no
case where it was better to use data from other sites. Indeed, sometimes,
importing such data yielded significantly worse estimates.
Similar projects have less variation and so can be easier to calibrate: 
Chulani et al.~\cite{chulani99} \& Shepperd and Schofield~\cite{shepperd97} report that row pruning
improves estimation accuracy.

Given a table $P*F$ containing one row for $P$ projects described using
$F$ features, row and column pruning 
prune {\em irrelevant} projects and features.
After pruning, the learner executes on a new table $P'*F'$
where $P' \subseteq P$ and $F' \subseteq F$.

Row pruning can be {\em manual} or {\em automatic}.  In {\em manual row pruning} (also called ``stratification'' in the COCOMO
literature~\cite{boehm00b}), an analyst applies their domain knowledge
to select project data that is similar to the new project to be
estimated. Every other method explored in this study is fully
automatic. Such automation enables an exploration of a broad range of
options.  

Instead, in the sequel, we compare the results from 
using source subsets to using all the data from one source.
Recall that our data sets
come from two $sources$: Boehm's COCOMO text~\cite{boehm81} and 
some more recent NASA data. Those sources divide into various {\em data sets}
representing data from different companies, or different project types
(see the appendix for details). Minimum size
of a subset is 20 instances while a source may contain 93 rows ($NASA93$) or 63 rows ($COC81$).

\begin{figure*}
\scriptsize
\begin{center}
\begin{tabular}{r|rrrrrrrrrrr}
&\begin{sideways}category-missionplanning\end{sideways}&\begin{sideways}center-2\end{sideways}&\begin{sideways}year-1975\end{sideways}&\begin{sideways}mode-embedded\end{sideways}&\begin{sideways}center-5\end{sideways}&\begin{sideways}project-gro\end{sideways}&\begin{sideways}fg-g\end{sideways}&\begin{sideways}project-X\end{sideways}&\begin{sideways}all\end{sideways}&\begin{sideways}mode-semidetached\end{sideways}&\begin{sideways}category-avionicsmonitoring\end{sideways}\\\hline
year-1980&15 / 43&13 / 62&0 / 75&9 / 50&14 / 63&9 / 52&31 / 87&13 / 63&38 / 93&27 / 80&5 / 63\\
category-missionplanning&&1 / 56&3 / 54&2 / 39&7 / 52&1 / 42&20 / 80&7 / 51&20 / 93&18 / 71&0 / 50\\
center-2&&&10 / 64&5 / 53&0 / 76&23 / 37&32 / 85&0 / 75&37 / 93&32 / 74&13 / 54\\
year-1975&&&&12 / 46&23 / 53&0 / 60&31 / 86&23 / 52&37 / 93&25 / 81&20 / 47\\
mode-embedded&&&&&13 / 47&3 / 41&8 / 93&12 / 47&21 / 93&0 / 90&3 / 48\\
center-5&&&&&&0 / 62&33 / 86&38 / 39&39 / 93&23 / 85&17 / 52\\
project-gro&&&&&&&20 / 83&0 / 61&23 / 93&20 / 72&4 / 49\\
fg-g&&&&&&&&33 / 85&80 / 93&69 / 80&30 / 80\\
project-X&&&&&&&&&38 / 93&23 / 84&17 / 51\\
all&&&&&&&&&&69 / 93&30 / 93\\
mode-semidetached&&&&&&&&&&&24 / 75\\
category-avionicsmonitoring&&&     &                     &                &                   &            &                   &           &\\                                                        
\end{tabular}
\end{center}
\caption{$NASA93$: intersection / union of examples in different data sets.
}\label{fig:overlap1}
\end{figure*}


{\em Automatic row pruning} uses algorithmic techniques to select a subset
of the projects (rows). 
NEAREST and LOCOMO~\cite{jalali07} are automatic and use nearest neighbor methods on 
the $Train$ set to find the $k$ most relevant projects to generate predictions for the projects in the $Test$ set. 
The core of both automatic algorithms is a distance measure that must compare all pairs of projects.
Hence, these automatics methods take time $O(P^2)$. 
Both NEAREST and LOCOMO learn an appropriate $k$ from the $Train$ set and the $k$ with the lowest
error is used when processing the $Test$ set. 
NEAREST averages the effort associated with the $k$ nearest neighbors 
while LOCOMO passes the $k$ nearest neighbors to Boehm's local calibration (LC) method.

Column pruners fall into two groups:
\bi
\item COCOMIN~\cite{baker07} is far less thorough.
COCOMIN is a near linear-time pre-processor that selects the features
on some heuristic criteria and does not explore all subsets of the features.
It runs in $O(F{\cdot}log(F))$ for the sort and $O(F)$ time for the exploration of selected features.
\item WRAPPER and LOCALW are very thorough search algorithms that explore
subsets of the features, in no set order. In the worst case, this search is an exhaustive
examination of all combinations; i.e this search takes time $O(2^F)$. 
\ei

\section{Experiments}

\subsection{Data}
%XXX different subset sets, two sources
% previously we hare report that these data sets have very different properties

This paper is based on 19 data sets from two sources:
\bi
\item
$COC81$\footnote{ 
\url{http://promisedata.org/repository/data/coc81/}.}
comes from Boehm's 1981 text on effort estimation. 
\item
$NASA93$\footnote{
\url{http://promisedata.org/repository/data/nasa93/}.}
comes from a study funded by the Space Station Freedom Program.
$NASA93$ contains data from six different NASA centers including the Jet Propulsion Laboratory.
\ei
The data sets represent different subsets of the data: e.g. just the ground systems; just the systems
that use FORTRAN; etc. As shown in \fig{overlap1} and \fig{overlap2}, there is some overlap between these
subsets:
\bi
\item Occasionally this overlap is quite large; e.g. the 80 records shared by $NASA93$ ``all'' and the
ground systems labeled ``fg-g''.
\item
However, in the usual case, the overlap is less than a third (for $COC81$) and a quarter (for $NASA93$)
of the number of examples found in the union of both subsets.
\item
Also, sometimes it is zero; e.g. $NASA93$'s mission planning systems have zero overlap with avionics monitoring.
\ei
For more details on this data, see the appendix. 

\begin{figure}
\scriptsize
\begin{center}
\begin{tabular}{r|rrrrrr}
          &    \begin{sideways} mode-e\end{sideways}&\begin{sideways} lang-ftn\end{sideways}&\begin{sideways} kind-min\end{sideways}&\begin{sideways} lang-mol\end{sideways}&\begin{sideways} kind-max\end{sideways}&\begin{sideways} mode-org\end{sideways}\\\hline 
     all     &       28 / 63&         24 / 63&         21 / 63&         20 / 63&         31 / 63&         23 / 63\\      
   mode-e          &       &          7 / 45&         16 / 33&         13 / 35&         10 / 49&          0 / 51\\      
 lang-ftn          &            &         &          6 / 39&          0 / 44&         16 / 39&         12 / 35\\      
 kind-min          &            &              &    &         14 / 27&          0 / 52&          4 / 40\\           
 lang-mol          &            &              &              &        &          2 / 49&          4 / 39\\       
 kind-max          &            &              &              &              &         &         15 / 39      
\end{tabular}
\end{center}
\caption{$COC81$: intersection / union of examples in different data sets.
}\label{fig:overlap2}
\end{figure}

\subsection{Experimental Procedure}

\noindent
Each of the 19 subsets of $COC81$ and $NASA93$ were expressed
as a table of data $P*F$.
The table stored {\em project} information in $P$ rows and
each row included the {\em actual} development effort. 
In the subsets of $COC81$ and $NASA93$ used in the study,
$20 \le P \le 93$. The upper bound of this range ($P=93$)
is the largest data set's size.
The lower bound of this range ($P=20$) was selected based on
a prior study~\cite{me06d}.
For details on these data sets, see the appendix.

The table also has $F$ columns containing the project {\em features} $\{f_1,f_2,...\}$.
The features used in this study come from Boehm's COCOMO-I work (described in the appendix)
and include items such as lines of code (KLOC), schedule pressure (sced), analyst capability (acap), etc. 

To build an effort model, table rows were divided at random into a $Train$ and $Test$ set
(and $|Train|+|Test|=P$). COSEEKMO's methods are then applied to the $Train$ set to generate a model which
was then used on the $Test$ set.
In order to compare this study with our work~\cite{me06d},
we use the same $Test$ set size as the COSEEKMO
study; i.e. $|Test|=10$.

Effort models were assessed via three evaluation criteria:
\bi
\item $AR$: absolute residual; $abs(actual - predicted)$;
\item $MRE$: magnitude of relative error;  $\frac{abs(predicted - actual)}{actual}$;
\item $MER$: magnitude of error relative to estimate; $\frac{abs(actual - predicted)}{predicted}$;
\ei

For the sake of statistical validity, the above procedure was repeated 20 times
for each of the 19 data sets of $COC81$ and $NASA93$.
Each time, a different seed was used to generate the $Train$ and $Test$ sets.

Methods' performance scores were assessed using performance ranks rather than exact values.
To illustrate the process of replacing exact values with ranks,
consider the following example.
If treatment $A$ generates $N_1=5$ values \{5,7,2,0,4\} 
and treatment $B$ generates $N_2=6$ values \{4,8,2,3,6,7\}, 
then these sort as follows:
\begin{center}
{\small
\begin{tabular}{c|c|c|c|c|c|c|c|c|c|c|c}
Samples& A & A & B & B & A & B & A & B & A & B & B\\
Values & 0 & 2 & 2 & 3 & 4 & 4 & 5 & 6 & 7 & 7 & 8\\
\end{tabular}}\end{center}
On ranking, averages are used when values are the same:
\begin{center}
{\small
\begin{tabular}{c|c@{~}|c@{~}|c@{~}|c@{~}|c@{~}|c@{~}|c@{~}|c@{~}|c@{~}|c@{~}|c}
Samples& A&  A  &B&  B&  A & B  &A&  B&  A&  B & B\\
Values&  0  &2  &2&  3&  4 & 4  &5&  6&  7&  7 & 8\\
Ranks&   1 &2.5&2.5 &4& 5.5&5.5 &7&  8& 9.5&9.5& 11\\
\end{tabular}}\end{center}
Note that, when ranked in this manner,
the largest value (8 in this case) gets the same rank even if it was
ten to a hundred times larger. That is, such rank tests
are less susceptible to large outliers.
This is very important for experiments with effort estimation.
In our experiments, we can build thousands to tens of thousands of estimators that exhibit
infrequent, but very large outliers.
For example, the relative error of an estimate is $RE=\frac{predicted-actual}{actual}$.
In our work we have seen data sets generate RE\% below 100 then suddenly spike
in one instance to over 8000\%.

After ranking the performance scores, we applied
the Mann-Whitney U test~\cite{mann47}:
\bi
\item
Non-paired tests compare the performance of 
populations of treatments while paired tests compare performance deltas of two methods
running on exactly the same train/test data. Since we are using row/column pruning,
paired tests are inappropriate since the underlying data distributions in the train test
can vary widely when (e.g.) a method that does use row and/or column pruning is compared to
one that does not.
\item
Mann-Whitney supports very succinct summaries of the results without intricate post-processing.
This is a very important requirement for our work since we are comparing 158 methods. 
Mann-Whitney does not require that the sample sizes are the same.
So, in a single U test, learner $L_1$ can be compared to all its rivals. 
\ei

%has been extensively studied in the data mining community. 
%T-tests that assume Gaussian distributions are strongly deprecated.
%After reviewing a wide range of
%comparisons methods\footnote{Paired t-tests with and without the use of geometric means of the relative ratios;
%binomial tests with the Bonferroni correction;
%paired t-tests; ANOVA; Wilcoxon; Friedman},
%Demsar~\cite{demsar06} advocates the use of the 1945 Wilcoxon~\cite{wilcoxon45} 
%signed-rank test that compares the ranks for the positive and negative differences (ignoring the signs). 
%Writing five years earlier,
%Kitchenham et.al.~\cite{kitc01} comment that the Wilcoxon test has its limitations.
%Demsar's report offers the same conclusion, 
%noting that the Wilcoxon test requires that the sample sizes are the same.
%
%One test not studied by Demsar is Mann and Whitney's 1947
%modification~\cite{mann47} to Wilcoxon rank-sum test (proposed along with his signed-rank test)~\cite{demsar06}. We prefer this test since
%Consider how to rank 158 methods with
%Wilcoxon:
%\bi
%\item Rank all methods by their median scores, $r_1, r_2,r_3,...r_{158}$.
%\item Using Wilcoxon, find all methods that were statistically insignificantly different to the top
%ranked method. Call those methods $r_1, r_2,... r_i$, the ``best''.
%Remove these methods from further analysis.
%\item Repeat for the remaining methods to return sets of methods labeled ``second best'', ``third best'' etc.
%\ei
% Note that it is unclear how to handle
%the situation where $r_1, ... r_i$ have been labeled ``best''
%but $r_j$ ($j > i$) is statistically insignificantly different from $r_k$ for $2 \le k \le i$.
%The U test, on the other hand, does not require any post-processing
%(such as the Friedman test or the above ``home brew'' method) to conclude 
%if the median rank of one population (say, the $L_1$ results)
%is greater than, equal to, or less than the median rank of another (say, the $L_2,L_3,..,L_x$ results).

Mann-Whitney can be used to report ``win'', ``loss'', or ``tie'' for
all pairs of methods 
$L_i,L_j, i\not=j$:
\bi
\item ``Tie'' ; i.e. the ranks are statistically the same;
\item ``Win'' ; i.e. not a tie and the median rank of one method has a lower error than the other;
\item ``Loss'' ; i.e. not a tie and the opposite to a win.
\ei
Given $L$ learning methods, the sum of $tie+win+loss$ for any one method is $L-1$.
When discussing discarding a method, an insightful metric is the number of losses. If this is non-zero,
then there is a case for discarding that method.

\subsection{158 Methods}

COSEEKMO's 158 methods combine:
\bi
\item Some {\em learners} such as standard linear regression, local calibration, and model trees.
\item Various {\em pre-processors} that may prune rows or columns.
\item Various {\em nearest neighbor} algorithms that can be used either as learners or as pre-processors to other learners.
\ei

\newcommand{\nope}{\ding{56}}
\newcommand{\yupe}{\ding{52}}
\begin{figure*}
\begin{center}
\scriptsize
\begin{tabular}{r@{~=~}l|l|l|p{1.5in}}
 method& name                  & row pruning & column pruning         & learner \\\hline
a    & LC           & \nope          & \nope                      & LC = Boehm's local calibration \\\hline
b    & COCOMIN +  LC     &  \yupe automatic $O(P^2)$        & \nope                      & local calibration \\\hline
c    & COCOMIN + LOCOMO + LC   &  \yupe automatic  $O(P^2)$       &  \yupe automatic $O(F{\cdot}log(F) + F)$ &  local calibration \\\hline
d    & LOCOMO + LC      &  \nope         & \yupe automatic $O(F{\cdot}log(F) + F)$ &  local calibration\\\hline
e    & ALL + LC         & \nope & \nope & local calibration on all the data from one source\\\hline
f    & M5pW  + M5p       &  \nope        &  \yupe Kohavi's WRAPPER~\cite{kohavi97} calling M5p~\cite{quinlan92b}, $O(2^F)$ & model trees\\\hline
g    & LOCALW  + LC      &  \nope        &  \yupe Chen's WRAPPER~\cite{me06d}  calling LC, $O(2^F)$ & local calibration\\\hline
h    & LsrW  + LSR       &  \nope        &  \yupe Kohavi's WRAPPER~\cite{kohavi97} calling LSR, $O(2^F)$ & linear regression\\\hline
i    & NEAREST       &  \yupe automatic $O(P^2)$& \nope &      mean  effort of nearest neighbors\\\hline
\end{tabular}
\end{center}
\caption{Nine effort estimation methods explored
in this paper. F is the number of features (columns) and P is the number of projects (rows).}\label{fig:abcd}
\end{figure*}

Note that only some of the learners use pre-processors.
In all, COSEEKMO's methods combine
15 learners without a pre-processor and 8 
learners with 8 pre-processors; i.e. 
\mbox{$15+8*8=79$} combinations in total.

COSEEKMO's methods input project features described using the 
symbolic range {\em very low} to {\em extra high}.
Some of the methods map the symbolic range to numerics 1..6.
Other methods map the symbolic range into a set of {\em effort multipliers} and
{\em scale factors} developed by Boehm and are shown in the appendix (\fig{effortmults}).
Previously, we have queried the utility of these 
effort multipliers and scale factors~\cite{me06d}. COSEEKMO hence executes its 79 methods
twice: once using Boehm's values, then once again using
perturbations of those values. Hence, in all, COSEEKMO contains $2*79=158$ methods.

There is insufficient space in this paper to describe the 158 methods
(for full details, see~\cite{jalali07}).
Such a complete description would be pointless since, as shown below, most of them are beaten by a very
small number of preferred methods. For example, our previous concerns regarding the effort multipliers and
scale factors proved unfounded (and so at least half the runtime of COSEEKMO is wasted).

\subsection{Brief Notes on Nine Methods}
This paper focuses on the
nine methods $(a,b,c,d,e,f,g,h,i)$ of \fig{abcd}. Four of these, $(a,b,c,d)$,
are our preferred methods while the other four 
comment on premises of some prior publications~\cite{me05c}.
Each method may use a column or row pruner or, as with $(a,e)$, no pruning at all.

One way to categorize \fig{abcd} is by their relationship to accepted practice
(as defined in the COCOMO texts~\cite{boehm81,boehm00b}).
Method $a$ is endorsed as best practice in the COCOMO community.
The others are our attempts to do better than current established practice using
e.g. intricate learning schemes or intelligent data pre-processors.

Method $e$ refers to an assumption we have explored
previously~\cite{me05a}, namely: is there some minimal subset of data
required to generate adequate effort estimates? Method $e$ uses all
possible data from one source and ignores any knowledge of the form 
``this is a ground systems so we should only train our
effort estimator using ground system data''. 

Method $f$ is an example of a more intricate learning scheme. 
Standard linear regression assumes that the data can be fitted to a single model.
On the other hand, the model trees~\cite{quinlan92b} used in $f$
permit the generation of multiple models (as well as a decision tree for selecting the appropriate model).

In methods $(f,h)$, the notation
M5pW and LsrW denotes a WRAPPER that uses M5p or LSR as its target learner (respectively).

Method $g$ referes to the technique that we argued for in a previous publication~\cite{me06d}.

Method $i$ generates estimates by averaging the effort seen in the nearest neighbors to each test instance.
Shepperd and Schofield~\cite{shepperd97} proposed 
this kind of reasoning for effort estimation from ``sparse'' data sets
\footnote{A table of data is ``sparse'' if many of its cells are empty. All our COCOMO data is non-sparse.}.
Note that this kind of reasoning does not use Boehm's assumptions about the parametric nature of the relationship
between COCOMO attributes and the effort estimate.

For more details on these methods, see the appendix.

%% \begin{figure*}
%% \begin{center}
%% \begin{tabular}{ccc}
%% criteria & COC81 & NASA93\\\hline
%% AR & \includegraphics[width=3in]{plots/AR-coc81.pdf} & \includegraphics[width=3in]{plots/AR-nasa93.pdf}\\\hline
%% MRE & \includegraphics[width=3in]{plots/MRE-coc81.pdf} & \includegraphics[width=3in]{plots/MRE-nasa93.pdf}\\\hline
%% MER& \includegraphics[width=3in]{plots/MER-coc81.pdf} & \includegraphics[width=3in]{plots/MER-nasa93.pdf}\\\hline
%% \end{tabular}
%% \end{center}
%% \caption{158 methods sorted by their number of losses.
%% Results from
%% four runs with different random number seeds and $N$ data sets from two sources
%% ($N=7$ for COC81 and $N=12$ for NASA93).
%% Each {\em seed X} ranks 158 methods by their number of losses
%% to the other 157 methods in $N$ data sets
%% (losses determined by Mann Whitney after 20 repeats of randomly generating
%% different training/test sets).
%% That is, the maximum y-value on these plots is 7*158 (for COC81)
%% and 12*158 (for NASA93).
%% Within each source and evaluation criteria: 
%% (i)~methods are sorted in {\em seed 1}
%% by their number of loses; (ii)~each plot in {\em seed 2}, {\em seed 3} and {\em seed 4}
%%  uses the same sort order as {\em seed 1}. 
%%  }\label{fig:fourruns} 
%% \end{figure*}


\section{Results}

\noindent
Figures~\ref{fig:resultsBoth-MRE-Run1},~\ref{fig:resultsBoth-MER-Run1}, 
and~\ref{fig:resultsBoth-AR-Run1} show results from 20 repeats of:
\bi
\item Dividing some subset into $Train$ and $Test$ sets.
Note that in the special case of method $e$, the ``subset''
was all the data from one source.
\item
Learning an effort model from the $Train$ set using COSEEKMO's 158 methods; 
\item
Applying that model to the $Test$ set; 
\item
Collecting performance statistics
from the $Test$ set using AR, MER, or MRE; 
\item Reporting the number of times a method losses, where ``loss'' is determined by
a Mann-Whitney U test (95\% confidence);
\ei
\noindent

Each mark on these plots shows the 
number of times a method loses in seven $COC81$ subsets (left plots) and
twelve $NASA93$ subsets (right plots). 
The x-axis shows results from the methods
$(a,b,c,d,e,f,g,h,i)$ described in \fig{abcd}. 

In these plots, methods that generate {\em lower} losses are {\em better}.
For example, the top-left plot of \fig{resultsBoth-AR-Run1}
shows results for ranking methods applied to $COC81$ using AR. 
In that plot, all of methods $(a,d)$ results from the seven $COCO81$ subsets
can be seen at $y=losses\approx 0$. That is, in that plot,
these two methods {\em never} lose against the other 158 methods.

In these results, conclusion instability due to 
{\em changing evaluation criteria} can be detected by comparing results across 
Figures~\ref{fig:resultsBoth-MRE-Run1},~\ref{fig:resultsBoth-MER-Run1}, and~\ref{fig:resultsBoth-AR-Run1}.
Also, conclusion instability due to {\em changing subsets} can be detected by comparing
results across different subsets; either across the two sources, $COC81$ and $NASA9$, or across their subsets selected during cross validation.
In order to make this test more thorough, we also conducted the study using different random seeds
controlling $Train$ and
$Test$ set generation (i.e. the three runs of \fig{resultsBoth-AR-Run1} that used different random seeds).

A single glance shows our main result: the plots are very similar.
Specifically, the $(a,b,c,d)$ results fall very close to $y=0$ losses.
The significance of this result is discussed below.

\begin{figure}

\begin{tabular}{|p{.95\linewidth}|}\hline
\begin{center}
\includegraphics[width=3.4in]{LossesColumnsBoth-MRE-Run1.pdf}
\end{center}
\\\hline\end{tabular}
\caption{MRE results. 
Mann-Whitey (95\% confidence).
These plots show number of losses of methods $(a,b,c,d,e,f,g,h,i)$ against
158 methods as judged by Mann-Whitney (95\% confidence).
Each vertical set of marks shows results from 7 subsets of
$COC81$ or 12 subsets of $NASA93$.}
\label{fig:resultsBoth-MRE-Run1}
\end{figure}
\begin{figure}

\begin{tabular}{|p{.95\linewidth}|}\hline
\begin{center}
\includegraphics[width=3.4in]{LossesColumnsBoth-MER-Run1.pdf}
\end{center}
\\\hline\end{tabular}
\caption{MER results.
Mann-Whitey (95\% confidence).
Same rig as \fig{resultsBoth-MRE-Run1}.}
\label{fig:resultsBoth-MER-Run1}
\end{figure}
\begin{figure*}[!t]
\begin{center}
\begin{tabular}{|c|}\hline

Results using random $seed_1$:\newline
\includegraphics[width=3.4in]{LossesColumnsBoth-AR-Run2.pdf}\\
Results using random $seed_2$:\newline
\includegraphics[width=3.4in]{LossesColumnsBoth-AR-Run1.pdf}\\
Results using random $seed_3$:\newline
\includegraphics[width=3.4in]{LossesColumnsBoth-AR-Run3.pdf}\\\hline

\end{tabular}
\end{center}
\caption{AR results, repeated three different times
with three different random seeds.
Same rig as \fig{resultsBoth-MRE-Run1}.}
\label{fig:resultsBoth-AR-Run1}
\end{figure*}

There are some instabilities in our results. For example, the
exemplary performance of methods $(a,d)$ in the top-left plot of
\fig{resultsBoth-AR-Run1} does {\em not} repeat in other plots.  For
example in the $NASA93$ MRE and MER results shown in
\fig{resultsBoth-MRE-Run1} and \fig{resultsBoth-MER-Run1}, method $b$
loses much less than methods $(a,d)$.

However, in terms of number of losses generated by methods $(a,b,c,d,e,f,g,h,i)$, the following
two results holds across all evaluation criteria and all subsets and all seeds:
\be
\item
One member of method $(a,b,c,d)$ always performs better (loses least)
than all members of methods $(e,f,g,h)$. Also, all members of methods $(e,f,g,h)$ perform better than $i$.
\item
Compared to 158 methods, one member of $(a,b,c,d)$
always loses at some rate very close to zero.
\ee

In our results, there is no single universal ``best'' method.
Nevertheless, out of 158 methods, there are 154 clearly inferior methods.
Hence, we recommend ranking methods $(a,b,c,d)$ on
all the available historical data, then applying the best ranked method
to estimate new projects.

%% \section{Results}

%% \noindent
%% \fig{fourruns} shows results of
%% 20 repeats of:
%% \bi
%% \item 
%% Dividing the available data  into  COC81 and NASA93)
%% \item
%% Separating out the 7 and 12 data sets in COC81 and  NASA93.
%% \item Dividing data set randomly into $Train$ and $Test$; 
%% \item
%% Learning an effort model from the $Train$ set using COSEEKMO's 158 methods; 
%% \item
%% Applying that model to the $Test$ set; 
%% \item
%% Collecting performance statistics
%% from the $Test$ set using AR, MER, or MRE; 
%% \item Reporting the number of times a method losses, where ``loss'' is determined by
%% a 
%% Mann-Whitney U test;
%% \item Repeating the above for four different random seeds: $seed_1, seed_2, seed_3, seed_4$.
%% \ei
%% \noindent
%% In these results,
%% conclusion instability due to 
%% {\em changing random selections} can be detected by
%% comparing  results across 
%% different random number seeds; i.e. by looking left to right on seeds 1,2,3,4:
%% \bi
%% \item
%% In \fig{fourruns}, for each combination of
%% source/evaluation method, the results are shown for four 
%% different random seeds. {\em Seed1} generates
%% a sort order on the number of losses of 158 methods in $N$ data sets (N=7 for COCO81; N=12 for NASA93).
%% This sort order is repeated in the runs using other seeds. Hence, changes to that sort order can be detected
%% by any {\em jitter} in the sort order of seeds 2,3,4. 
%% \ei
%% More generally, conclusion instability due to random selection, evaluation criteria, and {\em changing data
%% sources} can be seen by counting the number of times
%% a method appears in lowest $X$ number of the losses:
%% \bi
%% \item
%% In the sequel, we will say that the ``best'' methods appear
%% in the  bottom $X=20$ values of all twelve plots of \fig{fourruns}. This value of $X=20$
%% is an engineering decision determined by studying the shape of the \fig{fourruns}: many of those plots
%% exhibit a sharp increase in the number of losses above $X=20$ (this effect is particularly pronounced in the 
%% AR results).
%% \item
%% Note that if that list is empty, then there
%% is no stability in ``best'' method ranking across different random selections, different evaluation criteria,
%% and different data sets. 
%% \ei
%% A single glance at \fig{fourruns} shows an important effect main effect- 
%% \bi
%% \item Looking left to right for each source, the jitter in seeds 2,3,4 of \fig{fourruns} is not large;
%% \ei
%% That is,  changing either the random number seeds does not dramatically alter the rank ordering of the
%% methods. For example, while the results associated with method $e$
%% (NEAREST) exhibit some jitter, the number of losses for that method are quite stable:
%% \bi
%% \item losses in COC81: AR= 2998,  MER= 3140, MRE= 3126 (i.e. a 4\% difference in the min-to-max ranking);
%% \item losses in NASA93: AR= 4785, MER= 5391, MRE= 5461  (i.e. a 12\% difference in the min-to-max ranking).
%% \ei
%% (These are median number of losses over the four seeds.)

%% The list of ``best'' methods that  lost less than 20 times in all 12 plots is quite short:
%% \bi
%% \item $b$ appears $\frac{12}{12}$ times (COCOMIN+LC)
%% \item $c$ appears $\frac{12}{12}$ times (COCOMIN+ LOCOMO+LC)
%% \item $a$ appears $\frac{9}{12}$ times (LC)
%% \ei
%% From this list, we recommend that the following techniques  should always be applied
%% to COCOMO data:
%% \bi
%% \item
%% Always use Boehm's local calibration method;
%% \item
%% With or without the COCOMIN linear-time column pruner;
%% \item
%% With or without the LOCOMO linear-time row pruner;
%% \ei
%% These techniques combine into four methods (one for each combination of use/don't use row/column pruning).

%% In the sequel, the following observation will become important. Not 
%%  found in the list of ``top 20'' over all random evaluations and data sets and evaluation criteria 
%% are:
%% \bi
%% \item the 
%% non-parametric
%% methods (nearest neighbor $e$);
%% \item
%% complex parametric methods (including model
%% trees $f$, multivariate linear regression $h$);
%% \item
%% or intricate data pruning methods
%% (exponential time feature subset
%% selection $f,h,g$). 
%% \ei


%% %  the $(a,b,c,d)$ results 
%% % fall very close to $y=0$ losses.
%% % The significance of this result is discussed below.

%% % Each mark on these plots shows the 
%% % number of times a method loses in seven $COC81$ subsets (left plots) and
%% % twelve $NASA93$ subsets (right plots). 
%% % The x-axis shows results from the methods
%% % $(a,b,c,d,e,f,g,h,i)$ described in \fig{abcd}. 

%% % In these plots, methods that generate {\em lower} losses are {\em better}.
%% % For example,
%% % the top-left plot of 
%% % \fig{resultsBoth-AR-Run1}
%% % shows results for ranking methods applied to $COC81$ using AR. 
%% % In that plot, all of methods $(a,d)$ result from the seven $COCO81$ subsets
%% % can be seen at $y=losses\approx 0$. That is,
%% % in that plot,
%% % these two methods {\em never} lose against the other 158 methods.

%% % \begin{figure}
%% % \begin{center}
%% % \begin{tabular}{|p{.95\linewidth}|}\hline
%% % \includegraphics[width=3.4in]{LossesColumnsBoth-MRE-Run1.pdf}
%% % \\\hline\end{tabular}
%% % \end{center}
%% % \caption{MRE results. 
%% % Mann-Whitey (95\% confidence).
%% % These plots show number of losses of methods ${a,b..i}$ against
%% % 158 methods as judged by Mann-Whitney  (95\% confidence).
%% % Each vertical set of marks shows results from 7 subsets of
%% % COC81 or 12 subsets of NASA93.
%% % }
%% % \label{fig:resultsBoth-MRE-Run1}
%% % \end{figure}
%% % \begin{figure}
%% % \begin{center}
%% % \begin{tabular}{|p{.95\linewidth}|}\hline
%% % \includegraphics[width=3.4in]{LossesColumnsBoth-MER-Run1.pdf}
%% % \\\hline\end{tabular}
%% % \end{center}
%% % \caption{MER results.
%% % Mann-Whitey (95\% confidence).
%% % Same rig as \fig{resultsBoth-MRE-Run1}.}
%% % \label{fig:resultsBoth-MER-Run1}
%% % \end{figure}
%% % \begin{figure}[!t]
%% % \begin{center}
%% % \begin{tabular}{|p{.95\linewidth}|}\hline
%% % Results using random $seed_1$:\newline
%% % \includegraphics[width=3.4in]{LossesColumnsBoth-AR-Run2.pdf}\\
%% % Results using random $seed_2$:\newline
%% % \includegraphics[width=3.4in]{LossesColumnsBoth-AR-Run1.pdf}\\
%% % Results using random $seed_3$:\newline
%% % \includegraphics[width=3.4in]{LossesColumnsBoth-AR-Run3.pdf}\\\hline\end{tabular}
%% % \end{center}
%% % \caption{AR results, repeated three different times
%% % with three different random seeds.
%% % Same rig as \fig{resultsBoth-MRE-Run1}.
%% % }
%% % \label{fig:resultsBoth-AR-Run1}
%% % \end{figure}

%% % There
%% % are some instabilities in our results. For example, the exemplary
%% % performance of methods $(a,d)$
%% % in the top-left plot of
%% % \fig{resultsBoth-AR-Run1} does {\em not} repeat in other plots.
%% % For example in the $NASA93$ MRE and MER
%% % results shown in 
%% % \fig{resultsBoth-MRE-Run1} and \fig{resultsBoth-MER-Run1}, method $b$ loses much less than methods $(a,d)$.

%% % However, in terms of number of losses generated by methods $(a,b,c,d,e,f,g,h)$, the following
%% % two results holds across all evaluation criteria and all subsets:
%% % \be
%% % \item
%% % One member of method $(a,b,c,d)$ always performs better (loses least)
%% % than all members of methods
%% % $(f,g,h)$. Also, all members of methods $(f,g,h)$ perform better than $e$.
%% % \item
%% % Compared to 158 methods, one member of $(a,b,c,d)$
%% % always loses at some rate very close to zero.
%% % \ee


%% Note that there is no
%% single universal $best$ method.
%% Nevertheless, out of 158 methods, there are 154 lower ranked  methods (not always found amongst the ``best'' methods). 
%% Based on the above,  we recommend $LC$ plus or minus linear time row/column pruning on
%% all the available historical data, then applying the best ranked method
%% to estimate new projects.

\section{Discussion}
With the exception of the final notes on NEAREST,
the following discussion notes should not be generalized
beyond COCOMO-style data sets.

The methods recommended above are strong endorsement of Boehm's 1981 estimation research.
All our ``best'' methods are based around Boehm's preferred
method for calibrating generic COCOMO models to local data. 
Method $a$ is Boehm's {\em local calibration}
(or LC) procedure (defined in the appendix).
This result endorses three of Boehm's 1981 assumptions about effort estimation: 
\begin{description}
\item[{\em Boehm'81 assumption 1:}]~\newline
Effort can be modeled as a single function that is exponential on lines of code~\ldots 
\item[{\em Boehm'81 assumption 2:}]~\newline
\ldots and linearly proportional to the product of a set of effort multipliers;
\item[{\em Boehm'81 assumption 3:}]~\newline
The effort multipliers influence the effort by a set of pre-defined constants that can be taken from Boehm's textbook~\cite{boehm81}.
\end{description}

Our results argue that there is little added value in methods
$(e,f,g,h,i)$, at least for our COCOMO-style data sets.  
This is a useful result since these methods contain
some of our slowest algorithms. For example, the WRAPPER column
selection method used in $(f,g,h)$ is an elaborate heuristic search
through, potentially, many combinations.

The failure of model trees in method $f$ is also interesting. 
If the model trees of method $f$ had out-performed $(a,b,c,d)$, 
that would have suggested that effort is a multi-parametric phenomenon where, e.g.
over some critical size of software, different effects emerge.
This proved not to be the case, endorsing Boehm's assumption
that effort can be modeled as a single parametric log-linear equation.

Note that method $e$ often performs better than other methods. This is
partial evidence that {\em increasing} the training set size can be as
useful as trying smarter algorithms. However, note that method $e$ was
always beaten by one of $(a,b,c,d)$. Clearly, the value
of training sets specialized to particular subsets can be enhanced by
row and column pruning.

Of all the methods in \fig{abcd}, $(a,b,c,d)$
perform the best and $i$ performed the worst. 
The methods $(a,b,c,d)$ 
lost less than 20 times in all runs while the NEAREST method used in $i$ lost 
thousands of times. Why?

One distinguishing feature of method $i$ is the {\em assumptions}
it makes about the domain. 
The NEAREST neighbor method $i$ is {\em assumption-less}
since it makes none of the {\em Boehm'81} assumptions listed above.
We hypothesize that Boehm's assumptions are useful when processing COCOMO data.

To the above comment, we hasten to add that while NEAREST performed relatively worse {\em in isolation},
we still believe that nearest neighbor methods like NEAREST are a valuable addition to any effort estimation
toolkit:
\bi
\item Nearest neighbors used {\em in conjunction with other methods} can be quite powerful. For example,
in the above results, nearest neighbor row pruning proved to be a powerful addition to Boehm's local
calibration method.
\item As mentioned above, not all domains are described in terms that can be used by the parametric
forms like COCOMO. If the available domain data is in another format favorable to some of the other
learners, then it is possible that our ranking orders would change. For example, 
Shepperd \& Schofield argue that their case-based reasoning methods, like NEAREST
procedure used in method $i$,
are better suited to sparse data domains where precise numeric
values are {\em not} available on all factors~\cite{shepperd97}.
None of the data used in the study were sparse.
\ei

%\section{Related Work}
%
%We report here a set of comparative rankings of different estimation methods
%that are stable across:
%\bi
%\item
% hundreds of different random samples of data 
%\item taken
%from 19 projects, 
%\item assessed using three different evaluation criteria. 
%\ei To the best of our knowledge,
%this is the largest inter-method comparison yet reported in the literature.
%It is insightful to ask the question: why have not such  stable results
%been previously
%reported? 
%
%The first answer to this question is that it is not a widespread practice
%to compare different methods. As mentioned in the introduction, there
%it is common to report single-method results.
%
%Another answer comes from the statistical problems associated from
%conducting such inter-method comparisons.
%Elsewhere~\cite{me06d} we have documented the large deviations from central tendencies  
%that can occur
%in effort estimation.
%For example, in 30 experiments discarding ten examples (selected at random)
%the observed mean MRE values were less than 100\% while the standard deviations were
%\[\{median,max\} = \{150\%,649\%\}\]
%On investigation, the source of these large deviations were found to be the rare presence of 
%alarming large errors (in one extreme case, up to 8000\%).
%Large outliers can make mean calculations
%highly misleading. A single large outliers can make the mean
%value far removed from the median\footnote{Median: 
%the value below which 50\% of the values fall.}. 
%\item
%For data sets with only a small number of outliers (e.g. \fig{re}), the conclusions reached from
%different subsets can be very different,
%depending on the absence or presence of the infrequent outliers.
%\Such 
%One troubling result from the \fig{parametric} study is that the number of training
%examples
%was {\em not} connected to the size of standard deviation. A pre-experimental intuition
%was that the smaller the training set, the worse the prediction instability.
%On the contrary, we found small and large instability (i.e. MRE standard deviation)
%for both small
%and large training sets~\cite{me06d}. That is, instability
%cannot be tamed by further data collection. Rather, the data must be processed
%and analyzed in some better fashion (e.g. U test described below).
%
%These large instabilities explain the contradictory results in the effort estimation
%literature.
%Jorgensen reviews fifteen studies that compare model-based to expert-based estimation. 
%Five of those studies found in
%favor of expert-based methods; five found no difference; 
%and five found in favor of model-based estimation~\cite{jorg04}.
%Such diverse conclusions are to be expected if models exhibit large instabilities in their performance.
%
%
% from our prior work on the the papers that have attempted 
%
%19 projects, 
%testable conclusions 
\section{External Validity}
This study has several biases, listed below. 

{\em Biases in the paradigm}:
The paper explores model-based methods (e.g. COCOMO ) and not expert-based methods. 
Model-based methods use some algorithm to summarize old data and make predictions 
about new projects.
Expert-based methods use human expertise (possibly augmented with process guidelines
or checklists) to generate predictions.  
Jorgensen~\cite{jorg04} argues that most industrial effort estimation is expert-based
and lists 12 {\em best practices} for such effort-based estimation.
The comparative evaluation of model-based
vs. expert-based methods must be left for future work. 
The evaluation of expert-based methods will be a human-intensive study
and human subjects may rebel at participating in the lengthy experimental
procedure described below. Before comparing any effort estimation
methods (be they model-based or expert-based) we must prune
the space of model-based methods.
For more on expert-based methods, see~\cite{jorg04,jorgensen04,shepperd97,chulani99}.

{\em Evaluation bias:}
We showed conclusion stability 
across three criteria; absolute residual; magnitude of error
relative to estimate; or magnitude of relative error. 
This does not mean that we have shown stability across {\em all possible} evaluation biases. 
Other evaluation biases may offer different rankings to our estimation methods. 

{\em Sampling bias:}
Model-based estimation methods use data
and so are only useful in organizations that maintain historical
data. Such data collection is rare in organizations
with low process maturity. However, it is common elsewhere; e.g. amongst government
contractors whose contract descriptions include process auditing requirements.
For example, United States government contracts often require a model-based estimate at each
project milestone. Such models are used to generate estimates or to double-check an expert-based estimate.

Another source of sampling bias is that the data sets used in this study come from two sources:
(1) Boehm's 1981 text on Software Engineering~\cite{boehm81}
and (2) data collected from NASA in the 1980s and 1990s from six different
NASA centers including the Jet Propulsion Laboratory (for details
on this data, see the appendix). 
When we show our results to researchers in the field, they ask if two sources
is enough to draw valid external conclusions. In reply, we comment that these
two sources were repositories that accepted data from a wide range of projects.
For example, our NASA data comes from different teams working at geographical locations spread
throughout the United States using a variety of programming languages.
While some of our data is from flight systems (a particular NASA specialty),
most are ground systems and share many of the properties of other terrestrial
software (same operating systems, development languages, development practices).
Much of NASA's software is written by contractors who service a wide range of clients (not just NASA).
These contractors are contractually obliged (ISO-9001) to demonstrate their understanding and usage of
current industrial best practices. For this reason, prior research has argued that 
conclusions from NASA data are relevant to the general software engineering industry.
Basili, Zelkowitz, et al.~\cite{basili02}, for example, published
extensively for decades their conclusions taken from NASA data.

Yet another source of sampling bias is that our conclusions are based only on the data sets studied here. 
The data used in this study is the largest public domain set of COCOMO-style data available. Also, our 
data source is as large as the proprietary COCOMO data sets used in prior TSE publications~\cite{chulani99}.

{\em Biases in the model:}
This study adopts the COCOMO model for all its work. This decision
was forced on us: the COCOMO-style data sets, described in the appendix, are 
the only public domain data we could access. Also, all
our previous work was based on COCOMO data since our funding body (NASA)
makes extensive use of COCOMO. The implications of our work on other estimation
frameworks is an open and (as mentioned in the introduction) pressing issue.
We strongly urge researchers with access to non-COCOMO data to repeat the
kind of row/column pruning analysis described here.

Note that we make no claim that this study explores the entire space of possible
effort estimation methods. Indeed, when we review the space of known methods
(see Figure~1 in~\cite{myrtveit05}), it is clear that COSEEKMO covers only a small part of that total space. 
The reader may know of other effort estimation methods they believe we should try.
Alternatively, the reader may have a design or an implementation of a new kind
of effort estimator. In either case, before it can be shown that an existing or new method
is better than the four we advocate here,
we first need a demonstration that it is possible to make stable conclusions regarding
the relative merits of different estimation methods.
This paper offers such a demonstration.

\section{Conclusion}

This paper concludes five years of research that began with the following question;
can the new generation of data miners offer better effort estimates than traditional methods?
%Curiously, as discussed 
%in {\em Related Work}, this paper
%offers offers a contrasting conclusion to a prior study
%by Shepperd \& Kadoda~\cite{sheppherd01}. 
%Based on artificially generated data sets
%based on
%non-COCOMO-style data, those authors reported
%little or no conclusion stability regarding the relative benefits
%of different estimation methods.
%We offer below several hypothesis for our differing results.
%Shepperd 
%\& Kadoda  used whatever attributes were locally available
%while we map all our data into the features endorsed by the COCOMO research.
%Also, we make extensive use of {\em full pruning}
%of both rows and columns to cull extraneous
%or ``noise'' portions of the training tables of data.
%Case-based reasoning, on the other hand, only uses row pruning
%when it select relevant cases for analysis.
%We conjecture  that our conclusion stability
%is a result of COCOMO features+full pruning.
%
%
%{\em Contributions of This Paper:}
%The contributions of this paper are based on the above conclusions.
%Conclusion \#1 lets us discard many estimation methods. It also demands
%that researchers proposing new estimation methods need to carefully
%demonstrate that their new, supposedly more sophisticated method, is actually
%an improvement over current practice.
%
%Conclusion \#2 simplifies deploying  effort estimation methods
%at an industrial site (the search for useful estimation methods can be constrained
%to a very small set).
%
%Conclusion \#3 cautions that our four ``best'' methods cannot be pruned
%to a smaller set. Rather, when processing a new domain, all four methods should be tried.
%
%More generally, we recommend the use of COCOMO-style features (with full
%pruning) to avoid the instability problems reported by Shepperd and Kadoda
%(but this recommendation needs more study-
%see  
% our {\em Future Work} section).
%

In other work~\cite{baker07}
we detected no improvement using bagging~\cite{brieman96} and boosting~\cite{FreSch97} methods for
COCOMO-style data sets.
In this work, we have found that one of four methods is always better than another 154 methods:
\bi
\item
A single linear model is adequate for the purposes of effort estimation.
All the methods that assume multiple linear models, such as model trees $(f)$,
or no parametric form at all, such as nearest neighbor $(i)$, perform
relatively poorly.
\item
Elaborate searches do not add value to effort estimation.
All the $O(2^F)$ column pruners do worse than near-linear-time column pruning. 
\item
The more intricate methods such as model trees do no better than other methods. 
\ei
Unlike Shepperd \& Kadoda's results, we were able to find stable conclusions across different data sets, different
random number seeds, and even different evaluation criteria.  
%
%The implications for none-COCOMO-style effort estimation is an open question. At the very
%least, we would argue that
%the above results
%lead to the following methodological principle. Before releasing some new, supposedly more sophisticated
%effort estimation method, it should be a requirement that the author of that new method demonstrates
%that the new method offers some advantage over current practice.
%This principle may to be an obvious statement for any scientific discipline. However, it
%is not a commonly applied principle in effort estimation.
%Apart from Shepperd \& Kadoda~\cite{sheppherd01},
%there are few examples of 
%studies using multiple methods or multiple
%data sets. For the record, those examples
%include:
%\be
%\item
%Chulani \& Boehm's comparison of the Bayesian methods of COCOMO-II against COCOMO-I~\cite{chulani99};
%\item
% Mendes' use of two estimation methods on data collected within one company or from other companies~\cite{mendes07};
%\item Our prior work on COSEEKMO~\cite{me06d}.
% \ee
%However, our reading of the literature is that the ranking and pruning of many methods using many data sets
%and multiple evaluation criteria is not common practice. 
%Note that, of the above examples, only example \#1 offered a comparative assessment of alternate
%estimation methods. Example \#2 and \#3 only {\em used} alternative methods; they
%did not show that there existed some stable {\em ranking} between the methods (i.e. stable across
%data subsets, stable across evaluation criteria).
%
Consequently, we argue for both a 
{\em complication} and {\em simplification} of effort estimation research:
\bi
\item
{\em Complication \#1:}
One reason for {\em not} using COCOMO is that the available data has to 
be expressed in terms of the COCOMO features, which is problematic if companies have not collected their historical
data using the COCOMO ontology. 
Nevertheless, our results suggest that it may be worth the effort to use COCOMO-style data collection,
if only to reduce the instability in the conclusions.
\item {\em Complication \#2:}
Simply applying one or two methods in a new domain is not enough. In the study reported in this paper,
one method out of a set of four was always the best but {\em that best method was
data set-specific}.
Therefore, prior to researchers drawing conclusions about aspects
of effort estimation properties in a particular context, there should be a
{\em selection study} to rank and prune the available estimators according to the details of a local domain.
\item
{\em Simplification:}
Fortunately, our results also suggest that such {\em selection studies} need not be very elaborate.
At least for COCOMO-style data, we report that $\frac{154}{158}=97\%$ of the methods implemented
in our COSEEKMO toolkit~\cite{me06d} added little or nothing to Boehm's 1981 regression procedure~\cite{boehm81}.
\ei

Such a selection study could proceed as follows.
For COCOMO-style data sets, the following methods should be tried
and the one that does best on historic data (assessed using Mann-Whitney U test)
should be used to predict new projects:
\bi
\item Adopt the three Boehm'81 assumptions and use LC-based methods.
\item While some row and column pruning can be useful, elaborate column pruning (requiring
an $O(2^F)$ search) is not. 
Hence, try LC with zero or more of LOCOMO's row pruning or COCOMIN's column pruning.
\ei

For future work, we recommend an investigation of an ambiguity in our results:
\bi
\item
Prior experiments found conclusion {\em instability} after limited application
of row and column pruning to non-COCOMO features.
\item
Here, we found conclusion {\em stability} after extensive row and column pruning to COCOMO-style features.
\ei
It is hence unclear what removed the conclusion
instability. Was it pruning? Or the use of the
COCOMO features? To test this, we require a
new kind of data set. Given examples expressed
in whatever local features are available,
those examples should be augmented with COCOMO
features. Then, this study should be repeated:
\bi
\item With and without the local features;
\item With and without the COCOMO features;
\item With and without pruning;
\ei
We would be interested in contacting any industrial group with access to this new kind of data set.

\section*{Acknowledgments}
Martin Shepperd was kind enough to make suggestions about
different evaluation biases and the design of the
NEAREST and LOCOMO methods.

\appendix
\subsection{Data Used in This Study} 

In this study, effort estimators were built using all or some {\em part} of data from two sources:
\bdd
\item[{\em $COC81$:}]~~63 records in the COCOMO-I format. Source: \cite[p496-497]{boehm81}. 
Download from \url{http://unbox.org/wisp/trunk/cocomo/data/coc81modeTypeLangType.csv}.
\item[{\em $NASA93$:}]~~~~~93 NASA records in the COCOMO-I format.  
Download from \url{http://unbox.org/wisp/trunk/cocomo/data/nasa93.csv}.
\edd

Taken together, these two sets are the largest
COCOMO-style data source in the public domain (for reasons
of corporate confidentiality, access to Boehm's COCOMO-II data set is
highly restricted).
$NASA93$ was originally collected to create a NASA-tuned version of
COCOMO, funded by the Space Station Freedom Program and 
contains data from six NASA centers including the Jet Propulsion
Laboratory. For more details on this data set, see~\cite{me06d}.

Different subsets and number of subsets used (in parenthesis) are:
\bdd
\item[{\em All(2):}]~selects all records from a particular source.
\item[{\em Category(2):}]~~~~~~~$NASA93$ designation selecting the type of project; e.g. avionics.
\item[{\em Center(2):}]~~~~~$NASA93$ designation selecting records relating to where the software was built.
\item[{\em Fg(1):}]~$NASA93$ designation selecting either ``$f$'' (flight) or ``$g$'' (ground) software.
\item[{\em Kind(2):}]~~$COC81$ designation selecting records relating to the development platform; e.g. max is mainframe.
\item[{\em Lang(2):}]~~~$COC81$ designation selecting records about different development languages; e.g ftn is FORTRAN.
\item[{\em Mode(4):}]~~~designation selecting records relating to the COCOMO-I development mode: one of semi-detached, embedded, and organic.
\item[{\em Project(2):}]~~~~~$NASA93$ designation selecting records relating to the name of the project.
\item[{\em Year(2):}]~~is a $NASA93$ term that selects the development years, grouped into units of five; e.g. 1970, 1971, 1972, 1973, 1974 are labeled ``1970''.
\edd
There are more than 19 data sets overall. Some have fewer than 20 projects and 
hence were not used. The justification for using 20 projects or more is offered in~\cite{me06d}.

\subsection{Learners Used in This Study}

\subsubsection{Learning with Model Trees}

Model trees are a generalization of linear regression.
Instead of fitting the data to {\em one linear model}, 
model trees learn {\em multiple linear models}, and a
decision tree that decides which linear model to use.
Model trees are useful when the projects form regions and
different models are appropriate for different regions.
COSEEKMO includes the M5p model tree learner defined by Quinlan~\cite{quinlan92b}.
\subsubsection{Other Learning Methods}

See the {\em Related work} section for notes on learning
with linear regression; local calibration; and nearest neighbor methods.

\subsection{Pre-Processors Used in This Study}

\subsubsection{Pre-processing with Row Pruning}

The LOCOMO tool~\cite{jalali07} in COSEEKMO
is a row pruner that combines a nearest neighbor method
with LC. LOCOMO prunes away all projects except those $k$ ``nearest'' to the $Test$ set data.

To learn an appropriate value for $k$, LOCOMO uses the $Train$ set as follows:
\bi
\item For each project $p_0\in Train$, LOCOMO sorts the remaining 
$Train - p_0$ examples by their Euclidean distance from $p_0$. 
\item LOCOMO then passes the $k_0$ examples closest to $p_0$
to LC. The returned $<a,b>$ values are used to estimate effort for $p_0$.
\item After trying all possible $k_0$ values, $2 \le k_0 \le |Train|$, 
$k$ is then set to the $k_0$ value that yielded the smallest mean MRE\footnote{A
justifications for using the mean measure within LOCOMO is offered at the end of the appendix.}.
\ei
This calculated value $k$ is used to estimate the effort for projects in the
$Test$ set. For all $p_1\in Test$, the $k$ nearest neighbors
from $Train$ are passed to LC. The returned $<a,b>$ values are then 
used to estimate the effort for $p_1$. 

\subsubsection{Pre-Processing with Column Pruning}

Kirsopp \& Schofeld~\cite{kirsopp02} 
and Chen \& Menzies \& Port \& Boehm~\cite{me05c}
report that column pruning improves effort estimation.
Miller's research~\cite{miller02} explains why.
Column pruning (a.k.a. feature subset selection~\cite{hall03} or variable
subset selection~\cite{miller02}) reduces the deviation of a
linear model learned by minimizing least squares error~\cite{miller02}.
To see this, consider a linear model with constants $\beta_i$ 
that inputs features $f_i$ to predict for $y$:
\[y = \beta_0 + \beta_1\cdot f_1 + \beta_2\cdot f_2 + \beta_3\cdot f_3 ...\]
The variance of $y$ is some function of the variances in $f_1, f_2$, etc.
If the set $F$ contains noise 
then random variations in $f_i$ can increase the uncertainty of $y$.
Column pruning methods decrease the number of features $f_i$, thus
increasing the stability of the $y$ predictions. 
That is, the fewer the features (columns), the more restrained are the model predictions.

Taken to an extreme, column pruning can reduce $y$'s variance 
to zero (e.g. by pruning the above equation back to $y=\beta_0$)
but increases model error (the equation $y=\beta_0$ will ignore all project data when generating estimates).
Hence, intelligent column pruners experiment with some proposed subsets $F' \subseteq F$ before
changing that set. COSEEKMO currently contains three intelligent column pruners: WRAPPER, LOCALW, and COCOMIN.

WRAPPER~\cite{kohavi97} is a standard best-first search through the space of possible features. 
At worst, the WRAPPER must search an space exponential on the number of features $F$; i.e. $2^F$.
However, a simple best-first heuristic makes WRAPPER practical for effort estimation.
At each step of the search, all the current subsets are scored by passing them to
a {\em target learner}. If a set of features does not score better than a smaller subset,
then it gets one ``mark'' against it. If a set has more than $STALE=5$ number of marks, it is deleted.
Otherwise, a feature is added to each current set and the algorithm continues. 

In general, a WRAPPER can use any target learner. Chen's LOCALW is a WRAPPER specialized
for LC. Previously~\cite{me06d,me05c}, we have explored LOCALW for effort estimation.

The above description of WRAPPER should be read as a brief
introduction to all the techniques associated with this kind of column pruner. A WRAPPER is powerful tool
and can be extensively and usefully customized (e.g. using a hash-table cache to hold
the frequently seen combinations; alternative search methods to best-first search; etc). We refer the interested
reader to the thorough treatment of the subject found in Miller~\cite{miller02} and Kohavi \& Johns~\cite{kohavi97}.

Theoretically, WRAPPER (and LOCALW)'s exponential time search is more
thorough, hence more useful, than simpler methods that try fewer options. To 
test that theory, we will compare WRAPPER and LOCALW to a linear-time column pruner 
called COCOMIN~\cite{baker07}.

COCOMIN is defined by the following operators:
\[\{sorter, order, learner, scorer\}\]
The algorithm runs in linear time over a {\em sorted} set of features, $F$.
This search can be {\em order}ed in one of two ways:
\bi
\item A ``backward elimination'' process starts with all features $F$ and throws some away, one at a time.
\item A ``forward selection'' process starts with one feature and adds in the rest, one at a time.
\ei
Regardless of the search order, at some point the current set of features $F' \subseteq F$ 
is passed to a {\em learner} to generate a performance {\em score} by applying 
the model learned on the current features to the $Train$ set. 
COCOMIN returns the features associated with the highest score.

COCOMIN pre-sorts the features on some heuristic criteria. Some of
these criteria, such as standard deviation or entropy, are gathered
without evaluation of the target learner. Others are gathered by
evaluating the performance of the learner using only the feature in
question plus any required features, such as KLOC for COCOMO, to
calibrate the model. After the features are ordered, each feature is
considered for backward elimination, or forward selection if chosen,
in a single linear pass through the feature space, $F$. The decision to
keep or discard the feature is based on an evaluation measure
generated by calibrating and evaluating the model with the training data.

Based on~\cite{baker07}, the version of COCOMIN used in this study:
\bi
\item sorted the features by the highest median MRE;
\item used a backward elimination search strategy;
\item learned using LC;
\item scored using mean MRE.
\ei
Note that mean MRE is used 
internally to COCOMIN (and LOCOMO, see above) since it is fast and simple to compute. Once the search  
terminates, this paper strongly recommends the more thorough (and hence more intricate 
and slower) median non-parametric measures to assess the learned effort estimation model.

\bibliographystyle{IEEEtran}
\bibliography{refs}
\end{document}
\section*{Comments on Reviewer Feedback}

All the reviewers were correct in guiding us away from the previous focus of the last
version of this paper. In the prior version we looked too hard at conclusion instability
and the benefits of Mann-Whitney. As Reviewer1 said
``in any case using Mann Whitney U is hardly new - it's just the non-parametric analog of the t test''. 
After deemphasizing Mann-Whitney, we found that we make our main point without wasting the reader's
time on an established technique.
 
Reviewer1 did this paper a great service by pointing out the Shepperd and Kadoda (2001, TSE) reference.
It turns out that we were basing our work on the wrong premise. As can be seen in our new
introduction, the Shepperd and Kadoda paper is a much better framing for this work.

Reviewer1 also commented
``The paper isn't very detailed on this but I would guess the 2 data sets are 
similar because they share the same features''. On reflection, this lead to the speculation
that one way to reduce conclusion instability is to make a better selection of mature
 features that a large community has experimented with. In this
draft, we stress the role of the COCOMO features in reducing uncertainty. The prior draft made
no comment on this possibility and we thank Reviewer1 for this insight.

Reviewer1 pointed out several small typographical and semantic errors in the text that we have now fixed.  
For example, keywords have now been added. Also, when
we first introduce the WRAPPER we added the comment suggested by this reviewer: 
\bi
\item
{\em WRAPPER and LOCALW are very thorough search algorithms that explore
subsets of the features, in no set order. In the worst case, this search is an exhaustive
examination of all combinations; i.e this search takes time $O(2^F)$.}
\ei

Reviewer1 asked for more details on our WRAPPER methods 
saying ``there are lots of things you can do with a wrapper so much more info is needed here''.
A full exposition of all these methods would take many pages so we added the following text to the appendix:
\bi
\item {\em
The above description of WRAPPER should be read as a brief
introduction to all the techniques associated with this kind of column pruner. A WRAPPER is powerful tool
and can be extensively and usefully customized (e.g. using a hash-table cache to hold
the frequently seen combinations; alternative search methods to best-first search; etc). We refer the interested
reader to the thorough treatment of the subject found in Miller~\cite{miller02} and Kohavi \& Johns~\cite{kohavi97}.}
\ei

Reviewer1 remarked that the reference 35 might be incorrect. To the best of our knowledge, it is and would accept guidance on this point.

Reviewer2 asked us to review our terminology:
``The authors demonstrate conclusion stability for a portfolio of 154 methods 
(I would prefer the term 'instances of a method' as just changing some parameters in a 
process does not create a new method) when applied to 19 subsets of two sources.''
Hence, we added this footnote to the introduction:
\bi
\item
{\em To be precise, COSEEKMO currently supports 15 learners and 8 row/column pre-processors
which can be applied to two different sets of internal tuning parameters.
In one view, this combination of (15+8*8)*2=158 different
estimation generation methods are not really different ``methods'';
rather it might be more accurate to call them ``instances of methods''.
This paper does not adopt that terminology for the following reason.
To any user of our tools, our menagerie of estimation software
contains 158 {\em oracles} that may yield different effort estimates.
Hence, in the view adopted by this paper, they are 158
competing methods that must be assessed and (hopefully)
pruned to a more management size.}
\ei

Reviewer2 correctly takes issue with our treatment of Gaussians. All that material is deleted in this draft.
Reviewer2 also correctly points out an error in our description of sparse data and the NEAREST method, which
we fixed in this draft.

Reviewer2 comments ``the main limitation of the research to me is the use of just two data sets 
(coming from 1981 and 1993). Even though there are subsets created, this still does not change anything
 on this limitation. This is especially true for the type of conclusions drawn to demonstrate conclusion stability.''.

In reply, we make two points:
\bi
\item
We use 19 data sets from two sources. Each source was a database storing a very diverse set of projects.
For example, while much of our data comes from NASA, NASA is a very diverse enterprise serviced by a very large
number of commercial contractors.
\item Figures 3 and 4 explore the overlap in our 19 data sets. While some overlap exists, for the most part our data falls
into different pools.
\ei
 
Reviewer3 correctly pointed out errors in out treatment of 
MRE and papers by Foss, Myrtviet and Kitchenham.
All that material is deleted from this draft.

Reviewer3 argues that we should not use an unpaired Mann-Whitney to study our results, and that a paired
test would be more appropriate. Paired tests are appropriate when different methods are applied to the same
train/test data. This is not the case when comparing methods that use row/column pruners vs methods that
do not. Hence, we still maintain we should use an unpaired comparison test.

\end{document}
\subsection{Symptoms of Instability}

KFM
caution that,
historically, ranking estimation methods has been done quite poorly. 
Based on an analysis of two (non-COCOMO)
data sets as well as simulations over artificially generated data set,
Foss et al. and Myrtviet et.al.
concluded that numerous commonly used methods such as the mean
MRE\footnote{MRE = magnitude of relative error $=abs(actual-predicted)/actual$.} are
unreliable measures of estimation effectiveness.
Also, the conclusions reached from these standard measures can vary wildly
depending on which subset of the data is being used for testing~\cite{myrtveit05}.

\begin{figure}
\begin{center}
\includegraphics[width=3.5in]{Parametric.pdf}
\end{center}
\caption{Results of 2 different runs of COSEEKMO comparing
two methods using mean MRE values. Points on the
Y-axis show the difference in mean relative error (MRE) between method1 and method2.
Lower values endorse method2 since, when such values occur,
method2 has a lower error than method1.}
\label{fig:parametric}
\end{figure}

\fig{parametric} demonstrates conclusion instability.
It shows two experimental runs. In each run,
30 times, effort estimate models were built for our 19 data sets using
two methods. Each time, an effort model was built from a randomly selected 90\% of the data.
Results are expressed in terms of the difference in mean MRE between the
two subsets; e.g. in Run~\#1, method1 had a much larger mean MRE than method2.

After Run~\#1, the results endorse method2 since that method either (a)~did better (lower errors)
as method1 or (b)~had similar performance to method1.
However, that conclusion is not stable. Observe in Run~\#2 that:
\bi
\item
The improvements of method2 over method1 disappeared 
in subsets 1,2,3,7, and 11.
\item
Worse, in subsets 1,2, and 11 method1 performed
dramatically better than method2.
\ei
The deviations seen in 30 repeats of the above procedure
were quite large: within each data set, the standard deviation
on the MREs were $\{median,max\} = \{150\%,649\%\}$~\cite{me06d}.
Port (personal communication) has proposed a bootstrapping method to determine the true performance distributions
of COSEEKMO's methods. That method would require $10^2$ to $10^3$ re-samples and, given
COSEEKMO's current run-times, it would take $10^2$ to $10^3$ days to terminate.

One troubling result from the \fig{parametric} study is that the number of training
examples
was {\em not} connected to the size of standard deviation. A pre-experimental intuition
was that the smaller the training set, the worse the prediction instability.
On the contrary, we found small and large instability (i.e. MRE standard deviation)
for both small
and large training sets~\cite{me06d}. That is, instability
cannot be tamed by further data collection. Rather, the data must be processed
and analyzed in some better fashion (e.g. U test described below).

These large instabilities explain the contradictory results in the effort estimation
literature.
Jorgensen reviews fifteen studies that compare model-based to expert-based estimation. 
Five of those studies found in
favor of expert-based methods; five found no difference; 
and five found in favor of model-based estimation~\cite{jorg04}.
Such diverse conclusions are to be expected if models exhibit large instabilities in their performance.

\begin{figure}[!t]
\begin{center}
\includegraphics[width=2.5in]{graphs/re_nasa93_fg_g.pdf}
\includegraphics[width=2.5in]{graphs/re_nasa93_year_1975.pdf}
\includegraphics[width=2.5in]{graphs/re_nasa93_mode_embedded.pdf}
\includegraphics[width=2.5in]{graphs/re_coc81_langftn.pdf}
\end{center}
\caption{Relative errors seen in COSEEKMO's
experiments on four data sets. Thin lines show the actual
values. Thick lines show a Gaussian distribution that uses
mean and standard deviation of the actual values.
From top to bottom, the plots are of:
{\em (top)} NASA ground systems;
NASA software written around 1975;
NASA embedded software (i.e. software developed within tight hardware, software, and operational constraints);  
{\em (bottom)} some FORTRAN-based software systems.
}\label{fig:re}
\end{figure}

\subsection{Diagnosing the Cause}

The thin line of \fig{re} is drawn by sorting the 
relative error\footnote{RE = $(predicted - actual)/actual$} (RE)
seen in four of the subsets studied in \fig{parametric}.
Observe that while most of the actual RE values are nearly zero, an {\em infrequent}
number
(on the right hand side)
are {\em extremely large} (up to 8000 in the second plot). 
Such large {\em spikes} in RE result when
the predicted values are 
much larger than the actual values and result from
(1)~noise in the data or (2)~a training set that learns an overly steep
exponential function for the effort model. 

The size of the spikes in \fig{re} are remarkable. Research papers typically report RE values in the range 
$0\le RE\le3$\cite{me06d}.   
Such values are completely 
dwarfed by errors in the range of 8000 such as those seen in the second plot of \fig{re} (hence, most
of the thin lines in \fig{re} are flat). 
These very large, but
infrequent, outliers explain conclusion instability:
\bi
\item
Large outliers can make mean calculations
highly misleading. A single large outlier can make the mean
value far removed from the median\footnote{Median: 
the value below which 50\% of the values fall.}. 
\item
For data sets with only a small number of outliers (e.g. \fig{re}), the conclusions reached from
different subsets can be very different,
depending on the absence or presence of the infrequent outliers.
\ei

\fig{re} also illustrates how poorly standard methods assess the performance of effort estimation data.
Demsar~\cite{demsar06} offers a definition of {\em standard methods} in data mining.
In his study of four years of proceedings from the {\em International Conference
on Machine Learning}, Demsar
found that the standard
method of comparative assessment 
were t-tests over some form of repeated sub-sampling
such as cross-validation, separate subsets, or randomized re-sampling. 
Such t-tests assume that the distributions being studied are Gaussian and, as shown by the
thick line of \fig{re}, effort estimation results can be highly
non-Gaussian.
These thick lines show a Gaussian cumulative 
distribution function computed from the means and standard deviations of 
the actual RE values (the thin lines).
For example, the Gaussian approximation to the {\em actual values} of
\[\{1.1, 1.3, 1.5, 1.7, 2.1, 2.3, 2.7, 800\}\]
has a mean of 101.6 and a standard deviation of 282.2.
Observe how poorly such Gaussian distributions represent the actual  RE values:
\bi
\item
There exists orders of magnitude differences 
between the {\em actual} plots (the thin lines) and the {\em Gaussian} approximations
(the thick lines). 
\item
The Gaussian goes negative while none of our effort estimation methods assume that
it takes less than no time to build software.
\ei
\subsection{Fixing Instability}

The problem of comparatively assessing $L$ learners run on multiple sub-samples of $D$ data sets
has been extensively studied in the data mining community. 
T-tests that assume Gaussian distributions are strongly deprecated.
For example, Demsar~\cite{demsar06} argues that non-Gaussian populations are 
common enough to require a methodological change in data mining. 

After reviewing a wide range of
comparisons methods\footnote{Paired t-tests with and without the use of geometric means of the relative ratios;
binomial tests with the Bonferroni correction;
paired t-tests; ANOVA; Wilcoxon; Friedman},
Demsar advocates the use of the 1945 Wilcoxon~\cite{wilcoxon45} 
signed-rank test that compares the ranks for the positive and negative differences (ignoring the signs). 
Writing five years earlier,
Kitchenham et.al.~\cite{kitc01} comment that the Wilcoxon test has its limitations.
Demsar's report offers the same conclusion, 
noting that the Wilcoxon test requires that the sample sizes are the same.
To fix this problem, Demsar augments Wilcoxon with the Friedman test.

One test not studied by Demsar is Mann and Whitney's 1947
modification~\cite{mann47} to Wilcoxon rank-sum test (proposed along with his signed-rank test).
We prefer this test since:
\bi
\item
The Mann-Whitney U test does not require that the sample sizes are the same.
So, in a single U test, learner $L_1$ can be compared to all its rivals. 
\item
The U test does not require any post-processing
(such as the Friedman test) to conclude if the median rank of one population (say, the $L_1$ results)
is greater than, equal to, or less than the median rank of another (say, the $L_2,L_3,..,L_x$ results).
\ei

We used Mann-Whitney to compare a set of methods, computing the
$wins$, $ties$, and $losses$ for each.
Since we seek methods that can be rejected, the value of interest to us is
how often methods {\em lose}.
\begin{figure}
\begin{tabular}{|p{.95\linewidth}|}\hline
\footnotesize
The sum and median of A's ranks is
\[\begin{array}{ccl}
sum_A &=& 1 + 2.5 + 5.5 + 7 + 9.5 = 25.5\\
median_A& =&5.5
\end{array}\]
and the sum and median of B's ranks is
\[\begin{array}{ccl}
sum_B&=& 2.5 + 4 + 5.5 + 8 + 9.5 + 11 = 40.5\\
median_B&=& 6.75
\end{array}\]
The $U$ statistic is calculated from 
\mbox{$U_x=sum_x-(N_1(N_2+1))/2$}:
\[\begin{array}{c}
U_A = 25.5 - 5*6/2 = 10.5\\
U_B = 40.5 - 6*7/2 = 19.5
\end{array}\]
These can be converted to a Z-curve using:
\[
\begin{array}{rclcl}
\mu&= &(N_1N_2)/2&=&516.4\\
\sigma & = & \sqrt{\frac{N_1N_2(N_1 + N_2 + 1)}{12}}&=&5.477\\
Z_A & = & (U_A - \mu)/\sigma&=&-0.82\\
Z_B & = & (U_B - \mu)/\sigma&=&0.82\\
\end{array}
\]
(Note that $Z_A$ and $Z_B$ have the same absolute value.
In all case, these will be the same, with opposite signs.)

If $abs(Z) < 1.96$ then the samples $A$ and $B$ have the same median rankings (at
the 95\% significance level). In this case, we add one to both $ties_A$ \& $ties_B$.
Otherwise, their median values can be
compared, using some domain knowledge. In this work, {\em lower} values
are better since we are comparing errors. Hence:
\bi
\item
If $median_A < median_B$ add 1 to both $wins_A$ \& $losses_B$.
\item
Else if $median_A > median_B$ add 1 to both $losses_A$ \& $wins_B$.
\item
Else, add 1 to both $ties_A$ \& $ties_B$. 
\ei
\\\hline
\end{tabular}
\caption{An example of the Mann-Whitney U test.}\label{fig:mw}
\end{figure}

end{document}