%\documentclass[onecolumn,12pt,draftcls,journal,compsoc]{IEEEtran}
\documentclass[cite,journal,10pt,compsoc, twocolumn]{IEEEtran}
\usepackage{pifont}
\usepackage{times}
\newcommand{\ed}{\end{description}}
\newcommand{\fig}[1]{Figure~\ref{fig:#1}}
\newcommand{\eq}[1]{Equation~\ref{eq:#1}}
\newcommand{\hyp}[1]{Hypothesis~\ref{hyp:#1}}
\newcommand{\tion}[1]{\S\ref{sec:#1}}


\usepackage{alltt}
\usepackage{graphicx}
\usepackage{url}
\newcommand{\bi}{\begin{itemize}}
\newcommand{\ei}{\end{itemize}}
\newcommand{\be}{\begin{enumerate}}
\newcommand{\ee}{\end{enumerate}}
\newcommand{\bdd}{\begin{description}}
\newcommand{\edd}{\end{description}}
% IEEE Computer Society needs nocompress option
% requires cite.sty v4.0 or later (November 2003)
\usepackage[nocompress]{cite}
% normal IEEE
 % \usepackage{cite}

% correct bad hyphenation here
\hyphenation{op-tical net-works semi-conduc-tor}

\begin{document}

\title{Rejecting Effort Estimation Methods} 

\author{Tim~Menzies,~\IEEEmembership{Member,~IEEE,}
        Omid Jalali,
	    Dan Baker,
		Jairus Hihn,
		and Karen Lum
\thanks{ 
Tim Menzies, Omid Jalali, and Dan Baker
are with the Lane Department of
Computer Science and Electrical Engineering, West
Virginia University, USA: 
\protect\url{tim@menzies.us},
\protect\url{ojalali@mix.wvu.edu}, \protect\url{danielryanbaker@gmail.com}.}
\thanks{Jairus Hihn and Karen Lum at with NASA's Jet Propulsion Laboratory:
\protect\url{jhihn@mail3.jpl.nasa.gov},
\protect\url{karen.t.lum@jpl.nasa.gov}.}
\thanks{
The research described in this paper was carried out at the Jet 
Propulsion Laboratory, California Institute of Technology, under a contract with the US National Aeronautics and 
Space Administration. Reference herein to any specific 
commercial product, process, or service by trade name, 
trademark, manufacturer, or otherwise does not constitute 
or imply its endorsement by the US Government.}
\thanks{See \protect\url{http://menzies.us/pdf/07better.pdf} for an earlier draft of this paper.}
\thanks{
Manuscript received July 31, 2007; revised XXX, XXXX.}}

\markboth{Journal of ???,~Vol.~6, No.~1, January~2007}%
{Menzies \MakeLowercase{\textit{et al.}}: Better Effort Estimation}

\IEEEaftertitletext{\vspace{-1\baselineskip}
\noindent\begin{abstract}
After several years of applying data mining to effort estimation, we
report that many methods are relatively inferior to a small set
of simple methods.
The methods we now reject include
exponential-time feature selection,
model trees, 
standard linear regression, and (for non-sparse data sets)
nearest neighbor methods. The methods we now endorse include
local calibration, possibly augmented with
an almost linear-time feature selection or quadratic-time project selection.
Previously, concerns have been raised to the external validity of
such comparative assessments of effort estimation methods.
We show that our conclusions are stable across data set sub-samples
and a range of evaluation criteria.
\end{abstract}\noindent
\begin{keywords}COCOMO, effort estimation, data mining, 
evaluation, Mann-Whitney (wrong, U), non-parametric tests.
\end{keywords}\vspace{1\baselineskip}}\maketitle


%------------------------------------------------------------------------- 

\section{Introduction}
Software effort estimates are often wrong. 
Initial estimates may be incorrect by a factor of four~\cite{boehm81}
or even more~\cite{kemerer87}.  
Consequently, as the release date
approaches, projects (wrong) realize that they cannot afford to (e.g.) implement all planned
functionality or follow accepted practices for software quality
assurance.
As a result,
systems are often
delivered with reduced (or poorly tested) functionality~\cite{standish01}.
In the worst case, projects are canceled wasting the entire development effort. 
For example, in 2003, NASA canceled
the CLCS system
after spending 
hundreds of millions of dollars on software development.
The project was canceled after the initial estimate
of
\$206 million was increased to between
\$488 million \$533 million~\cite{CLCS03}.

While the need for better estimates is clear,
there exists
a very large number of effort
estimation methods~\cite{jorg04,jorgensen05}
and no good criteria for selecting between them.
Few studies empirically
compare all these
techniques. What is more usual are narrowly focused studies (e.g.
\cite{kemerer87,briand00,lum02,ferens98}) that
test, say,
linear regression models in different environments.
To add to the confusion, there
is an increasing number of open source data mining methods
becoming available such as the R project\footnote{\url{http://www.r-project.org/}},
Orange\footnote{\url{http://www.ailab.si/orange/}} and the
WEKA~\cite{witten05} toolkit that contains over 200 learners. 
Numerous {\em meta-learning} schemes exist
including {\em stacking} where data miners are linked together and
the conclusions of one 
data miner influences the next. 
There exists
one stacking for every ordering of $N$ learners.
That is, five learners can be stacked $5!=120$ ways and ten learners
can be stacked in over four (wrong, three) million ways.

Unless we can comparatively assess and reject effort estimation methods,
there will be too many methods to apply.
For example, our
COSEEKMO~\cite{me06d}
effort estimation toolkit
takes four
days to compare all of its 158 methods. Some of that runtime
is unavoidable since, for reasons of
statistical validity, we sample twenty times
from 19 different projects.
However,
COSEEKMO could run 10 to 100 times faster if its mixture of
bash, gawk, and JAVA programs were recoded in ``C''. 
Also, since much of its processing relates to statistical random sampling,
it could be parallelized and on a CPU farm.

If we could find ways to reject methods, then 
we could avoid that recoding effort or the need for managing a CPU farm.
For example, 35\% (wrong) of COSEEKMO's CPU is devoted to one method (model trees~\cite{quinlan92b}). If it could be shown that this method added little value,
then it would be useful to reject it.

Previously, Foss et.al.~\cite{foss05} and Myrtveit et.al.~\cite{myrtveit05}
have doubted the
practicality of comparatively assessing $L$ different learners processing $D$ data
sets. The results of such a comparison, they argue, vary according to the sub-sample
of the data being processed and the applied evaluation criteria.
These conclusions were based on a study of two real-world
data sets and some extensive simulation
studies on artificial data sets.

In this study, 8 (wrong or what?) different estimation methods were applied
to 19 real-world COCOMO-I data sets~\cite{boehm81}. 
The results were studied using non-parametric
methods (a Mann-Whitney (wrong, add U to it) test~\cite{mann47} over median ranks).
In contrast to Foss \& Myrtveit et.al. we found conclusions that were stable
across different subsets and different
evaluation criteria~\footnote{omid, wrong, maybe not}. We conjecture that our
results are so stable since (a)~non-parametric rank (wrong, rank-sum) tests are less
susceptible to the conflating effects of outliers; (b)~the Mann-Whitney (wrong, U)
test simplifies and clarifies comparative assessment;
and (c)~the numerous assumptions inherent in a COCOMO data set impose 
enough regularities to make clear conclusions.

Based on our stable conclusion,
we can now reject four kinds of effort estimation methods.
Consequently, COSEEKMO's runtimes have improved dramatically from four
days to twenty minutes (wrong, this will change, needs a rerun). 

The rest of this paper is structured as follows.

\subsection{Contribution of this Paper (wrong?)}
\bi
\item confirms much of the conclusions of Foss and myr, but by different methods
\item
extends their results. they dobuted mean MRE. we doubt mean anything
\item
idenfitied errors in prior eork
\item
fixed those errors
\item
Refuted the gaussaian assumption
\item
novel application of non-parameteric test
\item
showed stability in the eval
\item
etc
\item
Optimized cokseemo
\item
did so in a general way that others could adopt
\item
defined nw methodological principle sfor effort estimation
\item 
etc
\ei
\section{Data}
This paper is based on 19 subsets of two COCOMO-I data sets: $NASA93$
\footnote{See "nasa93" at \url{http://promisedata.org/repository}.}
 and $COC81$
\footnote{See "coc81" at \url{http://promisedata.org/repository}.}.
For details on these data sets, see the appendix.

Unlike other effort estimators
such as as PRICE-S~\cite{park88}, or SLIM (Software Life cycle Management)~\cite{putnam92}, or SEER-SEM (System Evaluation and 
Estimation of Resources, Software Estimation Model)~\cite{jensen83}, COCOMO is a public domain system with published data sets 
and baseline results~\cite{chulani99}.
In 2000, Boehm et.al. updated COCOMO-I model~\cite{boehm00b}.
After the update, numerous features of the model remained the same:
\bi
\item
Effort is assumed to be exponential on model size.
\item
For both COCOMO-I and COCOMO-II,
Boehm et.al. recommends a procedure called
{\em local calibration} (described below)
for tuning the generic COCOMO model to a local situation.
\item
Boehm et.al. advises effort estimates can be improved via {\em manual stratification} (described below);
i.e. use domain knowledge to select relevant past data.
\ei

Our previous studies~\cite{me06d} used data in both the
\mbox{COCOMO-I} format as well as COCOMO-II.
At the 2005 COCOMO forum, there was some initial
discussions about relaxing the security restrictions around the COCOMO-II data
set. To date, those discussions have not progressed.
Nevertheless, with the assistance of Barry Boehm,
this team has had occasional access to COCOMO-II data~\cite{me06d,me05c}.
However, since other researchers do not have 
similar access, this paper will only report
results from COCOMO-I.

Using COCOMO-I data makes our work more reproducible.
Reproducibility is an important methodological
principle in other disciplines since it allows a community to
confirm, refute, or even improve prior results.  Of the two
data sets used by Myrtveit et.al. (FINNISH and DESHARNAIS), only
the DESHARNAIS data set is in the public domain\footnote{See
"desharnais" at \url{http://promisedata.org/repository}.}. While
this does not in any way invalidate the results of Myrtveit et.al., it makes it
difficult for others to replicate those results and possibly make improvements.

\section{Comparative Assessment}

Kitchenham et.al.~\cite{kitc01}, Foss et.al.~\cite{foss05}, 
and Myrtveit et.al.~\cite{myrtveit05} caution that,
historically,
ranking estimation methods has been done quite poorly. 
Based on an analysis of two (non-COCOMO)
data sets as well as simulations over artificially generated data set,
Foss \& Myrtveit et.al.
concluded that numerous commonly used methods such as the mean
MRE\footnote{MRE = $abs(predicted - actual)/actual)$.} and PRED(30)\footnote{PRED(N) is
the percent of the mean MRE less than N\%.} are
unreliable measures of estimation effectiveness.
Also, the conclusions reached from these standard measures can vary wildly
depending on which subset of the data is being used for testing~\cite{myrtveit05}.

The instability in these measurements was vividly demonstrated in
our prior study of 19 subsets of $COC81$ and $NASA93$~\cite{me06d}.
In that study, for each subset,
ten projects (at random) were selected as a test set. Effort models were
built on the remaining projects, and then applied to the test set. 
The deviations seen after 30 repeats of the above procedure
was quite large. Within each data set, the standard deviation
on the MREs were $\{median,max\} = \{150\%,649\%\}$.
 
One troubling result from that study
was that training set size was 
{\em not} connected to the size of standard deviation. A pre-experimental intuition
was that the smaller the training set, the worse the prediction inaccuracies.
On the contrary, we found small and large deviations for both small
and large training sets~\cite{me06d}. This result means that these large standard
deviations cannot be tamed by further data collection. Rather, the data must be processed
and analyzed in some better fashion (e.g. using the Mann-Whitney methods (wrong U test) described below).


These large deviations explain much of the contradictory results in the effort estimation
literature.
Jorgensen reviews fifteen studies that compare model-based to expert-based estimation. 
Five of those studies found in
favor of expert-based methods; five found no difference; 
and five found in favor of model-based estimation~\cite{jorg04}.
Such diverse conclusions are to be expected if models exhibit large deviations.

The problem of comparatively assessing $L$ learners run on multiple sub-samples of $D$ data sets
has been extensively studied. 
In his study of four years of proceedings from the {\em International Conference
on Machine Learning}, Demsar~\cite{demsar06}
found that the most common
method of comparative assessment were similar to our previous
COSEEKMO study~\cite{me06d}; i.e. t-tests over some form of repeated sub-sampling
such as cross-validation, separate subsets, or randomized re-sampling. 

\begin{figure}[!t]
\begin{center}
\includegraphics[width=2.5in]{graphs/re_nasa93_fg_g.pdf}
\includegraphics[width=2.5in]{graphs/re_nasa93_year_1975.pdf}
\includegraphics[width=2.5in]{graphs/re_nasa93_mode_embedded.pdf}
\includegraphics[width=2.5in]{graphs/re_coc81_langftn.pdf}
\end{center}
\caption{Relative errors seen in COSEEKMO's
experiments on four data sets. Thin lines show the actual
values. Thick lines show a Gaussian distribution that uses
mean and standard deviation of the actual values.
From top to bottom,
the plots are of:
{\em (top)} NASA ground systems;
NASA software written around 1975;
NASA embedded software (i.e. software developed within tight hardware, software, and operational constraints);  
{\em (bottom)} some FORTRAN-based software systems.
}\label{fig:re}
\end{figure}

Demsar doubts the value of such t-tests since they assume that the populations being sampled
come from Gaussian populations. In the specific case of COSEEKMO's effort estimation
experiments, Demsar's argument is highly relevant.
The thin lines in \fig{re} show a sorted list of the
RE, or the relative error\footnote{RE = $(predicted-actual)/actual$.}, values 
seen when, 20 times, COSEEKMO's 158 methods were applied to subsets
of the data used in~\cite{me06d}. 
(wrong, repeated, ->) The thick lines show a Gaussian cumulative distribution function computed 
from the means and standard deviations of the actual REs (the thin lines).
The plots show that the Gaussian is a poor approximation to  
the actual values.
Observe that 
there exists orders of magnitude differences between the {\em
actual} plots (the thin lines) and the {\em Gaussian} approximations
(the thick lines).
This discrepancy exists for two reasons:
\bi
\item 
Most of the COSEEKMO methods yield only positive
predictions.
Yet half the Gaussian values are below zero. 
\item
While
most of the actual RE values are nearly zero, a very small number
are extremely large. These large values can arise for many reasons
including noise in the data or a training set that learns an overly steep
exponential function for the effort model.
\ei

Note that the
plots of \fig{re} are mostly flat, and spike at the extreme right hand side. 
That is, 
while the outliers 
are large (10 to 1000 times the median values), they are infrequent.
Hence, they will only appear in some sub-samples and in those sub-samples,
all calculations of mean MRE values will be dramatically altered.  

\fig{re} is a strong endorsement of the
Foss \& Myrtveit et.al. conclusions (but for different reasons):
\bi
\item
For data sets with 
large outliers (e.g. \fig{re}),
mean
calculations can be highly misleading. 
Median values\footnote{
Median: the value below which 50\% of the values fall.}  
should be used instead since they 
are far less susceptible to the presence of outliers.
\item
For data sets with only a small number of outliers (e.g. \fig{re}), the conclusions reached from
different subsets can be very different,
depending on the absence or presence of the infrequent outliers.
\ei
Demsar argues that non-Gaussian populations are common enough to require a methodological
change in data mining. After reviewing a wide range of
comparisons methods (paired t-tests with and without the use of geometric means of the relative ratios;
binomial tests with the Bonferroni correction;
paired t-tests; ANOVA; Wilcoxon; Friedman),
Demsar advocates the use of the 1945 Wilcoxon~\cite{wilcoxon45} 
signed-rank test that compares the ranks for the positive and negative differences (ignoring the signs). 
Writing five years early,
Kitchenham et.al.~\cite{kitc01} comment that the Wilcoxon test has its limitations.
Demsar's report  offers the same conclusion, 
noting that the Wilcoxon test requires that the sample sizes are the same.
This is a significant
problem for comparing learner $L_1$ against multiple other learners $L_2,L_3,.,L_x$ all 
running on the same
test set of size $N$: $L_1$ would generate $N$ results while its rivals would generated $N*(x-1)$
results. To fix this problem, 
Demsar augments Wilcoxon with the Friedman test.

One test not studied by Demsar is 
Mann and Whitney's 1947
modification~\cite{mann47} to Wilcoxon. We prefer this test since:
\bi
\item
Mann-Whitney (wrong, U test) does not require that the sample sizes are the same.
So,
in one test, learner $L_1$ can be compared to all its rivals. 
\item
Mann-Whitney (wrong, U test) does not require any post-processing (wrong, needs some sorting of arrays prepared before)
(such as the Friedman test) to conclude if the median rank of one population (say, the $L_1$ results)
is greater than, equal to, or less than the median rank of another (say, the $L_2,L_3,..,L_x$ results).
\ei
To conduct a Mann-Whitney (wrong, U) test,
values are replaced with their ranks in a sorted list of all values.
If treatment $A$ generates  
$N_1=5$ values \{5,7,2,0,4\}
and treatment $B$ generates $N_2=6$ values \{4,8,2,3,6,7\}, then
these sort as follows:
\begin{center}
{\small
\begin{tabular}{c|c|c|c|c|c|c|c|c|c|c|c}
Samples& A & A & B & B & A & B & A & B & A & B & B\\
Values & 0 & 2 & 2 & 3 & 4 & 4 & 5 & 6 & 7 & 7 & 8\\
\end{tabular}}\end{center}
On ranking,
averages are used when values are the same:
\begin{center}
{\small
\begin{tabular}{c|c@{~}|c@{~}|c@{~}|c@{~}|c@{~}|c@{~}|c@{~}|c@{~}|c@{~}|c@{~}|c}
Samples& A&  A  &B&  B&  A & B  &A&  B&  A&  B & B\\
Values&  0  &2  &2&  3&  4 & 4  &5&  6&  7&  7 & 8\\
Ranks&   1 &2.5&2.5 &4& 5.5&5.5 &7&  8& 9.5&9.5& 11\\
\end{tabular}}\end{center}
Note that, when ranked in this manner,
the largest value in the spikes in \fig{re} would have the same rank, regardless if the spike
was ten to a hundred times larger.
That is, such rank tests
are not susceptible to large outliers.

\begin{figure}[!t]
\begin{tabular}{|p{.95\linewidth}|}\hline
\footnotesize
The sum and median of A's ranks is
\[\begin{array}{ccl}
sum_A &=& 1 + 2.5 + 5.5 + 7 + 9.5 = 25.5\\
median_A& =&5.5
\end{array}\]
and the sum and median of B's ranks is
\[\begin{array}{ccl}
sum_B&=& 2.5 + 4 + 5.5 + 8 + 9.5 + 11 = 40.5\\
median_B&=& 6.75
\end{array}\]
The $U$ statistic is then calculated 
where \mbox{$U_x=sum_x-(N_1(N_2+1))/2$}:
\[\begin{array}{c}
U_A = 25.5 - 5*6/2 = 10.5\\
U_B = 40.5 - 6*7/2 = 19.5
\end{array}\]
These can be converted to a Z-curve using:
\[
\begin{array}{rclcl}
\mu&= &(N_1N_2)/2&=&516.4\\
\sigma & = & \sqrt{\frac{N_1N_2(N_1 + N_2 + 1)}{12}}&=&5.477\\
Z_A & = & (U_A - \mu)/\sigma&=&-0.82\\
Z_B & = & (U_B - \mu)/\sigma&=&0.82\\
\end{array}
\]
(Note that $Z_A$ and $Z_B$ have the same absolute value.
In all case, these will be the same, with opposite signs.)

If $abs(Z) < 1.96$ then the samples $A$ and $B$ have the same median rankings (at
the 95\% significance level). In this case, we add one to both $ties_A$ \& $ties_B$.
Otherwise, their median values can be
compared, using some domain knowledge. In this work, {\em lower} values
are better since we are comparing relative errors. Hence:
\bi
\item
If $median_A < median_B$ add 1 to both $wins_A$ \& $losses_B$.
\item
Else if $median_A > median_B$ add 1 to both $losses_A$ \& $wins_B$.
\item
Else, add 1 to both $ties_A$ \& $ties_B$. 
\ei
\\\hline
\end{tabular}
\caption{An example of the Mann-Whitney (wrong, U) test.}\label{fig:mw}
\end{figure}

\fig{mw} shows the Mann-Whitney (wrong, U) test concluding that these treatments $A$, $B$ have the same
median ranked values. As we have defined it, this test augments counters
for $wins$, $ties$, and $losses$ for effort estimation methods $A$ and $B$
(where $A$ and $B$ are single or groups of methods).
Since we seek methods that can be rejected, the value of interest to us is
how often methods {\em lose}.

\section{Experiments}

When Mann-Whitney (wrong, U test) was applied to COSEEKMO's 158 methods, we found
that most were unremarkable; i.e. neither relatively the best or relatively the worst.
This was both
a disappointing
and exciting result.
On one hand,
it meant that much of the option space explored by COSEEKMO
was superfluous. On the other hand, it 
meant that COSEEKMO could be significantly optimized.

For the rest of this paper, we focus on eight methods:
\bi
\item Four of these
are arguably the best methods in COSEEKMO;
\item One is arguably the worst;
\item
And three fall somewhere in-between best and worst.
These three were selected somewhat heuristically
(they comment on the premises of some of our prior publications~\cite{me05c}).
Since they are neither worst nor best, there is no need
to be rigorous in their selection.
\ei

\subsection{Design of Experiments}

All COSEEKMO's methods use a $P*F$ table of data:
\bi
\item
The table has $P$ rows.
containing information about different projects,
including the 
actual development effort. 
In the 19 subsets of $COC81$ and $NASA93$ used in the study,
$20 \le P \le 121$. These subsets relate to\footnote{omid- real short list. wrong, what do you mean?}.
\item
The table has $F$ columns containing
feature $f$ (wrong?) of a project such as lines of code (KLOC), 
schedule pressure (sced), etc. The features used in this
study come from Boehm's COCOMO-I work and are described in the appendix.
\ei
The rows of this table are divided at random into a $Train$ and $Test$ set
(and $|Train|+|Test|=P$). 
COSEEKMO's different
methods are then applied to the training data to generate a model. This
model is then used on each project in the test set.
In order to compare this study with our work~\cite{me06d},
we use the same test set size as the COSEEKMO
study; i.e. \mbox{$|Test|=10$}.
For the sake of statistical validity, the above procedure is repeated 20 times
for each of the 19 subsets of $COC81$ and $NASA93$.
Hence, this 
generates
$10*20*19=3,800$ results for each method\footnote{omid-right? (wrong, not true for locomo I where we merge all the neighborhoods into one)}. 


\subsection{Methods}
The current version of COSEEKMO has estimation methods 
that combine:
\bi
\item
Various {\em learners} such as
standard linear regression, local calibration,
model trees, or nearest neighbor algorithms. (wrong, don't include NN here. it is in the next one)
\item
Various {\em pre-processors}
that may prune rows or columns.
\ei

The learners can be categorized by their assumptions.
As we shall see, the better learners make the most assumptions about the data.
Many of COSEEKMO's learners are based around Boehm's preferred
method for calibrating generic COCOMO models to local data. This {\em local calibration}
(or LC) procedure, defined in the appendix, 
contains all Boehm's assumptions about effort estimation; i.e. that it
is exponential on lines of code, and that
the features influence the effort by a set of pre-defined constants that can be taken
from Boehm's textbooks~\cite{boehm81,boehm00b}. 
We call this the {\em strong assumptions} group.
\newcommand{\nope}{\ding{56}}
\newcommand{\yupe}{\ding{52}}
\begin{figure*}
\begin{center}
\scriptsize
\begin{tabular}{r|r@{~=~}l|l|l|p{1.5in}}
established\\
practice & method& name                  & row pruning & column pruning         & learner \\\hline
\yupe&a    & LC           & \nope          & \nope                      & LC = Boehms's local calibrations\newline \cite[p526-529]{boehm81}   \\\hline
&b    & COCOMIN +  LC     &  \yupe automatic $O(P^2)$        & \nope                      & local calibration \\\hline
&c    & COCOMIN + LOCOMO + LC   &  \yupe automatic  $O(P^2)$       &  \yupe automatic $O(F{\cdot}log(F) + F)$ &  local calibration \\\hline
\yupe&d    &  manual stratification + LC &  \yupe manual  & \nope             &  manual selection of projects, \newline
plus local calibration\\\hline
&e    &  LOCOMO + LC      &  \nope         & \yupe automatic $O(F{\cdot}log(F) + F)$ &  local calibration\\\hline
&f    & M5pW  + M5p       &  \nope        &  \yupe Kohavi's WRAPPER~\cite{kohavi97} calling M5p~\cite{quinlan92b}, $O(2^F)$ & model trees.\\\hline
&g    & LOCALW  + LC      &  \nope        &  \yupe Chen's WRAPPER~\cite{me06d}  calling LC, $O(2^F)$ & local calibration\\\hline
&h    & LsrW  + LSR       &  \nope        &  \yupe Kohavi's WRAPPER~\cite{kohavi97} calling LSR, $O(2^F)$ & linear regression\\\hline
&i    & NEAREST       &  \yupe automatic $O(P^2)$& \nope &      median  effort of near neighbors\\\hline
\end{tabular}
\end{center}
\caption{Notes on the different methods. F=the number of features.}\label{fig:abcd}
\end{figure*}


Some learners make
{\em weaker assumptions}.
COSEEKMO's model tree, linear regression (or LSR) learners
all assume that effort is exponential on effort (wrong?), that it can be linearized
to a log-linear form. However, there is no assumption about using pre-defined constants. 
As discussed in the appendix, model trees make even fewer assumptions than LSR. Linear
regression assumes that the data can be fitted to a single model. Model trees permit the
generation of multiple models (and learn a decision tree for selecting the appropriate one).
Model trees are much slower than our other methods and consume 35\% of (wrong, we don't know this exactly anymore) COSEEKMO's
CPU.

Finally, the nearest neighbor learners are {\em assumption-less}
since they do not even assume that the model is parametric. While
assumption-less, COSEEKMO's nearest neighbors are not {\em assumption-free}.
COSEEKMO uses a simple n-dimensional Euclidean distance to find similar projects.  
Wilson \& Martinez report that this measure assumes that the training data is not
sparse~\cite{wilson97a}. Such spare data sets arise when many of the
values of project features are unavailable. All our data sets are non-sparse
(we have zero unknown values).

COSEEKMO's pre-processors prune {\em irrelevant} projects and features.
After pruning, the learning (wrong, learner) executes on a new table $P'*F'$
where $P' \subseteq P$ and $F' \subseteq F$.
Two pre-processors are defined: row and column pruning.
One of our row pruners is manual and uses domain knowledge to select past, most-relevant projects.
Two (wrong, ??) other row pruners are automatic and use nearest neighbor methods to find the $k$ most relevant projects to
use for predicting some item in the $Test$ set:
\bi
\item
NEAREST returns median effort of $k$ nearest neighbors.
\item
LOCOMO~\cite{jalali07} is similar but instead of returning the median, it passes
the $k$ nearest neighbors to LC. 
\ei
In both cases, the appropriate value of $k$ is learned by a pre-processor that 
explores the $Train$ing set.
The core of a both algorithms is a distance measure that must compare all pairs of projects.
Hence, these methods take time $O(P^2)$. 

As to the column pruners, these fall into two groups:
\bi
\item
WRAPPER and LOCALW are very thorough search algorithms that explore
subsets of the features, in no set order. This search takes time $O(2^F)$. 
\item COCOMIN~\cite{baker07} is far less thorough.
COCOMIN is a near linear-time pre-processor that pre-sorts the features
on some heuristic criteria, then tries discarding the first $k$ items in the sort
order. COCOMIN runs in almost linear time:
$O(F{\cdot}log(F))$ for the sort and $O(F)$ time for the exploration of first $k$ items.
\ei

To pre-sort the features, COCOMIN runs LC for all every (wrong?) single feature, paired with KLOC,
and scores the resulting effort model on some hold-out test data using MRE.
The feature associated with the highest MRE is given the top position in the sort (for
more details, see the appendix).
Our pre-experimental intuition was that the thorough search of WRAPPER and LOCALW
would perform better than COCOMIN. As shown below, this intuition was false.

%One advantage of LOCOMO is that it reduces the odds that effort estimation
%will use outliers.
%Outliers arise from rare observations or imperfect data
%collection.  
%Outlier removal:
%\bi
%\item
% Simplifies model generation since the models don't have to
%be contorted to fit strange parts of the data.
%\item 
%Simplifies model validation since the model is formed from frequently
%seen parts of the data. When exploring regions with larger
%samples, there are fewer occasions where test data has to
%be interpolated {\em between} known observations.
%\ei
%Outliers can be identified using their distance $k$ to other observations: 
%the more distant the observation, the more likely that it is an outlier.
%Since LOCOMO removes projects with larger $k$ values, 
%it can be used to delete projects containing outliers.
%
The eight methods we will explore are listed $(a,b,c,d,e,f,g,h,i)$ in \fig{abcd}. 
Note that each class may use a column or row pruner but some $(a,i)$ use no pruning at all. 
In class $(f,g,h)$, the notation M5pW and LsrW denotes a WRAPPER that uses
M5p or LSR as its target learner (respectively).
One way to categorize \fig{abcd} is by their relationship to accepted practice
(as defined in the COCOMO texts~\cite{boehm81,boehm00b}):
\bi
\item
Some learners $(a,d)$ use the methods endorsed as best practice in the COCOMO community.
\item
The others are our attempts to do better than current established practice.
\ei

\section{Results}

\begin{figure}
\begin{center}
\includegraphics[width=3.0in]{LossesColumnsNasa93.pdf}
\end{center}
\caption{Number of times a method loses in twelve $NASA93$ subsets.
For an explanation of the x-axis marks $(a,b,c,d,e,f,g,h,i)$, see
\fig{abcd}.
In this figure, the {\em lower} the mark,
the {\em better} the method.}\label{fig:resultsNASA93}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=3.0in]{LossesColumnsCoc81.pdf}
\end{center}
\caption{Number of time a method losses in seven $COC81$ subsets.
For an explanation of the x-axis marks $(a,b,c,d,e,f,g,h,i)$, see
\fig{abcd}.
In this figure, the {\em lower} the mark,
the {\em better} the method.}\label{fig:resultsCOC81}
\end{figure}

\fig{resultsNASA93} and \fig{resultsCOC81}
show the number of $losses$ found by Mann-Whitney (wrong, U test) (at a 95\% confidence level)
over the generated RE scores\footnote{omid- RE or MRE?, wrong, MRE, but this should add in BRE, MER, etc.}
for the eight methods in \fig{abcd} in subsets of $NASA93$ and $COC81$.

The manual stratification results $(d)$ were generated by a modification in the sets used for creating $Train$ sets.
\bi
\item In every case except the for manual stratification, $Train$ and $Test$ are created from the same subsets of $NASA93$ or $COC81$.
\item In manual stratification, the $Test$ is created in the same manner from the subsets. However, the $Train$ is created
from the instances drawn from the $NASA93$ or $COC81$ and not their subsets.
\item In either case, $Train$ and $Test$ are exclusive.
\ei

%The manual stratification results $(d)$ were generated by a comparison of two scenarios:
%\bi
%\item For $all \in \{NASA93,COC81\}$ select $Train$ and $Test$ from $all$;
%\item Alternatively,
%For $all \in \{NASA93,COC81\}$ select $Train$ and $Test$ from a subset of $all$\footnote{omid- this 
%is wrong. plz fix}.
%\ei

All the other marks compare
the other seven methods against all 158 COSEEKMO methods. That is,
for the seven automatic methods $(a,b,c,e,f,g,h,i)$,
all the comparisons are between each \fig{abcd}
and all the other 157 methods.
Mann-Whitney (wrong, U tes) is particularly useful for such large scale comparisons since it makes
no assumptions that the populations being compared are of equal size.

The results of the eight methods in
\fig{resultsNASA93} and \fig{resultsCOC81}
are approximately ordered left-to-right according to the sum of their losses.
For example:
\bi
\item In \fig{resultsCOC81}, methods $(a,d)$ lose the least. 
\item In \fig{resultsNASA93}, $(b)$ loses the least: 
less than ten times (out of 158) in all the $NASA93$ subsets.
\item In both plots, $(i)$ loses the most.
\ei
%XXX fos etal. no clear best

In a result consistent with the cautions of Myrtveit et.al. (that
different subsets lead to different conclusions),
the pattern of losses in $NASA93$ is different to $COC81$:
\bi
\item
In $COC81$ (\fig{resultsCOC81}), the LC method defined by Boehm 26 years ago $(a,e)$ does
as well as anything else and much better than supposedly more sophisticated methods
such as model trees $(f)$. 
\item
But in $NASA93$ (\fig{resultsNASA93}), standard COCOMO methods $(a,e)$ do relatively poorly;
e.g. manual stratification $(e)$ loses up to 35 times. In that data set,
LOCOMO's linear-time row pruning $(b)$ does best.
\ei

While the loss patterns are different in the two subsets, certain patterns repeat:
\bi
\item
(wrong, wrong, wrong, with the new graphs). There is value-added in Boehm's effort estimation assumptions.
With only one exception, the {\em strong assumption}
methods, $(a,b,c,d,e,g)$ do better than the others. The only exception is $(g)$
which suffers from many losses in $NASA93$. 
\item
There is no value-added in these data sets from nearest neighbor methods $(i)$.
This is a result predicted by theory.
Shepperd \& Schofield argue that their case-based reasoning methods, like nearest neighbor,
are better suited to sparse data domains where precise numeric
values are {\em not} available on all factors~\cite{shepperd97}.
As mentioned above, all our data is non-sparse.
\item
(wrong wrong, needs revision maybe) A single linear model is adequate for the purposes of effort estimation.
All the methods that assume multiple linear models, such as model trees $(f)$,
or no parametric form at all, such as nearest neighbor $(i)$, perform
relatively poorly.
\item
(wrong, revision maybe) Elaborate searches do not add value to effort estimation.
All the $O(2^F)$ column pruners $(f,g,h)$ do worse than near-linear-time
column pruning $(c,e)$. We attribute the success of the near-linear-time pruners
to the pre-sort criteria. COCOMIN uses LC to pre-sort the features.
Repeating the above remarks regarding strong assumptions,
Boehm's effort estimation assumptions embodied in LC are very effective modeling heuristics.
\item
(wrong,revision maybe) The slower implementations such as model trees $(f)$, that consume (wrong, ??)
35\% of the COSEEKMO's runtimes,
do no better than other methods. 
\ei
Combining the last two points, we observe that COSEEKMO can run much faster
if it ignores $O(2^F)$ column pruning methods $(f,g,h)$
and model trees $(f)$.

As to the ``best'' method for effort estimation:
there is no clear ``best'' method. We advise that the following methods should be tried
and the one that does best on the available training data (assessed using Mann-Whitney (wrong, U Test))
should be used to predict new projects:
\bi
\item Make strong assumptions and use LC-based methods.
\item Row and column pruning can be useful but elaborate column pruning (requiring
an $O(2^F)$ search), is not useful. 
Hence, try LC with zero or more of LOCOMO's row pruning or COCOMIN's column
pruning; i.e. $(a,b,c,d)$.
\item If the training data is sparse, then nearest neighbor methods $(i)$ 
should at least be tried as well.
\ei

\section{External Validity}
\PARstart{A}s with any empirical study, our conclusions are biased
according to the analysis method. In our case, that bias
expresses itself in five ways: biases in the paradigm, biases in the model,
biases in the data, biases in the selection of methods,
and biases in the evaluation metric.

{\em Biases in the paradigm}: The rest of this paper explores more
model-based methods (e.g. COCOMIN, LOCOMO, LC) and not expert-based methods. 
Model-based methods use some algorithm to summarize old data and make predictions 
about new projects.
Expert-based methods use human expertise (possibly augmented with process guidelines
or checklists) to generate predictions.  
Jorgensen~\cite{jorg04} argues that most industrial effort estimation is expert-based
and lists 12 {\em best practices} for such effort-based estimation.
The comparative evaluation of model-based
vs expert-based methods must be left for future work. Before
we can compare any effort estimation
methods (be they model-based or expert-based) we must first demonstrate that any two
methods can be comparatively assessed.
For more on expert-based methods, see~\cite{jorg04,jorgensen04,shepperd97,chulani99}.

{\em Biases in the model:} This study uses COCOMO data sets since 
that was the only public domain data we could access.  
Nevertheless, the techniques described here can easily be generalized to other models. For example,
here we use COSEEKMO to select best parametric methods in the
COCOMO format~\cite{boehm81,boehm00b} but it could just as easily
be used to assess other model-based tools like
PRICE-S~\cite{park88}, SEER-SEM~\cite{jensen83}, 
or SLIM~\cite{putnam92}.
However, it should be noted that in the above study, 150 out of 158 methods were neither best
or worst (wrong, how do we know this except by looking at thos "all" graphs?).
If those percentages carry over to a study of SEER-SEM vs PRICE-S vs SLIM, then
we would predict that will yield similar performance.

{\em Biases in the data:} Issues of sampling bias threaten any data
mining experiment; i.e. what matters {\em there} may not be true {\em here}.
Our model-based methods use data
and so are only useful in organizations that maintain historical
data on their projects. Such data collection is rare in organizations
with low process maturity. However, it is common elsewhere; e.g. amongst government
contractors whose contract descriptions include process auditing requirements.
For example, it is common practice at NASA and the United States Department of Defense
to require a model-based estimate at each project milestone.
Such models are used to generate estimates or to double-check an expert-based
estimate.

Another source of data bias was already mentioned above;
our data sets are non-sparse and sparse data sets may
be more suitable for nearest neighbor tools.
Yet another source of bias is that some of the data used here comes from NASA and NASA
works in a particularly unique market niche. Nevertheless, we argue
that results from NASA are relevant to the general software engineering
industry. NASA makes extensive use of contractors.
These contractors service many other industries. 
These contractors are contractually
obliged (ISO-9001) to demonstrate their understanding and usage of
current industrial best practices. 
For these reasons, other noted researchers such
as Basili, Zelbowitz, et.al.~\cite{basili02} have argued that
conclusions from NASA data are relevant to the general software
engineering industry.

Myrtveit et.al.~\cite{myrtveit05} may not accept this last paragraph, arguing that their
results show very different conclusions, according to which subsets were explored
\footnote{OMID, wrong, wrong, wrong: over to you. please show 4 little plots of NASA/COC repeated twice. comment that the patterns were essentially
stable despite large-scale random selection of the subsets used for train and test}.

{\em Biases in the selection of methods:} Another source of bias in
this study is the set of methods explored by this study.
We can make no claim that \fig{abcd} represents the space of possible
effort estimation methods. (However, as discussed above, it covers
an interesting range of effort modeling assumptions).  
Instead of claiming that $(a,b,c,d)$ are ``best'', 
we really should say that $(a,b,c,d)$ are the best we have seen so far after four years
of trying many alternatives. Nevertheless, the reader
may know of other effort estimation methods they believe we should try.
Alternatively, the reader may have a design or an implementation of a new kind
of effort estimator. In either case, before it can be shown that an existing or new method
is better than those shown in \fig{abcd},
we first need a demonstration that there
exists statistical tests that distinguish between methods.
This paper offers such a demonstration.

{\em Evaluation bias:}
OMID- over to you. results from switching eval bias. (wrong wrong wrong)

%only cocomo framework
%
%only model-based (but some issues with the expert/model-based distinction).
%\begin{figure}
%\small
%\begin{tabular}{l|rr|rrr}
%&class& notes&		ties	&wins&	losses\\\hline
%COC81&a&standard lc&0&8&0\\
%&b&none+locomo+lc&2&5&1\\
%&I&none+manualStrat+lc&2&5&1\\
%&c&cocomin+none+lc&2&4&2\\
%&d&cocomin+locomo+lc&2&4&2\\
%&g&localW+none+m5p&0&3&5\\
%&f&m5pW+none+m5p&0&2&6\\
%&e&lsrW+none+lc&0&1&7\\
%&h&none+nearest+none&0&0&8\\\hline
%
%NASA93&a&standard lc&2&6&0\\
%&c&cocomin+none+lc&3&5&0\\
%&d&cocomin+locomo+lc&7&1&0\\
%&i&none+manualStrat+lc&3&4&1\\
%&b&none+locomo+lc&5&1&2\\
%&f&m5pW+none+m5p&3&2&3\\
%&g&localW+none+m5p&4&1&3\\
%&e&lsrW+none+lc&3&1&4\\
%&h&none+nearest+none&0&0&8
%\end{tabular}
%\caption{Mann-Whitney results on 7 subsets of COC81 (above)
%and 9 subsets of NASA93 (below).}\label{fig:results1}
%\end{figure}

\section{Discussion}

Definitions of best practices
are constantly changing (e.g. structured programming, object-oriented, value-based ,etc).
There are few stable conclusions in software engineering. 
One such stable conclusion is Boehm's 1981 modeling work on COCOMO-I. Despite four
years of trying to do better with a large suite of modern data mining methods,
the best we can do is to augment his standard 1981 local calibration procedure
with some simple near-linear time
column pruning and quadratic-time $(O(P^2)$ row pruning.

The Mann-Whitney (wrong, U test) have shown that most of the methods we have explored
have similar performance that is inferior to a small number of simple techniques.
The conclusion of this study is that there may be little value in
exploring yet more small variants to current data miners. Rather,
we need to explore radically different methods for supporting managerial 
decisions about software projects.

XXX, (wrong,....????, incomplete???)

\appendix
\subsection{Data used in this Study}

In this study, effort estimators were built using all or some {\em part} of data from two sources:
\bdd
\item[{\em $COC81$:}] ~~63 records in the
COCOMO 81 (wrong, COCOMO-I, hereafter) format. Source: \cite[p496-497]{boehm81}.  Download from 
\url{http://unbox.org/wisp/trunk/cocomo/data/coc81modeTypeLangType.csv}.
\item[{\em $NASA93$:}] ~~93 NASA records in the COCOMO 81 format.  
Download from \url{http://unbox.org/wisp/trunk/cocomo/data/nasa93.csv}.
\edd

Taken together, these two sets are the largest
COCOMO-style data source in the public domain (for reasons
of corporate confidentiality, access to Boehm's COCOMO-II data set is
highly restricted).
$NASA93$ was originally collected to create a NASA-tuned version of
COCOMO, funded by the Space Station Freedom Program and 
contains data from six NASA centers including the Jet Propulsion
Laboratory For more details on this dataset, see~\cite{me06d}.

The subsets in the data are:
\bdd
\item[{\em All:}] selects all records from a particular source;
e.g. ``coc81\_all''.
\item[{\em Category:}] ~~~~~is a NASA-specific
 designation selecting the type of project; e.g. avionics, data capture, etc.
\item[{\em Dev:}] indicates the development methodology; e.g. div.waterfall.
\item[{\em DevEnd:}] ~~~shows the last year of the software project.
\item[{\em Fg:}] selects either ``$f$'' (flight) or ``$g$'' (ground) software.
\item[{\em Kind:}] selects records relating
to the development platform; max= mainframe and mic= microprocessor.
\item[{\em Lang:}] selects records about different development languages.
\item[{\em Project} and {\em center}:]~~~~~~~~~~~~~~~  $NASA93$
 designations selecting records relating to where the software was built and the name
of the project.
\item[{\em Mode=e:}]~\newline
selects records relating to
the {\em embedded}  COCOMO 81 development mode: one of semi-detached,
embedded, organic.
\item[{\em Mode=o: }]~\newline
selects COCOMO 81 {\em organic} mode records.
\item[{\em Mode=sd: }]~\newline
selects COCOMO 81 {\em semi-detached} mode records.
\item[{\em Size:}] is a $cocII$ specific designation grouping the records into (e.g.) those around 100KLOC.
\item[{\em Type:}] selects different $COC81$  designations and include ``bus'' (for business application)
or ``sys'' (for system software).
\item[{\em Year:}] is a $NASA93$ term that
selects the development years, grouped into units of five; e.g. 1970,1971,1972,1973,1974
are labeled ``1970''.
\edd

\subsection{Learners}
\subsubsection{Learning with Linear Regression}

Linear regression assumes that the data can be
approximated by one linear model that includes lines of code
(KLOC) and other features $f$ seen in a software development
project:
\[\footnotesize
effort = \beta_0 + \sum\beta_i\cdot f_i
\]
Linear regression adjusts $\beta_i$ 
to minimize the {\em prediction error} (the
difference between project predicted and actual
values).

Boehm argues that effort is exponential on KLOC~\cite{boehm81}:
\[
effort = a \cdot KLOC^b  \cdot \prod_i\beta_i
\]
(where a,b are domain-specific constants). Such exponential
functions can be learned via linear regression after they are
converted to the following linear form.
\[
ln(effort)=ln(a)+b{\cdot}ln(KLOC)+\sum_iln(\beta_i)
\]
All our methods transform the data in this way and, when predictions are generated
for the $Test$ set, they are unlogged prior to computing performance statistics\footnote{omid is there a better way to say this? like we always convert P*F to P*log(F),wrong wrong wrong}.

\subsubsection{Learning with Model Trees}

Model trees are a generalization of linear regression.
Instead of fitting the data to {\em one linear model}, 
model trees learn {\em multiple linear models}, and a
decision tree that decides which linear model to use.
Model trees are useful when the projects form regions and
different models are appropriate for different regions.

COSEEKMO includes the M5p model tree learner defined by Quinlan~\cite{quinlan92b}.


\begin{figure}
\begin{center}
{\scriptsize
\begin{tabular}{l|r@{:~}l|}\cline{2-3}
upper:   &acap&analysts capability\\
increase &pcap&programmers capability\\
these to &aexp&application experience\\
decrease &modp&modern programming practices\\
effort   &tool&use of software tools\\
         &vexp&virtual machine experience\\
         &lexp&language experience\\\cline{2-3}
middle   &sced&schedule constraint\\\cline{2-3}
lower:   &data&data base size\\
decrease &turn&turnaround time\\
these to &virt&machine volatility\\
increase &stor&main memory constraint\\
effort   &time&time constraint for cpu\\
         &rely&required software reliability\\
         &cplx&process complexity\\\cline{2-3}
\end{tabular}}
\end{center}
\caption{{
Features used in this study. From~\cite{boehm81}. Most
range from 1 to 6 representing ``very low'' to ``extremely high''.
}}\label{fig:em}
\end{figure}

\begin{figure}
\begin{center}
{\scriptsize
\begin{tabular}{|l|c|r@{~}|r@{~}|r@{~}|r@{~}|r@{~}|r|}
    \hline
    &&1&2&3&4&5&6\\
    \hline
upper&ACAP   &1.46   &1.19   &1.00   &0.86   &0.71   &\\
(increase&PCAP   &1.42 &1.17   &1.00   &0.86   &0.70 &\\
these to&AEXP   &1.29 &1.13   &1.00   &0.91   &0.82   &\\
decrease&MODP   &1.2  &1.10 &1.00 &0.91 &0.82 &\\
effort)&TOOL   &1.24 &1.10 &1.00 &0.91 &0.83 &\\
&VEXP   &1.21 &1.10 &1.00 &0.90 &  &\\
&LEXP   &1.14 &1.07 &1.00 &0.95 &  &\\\hline
middle&SCED   &1.23 &1.08 &1.00 &1.04 &1.10 &  \\\hline
lower&DATA   &    & 0.94 &1.00 &1.08 &1.16&\\
(increase&TURN   &       &0.87   &1.00   &1.07   &1.15   &\\
these to&VIRT   &       &0.87   &1.00   &1.15   &1.30   &\\
increase&STOR   &       &       &1.00   &1.06   &1.21   &1.56\\
effort)&TIME   &  &    &1.00   &1.11   &1.30   &1.66\\
&RELY   &0.75& 0.88& 1.00 & 1.15 & 1.40&\\
&CPLX   &0.70 &0.85 &1.00 &1.15 &1.30 &1.65\\
    \hline
\end{tabular}}
\end{center}
\caption{The COCOMO 81 $\beta_i$ table~\cite{boehm81}. 
For example, the bottom right cell is saying that if CPLX=6, then
the nominal effort is multiplied by 1.65.}\label{fig:effortmults}
\end{figure}

\subsubsection{Learning with Local Calibration}

Local calibration (LC) is a specialized form
of linear regression developed by Boehm~\cite[p526-529]{boehm81}.
LC assumes project effort is exponential on KLOC; i.e. 
\[\footnotesize
effort = a\cdot KLOC^b \prod\beta_i
\]
\fig{effortmults} shows the $\beta_i$ values
recommended by Boehm (the names on the left hand side are defined in \fig{em}).
When $\beta_i$ is used in the above equation, they yield estimates in months 
where one month is 152 hours (and includes development and management hours).
To operate, LC linearizes the exponential equation to generate 
\[\footnotesize
log(effort)= log(a) + b\cdot log(KLOC) + \sum log(\beta_i)
\]
Linear regression would try to adjust all the $\beta_i$ values.
This is not practical when training on a very small number of projects.
Hence, LC fixes the $\beta_i$ values while adjusting the $<a,b>$ values to minimize
the prediction error. We shall refer to LC as ``standard practice'' since, in the COCOMO community
at least, it is the preferred method for calibrating standard COCOMO~\cite{boehm00b}.

\subsubsection{Learning with Nearest Neighbor}
(wrong wrong wrong, this is a preprocessor not learner, but then a problem with it not being a learner and used on data!!!).
Nearest neighbor make predictions using past data that is
similar to a new situation. Some distance measure is used to find the $k$ nearest
older projects to the a project in the test set. An effort estimate
can be generated from the median (wrong) effort of the $k$ nearest neighbors.

The benefit of nearest neighbor algorithms is that they
make the fewest domain assumptions. That is, they can process a broader range of the data available within projects.
For example:
\bi
\item
LC cannot be applied unless projects are described using the COCOMO ontology (\fig{em}). 
\item
Linear regression and model trees are best applied to data where most of the values for the numeric factors are known.
\ei

The drawback of nearest neighbor algorithms is that, sometimes, the domain assumptions they ignore are important
to that domain. For example, if effort is really exponential on KLOC, a standard nearest neighbor algorithm
has no way to exploit that.

\subsection{Pre-Processors}

\subsubsection{Pre-processing with Row Pruning}
Project data collected in one context many not be relevant to another.
Kitchenham et.al.~\cite{kitch07} take great care to document this effect.
They conducted a systematic review of ten projects comparing estimates using 
historical data {\em within} the same company or {\em imported} from another. In no
case was it better to use data from other sites, and sometimes
importing such data yielded significantly worse estimates.

Hence, given an a table of $P$ projects, it is good practice to
prune away rows relating to irrelevant projects. An often repeated
result in effort estimation is that the careful selection of project
data can improve the effectiveness of estimation. Similar projects
have less variation and so can be easier to calibrate. 
Chulani et.al.~\cite{chulani99} report that row pruning
can improved their PRED values anywhere between 5\% to 12\%.
Shepperd and Schofield report even more dramatic improvements in
PRED from 4\% to 44\%~\cite{shepperd97}.

Row pruning can be {\em manual} or {\em automatic}.
In {\em manual row pruning} (also called ``stratification'' in the COCOMO
literature~\cite{boehm00b}), an analyst applies their domain knowledge
to select project data that is similar to the new project to be
estimated.

{\em Automatic row pruning} use algorithmic techniques to select a subset
of the rows. The LOCOMO tool~\cite{baker07} (wrong,~\cite{jalali07}) in COSEEKMO
is a row pruner that combines a nearest neighbor method
with LC. LOCOMO prunes away all projects except those $k$ ``nearest'' to the 
$Test$ data.

To learn an appropriate value for $k$, LOCOMO uses the $Train$ing set as follows:
\bi
\item For each project $p_0\in Train$, LOCOMO sorts the remaining 
$Train - p_0$ examples by their Euclidean distance from $p_0$. 
\item LOCOMO then passes the $k_0$ examples closest to $p_0$
to LC. The returned $<a,b>$ values are used to estimate
effort for $p_0$.
\item After trying all $p_0\in Train$, $k_1$ is then set to the $k_0$ value that yielded
the smallest median MRE.
\ei
This calculated value $k_1$ is used to estimated projects in the
$Test$ set. For all $p_1\in Test$, the $k_1$ nearest neighbors
from $Train$ are passed to LC. 
The returned $<a,b>$ values are then used to estimate $p_1$.
In practice, for our project data, LOCOMO returns $k$ values in the range 4..11 (wrong, we don't know exactly).

\subsubsection{Pre-Processing with Column Pruning}
Kirsopp \& Schofeld~\cite{kirsopp02} 
and Chen \& Menzies \& Port \& Boehm~\cite{me05c}
report that column pruning improves effort estimation.
Miller's research~\cite{miller02} explains why.
Column pruning (a.k.a. feature subset selection~\cite{hall03} or variable
subset selection~\cite{miller02}) reduces the deviation of a
linear model learned by minimizing least squares error~\cite{miller02}.
To see this, consider a linear model with constants $c_i$ 
that input features $f_i$ to predict for $y$:
\[y = \beta_0 + \beta_1\cdot f_1 + \beta_2\cdot f_2 + \beta_3\cdot f_3 ...\]
The variance of $y$ is some function of the variances in $f_1, f_2$, etc.
If the set $f$ contains ``noise'' (spurious signals unconnected to the target variable $y$)
then random variations in $f_i$ can increase the uncertainty of $y$.
Column pruning methods decrease the number of features $f_i$, thus
increasing the stability of the $y$ predictions. 
That is, the fewer the columns, the more restrained are the model predictions.

Taken to an extreme, column pruning can reduce $y$'s variance 
to zero (e.g. by pruning the above equation back to $y=\beta_0$)
but increase model error (the equation $y=\beta_0$ will ignore all project data when generating estimates).
Hence, intelligent column pruners experiment with some proposed subsets $f'\subseteq f$ before
changing that set. COSEEKMO currently contains three intelligent column pruners: WRAPPER, LOCALW, and COCOMIN.

WRAPPER~\cite{kohavi97} is a standard best-first search through the space of possible features. 
At worse, the WRAPPER must search an space exponential on the number of features $F$; i.e. $2^F$.
However, a  simple best-first heuristic makes WRAPPER practical. for effort estimation.
At each step of the search, all the current subsets are scored by passing them to
a {\em target leaner}. If a set of features does not score better than a smaller subset,
then it gets one ``mark'' against it. If a set has more than $STALE=5$ number of marks, it is deleted.
Otherwise, a feature is added to each current set and the algorithm continues. 

In general, a WRAPPER can use any target learner. Chen's LOCALW is a WRAPPER specialized
for LC. Previously~\cite{me06d,me05c}, we have explored LOCALW for effort estimation.

Theoretically, WRAPPER (and LOCALW)'s exponential time search is more
thorough, hence more useful, than simpler methods that try fewer options. To 
test that theory, we will compare WRAPPER and LOCALW to a linear-time column pruner 
called COCOMIN~\cite{baker07}.

COCOMIN is defined by the following operators:
\[\{sorter, order, learner, scorer\}\]
The algorithm runs in linear time over a {\em sorted} list of features $1..F$, 
This search can be {\em order}ed in one of two ways:
\bi
\item
A ``back-select'' starts with all features $F$ and throws some away, one at a time.
\item
A ``forwards-select'' starts with one feature and adds in the rest, one at a time.
\ei
Regardless of the search order, at some point $t$ ($1 \le t \le F$), 
the current set of features are passed to a {\em learner}
to generate a performance {\em score} by applying the model learned on the current features
to the $Train$ing data. COCOMIN returns the features associated with the highest score.

The version of COCOMIN used in this study:
\bi
\item $sorted$ the features by passing each one, with KLOC, to LC. 
The top-sorted features were those that had highest MRE\footnote{is that defined yet? wrong, ????}.
\item $ordered$ the search with a back-select;
\item $learned$ using LC;
\item $scored$ using PRED(30)\footnote{is this defined yet? oh- and what about horizon, wrong wrong wrong}.
\ei

\bibliographystyle{plain}
%\bibliographystyle{IEEEtran}
\bibliography{refs}
\end{document}
