\documentclass[journal,compsoc]{IEEEtran}
%\documentclass[onecolumn,12pt,journal,compsoc]{IEEEtran}\renewcommand{\baselinestretch}{1.41}

\usepackage[table]{xcolor}

% *** CITATION PACKAGES ***
%
\ifCLASSOPTIONcompsoc
  % IEEE Computer Society needs nocompress option
  % requires cite.sty v4.0 or later (November 2003)
  % \usepackage[nocompress]{cite}
\else
  % normal IEEE
  % \textsc{}\usepackage{cite}
\fi

\usepackage{cite,url}

\ifCLASSINFOpdf
  \usepackage[pdftex]{graphicx}
  % declare the path(s) where your graphic files are
  \graphicspath{{/plots/}}
  % and their extensions so you won't have to specify these with
  % every instance of \includegraphics
  \DeclareGraphicsExtensions{.eps,.pdf,.jpeg,.png}
\else
  % or other class option (dvipsone, dvipdf, if not using dvips). graphicx
  % will default to the driver specified in the system graphics.cfg if no
  % driver is specified.
  % \usepackage[dvips]{graphicx}
  % declare the path(s) where your graphic files are
  % \graphicspath{{../eps/}}
  % and their extensions so you won't have to specify these with
  % every instance of \includegraphics
  % \DeclareGraphicsExtensions{.eps}
\fi


\usepackage{graphicx}
\usepackage{subfigure}
\usepackage{verbatim}
\usepackage{fancyvrb}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{timtricks}
\usepackage{multirow}
\usepackage{rotating}
\usepackage{times}
\usepackage{cite}
\usepackage{float}
\usepackage{color}
\usepackage[pdftex]{graphicx}
\newcounter{over}
\newcounter{max}
\newcommand{\fred}[2]{\setcounter{over}{#2}\addtocounter{over}{#1}}
\newcommand{\boxplot}[5]{%
\scalebox{1}{\textcolor{white}{\setcounter{max}{#4}\addtocounter{max}{#5}\fred{#4}{-#2}}%
\begin{picture}(100,5)%
\put(0,0){\line(0,1){6}}%
\put(100,0){\line(0,1){6}}%
\put(#2,4){\line(1,0){\theover}}%
\put(#3,4){\circle*{3}}%
\put(50,0){\line(0,1){6}}%
\end{picture}}
}

% correct bad hyphenation here
\hyphenation{op-tical net-works semi-conduc-tor}

\begin{document}

\title{Sampling Methods in Software Effort Estimation: An Investigation on Bias-Variance Trade-Off}

\author{Ekrem~Kocaguneli,~\IEEEmembership{Student Member,~IEEE}
        Tim~Menzies,~\IEEEmembership{Member,~IEEE}
        Martin~Shepperd,~\IEEEmembership{Member,~IEEE}
\IEEEcompsocitemizethanks{\IEEEcompsocthanksitem Ekrem Kocaguneli and Tim Menzies are with the Lane Department of Computer Science and Electrical Engineering, West Virginia University.
E-mail: ekocagun@mix.wvu.edu, tim@menzies.us
\IEEEcompsocthanksitem Martin Shepperd is with the School of Information Systems Computing \& Maths, Brunel University.
E-mail: martin.shepperd@brunel.ac.uk}% <-this % stops a space
\thanks{This research is funded in part by NSF,CISE, project \#0810879 }}


\markboth{Journal of IEEE Transactions on Software Engineering,~Vol.~X, No.~Y, SomeMonth~201z}%
{Shell \MakeLowercase{\textit{et al.}}: Bare Demo of IEEEtran.cls for Computer Society Journals}
\IEEEcompsoctitleabstractindextext{

\begin{abstract}

~\\
\textbf{\textit{Background}}: More than half the software effort estimation studies focus on model comparisons, for which sampling methods (SMs) are fundamental.
Studies adopt different SMs such as leave-one-out (LOO), 3Way and 10Way cross-validation.
However, none of the studies that we are aware of empirically justifies its choice.
\textit{\textbf{Aim:}} According to theory higher number of smaller test sets, increase the variance and decrease the bias.
However, there is no empirical investigation of this theoretical concept in software effort estimation.
In this paper, we empirically investigate whether theoretical assumptions hold for software effort datasets.
\textit{\textbf{Method:}} We evaluated 90 different algorithms on 20 different effort datasets.
For each algorithm, we calculated the bias and variance values under LOO, 3Way and 10Way to check if they behave as predicted by the theory.
\textit{\textbf{Results:}} 
This extensive study showed that the theory does not hold for effort estimation datasets.
We observed that: 1) the majority of the algorithms have statistically same bias and variance values under different SMs and 2) different SMs have similar run times.
\textit{\textbf{Conclusion:}} Since effort datasets used here are indifferent to sampling methods, we conclude that the bias-variance trade-off and run times are not the main concerns for selecting an SM.
When opting for a particular SM, we recommend considering reproducibility of the results and suggest LOO as the appropriate method for that purpose.

\end{abstract}

% Note that keywords are not normally used for peer review papers.
\begin{IEEEkeywords}
Software Cost Estimation, Experimentation, Bias, Variance
\end{IEEEkeywords}}

\maketitle

\section{Introduction}

Sampling method is an important topic for software effort estimation (from now on SEE) studies and an empirical study to compare the pros and cons of different sampling methods in SEE is urgent. 

The biggest research topic in SEE since 1980s is the introduction and comparison of new methods~\cite{jorgensen07}.
In their comprehensive systematic review Jorgensen and Shepperd report $61\%$ of selected SEE papers deal with that topic~\cite{jorgensen07}.
This group of papers use \textit{historical data}, i.e. and not a single one of them employs a data collection methodology.

Ideally a learned theory should be applied to new scenarios to observe if the predicted effect occurs in practice.
Only generating theories from historical data entails an internal validity threat, which we would like to call \textit{``fixed-scenario-problem''}.
\textit{fixed-scenario-problem} is the lack of new scenarios in evaluation and it threats the evaluation experiments like the ones reported in~\cite{jorgensen07}.
Therefore, studies without a new scenario for the learned model are limited within their experimental settings.
On the other hand it is impractical to expect every study to collect new data.
The mitigation to \textit{fixed-scenario-problem} is possible by simulating the application of a method to a new situation.  
A sampling method (from now on SM) forms the basis of such a simulation~\cite{Demsar2006,Alpaydin2004}.

There is a wide palette of available SMs used in the literature: Leave-one-out (LOO), 3Way and 10Way are some examples to the most commonly used ones~\cite{demsar06, Alpaydin2004, lessmann09, seni10}.
Similar to choosing colors from a palette, the choice of different SMs results in a different picture.
For example, theoretically LOO results in high-variance and low-bias in the results, whereas 3Way generates just the opposite (low-variance, high-bias) and 10Way in between the two methods~\cite{seni10,Hastie2008}.
The change from one method to the other is known as bias and variance (from now on $B\&V$)  trade-off.
Employment the right SM and correct interpretation of $B\&V$ trade-off is crucial to the validity of a particular study. 


To our surprise, in SEE domain there is no study employing a rigorous experimentation on the effects of different SMs.
Kitchenham et al. have already identified and raised the issue of SM selection~\cite{Kitchenham2007}; however, their mentioning is more of a pointer to future work.
Hence, this paper is a natural follow-up of previous SEE studies.
Furthermore, it is the first of its kind to rigorously investigate the $B\&V$ trade-off inherent in different SMs and it concerns more than half the SEE field.
In this paper, we present $B\&V$ trade-off results of 3 different SMs: LOO, 3Way and 10Way.
Our experimentation includes $90$ methods applied on $20$ datasets.
 
The results show that $B\&V$ behavior of SMs are different than the predicted: For the majority of the algorithms, $B\&V$ values of LOO, 3Way and 10Way are statistically the same.
Furthermore, we have seen that different SMs are very close to one another in terms of run times, see \fig{runtimes} for exact values.
The values of \fig{runtimes} belong to experiments coded in MATLAB and run on a 64-bit dual-core machine.
These findings, are associated with the following experimental concerns: 
\bi
\item non-biased evaluation of methods,
\item reproduction of the experiments,
\item and low run times.
\ei


In terms of bias and run-times SMs are similar.
However, for the reproduction of the experiments LOO is superior to 3Way and 10Way.
3Way and 10Way methods entail the use of random number generators and there is the issue of preserving similar statistical properties in the subsets at every iteration.
Therefore, they are stochastic methods that cannot be \textit{exactly} reproduced.
On the other hand, LOO is a deterministic method, which can be repeated exactly by other researchers.
As stated by Menzies and Shull in \cite{Oram2010}, \textit{``results are far more convincing when they’re found again''}.
Based on these findings and the related experimental concerns we recommend LOO over the other SMs.


It is fundamentally important to reach an empirical conclusion on the SMs to be used in SEE.
Due to lack of concrete conclusions (so called \textit{conclusion instability}) it is difficult to create a consensus on SMs.
See in \fig{dataset-paper} the mapping of SMs to studies with which we share one or more common datasets.
As can be seen LOO and ad-hoc methods are dominant, followed by 10Way and 3Way; i.e. there is no consensus.
Furthermore, in none of these studies is there an empirical justification of their choice of SMs.
Here, we provide such an empirical analysis and suggest LOO as the common SM to be used in SEE.
These conclusions are based on a diverse set of methods and datasets, hence our conclusions apply to a large body of SEE studies.
Yet we acknowledge the fact that this study should be repeated for new datasets.


\begin{figure}[!t]
\centering
{ 
\begin{tabular}{l|l}  
\textbf{SM} & \textbf{Run Time} \\\hline 
LOO & $9,960$ \\
3Way & $9,360$ \\
10Way & $10,140$ \\
\end{tabular}}
\label{fig:runtimes}
\caption{The run times in seconds for different SMs. These run times are for all the datasets included. Notice how close the run times of SMs are.}
\end{figure}


\subsection{Contributions}

The contributions of this research are:

\bi
\item The first systematic investigation of \textit{B\&V} trade-off in SEE domain
\item An extensive experimentation of 20 datasets and 90 algorithms
\item Showing that \textit{B\&V} trade-off and run times of SMs is not the main concern for SEE
\item Recommendation based on experimental concerns:
\bi
\item For reproducibility prefer LOO
\ei
\ei

\section{Terminology}
\label{sect:terminology}

A typical SEE dataset consists of a matrix X and a vector Y. 
The input variables (a.k.a. features) are stored in X, where each row corresponds to an observation and each column corresponds to a particular variable.
Similarly, the dependent variable is stored in a vector Y, where for each observation in X there exists a response value.

Now assume that a prediction model represented by $\hat{f}(x)$ has been learned from a training dataset and the actual values in the training set were generated by an unknown function $f(x)$.
So as to measure the errors between the actual values in Y and the predictions given by $\hat{f}(x)$, we can make use of an error function represented by $L(Y,\hat{f}(x))$.
Some examples of error functions are squared loss (Equation \ref{equ_loss_squared}) and absolute loss (Equation \ref{equ_loss_abs}).


\begin{equation}
L(Y,\hat{f}(x)) = \left( Y-\hat{f}(x) \right)^2
\label{equ_loss_squared}
\end{equation}

\begin{equation}
L(Y,\hat{f}(x)) =  |Y-\hat{f}(x)|
\label{equ_loss_abs}
\end{equation}

Given the assumptions that the underlying model is $Y=f(X) + \epsilon$ where $E(\epsilon)=0$ and $Var(\epsilon)=\sigma_{\epsilon}^{2}$, then we can come up with a derivation of the squared-error loss for $\hat{f}(X)$~\cite{Hastie2008}. The error for a point $X=x_0$ is:\\

\begin{tabular}{l c l}
$Error(x_0)$ & = & $E\left[ \left( Y - \hat{f}(x_0) \right)^{2} \vert X=x_0 \right]$\\[4ex]
			 & = & $\sigma^{2}_{\epsilon} + \left( E[\hat{f}(x_0) - f(x_0)] \right)^2 $ \\[4ex]
			 &   & $ + E\left[ \hat{f}(x_0) - E[\hat{f}(x_0)] \right]$ \\[4ex]
			 & = & $\sigma^{2}_{\epsilon} + Bias^{2}(\hat{f}(x_0)) + Var(\hat{f}(x_0))$\\[4ex]
			 & = & $\underbrace{Irreducable Error}_{1^{st} Term} + \underbrace{Bias^2}_{2^{nd} Term}$\\[4ex]
			 &   & $ + \underbrace{Variance}_{3^{rd} Term}$ \\
			 		& & \\[1ex]
\end{tabular}

In the above derivation, the explanations of the $1^{st}$, $2^{nd}$ and $3^{rd}$ terms are as follows:
\begin{itemize}
\item The $1^{st} Term$ is the so called \textit{``irreducable error''}, i.e. the variance of the actual model around its true mean.
This variance is inevitable regardless of how well  we model $f(x_0)$, only exception to that is when the actual variance is zero (when $\sigma^{2}_{\epsilon} = 0$).
\item The $2^{nd} Term$ is the square of the bias, which is the measure of how different the model estimates are fromt the \textit{true} mean of the underlying model.
\item The $3^{rd} Term$ is the variance of the estimated model. It is the expectation of the squared deviation of the estimated model from its own mean.
\end{itemize}
Furthermore, the above derivation is for an individual instance.
The $B\&V$ values associated with an algorithm $\hat{f}(X)$ is the mean of all individual values.

An important question associated with SMs is how $B\&V$ relate to different choices of the training size (\textit{K}).
In this study we consider three different SMs: LOO, 3Way and 10Way.
These SMs are selected due to their frequent use in SEE (see \fig{study-sampling-method}).
The brief descriptions of SMs applied on a dataset of size $N$ are:
\bi
\item LOO
\bi
\item Take one instance at a time as the test set
\item Build the learner on the remaining $N-1$ instances (training set)
\item Use the model to estimate for the test set
\ei
\item 3Way
\bi
\item Randomize order of rows in data
\item Divide dataset into 3 subsets of size close or equal to $N/3$
\item Use each subset as the test set and the remaining subsets as the training set
\item Repeat above procedure 10 times
\ei
\item 10Way
\bi
\item Randomize order of rows in data
\item Divide dataset into 10 subsets of size close or equal to $N/10$
\item Use each subset as the test set and the remaining subsets as the training set
\item Repeat above procedure 10 times
\ei
\ei

\textit{``Theoretically''} when training size (\textit{K}) is equal to the dataset size (\textit{K=N}), we expect SM to be approximately unbiased and to have high variance, because N training sets are so similar to one another and N models are used for estimation.
On the other hand, for small values of \textit{K}, say \textit{K=N/3} as in 3Way, we expect low variance and high bias~\cite{Hastie2008}.
For an SM that has training sizes between LOO and 3Way, say \textit{K=N/10} as in 10Way, expected $B\&V$ values are between those of LOO and 3Way.
Naively put, the \textit{``expected''} relationship is:
\bi
\item LOO~~~~~~~: High variance, low bias
\item 3Way~~~~~~: Low variance, high bias
\item 10Way~~~~~: Values between LOO and 3Way
\ei

In an ideal case, when we plot $B\&V$ values of each individual test instance on x and y axes respectively, we expect 3 clusters:
\bi
\item Upper Left~~: Low bias, high variance; i.e. LOO results.
\item Lower right~: High bias, low variance; i.e. 3Way results.
\item Center~~~~~~~: In-between $B\&V$; i.e. 10Way results.
\ei

A very \textit{simple} but \textit{ideal} case would look like \fig{ideal-simulation}. 
\fig{ideal-simulation} is only demonstrative and $B\&V$ values are randomly generated.
%In that figure, 30 hypothetical algorithms subject to both LOO and 3-Way are represented.

\begin{figure}[b!]
\begin{center}
\includegraphics[width=0.4\textwidth]{lib/IdealSimulation.pdf}
\end{center}
\caption{A simple simulation for the \textit{``expected''} case of $B\&V$ relation to testing strategies.}
\label{fig:ideal-simulation}
\end{figure}

\section{Related Work}

\subsection{Bias-Variance Trade-Off}

The SMs and the $B\&V$ trade-off are critical to assess different learners~\cite{Domingos2000,seni10,Hastie2008,Rodriguez2010}.
The issue has been both theoretically~\cite{Domingos2000, seni10, Hastie2008} as well as practically investigated~\cite{Mullin2000, Rodriguez2010, Braga2004, Isaksson2008}.
It is our understanding that the theoretically \textit{expected} and \textit{actual} behavior of different SMs may or may not overlap.
Depending on dataset size and type, actual $B\&V$ values may be different than \textit{expected}.

The size and type of effort estimation datasets are quite unique: noisy and limited in size~\cite{Lum2008, Shepperd1996}.
Kitchenham and Mendes have discussed the effects of SMs on $B\&V$ in the context of cross-within-company data~\cite{Kitchenham2007} and suggested that LOO biases positively towards within-company data.
However, their discussion is based on their extensive expert knowledge of the area and does not include an experimentation.
Aside from Kitchenham et al.'s study~\cite{Kitchenham2007}, there is no other SEE study investigating the $B\&V$ trade-off associated with SMs.

In \fig{dataset-paper} we list the studies that use one or more of the datasets used in our work, then in \fig{study-sampling-method} we grouped all these studies according to their adopted SM.
Note that we did not include the studies of \fig{dataset-paper} into \fig{study-sampling-method} if no explicit statement of the adopted SM was made.
\fig{study-sampling-method} shows that LOO and 10Way are the most popular SMs, followed by other ad-hoc methods, then by 10Way and then by 3-Way.
A wide range of different SMs are adopted in SEE and most of the time without a justification.
Therefore, an empirical investigation of SMs is critical and urgent; hence, the rest of this paper.

\begin{figure}[t!]
\centering
{\scriptsize  
\begin{tabular}{l|l}  
 \textbf{Dataset} & \textbf{Used by} \\\hline 
  telecom & \cite{keung2008c,shepperd97}\\
    kemerer & \cite{keung2008c,shepperd97,Finnie1997}  \\
   cocomo81o &\cite{Menzies2006,Lum2008,Kocaguneli2010} ,\\
  desharnaisL1 & \cite{Kocaguneli2010} ,\\ 
    cocomo81s &\cite{Menzies2006,Lum2008,Kocaguneli2010} ,\\ 
  desharnaisL3 & \cite{Kocaguneli2010}  ,\\
   albrecht & \cite{keung2008c,Li2009,Li2009a,shepperd97,Shepperd1996,Finnie1997} \\ 
   cocomo81e & \cite{Menzies2006,Bakir2009,Kocaguneli2010} \\ 
  nasa93\_center\_5 &\cite{Menzies2006,Lum2008,Kocaguneli2010} \\ 
  desharnaisL2 & \cite{Kocaguneli2010}   \\
  desharnais & \cite{Keung2008,keung2008c,keung2008b,Kocaguneli2010,shepperd97,Li2008,Kadoda2000,Kirsopp2002,Li2009,Li2009a}\\ 
  maxwell & \cite{Li2009a,Sentas2005} \\
  sdr &  \cite{Kultur2008,Turhan2007} \\ 
  nasa93\_center\_1 & \cite{Menzies2006,Lum2008,Kocaguneli2010}  \\
  miyazaki94 &\cite{Miyazaki1994} \\
  nasa93\_center\_2 & \cite{Menzies2006,Lum2008,Kocaguneli2010} \\ 
  finnish &\cite{Briand1999,shepperd97} \\
  cocomo81 &\cite{Menzies2006,Lum2008,Kocaguneli2010,Boehm1981}, \\
  nasa93 &\cite{Menzies2006,Lum2008,Kocaguneli2010} \\ 
  china & this study\\
\end{tabular}}
\caption{A sample of effort estimation papers that use the
data sets explored in this paper.}\label{fig:dataset-paper}
\end{figure}

\begin{figure}[t!]
\centering
{\small  
\begin{tabular}{l|l}  
\textbf{Method} 	&	\textbf{Used by}	\\\hline\hline
\multirow{3}{*}{LOO} 	&	\cite{Keung2008, Li2008, keung2008c}	\\
	&	\cite{keung2008b, Kocaguneli2010, Li2009a}	\\
	&	\cite{Kocaguneli2011, shepperd97}	\\\hline
\multirow{2}{*}{Others (ad-hoc, 6-Way etc.)} 	&	\cite{Briand1999, Kultur2008, Li2009}	\\
	&	\cite{Menzies2006, Sentas2005, Shepperd1996}	\\\hline
	\multirow{2}{*}{10-Way} 	&	\cite{Bakir2009, Lum2008, Kocaguneli2011}	\\
	&	\cite{Turhan2007}	\\\hline
\multirow{1}{*}{3-Way} 	&	\cite{Kocaguneli2011}	\\\hline
\end{tabular}}
\caption{Distribution of the studies in \fig{dataset-paper} w.r.t. their SM. Majority of the studies use LOO. LOO is followed by ad-hoc methods, 10-Way then 3-Way.}\label{fig:study-sampling-method}
\end{figure}

\subsection{Effort Estimation Methods}
Effort estimation is the activity of predicting the amount of effort required to complete a software development project~\cite{Keung2008}.
Estimation activities are carried out through:
\bi
\item algorithmic methods
\item non-algorithmic methods
\ei

\textit{Algorithmic methods} learn a model from historical data and pass new projects through that model to generate their estimates.
The number of proposed algorithmic methods and associated variants easily exceed tens of thousands.
Figure 3 of ~\cite{Keung2011} shows that for analogy-based effort estimation (which is just one branch of algorithmic methods), likely combinations are more than $6000$.
Some other examples to algorithmic methods are: various kinds of regression (simple, partial least square, stepwise, regression trees), 
neural networks and instance-based algorithms, just to name a few.
In Appendix \ref{sect:methods}, we provide the algorithmic methods used in this study.

\textit{Non-algorithmic methods} utilize the best knowledge of experienced human experts.
Such non-algorithmic methods, a.k.a. expert-based estimation, is defined to be a human intensive approach that is most commonly adopted in practice~\cite{Jorgensen2004}. 
In expert-based variants, estimates are produced by domain experts based on their very own personal experience. 
On one hand, these methods are flexible and intuitive as they can be applied in a variety of circumstances where other estimating techniques do not work.
For example, when there is no historical data or the requirements of a project are unavailable at the initial stages, a rough estimate in a very short period of time can be provided by expert estimates.
On the other hand -regardless of the efforts to establish guidelines for expert-based methods~\cite{Jorgensen2004}- there are still many ad-hoc methods used in practice.
Shepperd et al. \cite{Shepperd1996} do not
consider expert based estimation as an empirical method, since the
means of deriving an estimate are not explicit and therefore not
repeatable, nor easily transferable to other staff. 
In addition, knowledge relevancy is also a problem, as an expert may not be able 
to justify estimates for a new application domain. 
Lastly, from an experimental point of view SMs do not make sense for expert estimates, because expert estimates are based on the expert's personal experience rather than different divisions of train/test sets.
Hence, the rest of this paper excludes non-algorithmic methods from the discussion of $B\&V$.


\section{Methodology}

\subsection{Algorithms: Pre-Processors \& Learners}

This study uses \textit{10 different pre-processors $\times$ 9 learners = 90 algorithms}.
The selection is based on two criteria:
\bi
\item Learners and pre-processors must come from SEE literature; e.g.  \cite{Lum2008, Mendes2003, Jorgensen2007, Shepperd1996, Kultur2008, Shepperd1997, chang74, vent93, Boehm2000}.
\item Learners must make different assumptions about the data.
\ei
This second criteria is based on data-mining theory
that different learners are built on different assumptions, hence they have different biases~\cite{Kittler1998, Alpaydin1998, Dietterich2000, Ghosh2002}.

We hence used 10 pre-processors:
\bi
\item Three {\em simple preprocessors}: {\bf none, norm, and log};
\item One {\em feature synthesis} method: {\bf PCA};
\item Two {\em feature selection} methods: {\bf SFS} and {\bf SWreg};
\item Four {\em discretization} methods:  Based on equal frequency/width.
\ei
and 9 learners:
\bi
\item Two {\em iterative dichotomizers}: {\bf CART(yes), CART(no)};
\item A {\em neural net}: {\bf NNet};
\item Four {\em regression methods}: {\bf LReg, PCR, PLSR, SWReg}.
\item Two {\em instance-based} learners: {\bf ABE0-1NN, ABE0-5NN};
\ei
Note that ``ABE'' is short for analogy-based effort estimation. ABE0-kNN is a standard
analogy-based estimator with execution steps of:
\bi
\item
{\em Normalization} of data to zero-one interval;
\item
A {\em Euclidean} distance measure; 
\item
Estimates generated using the {\em k nearest neighbors}.
\ei
For detailed descriptions of all these learners, see Appendix~\ref{sect:methods}.


\subsection{Experiments}

\textit{Get estimates: } Let $A_i$ ( $i\in\{1,2,...,90\}$ ) be one of the $90$ algorithms and let $D_j$ ( $j\in\{1,2,...,20\}$ ) be one of the $20$ datasets.
Also let $SM_k$ ( $k\in\{1,2,3\}$ ) be one of the 3 SMs.
In this step every $A_i$ is run on every $D_j$ subject to every $SM_k$.
In other words every $A_i \times D_j \times SM_k$ combination is exhausted, and actuals as well as related predictions are stored to be used for $B\&V$ calculations.

\textit{Calculate $B\&V$ values: } The actuals and predictions coming from $A_i \times D_j \times SM_k$ runs are used to calculate the $B\&V$.
At the end of this step, we have 1 array of \textit{individual} $B\&V$ values for every $A_i \times D_j \times SM_k$.
Another interpretation is that for every algorithm-dataset combination ( $A_i \times D_j$ ) we have 3 arrays of $B\&V$ values (1 for each $SM_i$).
The \textit{overall} $B\&V$ values associated with every $A_i \times D_j \times SM_k$ are the mean of individual values.


\begin{figure*}[t!]
\centering
{\scriptsize 
%\begin{sideways} 
\begin{tabular}{l|rrl|lrrrrr}  
~\\~\\
 \textbf{} & \textbf{} & \textbf{} & \textbf{}&\multicolumn{6}{c}{\textbf{Historical Effort Data}}\\\cline{5-10} 
 \textbf{Dataset} & \textbf{Features} & \textbf{Size} & \textbf{Description}&\textbf{Units}& \textbf{Min} & \textbf{Median} &\textbf{Mean} & \textbf{Max} & \textbf{Skewness}\\ \hline
  cocomo81 &17&63& NASA projects& months&6&98&683&11400& 4.4\\
  \hspace{4 mm}cocomo81e &17&28& Cocomo81 embedded projects&months&9&354&1153&11400& 3.4\\ 
  \hspace{4 mm}cocomo81o &17&24& Cocomo81 organic projects&months&6&46&60&240& 1.7\\ 
  \hspace{4 mm}cocomo81s &17&11& Cocomo81 semi-detached projects&months&5.9&156&849.65&6400&2.64 \\ 
  nasa93 &17&93& NASA projects&months&8&252&624&8211& 4.2\\ 
  \hspace{4 mm}nasa93\_center\_1 &17 &12 & Nasa93 projects from center 1&months&24 &66 &139.92 &360 &0.86 \\ 
  \hspace{4 mm}nasa93\_center\_2 &17&37& Nasa93 projects from center 2&months&8&82&223&1350& 2.4\\ 
  \hspace{4 mm}nasa93\_center\_5
   &17&40& Nasa93 projects from center 5&months&72&571&1011&8211& 3.4\\ 
  desharnais &12&81& Canadian software projects&hours&546&3647&5046&23940& 2.0\\ 
  \hspace{4 mm}desharnaisL1 &11 &46 & Projects in Desharnais that are developed with Language1 & hours &805 &4035.5 &5738.9 &23940 &2.09 \\ 
  \hspace{4 mm}desharnaisL2 &11 &25 & Projects in Desharnais that are developed with Language2 & hours &1155 &3472 &5116.7 &14973 &1.16 \\ 
  \hspace{4 mm}desharnaisL3 &11 &10 & Projects in Desharnais that are developed with Language3 & hours &546 &1123.5 &1684.5 &5880 &1.86 \\ 
  sdr &22&24& Turkish software projects& months&2&12&32&342& 3.9\\ 
  albrecht &7&24& Projects from IBM&months&1&12&22&105& 2.2\\ 
  finnish &8 &38 &Software projects developed in Finland  &hours &460 &5430 &7678.3 &26670 &0.95 \\
  kemerer &7 &15 &Large business applications  &months &23.2 &130.3 &219.24 &1107.3 &2.76 \\
  maxwell &27 &62 &Projects from  commercial banks in Finland  &hours&583 &5189.5&8223.2&63694&3.26 \\
  miyazaki94 &8 &48 &Japanese software projects developed in COBOL  &months &5.6 &38.1 &87.47 &1586 &6.06 \\
  telecom &3 &18 &Maintenance projects for telecom companies  &months &23.54 &222.53 &284.33 &1115.5 &1.78 \\
  china &18 &499 &Projects from Chines software companies  &hours &26 &1829 &3921 &54620 &3.92 \\\cline{3-3}
          \multicolumn{2}{c}{~}  & Total: 1198& \multicolumn{7}{c}{~} 
\end{tabular}
%\end{sideways}
}
\caption{The 1198 projects used in this study come from 20 data sets.
Indentation in column one denotes that indented dataset is a subset of another one.}\label{fig:datasets}
\end{figure*}

\textit{Statistical Check on $B\&V$ values: } In this step we check if the 3 arrays of $B\&V$ values for every $A_i \times D_j$ combination are statistically different from one another (checks are based on Mann-Whitney at $95\%$ confidence interval).
This way we can see if the run of an algorithm on a single dataset subject to different SMs generate significantly different $B\&V$ values.
Since we have 3 different SMs, for every $A_i \times D_j$  there are 3 different tuples to look at: LOO vs. 3Way, LOO vs. 10Way, 3Way vs. 10Way.
When we are done with all the $A_i \times D_j$ combinations, we also see what percent of the $90$ algorithms resulted in significantly different $B\&V$ values.

\textit{Observe Run Times: } The total execution time of the experimentation is associated with particular implementation method, i.e. different implementations of the same algorithm will have different run times. 
Therefore, we used standard MATLAB functions in this study: All methods except ABE0-1NN and ABE0-5NN, and all pre-processors except discretizers are found in MATLAB libraries.

The run times are also expected to be greatly affected by particular SMs.
Each SM dictates a different number of times a learner is trained.
The training-time of a learner is much greater than the testing-time.
Once a learner is trained, the prediction for a particular test instance is instantaneous. 
Below are the number training times required for each SM on a \textit{single} dataset:
\bi
\item \textbf{LOO: } \textit{N trains} where N is the dataset size.
\item \textbf{3Way: } \textit{10 repeats x 3 bins = 30 trains}
\item \textbf{10Way : } \textit{10 repeats x 10 bins = 100 trains}
\ei
For 20 datasets in this study (a total of 1198) instances, we expect the following training times:
\bi
\item \textbf{LOO:} $1198$ trains.
\item \textbf{3Way:} $30$ trains/dataset $\times 20$ datasets $= 600$ trains
\item \textbf{10Way: } $100$ trains/dataset $\times 20$ datasets $= 2000$ trains
\ei
From above number of training times, we expect 3Way to be the fastest SM, followed by LOO and 10Way. 


\subsection{Datasets}

There is at least one study in SEE using one or more of the 20 datasets used in our study (see \fig{dataset-paper}).
Therefore, the results presented here are based on a large corpus and concern a number of previously published SEE studies.
The description of 20 datasets used in this study are provided in \fig{datasets}.
These datasets are available at \protect\url{http://promisedata.org/data}.

As described in \fig{datasets}, the datasets were collected in different parts of the world:
\bi
\item The desharnais dataset includes Canadian software projects, 
\item cocomo81 and nasa93 include projects developed in the United States,
\item sdr, contains projects of various software companies in Turkey~\cite{Bakir2009}.
\ei

Note that three of these data sets (nasa93\_center\_1, nasa93\_center\_2, nasa93\_center\_5)
come from different development centers around the United States. Another three of these
data sets (cocomo81e, cocomo81o, cocomo81s) represent different kinds of projects (embedded, organic and semi-detached respectively) developed by different team sizes and under different constraints~\cite{Boehm1981}.


Note also in \fig{datasets}, the skewness of the effort values (up to 6.06):
The datasets are extremely heterogeneous with as much as 60-fold
variation.  
There is also some divergence in the features used to describe the datasets:
\bi
\item 
While data sets have some effort values in common (measured in terms of man-months or man-hours),
no other feature is shared by all data sets.
\item
The cocomo* and nasa* data sets use the features defined by Boehm~\cite{Boehm1981}; e.g.
analyst capability, required software reliability, memory constraints, and use of software tools.
\item
The other data sets use a wide variety of features including, number of entities in the data model, number of basic logical transactions, query count and number of distinct business units serviced.
\ei


\section{Results}
\label{sect:results}

After calculating the $B\&V$ values for $90$ algorithms on all the datasets, we were unable to observe the behavior of \fig{ideal-simulation}, i.e. we did not observe three clusters at predicted $B\&V$ zones.
On the contrary, we observed that $B\&V$ values associated with different SMs were very close to one another.

For example, see in \fig{cocomo81} the mean $B\&V$ values of $90$ algorithms for cocomo81 dataset.
Note that different SMs are represented with different symbols and for every SM there are $90$ symbol occurrences corresponding to $90$ algorithms.
The $B\&V$ values associated with each SM overlap, instead of forming separate clusters.
Also, the \textit{expected} relative low and high $B\&V$ values of SMs (see \fig{ideal-simulation} for expected low and high) were not visible too.
Unlike the expected behavior, the actual $B\&V$ values were both high, regardless of the utilized SM.


All the values reported in \fig{cocomo81} are logged.
Also note that the axes are not scaled to an interval (say 1 to 20), because the differences are so small that scaling the axes makes it impossible to observe the behavior of $B\&V$ values.

We have conducted these experiments on all the datasets and generated \fig{cocomo81} for every dataset.
However, the results are the same:
\bi
\item 1) No \textit{expected} behavior of LOO, 3-Way and 10Way; 
\item 2) No distinct clusters, i.e. overlapping  $B\&V$ values of LOO, 3-Way and 10Way.
\ei

As plots do not provide additional information and as repeating \fig{cocomo81} for every dataset is not possible due to space constraints, we summarized these $B\&V$ values in terms of quartile charts in \fig{quartile}.
\fig{quartile} shows every dataset in a separate row, which is then divided into 3 sub-rows.
Sub-rows correspond to 3 different SMs and they show the related $B\&V$ quartile charts separately.
In every quartile chart, the median (represented with a dot), $25^{th}$ quartile (the left horizontal line-end) and the $75^{th}$ quartile (the right horizontal line-end) are shown.
Note in \fig{quartile}, how median as well as the quartiles of different SMs occur on top of one another.
In other words, \fig{quartile} proves our findings from \fig{cocomo81} in a much larger scale.

\begin{figure}[t!]
\begin{center}
\includegraphics[width=0.48\textwidth]{lib/cocomo81.pdf}
\end{center}
\caption{$B\&V$ values for cocomo81.}
\label{fig:cocomo81}
\end{figure}

\begin{figure*}[th!]
\begin{center}
\scriptsize
\begin{tabular}{l | l | c| c}
\textbf{Dataset} & \textbf{SM} & \textbf{Bias}     & \textbf{Variance}     \\\hline\hline
\multirow{3}{*}{	cocomo81	} & 3Way & \boxplot{ 70.7 }{ 70.7 }{ 70.8 }{ 70.9 }{ 71.5 } & \boxplot{ 48.1 }{ 62.4 }{ 68.0 }{ 71.3 }{ 72.2 } \\ 
   & 10Way & \boxplot{ 70.7 }{ 70.7 }{ 70.7 }{ 70.9 }{ 71.1 } & \boxplot{ 39.9 }{ 59.6 }{ 67.1 }{ 68.8 }{ 72.4 } \\
   & LOO & \boxplot{ 70.7 }{ 70.7 }{ 70.7 }{ 70.9 }{ 71.1 } & \boxplot{ 30.9 }{ 58.9 }{ 66.4 }{ 68.5 }{ 73.3 } \\\hline 
\multirow{3}{*}{ cocomo81o } & 3Way & \boxplot{ 36.7 }{ 36.7 }{ 36.7 }{ 36.8 }{ 63.9 } & \boxplot{ 15.6 }{ 31.3 }{ 33.7 }{ 35.0 }{ 76.2 } \\ 
   & 10Way & \boxplot{ 36.7 }{ 36.7 }{ 36.7 }{ 36.8 }{ 46.0 } & \boxplot{ 10.9 }{ 31.6 }{ 33.0 }{ 34.3 }{ 46.2 } \\ 
   & LOO & \boxplot{ 36.7 }{ 36.7 }{ 36.7 }{ 36.8 }{ 46.1 } & \boxplot{ 6.4 }{ 31.8 }{ 33.2 }{ 34.2 }{ 46.6 } \\\hline 
\multirow{3}{*}{ cocomo81e } & 3Way & \boxplot{ 73.2 }{ 73.2 }{ 73.3 }{ 73.6 }{ 83.7 } & \boxplot{ 47.9 }{ 64.6 }{ 71.3 }{ 74.2 }{ 93.9 } \\ 
   & 10Way & \boxplot{ 73.2 }{ 73.2 }{ 73.3 }{ 73.5 }{ 75.1 } & \boxplot{ 47.3 }{ 60.5 }{ 67.8 }{ 73.2 }{ 80.3 } \\ 
   & LOO & \boxplot{ 73.2 }{ 73.2 }{ 73.3 }{ 73.5 }{ 75.3 } & \boxplot{ 41.5 }{ 60.8 }{ 69.1 }{ 74.0 }{ 80.9 } \\\hline 
\multirow{3}{*}{ cocomo81s } & 3Way & \boxplot{ 70.6 }{ 70.6 }{ 70.6 }{ 71.0 }{ 73.8 } & \boxplot{ 40.1 }{ 55.6 }{ 56.9 }{ 62.3 }{ 76.5 } \\ 
   & 10Way & \boxplot{ 70.6 }{ 70.7 }{ 70.8 }{ 71.1 }{ 84.1 } & \boxplot{ 38.4 }{ 54.1 }{ 58.8 }{ 64.8 }{ 78.0 } \\ 
   & LOO & \boxplot{ 70.6 }{ 70.7 }{ 71.0 }{ 71.1 }{ 84.8 } & \boxplot{ 38.5 }{ 49.7 }{ 57.0 }{ 63.1 }{ 77.9 } \\\hline 
\multirow{3}{*}{ nasa93 } & 3Way & \boxplot{ 66.2 }{ 66.2 }{ 66.2 }{ 66.3 }{ 66.7 } & \boxplot{ 39.7 }{ 59.8 }{ 64.1 }{ 65.3 }{ 68.8 } \\ 
   & 10Way & \boxplot{ 66.2 }{ 66.2 }{ 66.2 }{ 66.2 }{ 66.7 } & \boxplot{ 32.5 }{ 58.8 }{ 63.8 }{ 64.5 }{ 68.7 } \\ 
   & LOO & \boxplot{ 66.2 }{ 66.2 }{ 66.2 }{ 66.2 }{ 66.7 } & \boxplot{ 22.0 }{ 58.8 }{ 63.1 }{ 64.1 }{ 67.9 } \\\hline 
\multirow{3}{*}{ nasa93\_center\_1 } & 3Way & \boxplot{ 45.4 }{ 45.4 }{ 45.4 }{ 45.8 }{ 56.7 } & \boxplot{ 17.4 }{ 32.0 }{ 41.9 }{ 45.7 }{ 62.1 } \\ 
   & 10Way & \boxplot{ 45.4 }{ 45.4 }{ 45.4 }{ 45.6 }{ 55.6 } & \boxplot{ 15.7 }{ 41.3 }{ 44.9 }{ 45.4 }{ 59.5 } \\ 
   & LOO & \boxplot{ 45.4 }{ 45.4 }{ 45.4 }{ 45.6 }{ 56.3 } & \boxplot{ 14.6 }{ 41.3 }{ 44.7 }{ 45.4 }{ 59.4 } \\\hline 
\multirow{3}{*}{ nasa93\_center\_2 } & 3Way & \boxplot{ 54.5 }{ 54.5 }{ 54.6 }{ 54.7 }{ 63.0 } & \boxplot{ 31.8 }{ 48.3 }{ 50.3 }{ 53.1 }{ 71.4 } \\ 
   & 10Way & \boxplot{ 54.5 }{ 54.5 }{ 54.6 }{ 54.7 }{ 67.3 } & \boxplot{ 25.2 }{ 48.9 }{ 52.1 }{ 53.1 }{ 74.4 } \\ 
   & LOO & \boxplot{ 54.5 }{ 54.5 }{ 54.6 }{ 54.7 }{ 68.7 } & \boxplot{ 19.8 }{ 48.8 }{ 51.3 }{ 53.0 }{ 75.1 } \\\hline 
\multirow{3}{*}{ nasa93\_center\_5 } & 3Way & \boxplot{ 68.5 }{ 68.5 }{ 68.6 }{ 68.6 }{ 77.8 } & \boxplot{ 44.2 }{ 60.6 }{ 65.8 }{ 67.0 }{ 88.7 } \\ 
   & 10Way & \boxplot{ 68.5 }{ 68.5 }{ 68.5 }{ 68.6 }{ 72.5 } & \boxplot{ 40.3 }{ 59.6 }{ 65.3 }{ 66.3 }{ 80.5 } \\ 
   & LOO & \boxplot{ 68.5 }{ 68.5 }{ 68.5 }{ 68.6 }{ 72.2 } & \boxplot{ 33.3 }{ 60.4 }{ 65.2 }{ 65.9 }{ 80.5 } \\\hline 
\multirow{3}{*}{ desharnais } & 3Way & \boxplot{ 79.2 }{ 79.2 }{ 79.2 }{ 79.2 }{ 80.3 } & \boxplot{ 54.4 }{ 75.2 }{ 76.5 }{ 77.3 }{ 79.0 } \\ 
   & 10Way & \boxplot{ 79.2 }{ 79.2 }{ 79.2 }{ 79.2 }{ 80.4 } & \boxplot{ 46.1 }{ 75.3 }{ 76.6 }{ 77.8 }{ 79.5 } \\ 
   & LOO & \boxplot{ 79.2 }{ 79.2 }{ 79.2 }{ 79.2 }{ 80.4 } & \boxplot{ 35.8 }{ 75.4 }{ 76.6 }{ 77.4 }{ 80.0 } \\\hline 
\multirow{3}{*}{ desharnaisL1 } & 3Way & \boxplot{ 79.9 }{ 79.9 }{ 79.9 }{ 79.9 }{ 81.2 } & \boxplot{ 58.8 }{ 75.6 }{ 77.2 }{ 77.6 }{ 79.1 } \\ 
   & 10Way & \boxplot{ 79.9 }{ 79.9 }{ 79.9 }{ 79.9 }{ 81.4 } & \boxplot{ 50.4 }{ 75.5 }{ 77.2 }{ 78.1 }{ 80.0 } \\ 
   & LOO & \boxplot{ 79.9 }{ 79.9 }{ 79.9 }{ 79.9 }{ 81.5 } & \boxplot{ 42.5 }{ 75.5 }{ 77.0 }{ 78.2 }{ 79.0 } \\\hline 
\multirow{3}{*}{ desharnaisL2 } & 3Way & \boxplot{ 78.0 }{ 78.0 }{ 78.0 }{ 78.1 }{ 79.0 } & \boxplot{ 59.9 }{ 76.0 }{ 76.8 }{ 77.2 }{ 80.2 } \\ 
   & 10Way & \boxplot{ 78.0 }{ 78.0 }{ 78.0 }{ 78.1 }{ 79.0 } & \boxplot{ 52.2 }{ 76.1 }{ 76.9 }{ 77.2 }{ 79.3 } \\ 
   & LOO & \boxplot{ 78.0 }{ 78.0 }{ 78.0 }{ 78.1 }{ 79.1 } & \boxplot{ 46.9 }{ 76.1 }{ 76.9 }{ 77.1 }{ 78.7 } \\\hline 
\multirow{3}{*}{ desharnaisL3 } & 3Way & \boxplot{ 69.2 }{ 69.2 }{ 69.2 }{ 69.5 }{ 80.3 } & \boxplot{ 53.3 }{ 55.1 }{ 66.1 }{ 69.7 }{ 86.3 } \\ 
   & 10Way & \boxplot{ 69.2 }{ 69.2 }{ 69.2 }{ 69.4 }{ 87.4 } & \boxplot{ 47.9 }{ 48.1 }{ 65.6 }{ 69.0 }{ 90.2 } \\ 
   & LOO & \boxplot{ 69.2 }{ 69.2 }{ 69.2 }{ 69.4 }{ 87.4 } & \boxplot{ 47.9 }{ 48.1 }{ 65.6 }{ 69.0 }{ 90.2 } \\\hline 
\multirow{3}{*}{ sdr } & 3Way & \boxplot{ 39.2 }{ 39.2 }{ 39.3 }{ 39.5 }{ 71.5 } & \boxplot{ 13.7 }{ 25.4 }{ 33.1 }{ 39.4 }{ 77.2 } \\ 
   & 10Way & \boxplot{ 39.2 }{ 39.3 }{ 39.4 }{ 39.6 }{ 70.1 } & \boxplot{ 12.9 }{ 25.3 }{ 31.1 }{ 35.9 }{ 72.3 } \\ 
   & LOO & \boxplot{ 39.2 }{ 39.2 }{ 39.3 }{ 39.5 }{ 69.3 } & \boxplot{ 9.0 }{ 23.9 }{ 31.4 }{ 39.1 }{ 76.9 } \\\hline 
\multirow{3}{*}{ albrecht } & 3Way & \boxplot{ 30.7 }{ 30.7 }{ 30.7 }{ 30.7 }{ 31.4 } & \boxplot{ 7.9 }{ 26.5 }{ 28.9 }{ 29.7 }{ 32.0 } \\ 
   & 10Way & \boxplot{ 30.7 }{ 30.7 }{ 30.7 }{ 30.7 }{ 31.4 } & \boxplot{ 4.7 }{ 26.3 }{ 28.8 }{ 29.8 }{ 31.9 } \\ 
   & LOO & \boxplot{ 30.7 }{ 30.7 }{ 30.7 }{ 30.7 }{ 31.3 } & \boxplot{ 0.0 }{ 25.7 }{ 28.9 }{ 29.8 }{ 32.6 } \\\hline 
\multirow{3}{*}{ finnish } & 3Way & \boxplot{ 83.7 }{ 83.7 }{ 83.7 }{ 83.8 }{ 83.9 } & \boxplot{ 64.5 }{ 81.1 }{ 81.9 }{ 83.1 }{ 84.4 } \\ 
   & 10Way & \boxplot{ 83.7 }{ 83.7 }{ 83.7 }{ 83.8 }{ 83.9 } & \boxplot{ 54.1 }{ 81.2 }{ 82.0 }{ 83.0 }{ 84.7 } \\ 
   & LOO & \boxplot{ 83.7 }{ 83.7 }{ 83.7 }{ 83.8 }{ 84.0 } & \boxplot{ 48.3 }{ 81.3 }{ 82.0 }{ 83.1 }{ 84.9 } \\\hline 
\multirow{3}{*}{ kemerer } & 3Way & \boxplot{ 51.9 }{ 51.9 }{ 52.0 }{ 52.1 }{ 53.8 } & \boxplot{ 34.5 }{ 41.7 }{ 46.6 }{ 48.5 }{ 55.8 } \\ 
   & 10Way & \boxplot{ 51.9 }{ 51.9 }{ 51.9 }{ 52.1 }{ 52.8 } & \boxplot{ 27.4 }{ 41.5 }{ 46.0 }{ 48.4 }{ 51.6 } \\ 
   & LOO & \boxplot{ 51.9 }{ 51.9 }{ 51.9 }{ 52.1 }{ 52.9 } & \boxplot{ 26.5 }{ 40.6 }{ 46.1 }{ 48.2 }{ 51.9 } \\\hline 
\multirow{3}{*}{ maxwell } & 3Way & \boxplot{ 87.5 }{ 87.5 }{ 87.5 }{ 87.6 }{ 89.8 } & \boxplot{ 63.7 }{ 82.4 }{ 86.0 }{ 86.4 }{ 100.0 } \\ 
   & 10Way & \boxplot{ 87.5 }{ 87.5 }{ 87.5 }{ 87.5 }{ 88.0 } & \boxplot{ 57.0 }{ 82.2 }{ 86.0 }{ 86.6 }{ 90.1 } \\ 
   & LOO & \boxplot{ 87.5 }{ 87.5 }{ 87.5 }{ 87.6 }{ 88.0 } & \boxplot{ 47.3 }{ 82.1 }{ 86.0 }{ 86.8 }{ 89.8 } \\\hline 
\multirow{3}{*}{ miyazaki94 } & 3Way & \boxplot{ 50.8 }{ 50.8 }{ 50.8 }{ 50.9 }{ 51.1 } & \boxplot{ 24.4 }{ 34.3 }{ 42.8 }{ 49.1 }{ 53.8 } \\ 
   & 10Way & \boxplot{ 50.8 }{ 50.8 }{ 50.8 }{ 50.9 }{ 51.1 } & \boxplot{ 21.0 }{ 36.0 }{ 40.8 }{ 50.2 }{ 55.0 } \\ 
   & LOO & \boxplot{ 50.8 }{ 50.8 }{ 50.8 }{ 50.9 }{ 51.1 } & \boxplot{ 13.5 }{ 34.2 }{ 40.4 }{ 47.7 }{ 53.8 } \\\hline 
\multirow{3}{*}{ telecom } & 3Way & \boxplot{ 52.0 }{ 52.0 }{ 52.0 }{ 52.1 }{ 53.0 } & \boxplot{ 32.9 }{ 48.0 }{ 50.2 }{ 51.1 }{ 53.0 } \\ 
   & 10Way & \boxplot{ 52.0 }{ 52.0 }{ 52.0 }{ 52.0 }{ 53.2 } & \boxplot{ 26.9 }{ 46.6 }{ 49.3 }{ 51.3 }{ 52.0 } \\ 
   & LOO & \boxplot{ 52.0 }{ 52.0 }{ 52.0 }{ 52.0 }{ 53.2 } & \boxplot{ 24.7 }{ 46.5 }{ 49.2 }{ 51.3 }{ 52.0 } \\\hline 
\multirow{3}{*}{ china } & 3Way & \boxplot{ 82.9 }{ 82.9 }{ 82.9 }{ 83.0 }{ 83.4 } & \boxplot{ 48.4 }{ 76.8 }{ 81.6 }{ 82.9 }{ 84.0 } \\ 
   & 10Way & \boxplot{ 82.9 }{ 82.9 }{ 82.9 }{ 83.0 }{ 83.4 } & \boxplot{ 42.6 }{ 76.8 }{ 81.5 }{ 82.8 }{ 83.6 } \\ 
   & LOO & \boxplot{ 82.9 }{ 82.9 }{ 82.9 }{ 83.0 }{ 83.4 } & \boxplot{ 21.2 }{ 76.6 }{ 81.5 }{ 82.8 }{ 83.2 } \\ 
\end{tabular}
\end{center}
\caption{$B\&V$ values in quartiles for all datasets.}
\label{fig:quartile}
\end{figure*}


Another way to investigate the $B\&V$ values associated with different SMs is to check their statistical significance.
\fig{bias-var-table} shows what percent of $90$ algorithms had statistically the same bias or variance values coming from tuples of SMs.
Every cell of \fig{bias-var-table} reports the percentage associated with a comparison between tuples of SMs (LOO vs. 3Way, LOO vs. 10Way, 3Way vs. 10Way).
Note the extremely high percentage values in \fig{bias-var-table}, meaning that for a very high percent of the algorithms, the difference in $B\&V$ values coming from different SMs are statistically insignificant.


\fig{sortedBiasPlot} and \fig{sortedVarPlot} provide another perspective to \fig{bias-var-table}: Sorted percentages of every SM tuple in \fig{bias-var-table}.
See in both figures that the percentage values of 3Way vs. 10Way are lowest, whereas the percentages associated with LOO vs. 3Way and LOO vs. 10Way are much higher.
$B\&V$ values of LOO are much closer to 3Way and 10Way.
However, the $B\&V$ values of 3Way and 10Way more separate from one another. 


\begin{figure}[b!]
\begin{center}
{\scriptsize  
\begin{tabular}{l | l | l l | l  l }
dataset &   \multicolumn{3}{c|}{bias} & \multicolumn{2}{c}{variance} \\\hline\hline
 &  \multicolumn{2}{r}{3Way} & 10Way  & 3Way & 10Way \\\cline{3-6} 
\multirow{2}{*}{cocomo81} & LOO & 43 & 82 & 57 & 80 \\
 & 3Way & - & 21 & - & 40 \\\hline
\multirow{2}{*}{cocomo81o} & LOO & 91 & 100 & 76 & 93 \\
 & 3Way & - & 90 & - & 63 \\\hline
\multirow{2}{*}{cocomo81e} & LOO & 68 & 89 & 54 & 78 \\
 & 3Way & - & 36 & - & 19 \\\hline
\multirow{2}{*}{cocomo81s} & LOO & 62 & 87 & 56 & 74 \\
 & 3Way & - & 32 & - & 34 \\\hline
\multirow{2}{*}{nasa93} & LOO & 81 & 90 & 62 & 76 \\
 & 3Way & - & 59 & - & 60 \\\hline
\multirow{2}{*}{nasa93\_center\_1} & LOO & 94 & 94 & 47 & 84 \\
 & 3Way & - & 81 & - & 47 \\\hline
\multirow{2}{*}{nasa93\_center\_2} & LOO & 84 & 96 & 77 & 91 \\
 & 3Way & - & 58 & - & 42 \\\hline
\multirow{2}{*}{nasa93\_center\_5} & LOO & 87 & 97 & 70 & 88 \\
 & 3Way & - & 71 & - & 41 \\\hline
\multirow{2}{*}{desharnais} & LOO & 100 & 100 & 91 & 93 \\
 & 3Way & - & 100 & - & 81 \\\hline
\multirow{2}{*}{desharnaisL1} & LOO & 100 & 100 & 91 & 92 \\
 & 3Way & - & 98 & - & 86 \\\hline
\multirow{2}{*}{desharnaisL2} & LOO & 99 & 100 & 91 & 93 \\
 & 3Way & - & 94 & - & 69 \\\hline
\multirow{2}{*}{desharnaisL3} & LOO & 94 & 100 & 60 & 100 \\
 & 3Way & - & 86 & - & 43 \\\hline
\multirow{2}{*}{sdr} & LOO & 52 & 64 & 29 & 62 \\
 & 3Way & - & 20 & - & 17 \\\hline
\multirow{2}{*}{albrecht} & LOO & 99 & 100 & 79 & 93 \\
 & 3Way & - & 78 & - & 50 \\\hline
\multirow{2}{*}{finnish} & LOO & 100 & 100 & 91 & 92 \\
 & 3Way & - & 100 & - & 84 \\\hline
\multirow{2}{*}{kemerer} & LOO & 92 & 100 & 78 & 86 \\
 & 3Way & - & 82 & - & 58 \\\hline
\multirow{2}{*}{maxwell} & LOO & 94 & 100 & 81 & 89 \\
 & 3Way & - & 82 & - & 64 \\\hline
\multirow{2}{*}{miyazaki94} & LOO & 77 & 93 & 52 & 78 \\
 & 3Way & - & 50 & - & 36 \\\hline
\multirow{2}{*}{telecom} & LOO & 100 & 100 & 91 & 96 \\
 & 3Way & - & 100 & - & 70 \\\hline
 \multirow{2}{*}{china}	&	LOO	&	94	&	99	&	71	&	84	\\
	&	3Way	&	-	&	83	&	-	&	60	\\\hline
\end{tabular}
}
\end{center}
\caption{Percentage of algorithms for which $B\&V$ values coming from different SMs are the same. Note the very high percentage values, meaning that for the majority of the algorithms different SMs generate statistically the same values.}
\label{fig:bias-var-table}
\end{figure}

\begin{figure}[t!]
\begin{center}
\includegraphics[width=0.4\textwidth]{lib/sortedBiasTiePerc.pdf}
\end{center}
\caption{Sorted bias values of LOO, 3Way and 10Way. Actual values are given in \fig{bias-var-table}.}
\label{fig:sortedBiasPlot}
\end{figure}

\begin{figure}[t!]
\begin{center}
\includegraphics[width=0.4\textwidth]{lib/sortedVarTiePerc.pdf}
\end{center}
\caption{Sorted variance values of LOO, 3Way and 10Way. Actual values are given in \fig{bias-var-table}.}
\label{fig:sortedVarPlot}
\end{figure}


In \fig{runtimes-all} we see run times for $20$ datasets.
Our expectation that 3Way would be the fastest SM followed by LOO, then 10Way holds.
However, the difference is much smaller than expected.
Although there are orders of magnitude differences between SMs in terms of train-times, the run time difference is limited to a couple of minutes.
Therefore, run time is not a critical factor in the choice of SMs.

\begin{figure}[!th]
\centering
{ 
\begin{tabular}{l|l}  
\textbf{SM} & \textbf{Run time} \\\hline 
LOO & $9,960$ \\
3Way & $9,360$ \\
10Way & $10,140$ \\
\end{tabular}}
\label{fig:runtimes-all}
\caption{The run times in seconds. The expected order of SMs from fastest to lowest (3Way, LOO, 10Way) holds, however the difference is in the order of minutes.}
\end{figure}


\section{Threats to Validity}

One obvious threat to the validity of our results is the implementation of the algorithms.
Although we used standard functions from MATLAB libraries, there is still considerable amount of code into which standard functions were embedded.
Therefore, run-times will be different in other implementations.
However, since all SMs are run on the same code-base, the relative position of SMs in terms of run-times would remain the same.

Another validity threat concerning the run-times is the particular machine on which the experiments are run.
Similar to implementation, different machines will yield different run-times, but the relative position of SMs will remain the same.

The choice of SMs in this work depends on the literature using one or more of datasets used here.
On the other hand, different choices of SMs may also have an effect on the results.
We fully support that claim and leave it as a future work.
However, we also remind that this study based on $90$ algorithms and $20$ datasets is much more extensive than most of the SEE papers.


\section{Conclusions}

This study is a natural result of the prior work on SEE and it relates to more than half of the field.
To the best of our knowledge, it is the first empirical investigation on $B\&V$ trade-off inherent in different SMs.

Our experimentation investigates a very large space of $90$ algorithms and $20$ datasets.
The results present the surprising finding that $B\&V$ values in SEE domain behave quite different than the expected: Different SMs are statistically the same.
Similarity of SMs persists in terms of the run times.
The biggest run time difference is between 3Way and 10Way, which is $780$ seconds ($13$ minutes).


We finish this study with recommendations based on our empirical findings:
\bi
\item Different SMs only introduce a negligible  $B\&V$ difference,
\item SMs have similar run times, 
\item the main factor to consider when opting for an SM is the reproduction of experiments.
\ei
LOO is a deterministic process that enables \textit{``exact''} reproduction of a prior work.
On the other hand 3Way and 10Way (so are any \textit{T}-Way methods) are stochastic.
Hence, we recommend LOO over 3Way and 10Way to be used in SEE studies.

\section{Future Work}

The most likely future direction to this work is the reproduction of it with new datasets (and possibly with new algorithms) to see if the $B\&V$ behavior persists.

Another future direction is to investigate further, why theoretical and actual behavior of $B\&V$ differs from one another in SEE datasets.

Finally, repeating the $B\&V$ analysis reported here with other SMs in SEE (including the ad-hoc methods) and reaching a consensus to determine which SMs should be used under which conditions is the ultimate goal of this initial analysis.

\bibliographystyle{abbrv}
\bibliography{library}

\appendix
\section*{A. Methods}
\label{sect:methods}

This study uses $90$ algorithms, which are product of $10$ pre-processors combined with $9$ learners.
The details of the learners as well as the pre-processors are provided below.

\subsection*{A.1. Ten Pre-processors}

In this study, we investigate:
\bi
\item Three {\em simple preprocessors}: {\bf none, norm, and log};
\item One {\em feature synthesis} methods called {\bf PCA};
\item Two {\em feature selection} methods: {\bf SFS} (sequential forward selection) and {\bf SWreg};
\item Four {\em discretization} methods:  divided on equal frequency/width.
\ei
{\bf None} is the simplest preprocessor- all values are unchanged.

With the {\bf norm} preprocessor,
numeric values are  normalized
to a
0-1 interval using Equation \ref{equation:normalization}. Normalization means
that no variable has a greater influence than any other. 
\begin{equation}
\small
normalizedValue = \frac{(actualValue - min(allValues))}{(max(allValues) - min(allValues))}
\label{equation:normalization}
\end{equation}

With the {\bf log} preprocessor, all numerics are replaced with their natural logarithm value. This {\bf log}ging
procedure minimizes the effects of the occasional very large numeric values.

Principal component analysis~\cite{Alpaydin2004}, or
{\bf PCA}, is a {\em feature synthesis} preprocessor that
converts a number of possibly correlated variables into a smaller number of uncorrelated variables called components. The first component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible.

Some of the preprocessors aim at finding a subset of all features according to certain criteria
such as
{\bf SFS} (sequential forward selection) and {\bf SWR} (stepwise regression).
{\bf SFS} adds features into an initially empty set until no improvement is possible with the addition of another feature. Whenever the selected feature set is enlarged, some oracle is called to assess the value
of that set of features. In this study, 
we used the MATLAB, \textit{objective} function (which reports the 
the mean-squared-error of a simple linear regression on the training set).
One caution to be made here is that exhaustive search algorithms over all features can be very time consuming ($2^n$ combinations in an \textit{n}-feature dataset), therefore SFS works only in forward direction (no backtracking).

{\bf SWR} adds and removes features from a multi-linear model.
Addition and removal is controlled by the p-value in an F-Statistic.  At
each step, the F-statistics for two models (models with/out
one feature) are
calculated.  
%Provided that the feature was not in the model, the
%null hypothesis is: ``Feature would have a zero coefficient in the
%model, when it is added''.  If the null hypothesis can be rejected,
%then the feature is added to the model.  As for the other scenario
%(i.e. feature is already in the model), the null hypothesis is:
%``Feature has a zero coefficient''.  If we fail to reject the null
%hypothesis, then the term is removed.  

{\em Discretizers} are pre-processors that maps every numeric value in a column of data
into a small number of discrete values:
\bi
\item {\bf width3bin:} This procedure clumps the data features into 3 bins, depending on equal width of all bins see
Equation \ref{equation:binning}.

\begin{equation}\small
binWidth = ceiling\left(\frac{max(allValues) - min(allValues)}{n}\right)
\label{equation:binning}
\end{equation}
\item {\bf width5bin:} Same as {\bf width3bin} except we use 5 bins.
\item {\bf freq3bin:} Generates 3 bins of  equal population size;
\item {\bf freq5bin:} Same as {\bf freq3bin}, only this time we have {\em 5} bins.
\ei

\subsection*{A.2. Nine Learners}\label{sec:learners}

Based on our reading of the effort estimation literature, we identified nine commonly used learners that divide
into
\bi
\item Two {\em instance-based} learners: {\bf ABE0-1NN, ABE0-5NN};
\item Two {\em iterative dichotomizers}: {\bf CART(yes),CART(no)};
\item A {\em neural net}: {\bf NNet};
\item Four {\em regression methods}: {\bf LReg, PCR, PLSR, SWReg}.
\ei
{\em Instance-based learning} can be used for analogy-based estimation (ABE).
Since it is not practical to experiment with the all ABE variants,
we focus on two standard variants.
ABE0 is our name for
a very basic type of ABE that we derived from
various ABE studies~\cite{Mendes2003, Li2009, Kadoda2000}.
In {\bf ABE0-xNN}, features are firstly normalized to 0-1 interval,
then the distance between test and train instances is measured
according to Euclidean distance function, \textit{x} nearest neighbors
are chosen from the training set and finally for finding estimated value
(a.k.a adaptation procedure) the median of \textit{x} nearest
neighbors is calculated.  We explored
two different \textit{x}:
\bi
\item {\bf ABE0-1NN:} Only the closest analogy is used. 
Since the median of a single value is itself, the 
estimated value in {\bf ABE0-1NN} is the actual effort value of the closest analogy.
\item {\bf ABE0-5NN:} The 5 closest analogies are used for adaptation.
\ei
\textit{Iterative Dichotomizers} 
seek
the best attribute value $splitter$ that most simplifies the data that
fall into the different splits. 
Each such splitter becomes a root of a tree.
Sub-trees are generated
by 
calling iterative dichotomization recursively
on each of the splits.
The CART iterative dichotomizer~\cite{Breiman1984} is defined for continuous target concepts 
and its  $splitters$ strive to reduce the GINI index of the data that
falls into
each split.
In this study, we use two variants:
\bi
\item {\bf CART (yes):} This version prunes the generated tree using cross-validation.
For each cross-validation, an internal node is made into a leaf (thus pruning its sub-nodes).
The sub-tree that resulted in the lowest error rate is returned. 
\item {\bf CART (no):} Uses the full tree (no pruning).
\ei

In \textit{ Neural Nets}, or {\bf NNet},
an input layer of project details
is connected to zero or more ``hidden'' layers which then  connect
to an output node (the effort prediction). The connections are weighted.
If the signal arriving to a node sums to more than some
threshold, the node  ``fires'' and a weight is propagated
across the network.  Learning in a neural net
compares the output value to the expected value, then applies some
correction method to improve the edge weights (e.g. 
back propagation).
Our {\bf NNet} uses three layers.

This study also uses four
\textit{regression methods}.
{\bf LReg} is a simple linear regression algorithm. 
Given the dependent variables, this learner calculates the coefficient estimates of the independent variables.
{\bf SWreg} is the stepwise regression. As a pre-processor {\bf SWreg} is used to
select features for other learners, here we use {\bf SWreg} as a learner (that is, the predicted
value is a regression result using the features selected by the last step of {\bf SWreg}).
Partial Least Squares Regression ({\bf PLSR}) as well as Principal Components Regression ({\bf PCR}) 
are algorithms that are used to model a dependent variable.
While modelling an independent variable, they both construct new independent variables as linear combinations of original independent variables.
However, the ways they construct the new independent variables are different.
 {\bf PCR} generates new independent variables to explain the observed variability in the actual ones.
While  generating new variables the dependent variable is not considered at all.
In that respect, {\bf PCR} is similar to selection of \textit{n-many} components via {\bf PCA} (the default value of components to select is 2 in MATLAB implementation, so we used it that way) and applying linear regression.
{\bf PLSR}, on the other hand,
 considers the independent variable and picks up the \textit{n-many} of the new components (again with a default value of 2) that yield lowest error rate.
Due to this particular property of {\bf PLSR}, it usually results in a better fitting.

\end{document}