%\documentclass{sig-alternate}
\documentclass{svjour3}                     % onecolumn (standard format)
\usepackage{times}
\usepackage{cite}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{multirow}
\usepackage{rotating}


%\usepackage[none]{hyphenat} 
\newenvironment{smallitem}
 {\setlength{\topsep}{0pt}
  \setlength{\partopsep}{0pt}
  \setlength{\parskip}{0pt}
  \begin{itemize}
  \setlength{\leftmargin}{.2in}
  \setlength{\parsep}{0pt}
  \setlength{\parskip}{0pt}
  \setlength{\itemsep}{0pt}}
 {\end{itemize}}

\newenvironment{smallenum}
 {\setlength{\topsep}{0pt}
  \setlength{\partopsep}{0pt}
  \setlength{\parskip}{0pt}
  \begin{enumerate}
  \setlength{\leftmargin}{.2in}
   \setlength{\parsep}{0pt}
  \setlength{\parskip}{0pt}
  \setlength{\itemsep}{0pt}}
 {\end{enumerate}}


\usepackage[table]{xcolor}
\usepackage{url,graphicx}
\newcommand{\G}{\cellcolor[rgb]{0.8,0.8,0.8}}
\newcommand{\fig}[1]{Figure~\ref{fig:#1}}
\newcommand{\eq}[1]{Equation~\ref{eq:#1}}
\newcommand{\hyp}[1]{Hypothesis~\ref{hyp:#1}}
\newcommand{\tion}[1]{\S\ref{sec:#1}}

\newcommand{\bi}{\begin{smallitem}}
\newcommand{\ei}{\end{smallitem}}
\newcommand{\be}{\begin{smallenum}}
\newcommand{\ee}{\end{smallenum}}
\newcommand{\bd}{\begin{description}}
\newcommand{\ed}{\end{description}}


\begin{document}

\title{Bias and Variance in Software Effort Estimation}
\subtitle{An Investigation of Bias-Variance Trade-Off Subject to Different Testing Strategies}

\maketitle

A typical dataset consists of a a matrix X and a vector Y. 
The input variables (a.k.a. features) are stored in X, where each row corresponds to an observation and each column corresponds to a particular variable.
Similarly, the dependent variable is stored in a vector Y, where for each observation in X there exists a response value.

Now assume that a prediction model represented by $\hat{f}(x)$ has been learned from a training dataset $\tau$.
So as to measure the errors between the actual values in Y and the predictions given by $\hat{f}(x)$, we can make use of an error function represented by $L(Y,\hat{f}(x))$.
Some examples of error functions are squared loss (given in Equation \ref{equ_loss_squared}) or absolute loss (given in Equation \ref{equ_loss_abs}).


\begin{equation}
L(Y,\hat{f}(x)) = \left( Y-\hat{f}(x) \right)^2
\label{equ_loss_squared}
\end{equation}

\begin{equation}
L(Y,\hat{f}(x)) =  |Y-\hat{f}(x)|
\label{equ_loss_abs}
\end{equation}

Given the assumptions that the underlying model is $Y=f(X) + \epsilon$ where $E(\epsilon)=0$ and $Var(\epsilon)=\sigma_{\epsilon}^{2}$, then we can come up with a derivation of the squared-error loss for $\hat{f}(X)$~\cite{Hastie2003}. The error for a point $X=x_0$ is:\\

\begin{tabular}{l c l}
$Error(x_0)$ & = & $E\left[ \left( Y - \hat{f}(x_0) \right)^{2} \vert X=x_0 \right]$\\[2ex]
			 & = & $\sigma^{2}_{\epsilon} + \left( E[\hat{f}(x_0) - f(x_0)] \right)^2 + 
			 			E\left[ \hat{f}(x_0) - E[\hat{f}(x_0)] \right]$ \\[1ex]
			 & = & $\sigma^{2}_{\epsilon} + Bias^{2}(\hat{f}(x_0)) + Var(\hat{f}(x_0))$\\[1ex]
			 & = & $\underbrace{Irreducable Error}_{1^{st} Term} + \underbrace{Bias^2}_{2^{nd} Term} + 
			 		\underbrace{Variance}_{3^{rd} Term}$ \\
			 		& & \\[1ex]
\end{tabular}

In the above derivation, the explanations of the $1^{st}$, $2^{nd}$ and $3^{rd}$ terms are as follows:
\begin{itemize}
\item The $1^{st} Term$ is the so called \textit{``irreducable error''}, i.e. the variance of the actual model around its true mean.
This variance is inevitable regardless of how well  we model $f(x_0)$, only exception to that is when the actual variance is zero (when $\sigma^{2}_{\epsilon} = 0$).
\item The $2^{nd} Term$ is the square of the bias, which is the measure of how different the model estimates are fromt the \textit{true} mean of the underlying model.
\item The $3^{rd} Term$ is the variance of the estimated model. It is the expectation of the squared deviation of the estimated model from its own mean.
\end{itemize}
Furthermore, the above derivation is for an individual instance.
The bias and variance values associated with an algorithm $\hat{f}(X)$ is the mean of all individual values.

Then the question becomes how the bias and variance (from now on $B\&V$) relate to different choices of the training size (\textit{K}), i.e. the relation to cross-validation method (CV).
Here we will consider two cases of CV: leave-one-out (LOO) and 3-Way.
Ideally when training size is equal to the dataset size (\textit{K=N}), we expect CV to be approximately unbiased and to have high variance, because N training sets are so similar to one another.
On the other hand, for small values of \textit{K}, say \textit{K=N/3} as in 3-Way, we expect lower variance and a higher bias~\cite{Hastie2003}.
Naively put, the relationship is:
\bi
\item LOO	: Higher variance, lower bias
\item 3-Way	: Lower variance, higher bias
\ei

In an ideal case, when we plot $B\&V$ values of each individual test instances on x and y axes respectively, we expect 2 clusters:
\bi
\item Upper Left: Low bias, high variance; i.e. LOO results.
\item Lower right: High bias, low variance; i.e. 3Way results.
\ei

Just for the sake of clarity, a very \textit{simple} but \textit{ideal} case would look like \fig{ideal-simulation}. In that figure, 30 hypothetical algorithms subject to both LOO and 3-Way are represented.

\begin{figure}[th!]
\begin{center}
\includegraphics[width=0.6\textwidth]{lib/IdealSimulation.pdf}
\end{center}
\caption{A simple simulation for the ideal case of $B\&V$ relation to testing strategies.}
\label{fig:ideal-simulation}
\end{figure}

When we calculated the $B\&V$ values for $90$ algorithms (the algorithms in Comba paper) on various datasets, we were unable to observe the behavior of \fig{ideal-simulation}, i.e. we did not observe two distinct clusters at predicted $B\&V$ zones.
On the contrary, we observed that both $B\&V$ values are close to one another for LOO and 3Way, i.e. the two clusters mostly overlap.
Also, the \textit{ideal} or \textit{predicted} lowness and highness for $B\&V$ values were not visible too.
The actual $B\&V$ values were both high, regardless of the testing strategy.
In \fig{nasa93}, \fig{cocomo81}, \fig{desharnais} the $B\&V$ plots of $90$ algorithms (i.e. $90$ circles for 3-Way and $90$ triangles for LOO) for Nasa93, Cocomo81 and Desharnais datasets are to be seen.
All the values reported in these figures are logged.
Also note that the axes in these figures are not scaled, because the differences are so small that scaling the axes makes it difficult to observe the behavior of $B\&V$.
See in these figures, how the \textit{ideal} behavior of $B\&V$ differs from the \textit{actual} case for software effort datasets.
We have conducted these experiments on many more datasets and the results are pretty much the same: 1) No ideal behavior for 3-Way and LOO; 2) 3-Way and LOO $B\&V$ values overlap.

\begin{figure}[th!]
\begin{center}
\includegraphics[width=0.6\textwidth]{lib/nasa93.pdf}
\end{center}
\caption{$B\&V$ values for Nasa93.}
\label{fig:nasa93}
\end{figure}


\begin{figure}[th!]
\begin{center}
\includegraphics[width=0.6\textwidth]{lib/cocomo81.pdf}
\end{center}
\caption{$B\&V$ values for Cocomo81.}
\label{fig:cocomo81}
\end{figure}

\begin{figure}[th!]
\begin{center}
\includegraphics[width=0.6\textwidth]{lib/desharnais.pdf}
\end{center}
\caption{$B\&V$ values for Desharnais.}
\label{fig:desharnais}
\end{figure}

\bibliographystyle{abbrv}
\bibliography{references} 
\end{document}