\section{Introduction}

Being able to choose the most appropriate software development effort predictor for the local software projects remains elusive for many project managers and researcher. For decades, researchers have been seeking for the ``best'' software effort predictor.
At the time of writing, there is no such a commonly agreed ``best'' predictor found which provides consistently the most accurate estimate.
The usual conclusion is that effort estimation suffers from a {\em ranking instability}
syndrome; i.e. different researchers offer conflicting rankings as to what is 
``best''~\cite{shepperd01b,myrtveit05}.
It seems, different set of best effort predictors exist under various different situations given different historical sample datasets. 

This is an open and imminent
issue as accurate effort estimation is crucial to Software
Engineering, and is known as a major challenge for many software
project managers. Both overestimating and underestimating would
result in unfavorable impacts to the business' competitiveness and project resource
planning. Conventionally, the single most familiar effort predictor may be used for different situations, however this approach may not produce the best effort estimates for different projects.


%The literature reports many  effort estimation methods.
%For example, if we just look at instance-based methods,
%\fig{cbr} lists thousands of  
%algorithms  for instance-based effort predictor. 
%A similiar list for model-based 
%effort estimation would be just as long
%(and include 
%linear regression, regression trees, neural nets, etc.).
%

Being able to compare and determine the best effort predictor for different scenarios is critically important to the relevance of the estimates to the target problem under investigation. Software effort estimation research focuses 
on the
{\em learner} used to generate the estimate (e.g. linear regression, neural nets, etc)
in many cases, overlooking the importance of the quality and characteristics of  the {\em data} being used
in the estimation process. We argue that this approach is somewhat misguided since, as shown this study, learner
performance is greatly influenced by the data preprocessing and the datasets being used to  evaluate the learner.
A preprocessor and a learner \textit{forms} a complete effort estimation method in general, for example the data normalization technique as a preprocessor with linear regression as the learner.  
%This paper shows that many datasets used by prior publications are very  limited in number
%to distinguish {\em strong/weak} datasets. 
%All the results based on these {\em weak} datasets
%(including many results by the authors of this papers) must hence be revisited.

Ranking stability in software effort estimation is of the primary research focus, being able to correctly classify the characteristics of each method allows the most suitable predictors to be used in the estimation process.  
The study is not at its early stage, it is based on the success of a previous study described in Menzies et al.~\cite{menzies11}, where a large number of predictors were applied on simulated datasets, and they were able to derived precise and stable ranking of all the predictors in the study with changed parameters in the random number seeds, different evaluation criteria and subsets of the data used. 
The hypothesis in this study is that if we are able to derive a stable ranking conclusion using simulated data, similar behavior should be observed when applying real heterogenous datasets from public domains where they are from different sources with various differences in project characteristics and evaluation criteria. The main contribution is that this comprehensive study presents a method which can be used to determine the best effort predictors to use at different situations. 


%The good news is that there are {\em strong} datasets that clearly
%illustrate the value of any predictor.  
Method combinations can produce vastly different results, in all, this study
applies 90 predictors (10 learners and 9 preprocessors) to 20 datasets and measure their performance
using seven performance criteria.  To the best of
our knowledge, this is the largest effort estimation study yet
reported in the literature to date. One result of exploring such a large
space of data and algorithms is that we 
are able to report stable conclusions whereas prior studies were not.

This  paper is structured as follows.
Section 2 addresses our research challenge and motivation.
Related work discusses effort estimation and the prior reports on {\em conclusion instability}.
Those reports used a
dataset to {\em seed} the generation of artificial data. Our results section shows that
if we extend the experiments to a broader set of methods and project data, 
we are able to discover stable conclusions such as
that we can list best (and worst) effort predictors, which was non-trivial in the past.
%; on the other hand, we need more datasets to make a similar claim of \textit{strong/weak} datasets.
%some datasets are {\em weak}; and that the {\em strong} datasets
%show which predictors are consistently better than others.
%Based on those results, our conclusion will list best (and worst)
%  effort  estimators.
