\subsection {Validity}
{\em Construct validity} (i.e. face validity) assures that we are measuring
what we actually intended to measure~\cite{Robson2002}.
Previous studies have concerned themselves with the construct validity of 
different performance measures for effort estimation (e.g. \cite{foss03}).
While, in theory, these performance measures have an impact on the rankings of effort
estimation algorithms, we have found that other factors dominate. In particular,
\fig{dot-plot} showed that features of the data set (whether or not it is ``weak'')
have a major impact on what could be concluded after studying a particular estimator
on a particular data set. We also show empirically the surprising result that our
results are stable across a range of performance criteria.


%
%In the study we use MRE and {\em win-tie-loss} values for measuring and
%comparing performance of different models.
%Another internal validity issue is our use of MRE (Magnitude of
%Relative Error).  
%%This is the most widely used evaluation criterion
%%for assessing the performance of competing software effort estimation
%%models~\cite{Briand1999, foss03,Wang2009}. Each individual project
%%case's MRE value is a direct measure of the absolute difference
%%between the prediction and the known actual value~\cite{Stensrud}. Therefore the
%%smaller the MRE indicates the better the performance of the
%%prediction system~\cite{Kirsopp2002a}. 
%Foss et al.~\cite{foss03} conducted an simulation study of the model evaluation criteria, examined the most commonly used evaluation criteria used in software effort estimation, including the de facto criterion MRE. Their results have shown that MRE is not recommended for comparing different model prediction performance, and call for the development of a new, simple-to-use, universally accepted metric  for evaluating prediction models. At the time of writing, the {\em ``Holy Grail''} kind of evaluation metric has not been developed in the research community. Their recommendation was to always using a combination of theoretical justifications. For the same reason, the study used MRE together with Wilcoxon ranked test to formally verify the result. 
%
%


%Foss et al.\cite{foss03} have provided an extensive discussion using a simulation study demonstrating that the different characteristics of different variants of MREs to select the best prediction model, and their inherited weaknesses. 


%The nature and the drawbacks of these evaluation criteria are well studied, and used for the purpose of comparing different software prediction techniques. 


{\em External validity} is the ability to generalize results outside
the specifications of that study\cite{Milic2004}.  To ensure external
validity, this paper has
studied a large number of projects.
Our
data sets are diverse, measured in terms 
of their
sources, their domains and the time they were developed in.
We use datasets composed of software development
projects from different organizations around the world to
generalize our results\cite{Bakir2009}. Our reading of the literature
is that this study uses more data, from more sources,
than numerous other papers. For example,
Table 4 of~\cite{Mendes2007} list the total number of projects used by a sample of
other studies. The median value of that sample is 186; i.e.  one-sixth
of the 
1198 projects used here.

As to the external validity of our choice of algorithms,   recalling \fig{cbr},
it is clear that this study has not explored the full range of effort estimation algorithms.
Clearly, future work is required  to repeat this study using the ``best of breed'' found here (e.g. 
bands one and two of \fig{freq_method} as well as other algorithms).

Having cast doubts on our selection of algorithms, we hasten to add that
this paper  has focused on algorithms that have been 
extensively studied in the literature~\cite{shepperd97} as well as  the commonly available datasets (that is, the ones
available in the PROMISE repository of reusable SE data).
That is, we assert that these results should apply to much to current published literature
on effort estimation.