

\section{Data Used in This Study}

With the exception of ISBSG-Banking and SDR, all the data used in this study is 
available at \protect\url{http://promisedata.org/data} or from the
authors.
As shown in \fig{data}, our data includes:
\bi
\item Data from the International Software Benchmarking Standards Group (ISBSG);
\item The Desharnais and Albrecht data sets;
\item
 SDR, which is data from projects of various software companies from Turkey.
SDR is collected from Softlab, the  Bogazici University Software Engineering Research 
Laboratory repository~\cite{Bakir2009};
\item
And the standard COCOMO data sets (Cocomo*, Nasa*). 
\ei

Projects in ISBSG dataset can be grouped according to their business domains.
In previous studies, breakdown of ISBSG according to business domain has also been used\cite{Bakir2008}.
Among different business domains we selected banking due to: 
\begin{enumerate}
\item[1.]Banking domain includes many projects whose data quality is reported to 
be high (ISBSG contains projects with missing attribute values).
\item[2.]ISBSG Banking domain is the dataset we have analyzed and worked for a 
long time due to our hands on experience in building effort estimation models in banking industry.
\end{enumerate}
We will denote  the banking domain subset of ISBSG as ``ISBSG-Banking''.

Note that two of these data sets (Nasa93c2, Nasa93c5)
come from different development centers around the United States. Another two of these
data sets (Cocomo81e, Cocomo81o) represent different kinds of projects:
\bi
\item The Cocomo81e  ``embedded projects''
are those developed within tight constraints (hardware, software, operational, ...);
\item
The Cocomo81o ``organic projects'' come from
small teams with good experience working with less than rigid requirements.
\ei

Note also in Figure \ref{fig:data}, the skewness of our effort values (2.0 to 4.4):
our datasets are extremely heterogeneous with as much as 40-fold
variation.  
There is also some divergence in the features used to describe our data:
\bi
\item 
While our data includes some effort value (measured in terms of months or hours),
no other feature is shared by all data sets.
\item
The Cocomo* and NASA* data sets all use the features defined by Boehm~\cite{Boehm1981}; e.g.
analyst capability, required software reliability, memory constraints, and use of software tools.
\item
The other data sets use a wide variety of features including, number of entities in the data model, number of basic logical transactions, query count and number of distinct business units serviced.
\ei

