
With the exception of SDR, all the data used in this study is 
available at \protect\url{http://promisedata.org/data} or from the
authors.
Our data sets are very heterogenous (observer their 60-fold variation in the skewness from 0.86 to 6.6).
As shown in \fig{datasets}, our data includes
projects from (a) various geographical locations (Canada, China, Finland, Japan, Turkey, USA etc. );
(b) 
datasets with various instance (from 11 instances to 499 instances) and feature (from 3 features to 27 features) sizes;
and (c) datasets with high divergence in terms of features describing the software projects.
For example,
the COCOMO* and NASA* data sets all use the features defined by Boehm~\cite{Boehm1981}; e.g.
analyst capability, required software reliability, memory constraints, and use of software tools.
The other data sets use a wide variety of features including, number of entities in the data model, number of basic logical transactions, query count, number of distinct business units serviced etc.

As to other details about our data:
\bi
\item COCOMO81 and NASA93 are standard COCOMO data sets and the indented datasets starting with COCOMO and NASA (COCOMO*, NASA*) are their subsets. Criterion for subsets in NASA93 is the development centers (center\_1, center\_2 and center\_5) and in COCOMO81 it is the development mode (embedded, organic and semi-detached).
\item DESHARNAIS contains projects from Canadian software house (and project size is measured in function points). Subsets of DESHARNAIS contain projects developed in different languages.
\item SDR is a dataset that is includes projects of various software companies from Turkey and is collected by Softlab, the  Bogazici University Software Engineering Research Laboratory repository~\cite{Bakir2009};
\item MIYAZAKI94~\cite{Miyazaki1994} contains projects developed by companies in Japan is recently donated to PROMISE repository and made available to public access.  
\item The CHINA dataset is one of the largest publicly available datasets with 499 instances. It includes software projects developed in China by various software companies in multiple business domains.
\ei

%Note that two of these data sets (Nasa93c2, Nasa93c5)
%come from different development centers around the United States. Another two of these
%data sets (Cocomo81e, Cocomo81o) represent different kinds of projects:
%\bi
%\item The Cocomo81e  ``embedded projects''
%are those developed within tight constraints (hardware, software, operational, ...);
%\item
%The Cocomo81o ``organic projects'' come from
%small teams with good experience of working with less than rigid requirements.
%\ei




