Risk and relevance and defect predictors

At the recent PROMISE'08 conference, Murray Cantor from IBM challenged
the empirical software engineering community to express their work
at the business level.  See
http://www.slideshare.net/gregoryg/risk-and-relevance-20080414ppt-400519?src=embed

His specific proposal was to assess the value added of data mining
methods to SE in terms of their value added to the economics of
software development. We write to accept the Cantor challenge. In
this paper we offer a meta-model of the impact of data mining for
cost estimation and defect prediction on a software development
project. The proposed meta-model is a skeleton document that, we
hope, will bind together the diverse data mining for SE community.
We publish this meta-model to allow an integration of many diverse
research efforts into one coherent whole.

Cantor stresses that when running such a model:

** Little will be known with great precision. In fact, in the usual
case, all the variables of this model will random variables. Hence,
the predictions of this model will not be point values but a range
of possibilities.

** The model will not be "one size fits all". In fact, the meta-model
will contain numerous utility factors that will change, from domain
to domain.

Given the variance of the random variables in the meta-model, and
its dependance on domain-specific utility functions,  It is up to
future research to determine methods that:

**  most constrain the variance of these possible predictions while
shifting their expected value to some more desirable  point.

** identifies cliched values of the utility functions for different
domains.

One methodological point before proceeding. In our view, the
meta-model is either small, or problematic. Much prior work has
build elaborate process models of SE projects, but lacked the
information to tune that model to local conditions. Modelers should
therefore take care not to over-elaborate their model beyond, say,
one or two dozen variables (plus utility functions), lest they
generate a model that is problematic to tune.

Having said that, we now move to defining a meta-model for software
production. That model expresses the value added of data miners in
terms of their impact on the net present value of a piece of software.
Based on the above, we will build a small model, comprising random
variables and utility functions.

NPV= 
	sum (i=1 to n)  R[i] / (1 + r)^i  -
	sum (j=1 to p)  M[j] / (1 + m)^j -
	sum (k=1 to m)  D[k] / (1 + d)^k 

where "R" is revenue and "r" is discount rate for revenue, based on time
where "M" is maintenance cost and "m" is  its associated discount rate  
where "D" is development effort  and  "d" is  its associated discount rate  

and R,M,D are random variables

Observations:

1) Our field has many development effort models (e.g. COCOMO)

2) Our field has many defect predictors (e.g. see the work of Menzies
and  Nagappan, and 1000 others)

3) Our field does not have revenue predictors <== future work needed

4) The impact of our defect predictors on the above model is unclear,
to say the least.  E.g. our predictors are assessed w.r.t. historical
effort logs and we score our learners a value "S" by their ability
to find errors in the historical logs. But those logs only hold X%
of the true errors (those found via inspection, etc). So our
assessment of the errors we found are muted by the incompleteness
"C" of the log using "S*C".

Sources to plunder for the above models:
This is a short list, that requires much growth:

1) Experiences and results from initiating field defect prediction
and product test prioritization efforts at ABB Inc. ICSE 2006
http://portal.acm.org/citation.cfm?id=1134285.1134343


============
BEGIN rough notes (yeah, right, like the above weren't rough enough).

We're going to need some knowledge of standard fault distributions.

On the Distribution of Software Faults 
Hongyu Zhang
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 34, NO. 2, MARCH/APRIL 2008 301 

Andrews, ICSE 2008, a minimal set of defect mutation operators