Risk and relevance and defect predictors At the recent PROMISE'08 conference, Murray Cantor from IBM challenged the empirical software engineering community to express their work at the business level. See http://www.slideshare.net/gregoryg/risk-and-relevance-20080414ppt-400519?src=embed His specific proposal was to assess the value added of data mining methods to SE in terms of their value added to the economics of software development. We write to accept the Cantor challenge. In this paper we offer a meta-model of the impact of data mining for cost estimation and defect prediction on a software development project. The proposed meta-model is a skeleton document that, we hope, will bind together the diverse data mining for SE community. We publish this meta-model to allow an integration of many diverse research efforts into one coherent whole. Cantor stresses that when running such a model: ** Little will be known with great precision. In fact, in the usual case, all the variables of this model will random variables. Hence, the predictions of this model will not be point values but a range of possibilities. ** The model will not be "one size fits all". In fact, the meta-model will contain numerous utility factors that will change, from domain to domain. Given the variance of the random variables in the meta-model, and its dependance on domain-specific utility functions, It is up to future research to determine methods that: ** most constrain the variance of these possible predictions while shifting their expected value to some more desirable point. ** identifies cliched values of the utility functions for different domains. One methodological point before proceeding. In our view, the meta-model is either small, or problematic. Much prior work has build elaborate process models of SE projects, but lacked the information to tune that model to local conditions. Modelers should therefore take care not to over-elaborate their model beyond, say, one or two dozen variables (plus utility functions), lest they generate a model that is problematic to tune. Having said that, we now move to defining a meta-model for software production. That model expresses the value added of data miners in terms of their impact on the net present value of a piece of software. Based on the above, we will build a small model, comprising random variables and utility functions. NPV= sum (i=1 to n) R[i] / (1 + r)^i - sum (j=1 to p) M[j] / (1 + m)^j - sum (k=1 to m) D[k] / (1 + d)^k where "R" is revenue and "r" is discount rate for revenue, based on time where "M" is maintenance cost and "m" is its associated discount rate where "D" is development effort and "d" is its associated discount rate and R,M,D are random variables Observations: 1) Our field has many development effort models (e.g. COCOMO) 2) Our field has many defect predictors (e.g. see the work of Menzies and Nagappan, and 1000 others) 3) Our field does not have revenue predictors <== future work needed 4) The impact of our defect predictors on the above model is unclear, to say the least. E.g. our predictors are assessed w.r.t. historical effort logs and we score our learners a value "S" by their ability to find errors in the historical logs. But those logs only hold X% of the true errors (those found via inspection, etc). So our assessment of the errors we found are muted by the incompleteness "C" of the log using "S*C". Sources to plunder for the above models: This is a short list, that requires much growth: 1) Experiences and results from initiating field defect prediction and product test prioritization efforts at ABB Inc. ICSE 2006 http://portal.acm.org/citation.cfm?id=1134285.1134343 ============ BEGIN rough notes (yeah, right, like the above weren't rough enough). We're going to need some knowledge of standard fault distributions. On the Distribution of Software Faults Hongyu Zhang IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 34, NO. 2, MARCH/APRIL 2008 301 Andrews, ICSE 2008, a minimal set of defect mutation operators