\subsection{About $Application_1$} Some details of $application_1$, are shown in \fig{task1}; i.e. a QA team working on a limited budget wants to {\em sort} the {\em modules} that the {\em data miner} {\em predicts} are defective to find (a)~ those that require urgent inspection, and (b)~others than could be inspected later (or never). It is assumed that the QA team's inspections are $0 \le \alpha \le 100\%$ effective in recognizing faulty module. \begin{figure} \begin{displaymath} \xymatrix{ old\; data \ar[r] & data\;miner \ar[d] & \\ new\; data \ar[r] & predictor \ar[r] & predicitons \ar[d]\\ & QA\; team \ar[d]_{\alpha\%}^{effective} & sorted \ar[l]_{RDF}^{order}\\ &detect\; reports & \\ & } \end{displaymath} \caption{$Application_1$: the QA team inspects the new data modules that the data miner predicts are faulty. The modules are inspected in order of their {\em relative defect frequency} (RDF) where one module is ranked higher than another if it is more likely to contain defects. The QA team is assumed to be $\alpha$\% effective at recognizing defective modules. }\label{fig:task1} \end{figure} We make no assumption that $application_1$ is the only possible way to use data miners. There are many other business applications of defect predictors that do not conform to $application_1$ (e.g. two different applications were discussed above). However, $application_1$ was choosen for two reasons. Firstly, it is a common usage of automatic defect predictors. For example, if a V\&V company is hired to audit the code from some new off-shore client, they may have a large code base to inspect in the shortest possible time. Secondly, $application_1$ addresses current concerns in the defect prediction literature. Arisholm \& Briand~\cite{arisholm06} argue against certain standard mesures of predictor performance such as accuracy (defined in \fig{measures}), saying that a highly accurate predictor can be undesirable in other ways. For example, accuracy says nothing about the appropriate sort order for reading modules. In $application_1$ we want to support a QA team {\em reading less} while {\em finding more} defects. For such a budget-concsious team, if X\% of the modules are predicted to be faulty and if those modules contain less than X\% of the detects, then the costs of generating the defect predictor is not worth the effort. Koru et al.~\cite{koru07} have much to say about the relative defect frequency (RDF) of different biasing strategies for selecting which modules should be inspected next. They speculate that the relationship between module {\em size} and {\em number of defects} is not linear but {\em logarithmic}; i.e. smaller modules are proportionally more troublesome. Accordiningly, they argue that LOC can be be used in to create a biasing strategy with higher RDF. For example: if one has the resources to inspect 10,000 LOC, then the {\em logarithm defect hypothesis} would say that it is better to pick 100 classes of size 100-LOC as opposed to picking 10 classes of 1,000 LOC. $Application_1$ can test the logarthmic defect hypothesis by trying two $sort$ orders: \bi \item In Koru's preferred {\em manualUp} policy, the {\em smaller} modules will be inspected first. \item In the opposite {\em manaualDown} policy, the larger {\em larger} modules will be inspected first. \ei \begin{figure} \begin{center} \includegraphics[width=3in]{plots/effort.pdf} \end{center} \caption{Effort-vs-PD.}\label{fig:effort} \end{figure} The relative merits of biasing strategies like {\em manualUp} and {\em manualDown} can be compared using the Effort-vs-PD diagram of \fig{effort}. The curves in that figure are generated as follows: \bi \item Some oracle {\em selects} a set of modules to inspect. In the case of automatic data mining, this would be the modules predicted to be defective. In the case of {\em manualUp} and {\em manualDown}, it would be all modules. \item The {\em selected} are sorted. For example, except for {\em manualDown}, we sort all modules ascending on LOC. \item In {\em selected} set is explored in the sorted order \mbox{$1 \le x \le |selected|$}. For each $x$ value, the $y$ value is the percentage of the defective modules seen in $1 \le i \le x$. This assumes $\alpha=100\%$; i.e. the QA team will {\em always} recognize a defective module. This implausible assumption will be fixed below. \ei Note that: \bi \item If Koru et al. are right then the {\em manualUp} and {\em manualDown} curves should appear as drawn in \fig{effort}; i.e. {\em manualUp} should find defective modules faster than {\em manualDown}. \item Typically, defeect detectors do not trigger on all modules, and have some false alarm rate. For example, the {\em good} curve of \fig{effort} triggers on B=43\% of the code while only detecting 93\% of the defective modules. Similarly, the {\em bad} curve stops after finding 30\% of the defective modules in 24\% of the code. \ei \fig{effort} lets us define reasonable lower bounds on the performance of an automatic data miner being used for $application_1$: \bi \item For Arisholm \& Briand to approve of a data miner, its curve must fall {\em above} the diagonal line marked as {\em minimum}. This is the region where a QA team can {\em read less} (measured in terms on \%LOC inspected) and {\em finds more} (measured in terms of number of the percent of defective modules found in the inspected set of modules). \item A {\em bad} automatic method performs worse than simple manual methods; i.e. its Effort-vs-PD curve falls {\em below} both the {\em manualDown} and {\em manualUp} curves. For example, see the {\em bad} curve of \fig{effort}. \ei \fig{effort} also lets us define a performance upper bound. Imagine some omniceient oracle could somehow restrict the insepctions to just the $A\%$ defective modules (if \fig{effort}, $A=30\%$). If {\em manualUp} was then applied to just those defective modules, then that would result in the {\em best} curve. Realistically, defect predictors can approach the {\em best} curve, but never reach it. Hence, the most we can hope for is something like the {\em good} curve that falls {\em below} the {\em best} curve and {\em above} the {\em manualUp} and {\em manualDown} curves. Two more details will complete our discussion of \fig{effort}. Firstly, when comparing supposdely {\em good} defect predictors, it is useful to express their performance in terms of (a)~the area under the Effort-vs-PD curve expressed as (b)~a ratio of the area under the {\em best} curve. To be complete, that evaluation should contain the $\alpha$ factor that models the effectiveness of QA teams that inspect modules according to the defect predictor's recommedation. However, as shown in \fig{task1}, that factor applies to the activity that occus {\em after} the data miners runs and the modules are sorted in ascending order by LOC. Hence: \bi \item That $\alpha$ factor {\em is the same across all data miners}; \item By expressing the value of a defect predictor a ratio of the area under the {\em best} curve, that factor cancels out. \item So we can assess the relative merits of different defect predictors {\em independently} of $\alpha$. \ei Secondly, as mentioned above, the curves from our data miners terminate at some $X<100\%$ value. For example, the {\em good} curve of \fig{effort} terminates at some point $C$ where $X=45\%$. To compute the area under the Effort-vs-PD curve, we must fill in the gap between the terminiation point and $X=100$. In the sequel, we will make the following {\em worst case assumption}: \bi \item The QA team {\em only} inspects the modules referred to it by the data miner; \item That is, if a module is not recommended by the data miner it is not inspected. \ei Visually, for the $good$ curve, this worst-case assumption would correspond to a flat line running to the right from point $C$ to $X=100$. Our use of this {\em worst-case assumption} means that our results will {\em under-estimate} the efficacy of our learners (and we will return to this point, below).