\subsection{About $Application_1$}
Some details of $application_1$, 
are shown  in  \fig{task1}; i.e. 
a QA team working on a limited budget wants to
{\em sort} the {\em modules} that the  {\em data miner}  {\em predicts}
are defective to find
(a)~ those that require urgent inspection, and (b)~others
than could be inspected later (or never).  
It is assumed that the QA team's inspections are $0 \le \alpha \le 100\%$ effective
in recognizing faulty module.
\begin{figure}
\begin{displaymath}
    \xymatrix{ 
old\; data \ar[r] & data\;miner \ar[d] & \\
new\; data \ar[r] & predictor \ar[r]                    & predicitons \ar[d]\\
  & QA\; team \ar[d]_{\alpha\%}^{effective}   &  sorted \ar[l]_{RDF}^{order}\\
   &detect\; reports     &  \\
   & }
\end{displaymath}
\caption{$Application_1$: the QA team inspects  the  new data modules that the data miner predicts are faulty.
The modules are inspected in order of their
 {\em relative defect frequency} (RDF)
where one module is ranked higher than another if it is more likely to contain defects.
The QA team is assumed to be  $\alpha$\% effective at recognizing defective modules.
 }\label{fig:task1}
\end{figure}

We make no assumption that $application_1$ is the only possible way to use data miners.
There are many other business applications of defect predictors that do not conform to $application_1$
(e.g. two different applications were discussed above).
However, $application_1$ was choosen for two reasons.
Firstly, it is a common usage of automatic defect predictors. For example, if a V\&V company is
hired to audit the code from some new off-shore client, they may have a large code base to inspect
in the shortest possible time.

Secondly,  $application_1$ addresses current concerns in the defect prediction literature.
Arisholm \& Briand~\cite{arisholm06} argue against certain standard mesures of predictor performance such as accuracy (defined in
\fig{measures}), saying that a highly accurate predictor can be undesirable in other ways.
For example, accuracy says nothing about the appropriate sort order for reading modules.
In $application_1$ we want to support a
 QA team 
 {\em reading less} while {\em finding more} defects.
For such a budget-concsious team, 
if X\% of the modules are predicted to be faulty and if those modules contain less than X\% of the detects,
then the costs of generating the defect predictor is not worth the effort. 

Koru et al.~\cite{koru07} have much to say about  the relative defect
frequency (RDF) of different biasing strategies for selecting which modules should be inspected next.
They speculate that the relationship between module {\em size} and
{\em number of defects} is not linear but {\em logarithmic};
i.e. smaller modules are proportionally more
troublesome. Accordiningly, they argue that LOC can be be used in to create a biasing
strategy with higher RDF. For example: if one has the resources to
inspect 10,000 LOC, then the {\em logarithm defect hypothesis} would say that it
is better to pick 100 classes of size
100-LOC as opposed to picking 10 classes of 1,000 LOC.
$Application_1$ can  test the logarthmic defect hypothesis by trying two
$sort$ orders:
\bi
\item In Koru's preferred    {\em manualUp}  policy, the  {\em smaller} modules will be inspected first. 
\item In the opposite  {\em manaualDown} policy,  the  larger  {\em larger} modules will be inspected  first.
\ei

\begin{figure}
\begin{center}
\includegraphics[width=3in]{plots/effort.pdf}
\end{center}
\caption{Effort-vs-PD.}\label{fig:effort}
\end{figure}

The relative merits of biasing strategies like {\em manualUp} and {\em manualDown} 
can be compared using 
the Effort-vs-PD diagram of \fig{effort}.
The curves in that figure are generated as follows:
\bi
\item Some oracle {\em selects} a set of modules to inspect. In the case of automatic data mining, this would be the modules predicted
to be defective. In the case of {\em manualUp} and {\em manualDown}, it would be all modules.
\item The {\em selected} are sorted. For example,
except for {\em manualDown}, we sort all modules
ascending  on LOC.
\item In {\em selected} set is explored in the sorted order \mbox{$1 \le x \le |selected|$}. For each $x$ value, the $y$
value is the percentage of the defective modules seen in $1 \le i \le x$.
This  assumes $\alpha=100\%$; i.e. the QA team will {\em always} recognize
a defective module.  This implausible assumption will be fixed below.
\ei
Note that:
\bi
\item
 If Koru et al. are right then the {\em manualUp} and {\em manualDown} curves should appear as drawn in
 \fig{effort}; i.e. {\em manualUp} should find defective modules faster than {\em manualDown}.
\item
Typically, defeect detectors do not trigger on all modules, and have some false alarm rate. 
For example, the {\em good} curve of \fig{effort} triggers on B=43\% of the code while only
detecting 93\% of the defective modules.  Similarly, the {\em bad} curve stops after finding 30\% of the defective
modules in 24\% of the code.
\ei

\fig{effort} lets us define reasonable lower bounds on the performance of  an automatic data miner being used for  $application_1$:
\bi
\item
For Arisholm \& Briand to approve of a data miner, its curve must fall {\em above} the diagonal line marked as {\em minimum}.
This is the region where a
 QA team can  {\em read less} (measured in terms on \%LOC inspected) and  {\em finds more}
(measured in terms of number of the percent of defective modules found in the inspected set of
modules).
\item A {\em bad}  automatic method performs worse  than simple manual methods; i.e.
its Effort-vs-PD curve falls {\em below} both the  {\em manualDown} and {\em manualUp} curves.
For example, see the {\em bad} curve of \fig{effort}.
\ei
\fig{effort} also lets us define a performance upper bound.  Imagine some  omniceient oracle could somehow restrict the
insepctions to just the $A\%$ defective modules (if \fig{effort}, $A=30\%$). 
If  {\em manualUp} was then applied to just those defective modules,
then that would result in the {\em best} curve.  
Realistically, defect predictors can approach the {\em best} curve, but never reach it.  Hence, the most we can hope for 
is something like the {\em good} curve that  falls {\em below} the {\em best} curve and 
 {\em above} the {\em manualUp} and {\em manualDown} curves.

Two more details will complete our discussion of  \fig{effort}. Firstly, when comparing  supposdely {\em good} defect predictors,
it is useful to express their performance in terms of (a)~the area under the Effort-vs-PD curve expressed as (b)~a ratio of
the area under the {\em best} curve. To be complete, that evaluation should contain the $\alpha$ factor that models the effectiveness
of QA teams that inspect modules according to the defect predictor's recommedation. However, as shown in \fig{task1},
 that factor applies to
the activity that occus {\em after} the data miners runs and the modules are sorted in ascending order by LOC. Hence:
\bi
\item That $\alpha$ factor
{\em is the same across all data miners};
\item
By expressing the value of a defect predictor a ratio of the area under the {\em best} curve, 
that factor cancels out.
\item
So we can assess the relative merits of different defect predictors {\em independently} of $\alpha$.
\ei
Secondly, as mentioned above, the curves from our data miners terminate at some $X<100\%$ value.  For example, the {\em good}
curve of \fig{effort} terminates at some point $C$ where $X=45\%$. To compute the area under the Effort-vs-PD curve, 
we must fill in the gap between 
the terminiation point  and $X=100$.  In the sequel, we will make the following {\em  worst case assumption}:
\bi
\item 
The QA team {\em only} inspects the modules referred to it by the
data miner;
\item
That is, if a module is not recommended by the data miner it is not inspected. 
\ei
Visually, for the $good$ curve, this worst-case assumption
 would correspond to
a flat line running to the right from point $C$ to $X=100$.  Our use of this {\em worst-case assumption} means that our results
will {\em under-estimate} the efficacy of our learners (and we will return to this point, below).