Thanks to the prompting of the reviewers, this version is now more
succinct (smaller sections, less technical detail, more "sign posts"
breaking up the general flow).  Also, that dreadful section 4 in the
old version (on the experimental method) is now completely rewritten.
Finally, a simpler way of explaining it all is used throughout the paper
(so instead of "feature subset selection", we now do "column pruning").

Many other changes have been applied as well. For specific details on
what we changed based on each reviewer's comments, see below.

> ==================================================================
> Reviewer 1
> 				
> This reviewer recommends that this paper be accepted provided that
> c01..c03, p02..p03, and t02..t03 are consolidated and results are
> presented in both relative and absolute terms. If "before" is 10% and
> after is 15%, then relative improvement is 50% ((15% - 10%)/10%) and
> absolute improvement is 5% (15% - 10%).  I believe this will address
> many of this concerns mentioned below.

A most excellent idea. Figure 6, middle plots, show the results
from the combined data sets. It also shows how the combined data sets
"bridge the gap" between the experience with larger data sets
and the very small ones.

> 5. What do you see as the weakest aspect of this manuscript?
> 
> 1) The paper argues when less is "NOT" more on page 3.
>    If the paper is assuming a "data mining" applied to
>    "Software Engineering" context, then reason 1 does not
>    make much sense. If no data is available, then there
>    is no form of algorithmic and/or machine learning modeling.

Quite true. So the "NOT" section (which is now the second last section) now includes
notes on what to do when there is NO data present (see lines 360 to 370)

> 2) Figure 1 shows impressive results especially with data sets
>    5 through 12. However, sample size in data sets 5 though 10
>    are quite small. What about merging c01..c03; p02..p04; and
>    t02..t03? This would yield c0X = 56, p0X = 48, and t0X = 24.

Done: see figure 6

> 3) Figure 1 divides results into 2 subtables. What was the reasoning
>    for this partition? Is it based on sample size (larger samples
>    in top subtable)? Or is it based on public versus private data?

Roger. That old fig 1 was just a bad idea. Figure 6 breaks things out
based on pruning methods.
 
> 4) On page 2 the authors claim "If experience can tell us when
>    to add variables, it should also be able to tell us when to
>    subtract variables."
> 
>    It is assumed that "experience" refers to human-based
>    experience. If so, then is it really necessary to build
>    statistical/ML-based models?

The idea _was_ that experience processed by an AI agent would help
humans understand their systems better.  But, as correctly pointed out
by this reviewer, that idea was not properly developed. That metaphor
has been dropped.
 
> 5) Figure 2 is a bit confusing. It does not seem to add
>    much value to the paper. This reviewer suggests removing
>    the figure. It seems that the authors are proposing how
>    to incorporate variable reduction into the data mining
> process.

Agreed. Figure deleted.

> 6) In section 3 ("Why Subtract Variables"), the authors provide
>    business reasons why it is important to subtract variables.
>    In the second bullet, the authors provide an example of assessing
>    competitive bids. There are two concerns regarding this example:
>    A) It is an obvious example in the Time and Money will be
>       the most prominent driving forces.
>    B) This data mining type of problem is a lot easier than the
>       case study presented in the paper. The reason for this claim
>       is that the second bullet is an assessment type of  problem
>       as opposed to a predictor type of problem.

Agreed. Confused writing on our part. That section is now much shorter
and clearer.

> 7) In section 3 ("Why Subtract Variables"), the authors argue
>    for subtract variable because of "Irrelevancy." Essentially,
>    this is a Type I, Type II issue (Only throw away the irrelevant,
>    but keep the significant). Jumping ahead to Figure 5, there is
>    a blur between relevant and irrelevant. That is, the paper
>    argues the irrelevancy issue, but doesn't deliver consistent
>    results in Figure 5.

Agreed. The core issue (drawback) with feature subset selection is
that it is syntactic and confuse irrelevant with insignificant. A real solution
to this problem is beyond the scope of this paper.

> 8) In section 3 ("Why Subtract Variables"), the authors argue
>    for subtracting variables because of "Under-sampling." 
>    A) For the paper "in preparation," the authors might want
>       to consider the question, "How many features are enough?"
>       That is, how does sample size drive feature set size.

Good idea

>    B) The argument made on page 6, first paragraph, assumes an
>       equal distribution of the variables, which is not normally
>       the case. Thus, the 88% and 3.5% results are "worst case
>       scenarios."

Agreed. Bad argument on our part. It has been dropped.

> 9) On page 8, equation 1, claims that EMi has 15 effort multipliers
>     (which is true for COCOMO I). Since some of your data is
>     COCOMO II data, you might want to mention that the latter
>     version has 17 effort multipliers.

Agreed. Text changed. See line 189
 
> 10) Equation 3 raises an interesting question about feature
>     reduction. Since many of the terms contain "Size" then
>     won't it be less likely that "Size" would be removed?
>     (In this case loc.)

In fact that turns out to be the case. In all our experiments, LOC
survived pruning. Sadly, this is one of the technical details NOT
included in this new draft- there _was_ a whole discussion on how
different features are found in different subsets. But that discussion
got so long that it became a whole separate paper (which just got
accepted to ASE 2005.  at a 22% acceptance rate- we are happy.
Regardless, that discussion is too long and too intricate for this paper.

> 11) The paper argues for using the Wrapper technique for
> producing
>     better effort estimation predictions. However, if I am a 
>     project manager, how far do I reduce? Figure 5 does not
>     shed any light on this since "best results" are achieved 
>     anywhere from FS01 (t02) all the way through FS07 (p04).
>     The paper does not plot the lines (in Figure 5) all the way
>     to FS07 for all the lines. Thus, as a project manager, I do
>     not know the consequences of extending the reduction out
>     to FS07. 

We should have been clearer in the last draft. The stopping criteria for FSS is FULLY
AUTOMATIC.  There is no human selection of "the best subset". That is done by t-tests
and the operator just gets back one set of attributes. The automatic nature
of this process is stressed in this new draft: see lines 18, 61, 279, 300

> 12) It is noted that the plots (bottom graph of Figure 5)
>     are not monotonic. This inconsistency raises questions about
>     how far to reduce. Perhaps the authors may wish 
>     to include a confidence factor. That is, a reduction to FS01
>     will improve effort prediction X percent of the time.

This is a very interesting remark. Recently we have been attracted to
bayesian modeling average to get a sense of the space of the
theory. But at this time, we have no real results to show in this
direction.

> 13) The paper talks about performing 30 hold-out experiments.
>     What was the distribution of training to test samples. Page
>     11 of the paper implies a 2/3, 1/3 ratio. Is this correct?
>     If so, then for data sets 5 through 10 there are 10 or
>     fewer samples in the training set and 5 or fewer samples in
>     the test set. This argues for the consolidation of data sets.

Indeed. See the new figure 6 and the data sets "call, pall, tall"

> 14) Also, if an experiment is run 30 times per data set per
> feature reduction, then why not run a t-test on data set X for
> feature levels N and N+1. It would be possible to claim that level
> N+1 produces statistically superior results to level N (for data
> set X).

The old draft described, in detail, exactly how we implemented the above
procedure (page 11, second dot point). Based on the advise of the
other reviewers, we have reduced the level of technical detail in this draft.
The current draft has a (very terse) description of the t-tests on line 276

> 15) In section 5, Related Work, the authors refer to Kirsopp &
>     Shepperd as "K&S" more than once. This is rather informal
>     probably not suitable for a journal article.

Yes. K&S removed
> 
> 16) The authors converted the answers using natural logs. Were
>     the answers converted back prior to measuring with pred(30)?
>     If not, all the results are greatly distorted.

Ooops-we missed that from the last draft. Fixed now. See line 220

> ==========================================================
> Reviewer 2
> [some deletions]
>
> 5. What do you see as the weakest aspect of this manuscript?
> 
> Section 4 on the case study. It is too dense. Too much is
> missing. I have no idea how to interpret Figure 5 which is the
> mainr esult. I am not sure what the variable listed in Figure 4
> means. In order to fit within IEEE Software length guidelines,
> too much of understanding was removed in the paper.
>    The supporting paper with the manuscript is not relevant. If
> I need that paper to understand the current paper, then there is
> no need for the current paper.
> Section 4 needs to be rewritten and made more understandable as
> to what is going on.
>
> A. Public Comments (these will be made available to the author)
>  Section 4 - the central theme of the paper - is unreadable. The
> topic is important, but this version is not readable to the
> general IEEE Software reader.

That dreadful section 4 in the old version (on the experimental
method) is now completely rewritten.  Section 4 has been substantially
rewritten, divided into other sections, with some technical details
described more succinctly.  Finally, a simpler way of explaining it
all is used throughout the paper (so instead of "feature subset
selection", we now do "column pruning").

> ===========================================================
> Reviewer 3
> [deletions]
> 5. What do you see as the weakest aspect of this manuscript?
> It does not target the audience of the "Software" magazine.

Agreed. The current draft is, we believe, much better.

> Section III. Detailed Comments
> 
> A. Public Comments (these will be made available to the author)
>  Section I.C.1 The title implies that more than one cost
> estimation models was used in this experiment. The paper only
> refers to COCOMO. Reconcile this issue.

Agreed. While our process generates multiple models (one after each
pruning), they are all in the COCOMO framework.
So, as per this reviewer's suggestion, the title has been changed.

> The paper has potential but right now it reads as if it was
> written for datamining specialists. It should be re-written for
> the readers of the "Software" magazine. That means that less
> details should be given about the machine learners and more
> about why this work is important for a user of the COCOMO model.

As mentioned above, that dreadful section 4 in the old version (on the
experimental method) is now completely rewritten.  Section 4 has been
substantially rewritten, divided into other sections, with some
technical details described more succinctly.  Finally, a simpler way
of explaining it all is used throughout the paper (so instead of
"feature subset selection", we now do "column pruning").

> How would a software project manager use COCOMO any differently
> due to your proposed approach and what does that buy him/her?

Good point- and one NOT addressed by the last draft. The change to current
practice is described lines 22 to 23 of the new draft.

> Do they have to be related
> projects to the one at hand? How much related? 

Without knowledge of how
well old projects are related to new projects, only column pruning can
be performed (see the left-hand-side plot of fig.6.).  This can
produce some improvements but not as much as when domain knowledge is
used to divide up the data. While the best improvements come from
heavy division (see fig.6's right hand plot), even a little
stratification knowledge can be helpful (e.g. the middle plot of fig 6
shows that even a rudimentary stratifications can help the process of
learning estimators.