Thanks to the prompting of the reviewers, this version is now more succinct (smaller sections, less technical detail, more "sign posts" breaking up the general flow). Also, that dreadful section 4 in the old version (on the experimental method) is now completely rewritten. Finally, a simpler way of explaining it all is used throughout the paper (so instead of "feature subset selection", we now do "column pruning"). Many other changes have been applied as well. For specific details on what we changed based on each reviewer's comments, see below. > ================================================================== > Reviewer 1 > > This reviewer recommends that this paper be accepted provided that > c01..c03, p02..p03, and t02..t03 are consolidated and results are > presented in both relative and absolute terms. If "before" is 10% and > after is 15%, then relative improvement is 50% ((15% - 10%)/10%) and > absolute improvement is 5% (15% - 10%). I believe this will address > many of this concerns mentioned below. A most excellent idea. Figure 6, middle plots, show the results from the combined data sets. It also shows how the combined data sets "bridge the gap" between the experience with larger data sets and the very small ones. > 5. What do you see as the weakest aspect of this manuscript? > > 1) The paper argues when less is "NOT" more on page 3. > If the paper is assuming a "data mining" applied to > "Software Engineering" context, then reason 1 does not > make much sense. If no data is available, then there > is no form of algorithmic and/or machine learning modeling. Quite true. So the "NOT" section (which is now the second last section) now includes notes on what to do when there is NO data present (see lines 360 to 370) > 2) Figure 1 shows impressive results especially with data sets > 5 through 12. However, sample size in data sets 5 though 10 > are quite small. What about merging c01..c03; p02..p04; and > t02..t03? This would yield c0X = 56, p0X = 48, and t0X = 24. Done: see figure 6 > 3) Figure 1 divides results into 2 subtables. What was the reasoning > for this partition? Is it based on sample size (larger samples > in top subtable)? Or is it based on public versus private data? Roger. That old fig 1 was just a bad idea. Figure 6 breaks things out based on pruning methods. > 4) On page 2 the authors claim "If experience can tell us when > to add variables, it should also be able to tell us when to > subtract variables." > > It is assumed that "experience" refers to human-based > experience. If so, then is it really necessary to build > statistical/ML-based models? The idea _was_ that experience processed by an AI agent would help humans understand their systems better. But, as correctly pointed out by this reviewer, that idea was not properly developed. That metaphor has been dropped. > 5) Figure 2 is a bit confusing. It does not seem to add > much value to the paper. This reviewer suggests removing > the figure. It seems that the authors are proposing how > to incorporate variable reduction into the data mining > process. Agreed. Figure deleted. > 6) In section 3 ("Why Subtract Variables"), the authors provide > business reasons why it is important to subtract variables. > In the second bullet, the authors provide an example of assessing > competitive bids. There are two concerns regarding this example: > A) It is an obvious example in the Time and Money will be > the most prominent driving forces. > B) This data mining type of problem is a lot easier than the > case study presented in the paper. The reason for this claim > is that the second bullet is an assessment type of problem > as opposed to a predictor type of problem. Agreed. Confused writing on our part. That section is now much shorter and clearer. > 7) In section 3 ("Why Subtract Variables"), the authors argue > for subtract variable because of "Irrelevancy." Essentially, > this is a Type I, Type II issue (Only throw away the irrelevant, > but keep the significant). Jumping ahead to Figure 5, there is > a blur between relevant and irrelevant. That is, the paper > argues the irrelevancy issue, but doesn't deliver consistent > results in Figure 5. Agreed. The core issue (drawback) with feature subset selection is that it is syntactic and confuse irrelevant with insignificant. A real solution to this problem is beyond the scope of this paper. > 8) In section 3 ("Why Subtract Variables"), the authors argue > for subtracting variables because of "Under-sampling." > A) For the paper "in preparation," the authors might want > to consider the question, "How many features are enough?" > That is, how does sample size drive feature set size. Good idea > B) The argument made on page 6, first paragraph, assumes an > equal distribution of the variables, which is not normally > the case. Thus, the 88% and 3.5% results are "worst case > scenarios." Agreed. Bad argument on our part. It has been dropped. > 9) On page 8, equation 1, claims that EMi has 15 effort multipliers > (which is true for COCOMO I). Since some of your data is > COCOMO II data, you might want to mention that the latter > version has 17 effort multipliers. Agreed. Text changed. See line 189 > 10) Equation 3 raises an interesting question about feature > reduction. Since many of the terms contain "Size" then > won't it be less likely that "Size" would be removed? > (In this case loc.) In fact that turns out to be the case. In all our experiments, LOC survived pruning. Sadly, this is one of the technical details NOT included in this new draft- there _was_ a whole discussion on how different features are found in different subsets. But that discussion got so long that it became a whole separate paper (which just got accepted to ASE 2005. at a 22% acceptance rate- we are happy. Regardless, that discussion is too long and too intricate for this paper. > 11) The paper argues for using the Wrapper technique for > producing > better effort estimation predictions. However, if I am a > project manager, how far do I reduce? Figure 5 does not > shed any light on this since "best results" are achieved > anywhere from FS01 (t02) all the way through FS07 (p04). > The paper does not plot the lines (in Figure 5) all the way > to FS07 for all the lines. Thus, as a project manager, I do > not know the consequences of extending the reduction out > to FS07. We should have been clearer in the last draft. The stopping criteria for FSS is FULLY AUTOMATIC. There is no human selection of "the best subset". That is done by t-tests and the operator just gets back one set of attributes. The automatic nature of this process is stressed in this new draft: see lines 18, 61, 279, 300 > 12) It is noted that the plots (bottom graph of Figure 5) > are not monotonic. This inconsistency raises questions about > how far to reduce. Perhaps the authors may wish > to include a confidence factor. That is, a reduction to FS01 > will improve effort prediction X percent of the time. This is a very interesting remark. Recently we have been attracted to bayesian modeling average to get a sense of the space of the theory. But at this time, we have no real results to show in this direction. > 13) The paper talks about performing 30 hold-out experiments. > What was the distribution of training to test samples. Page > 11 of the paper implies a 2/3, 1/3 ratio. Is this correct? > If so, then for data sets 5 through 10 there are 10 or > fewer samples in the training set and 5 or fewer samples in > the test set. This argues for the consolidation of data sets. Indeed. See the new figure 6 and the data sets "call, pall, tall" > 14) Also, if an experiment is run 30 times per data set per > feature reduction, then why not run a t-test on data set X for > feature levels N and N+1. It would be possible to claim that level > N+1 produces statistically superior results to level N (for data > set X). The old draft described, in detail, exactly how we implemented the above procedure (page 11, second dot point). Based on the advise of the other reviewers, we have reduced the level of technical detail in this draft. The current draft has a (very terse) description of the t-tests on line 276 > 15) In section 5, Related Work, the authors refer to Kirsopp & > Shepperd as "K&S" more than once. This is rather informal > probably not suitable for a journal article. Yes. K&S removed > > 16) The authors converted the answers using natural logs. Were > the answers converted back prior to measuring with pred(30)? > If not, all the results are greatly distorted. Ooops-we missed that from the last draft. Fixed now. See line 220 > ========================================================== > Reviewer 2 > [some deletions] > > 5. What do you see as the weakest aspect of this manuscript? > > Section 4 on the case study. It is too dense. Too much is > missing. I have no idea how to interpret Figure 5 which is the > mainr esult. I am not sure what the variable listed in Figure 4 > means. In order to fit within IEEE Software length guidelines, > too much of understanding was removed in the paper. > The supporting paper with the manuscript is not relevant. If > I need that paper to understand the current paper, then there is > no need for the current paper. > Section 4 needs to be rewritten and made more understandable as > to what is going on. > > A. Public Comments (these will be made available to the author) > Section 4 - the central theme of the paper - is unreadable. The > topic is important, but this version is not readable to the > general IEEE Software reader. That dreadful section 4 in the old version (on the experimental method) is now completely rewritten. Section 4 has been substantially rewritten, divided into other sections, with some technical details described more succinctly. Finally, a simpler way of explaining it all is used throughout the paper (so instead of "feature subset selection", we now do "column pruning"). > =========================================================== > Reviewer 3 > [deletions] > 5. What do you see as the weakest aspect of this manuscript? > It does not target the audience of the "Software" magazine. Agreed. The current draft is, we believe, much better. > Section III. Detailed Comments > > A. Public Comments (these will be made available to the author) > Section I.C.1 The title implies that more than one cost > estimation models was used in this experiment. The paper only > refers to COCOMO. Reconcile this issue. Agreed. While our process generates multiple models (one after each pruning), they are all in the COCOMO framework. So, as per this reviewer's suggestion, the title has been changed. > The paper has potential but right now it reads as if it was > written for datamining specialists. It should be re-written for > the readers of the "Software" magazine. That means that less > details should be given about the machine learners and more > about why this work is important for a user of the COCOMO model. As mentioned above, that dreadful section 4 in the old version (on the experimental method) is now completely rewritten. Section 4 has been substantially rewritten, divided into other sections, with some technical details described more succinctly. Finally, a simpler way of explaining it all is used throughout the paper (so instead of "feature subset selection", we now do "column pruning"). > How would a software project manager use COCOMO any differently > due to your proposed approach and what does that buy him/her? Good point- and one NOT addressed by the last draft. The change to current practice is described lines 22 to 23 of the new draft. > Do they have to be related > projects to the one at hand? How much related? Without knowledge of how well old projects are related to new projects, only column pruning can be performed (see the left-hand-side plot of fig.6.). This can produce some improvements but not as much as when domain knowledge is used to divide up the data. While the best improvements come from heavy division (see fig.6's right hand plot), even a little stratification knowledge can be helpful (e.g. the middle plot of fig 6 shows that even a rudimentary stratifications can help the process of learning estimators.