There are still many open questions about evaluation methods for ranking search based software engineering algorithms.  A few ways in which the experiment could be expanded are provided below.

\subsubsection{More Algorithms, More Datasets}
The 52 algorithms covered in this paper are only a small subset of the available algorithms for search based software engineering.  While data mining and artificial intelligence are relatively young fields of computer science, they have already generated hundreds of algorithms which can be combined in thousands of ways.  The combination of algorithms approach could be extended to include a more diverse range of algorithms, and scalability of the results could be assessed.  Additionally, there are more datasets available for effort estimation in the PROMISE data repository.  These additional datasets could be included and tested, to see if initial results hold across a larger range of datasets.
\\
The current COMBA coding system has an open-source implementation in which algorithms are called from shell scripts, allowing algorithms to be collected from multiple languages across multiple environments.  This allows for the rapid expansion of the system by adding already implemented and available methods.
\\
This task is suggested primarily because of the difference in results between versions of COMBA.  The MWW tests performed for algorithm comparison are biased based on the performance of the algorithms included in the COMBA system.  It is possible that there were not a large enough number of poor or high performance algorithms, both of which could account for the difference in results.

\subsubsection{Other Fields of Search-Based Software Engineering}    
This paper only examined the method of effort estimation, while many algorithms exist for other aspects of project planning.  The combination of algorithms approach could be applied to tasks such as defect prediction and cost optimization. On a broader scale, the combination of algorithms method could be used to evaluate algorithms from domains other than search based software engineering.

\subsubsection{Algorithm Optimization}
Many algorithms used within the COMBA system, such as Neural Nets and Principle Component Analysis can be run using a variety of different settings.  In addition to more algorithms, algorithms can be run with different specifications.  Many times, specifications for algorithms are decided using \" engineering judgment \", rather than empirical methods.  Sometimes this involves citing another paper in which a particular setting was used, even if that paper did not test other settings.  The diverse range of data sets tested for might show that certain settings tend to perform better, and provide a basis for further research using those methods.
\\
Additionally, the different specifications for algorithms can be tested and compared to one another.  In this paper, several algorithms (CART, kNN, discretization) were tested with different values and compared.  The program can control setting optional values, and use a search process to find optimal values for these settings through repeated runs and comparisons.
\\
This does not only need to be used on existing methods.  COMBA also serves as an environment for creating new algorithms.  Supposing an algorithm is created as a baseline and run on the available datasets, future runs of the algorithm with changes made can be compared to the previous states, allowing tests to see if changes have improved performance or not.

\subsubsection{Rank On Other Dimensions}
It should be noted that only the error measures provided were used to assess the given algorithms.  There are many other factors on which an algorithm can be evaluated, such as Big-O notation or empirical runtime results.  Future tests could collect the time taken to evaluate each dataset, and provide this along with error measures.  This could also be used as a complexity measure, to assess if complex and time-consuming algorithms increase performance or not.

\subsubsection{Real Time Readjustment}
A combination learning approach could be applied in an incremental fashion on data whose results depend on predictions made by a given algorithm.  For example, suppose a real-time simulation of a company is performed using estimates produced by algorithms which performed well on previous data.  Changes in performance can be taken in to account, so that multiple time steps are available and can be treated as different datasets.  This allows back propagation for algorithm recommendation and evaluation, and for certain applications can uncover more details about the underlying assumptions of data generation. 

\subsubsection{Domain Specific Knowledge Acquisition}
One aspect of the datasets has a real world value corresponding to it.  Other papers have proposed feature weighting, in which certain aspects of a project are more important than others for predicting effort.  Domain inspecific ways, such as genetic algorithm optimization, have been proposed to find the best set of feature weights.  A broad approach could look at the application of certain features across different datasets.  For example, lines of code is generally regarded to be a good indicator of project effort, though it itself often requires estimation to obtain.  This null hypothesis could be tested, by stripping lines of code from the dataset and viewing the results.  This approach could be made more general by removing features to test their effects on estimation accuracy using different algorithms.

\subsubsection{Experiment Verification and Validation}
COMBA provides an environment to reproduce results published in other papers.  Experiments which relied on a limited number of datasets or algorithms to compare to could be reproduced.  A possibility would be reproducing multiple experiments in one paper, and noting if the assumptions in the initially published result held upon more intensive testing.
\\
More ambitiously, the COMBA system is freely available and could be recommended for use.  In this way, researchers in other areas could download COMBA and quickly validate their results.  There are more ways to expand this project then are feasible given the time-scale, so this approach provides a system which will build itself.  The more people using the system, the stronger the results produced from it become as new algorithms and datasets become contributed to the COMBA system.

\subsubsection{Performance Analysis}
No reason was given as to why the algorithms which performed well did so.  A further study could be done in which it was observed for which type of datasets algorithms performed well, and the properties of the data that effect their performance.  This has the benefit of not requiring domain specific software engineering data.  Any n-dimensional space could be analyzed, as many of the algorithms used in search-based software engineering do not make domain specific assumptions about the data.  If COMBA is run on a large number of algorithms and datasets, a large database of results becomes available.  This makes possible rapid testing and verification by comparing to existing results stored from previous runs.
\\
These results could be used for suggesting algorithms given a dataset.  Similarity measures between datasets could be evaluated, and high-performing algorithms from similar datasets could be suggested without having to test them on the dataset.  The results of the recommendation system could then be compared to the actual results of running the possible algorithms.  While neural nets and other stochastic optimization processes are typically associated with poor runtimes, a large database provides a terrain for optimization in which no complex algorithms need to be run.  The results would already be precomputed, so verification could be done within reasonable time bounds.  A study could also be done where results are incrementally added to the database, and the optimization effectiveness is based on number of datasets.  It would be interesting to see if over fitting occurs with a few datasets, or if that too many datasets could flood the optimizer so that results are too general to be applicable.
