There are still many open questions about evaluation methods for ranking search based software engineering algorithms.  A few ways in which the experiment could be expanded are provided below.

\subsubsection{More Algorithms, More Datasets}
The 72 algorithms covered in this paper are only a small subset of the available algorithms for search based software engineering.  While data mining and artificial intelligence are relatively young fields of computer science, they have already generated hundreds of algorithms which can be combined in thousands of ways.  The Combination of Algorithms approach could be extended to include a more diverse range of algorithms, and scalability of the results could be assessed.  Additionally, there are more datasets available for effort estimation in the PROMISE data repository.  These additional datasets could be included and tested, to see if initial results hold across a larger range of datasets.

\subsubsection{Open Source Implementation}
The experiment was performed using software which referenced proprietary Matlab libraries.  It is not portable to machines without Matlab, and Matlab is not freely available.  There are many data mining and statistics packages available which are open source, such as R and Weka.  An interface could be built which links the available algorithms in the packages together and performs statistical analysis on the results.
\\
R is similar to Matlab in that it is a language for statistical manipulation.  A program could be written in R to perform the same function as the current COMBA system.
\\
Weka is an open source Java package with a visual and command line interface.  The Weka algorithms can be accessed either through shell script on the command line or imported and accessed in Java code.  The advantage of shell script is that if algorithms outside of Weka, possibly written in other languages, can also be accessed.  The statistical analysis available in the current Matlab COMBA could be accessed through a shell script as well.  The disadvantage is that management becomes more difficult with shell scripting.  A program which references several different programming languages can be hard to make modifications for or fix if errors occur.

\subsubsection{Incremental Results}
Currently, all methods must be run every time analysis is performed.  The code could be changed such that the raw results and error measures are saved, and a program links multiple results files and performs Mann-Whitney Wilcoxian tests on results.  This has the advantage of shorter turn-around time for each new algorithm.  A new algorithm need only run on a subset of the data available, and then could be compared to a database of previous results immediately.  The program and results data could also be released as a public resource, if the software was open source as the PROMISE datasets are freely available.  Users could create results in a COMBA format, and pass them to the statistical comparison program.
\\
This could even be done on a website, which has the advantage that each set of results benchmarked would contribute to the overall amount of results stored in the system.

\subsubsection{Other Fields of Search-Based Software Engineering}    
This paper only examined the method of effort estimation, while many algorithms exist for other aspects of project planning.  The combination of algorithms approach could be applied to tasks such as defect prediction and cost optimization. On a broader scale, the combination of algorithms method could be used to evaluate algorithms from domains other than search based software engineering.  It could be assessed as a separate tool for revealing patterns in data.\\

\subsubsection{Discrete Data is Not Handled}
The current project did not handle discrete elements in the data, so a different software package could be used which does handle discrete elements in the data.  It could be observed whether or not the inclusion of discrete data effected the results or the rankings.  How algorithms deal with discrete elements is frequently omitted in academic papers, so if it has a concrete effect this would be a useful result to share.

\subsubsection{Performance Analysis}
No reason was given as to why the algorithms which performed well did so.  A further study could be done in which it was observed for which type of datasets algorithms performed well, and the properties of the data that effect their performance.  This has the benefit of not requiring domain specific software engineering data.  Any n-dimensional space could be analyzed, as many of the algorithms used in search-based software engineering do not make domain specific assumptions about the data.  If COMBA is run on a large number of algorithms and datasets, a large database of results becomes available.  This makes possible rapid testing and verification by comparing to existing results stored from previous runs.
\\
These results could be used for suggesting algorithms given a dataset.  Similarity measures between datasets could be evaluated, and high-performing algorithms from similar datasets could be suggested without having to test them on the dataset.  The results of the recommendation system could then be compared to the actual results of running the possible algorithms.  While neural nets and other stochastic optimization processes are typically associated with poor runtimes, a large database provides a terrain for optimization in which no complex algorithms need to be run.  The results would already be precomputed, so verification could be done within reasonable time bounds.  A study could also be done where results are incrementally added to the database, and the optimization effectiveness is gaged on number of datasets.  It would be interesting to see if over fitting occurs with a few datasets, or if that too many datasets could flood the optimizer so that results are too general to be applicable.

\subsubsection{Algorithm Tweaking}
Many algorithms used within the COMBA system, such as Neural Nets and Stepwise Regression can be run using a variety of different settings.  In addition to more algorithms, algorithms can be run with different specifications.  Many times, specifications for algorithms are decided using \" engineering judgment \", rather than empirical methods.  Sometimes this involves citing another paper in which a particular setting was used, even if that paper did not test other settings.  The diverse range of data sets tested for might show that certain settings tend to perform better, and provide a basis for further research using those methods.

\subsubsection{Algorithm Building}
The results of the COMBA experiment could be used to build a new algorithm whose only goal is to rank well compared to existing algorithms.  It could go through iterative optimization to maximize its ranking, perhaps using a genetic programming approach with a lambda calculus.  Datasets could then be introduced that the algorithm did not optimize on, to see if its performance over fitted to the existing data or whether the optimized algorithm uncovered useful information about estimation.

\subsubsection{Real Time Readjustment}
A combination learning approach could be applied in an incremental fashion on data whose results depend on predictions made by a given algorithm.  For example, suppose a real-time simulation of a company is performed using estimates produced by algorithms which performed well on previous data.  Changes in performance can be taken in to account, so that multiple time steps are available and can be treated as different datasets.  This allows back propagation for algorithm recommendation and evaluation, and for certain applications can uncover more details about the underlying assumptions of data generation. 

\subsubsection{Synthetic Data Production}
While PROMISE provides a collection of software engineering data, there is still a large portion of data which is company specific and private.  A problem in creating a synthetic database is whether or not it is representative of the given domain.  While data could be randomly generated and algorithms scramble to try and find patterns within that data, perhaps a smarter approach to data production could be produced.
\\
The advantage of doing this within the COMBA system is benchmarking synthetic data with actual data.  Synthetic data can be ranked, and patterns seen in the other datasets, such as which algorithms performed well, can be compared.  It is possible that some companies with private data have many more entries then the available datasets.  Search based software engineering typically relies on datasets with fewer than 100 instances.  Synthetic data could be created of a larger size that tries to extrapolate and expand patterns seen in smaller datasets.

\subsubsection{Rank On Other Dimensions}
It should be noted that only the error measures provided were used to assess the given algorithms.  There are many other factors on which an algorithm can be evaluated, such as Big-O notation or empirical runtime results.  Future tests could collect the time taken to evaluate each dataset, and provide this along with error measures.  This could also be used as a complexity measure, to assess if complex and time-consuming algorithms increase performance or not.

\subsubsection{Combination of Data}
This paper combined different learners and assessed the combined results.  Manipulations can also be performed on the datasets themselves.  For example, datasets from different companies could be combined together, and the results of the algorithms on these new combined datasets could be assessed.
\\
Combining datasets introduces the problem that there is not a industrial standard for collecting information about software engineering projects.  Design decisions would come up such as how to treat features which are represented in one dataset but not another when combining datasets.  Different approached for combination could be evaluated.

\subsubsection{Domain Specific Knowledge Acquisition}
One aspect of the datasets has a real world value corresponding to it.  Other papers have proposed feature weighting, in which certain aspects of a project are more important than others for predicting effort.  Domain inspecific ways, such as genetic algorithm optimization, have been proposed to find the best set of feature weights.  A broad approach could look at the application of certain features across different datasets.  For example, lines of code is generally regarded to be a good indicator of project effort, though it itself often requires estimation to obtain.  This null hypothesis could be tested, by stripping lines of code from the dataset and viewing the results.  This approach could be made more general by removing features to test their effects on estimation accuracy using different algorithms.

\subsubsection{Data Visualization}
The human heuristic is often a powerful one for quickly assessing patterns in visual models.  One idea would be to create a visual representation of a dataset, and then present it to an expert.  The expert could be asked to identify what types of algorithms they think would perform well on a dataset given its shape and attributes.  The experts predictions can then be indexed with the actual rankings, to score a visualization technique.  There may be a simple visualization that allows an expert to make complex decisions quickly.  This also allows new datasets to be visualized, and the expert to make judgments on this new data without consulting all possible algorithms.
\\
This also has the benefit of potentially explaining why certain algorithms or datasets performed well, discussed earlier in this section.

\subsubsection{Experiment Verification and Validation}
COMBA provides an environment to reproduce results published in other papers.  While no algorithm can be shown to be the absolute best, as seen in the No Free Lunch theorem, an algorithm can be shown to be reasonably good.  Experiments which relied on a limited number of datasets or algorithms to compare to could be reproduced.  A possibility would be reproducing multiple experiments in one paper, and noting if the assumptions in the initially published result held upon more intensive testing.
\\
More ambitiously, if the COMBA system was freely available it could be housed and recommended for use.  In this way, researchers in other areas could download COMBA and quickly validate their results.  There are more ways to expand this project then are feasible given the time-scale, so this approach provides a system which will build itself.  The more people user, the stronger the results produced from it become as new algorithms and datasets become contributed to the COMBA system.

\subsubsection{Pattern Matching Tested Algorithms}
While this paper directly combined large-scale programming constructs together, the learns and preprocessors used could further be broken down in to their components.  There are some programming languages, such as LISP, in which the components of a program are highly transparent and thus modifiable.  A potential experiment could be to reprogram algorithms in a language which is transparent, and perform similarity analysis between high-performing algorithms and low-performing algorithms.
\\
This creates interesting problems, such as how to identify when two constructs are doing the same thing but in different ways.  This can be solved by pattern matching, or extrapolating the function of a code from a series of commands.  The time complexity in these scenarios can be a hindering factor, especially when iterative development and optimization processes are required to obtain a functioning system.