A PRACTICAL PERFORMANCE INDEX FOR COMPARING OF OPTIMIZATION SOFTWARE

In this paper we propose a new practical performance index for ranking of numerical methods. This index may be very helpful especially when several methods are tested on a large number of instances, since it provides a concise and precise idea of the relative efficiency of a method with the respect to the others. In order to evaluate the efficiency of the proposed rule, we have applied it to the numerical results presented on previously published papers.


INTRODUCTION
The increasing emphasis on the computational aspects of optimization methods and their impact on the solution real-world applications have prompted the need to design meaningful indices for performance evaluation. However, many difficulties arise in evaluating and interpreting the results in a fair and balanced way.
First of all, we have to establish the object of evaluation. Actually, for any method, it is possible to define different algorithms and software implementations whose efficiency strongly depends on the compiler and on the hardware platform used.
Furthermore, we have to choose the criteria on which we shall carry out the evaluating process. The traditionally used performance measures are: the computational performance (speed and memory, robustness), the solution quality (accuracy), and the scope (problem type and size).
A frequently used notion in the comparison of numerical methods is the one of "the best method". The "best" is a relative concept that depends on subjective criteria, whose choice depends on the goal of the comparison: the best may mean the fastest, or the easiest to apply, or the most reliable.
Another important issue to address in the evaluating process is the choice of the test problems. The testing phase of optimization software could be inadequate for several reasons: the number of test problems could be too small, the instances could have small size and present some regularity. We observe that the testing is a very crucial phase since if any error occurs in the manner of carrying out the experiments, the same error may affect any performance index used to compare the results.
Solving the controversy surrounding the reporting of results from scientific experiments is out the scope of this paper. Relevant papers try to give instructions on the manner of carrying out numerical experiments in a correct way. Interested readers are referred, for example, to [1], [2] and [3], where the authors examined the issues involved in reporting on the empirical testing of parallel mathematical programming algorithms.
The main contribution of this paper is to define a new performance index, once the user has stated clearly what is tested, which performance criteria are considered, and which performance measure is used. The proposed index gives a concise idea of the relative efficiency of a method with respect to the others, when it is applied to solve a given set of test problems.
The rest of the paper is organized as follows. In the next section we introduce a simple example showing the weaknesses of certain known rules. In Section 3, we define our performance index. Finally, in Section 4, we illustrate the efficiency of our rule by considering the computational results of some numerical methods proposed to solve two classes of optimisation problems.

International Scientific
Journal of Computing

AN EXAMPLE
Let p be the number of test problems, m the number of methods that we want to rank, and let C i,,j be the cost (for example the execution time) required to solve test j (j = 1, 2,..., p) by the method i (i = 1, 2,..., m).
Let us consider the simple case with m = 2 and p = 3 reported in Table 1. In order to compare the results of Method 2 with respect to those of Method 1, a straightforward rule is to consider the ratio of the sum of the costs: Another simple rule is defined by the following index: corresponding to the average speed-up of Method 1 over Method 2. The value of R (1) shows that Method 2 works better than Method 1, whereas R (2) indicates that there is no difference in the performance of the two methods. However, the use of both indices may be misleading since Method 2 improves 50% over Method 1 on the first test (the improvement of Method 2 over Method 1 for solving test j is defined by the ratio (C 1,j /C 2,j )/C 1,j ), whereas Method 1 improves 50% over Method 2 for the other two tests. This means that Method 1 is globally more efficient than Method 2.
This simple example suggests that the above exposed rules are not reliable. Actually, R (1) does not consider the difference among the costs required by each method to solve all the test problems. On the other hand, R (2) is useful only if the ratio C 1,j /C 2,j is greater than or equal to one, for all the test.
Other difficulties arise when, for some tests, one method either fails to solve the test problem or it finds a different solution with respect to the one obtained by the other methods. In this situation, one possibility is to avoid to include in the comparison the results of this particular test problem.

A NEW PRACTICAL PERFORMANCE INDEX
Much work has been devoted in defining performance measures for comparing of optimization software developed for specific problems. The interested readers are referred, for example, to [4], [5], [6] and [7]. Our index is more general and can be used to compare a large number of methods tested on a specified set of instances. More specifically, for each method i, we define the total quality index R i by the following pair of values: Here R i (SQ) is a measure of the solution quality (i.e., robustness) of the method i. In particular, we have chosen to define R i (SQ) as the percentage of successes, i.e. the ratio between the number of successful exits over the number p of test problems; as such R i (SQ ∈ [0,1]. We observe that this index may be omitted when the value is equal for all the methods or meaningless for some particular application.
is the index of the computational performance defined as follows: where: More specifically, for each method i and for each test problem j, we define a score r i (j) which is equal to the ratio of the cost of this method over the cost of the best-considered method. Thus, R i (CP) is an average score that indicates how much a particular method has been less efficient, on average, than the most successful method (defined from now on as the ideal method). This value takes into account not only if a method works better than the others, but also how much the method outperforms the others.
For the example reported in Table 2, we have: We observe that our definition of R i (CP) can be viewed as a generalization of the rule used by Brown and Bartholomew-Biggs in [8] to rank some methods for solving unconstrained optimization problems. Their idea consists in assigning 1 point to the most successful code, 2 points to the second, and so on. In this way, the total score obtained for each method reflects the frequency of outperforming. The main drawback of the Brown-Biggs' rule is that it provides a qualitative ranking rather than a quantitative one.
It is worth mentioning that the index of computational performance R (CP) is particularly useful also to establish the speed-up of parallel methods and to compare their efficiency with the sequential counterparts [9]. Obviously, this is possible only when the costs C i,,j used to rank the methods correspond to the execution times.
In the case of failure for some test problem j, we suggest to replace the failure with the maximum cost computed over all the methods used for solving test j. This solution seems to be reasonable because the main aim of the rule is to rank the methods using a relative index (i.e., we can only establish if a method works better than the others considered for solving the limited set of selected test problems). In this case, the index R (SQ) will take into account the percentage of the failure of the method.

NUMERICAL ILLUSTRATION
In order to evaluate the efficiency of our rule, we have considered methods proposed in the literature to solve two classes of classical optimization problems, i.e. the shortest path problem in a directed graph, and the problem of finding the stationary points of a nonlinear and unconstrained function.
In the former case, we use the results collected by Bertsekas [10] on 16 network problems solved by using the Bellman-Ford (B-F) method, the D'Esopo-Pape (D-P) method, the Small Label First (SLF) method, the Threshold (THR) method and the combination of the last two methods (SLF-THR). The cost chosen by Bertsekas to evaluate the performance of the method is the execution time (in secs).
On the basis of the results reported in table 3, we obtain the ranking of Fig. 1. Note that, we have reported only the values of the index R (CP) of the computational performance, since all the methods terminate with the optimal solution.
Our index reveals that the D-P method is about 11 times slower on the average than the SLF-THR, whereas the performance of SLF-THR is very close to that of the ideal method, which solves all the test problems in the minimum time over all the considered methods (this means that none of the methods is ideal). The analysis of the results presented by Bertsekas in [10] completely matches our conclusions: "... the SLF method can also be combined with the threshold algorithm thereby considerably improving its practical performance", or: "... for some test problems the D'Esopo-Pape algorithm performs very poorly; we have not seen in the literature any report of a class of randomly generated sparse problems where this algorithm exhibits such poor behaviour", and so on.
As another example we have considered the numerical results reported in [11] to solve unconstrained non linear optimisation problems by limited memory Quasi-Newton methods. In particular, comparative tests have been conducted on a set of 18 well-known test problems (three of these have two different dimensions). Among the noticeable variety of collected results, we have considered (according to the choice of the authors of [10]) only the CPU time for nine different codes, whose names are reported in Table 4.
The results are summarized in Table 5.  In [11] it is pointed out that all methods have a practical appeal. E04DGF appears to be the least efficient method for the library test problems, and when the number m is increased from 3 to 7, there is no significant improvement in performance. Similar conclusions (and many others) can be derived by examining the details of the results of Figure 2. This confirms that our rule is sound and reliable.

CONCLUSION
In this paper we have proposed a new cumulative index for ranking numerical methods used to solve optimization problems. The results have confirmed that our index is sound and reliable and, thus, it promises to be a useful tool to measure the performance of any optimization method, even implemented in parallel, especially when several methods are used for the comparison and the number of test problems is so large to make difficult the analysis of the numerical results using other approaches.