The following table provides a rough comparison of the corpora. The modified EUSES corpus convinces by the diversity of the spreadsheets and the high numbers of base spreadsheets and faulty versions. The Payroll/Gradebook corpus is the only corpus that comes with authentic testing decisions, i.e. the testing decisions are provided by users. All other corpora have artificial created testing decisions. The main advantage of the Info1 corpus is that it comes with real faults. The Integer corpus is useful for evaluating tools that cannot handle floating point numbers.
Corpus | EUSES | Info1 | Integer | P/G* |
---|---|---|---|---|
Number of base spreadsheets | 184 | 2 | 33 | 2 |
Number of faulty versions | 576 | 119 | 231 | 349 |
Spreadsheet diversity | large | small | medium | small |
Spreadsheet origin | authentic | exercise | mixed | laboratory |
Fault orign | injected | real | injected | injected |
Fault complexity | single** | multiple | multiple | multiple |
Testing decisions origin | artificial | artificial | artificial | user-provided |
Testing decisions quality | always correct | always correct | always correct | wrong classifications possible |
Testing decision area | result cells | formula cells | result cells | arbitrary cells |
Domain | Real | Real | Integer | Real |
*P/G = Payroll/Gradebook
** an additional version of the EUSES corpus with double and tripe faults has been added
The following table provides a quantitative comparison of three of the corpora. The number of input and formula cells are good indicators for the size of the spreadsheets. While the Payroll/Gradebook spreadsheets are very small spreadsheets, the EUSES spreadsheets have between 6 and more than 10,000 formula cells. The smallest Info1 spreadsheet has even 501 formula cells.
A high percentage of copied cells indicates that grouping techniques, which treat similar cells as a unit, are well suited for this corpus. Both EUSES and Info1 have a high percentages of copied formulas. The number of IFs is a rough indicator for the success of dynamic techniques, in which the concrete evaluation of conditions is important. Info1 contains the largest number of IF statements.
The average number of operators per formula cell indicates the complexity of the spreadsheet. The high number of provided testing decisions in the EUSES corpus originates from fact that they are automatically generated by comparing the results cells of a faulty spreadsheet with the correct spreadsheet. Such a high number of testing decisions would never be provided by a user.
Feature | EUSES | Info1 | P/G* | |
---|---|---|---|---|
Number of formula cells | Min | 6 | 501 | 10 |
Q1 | 40 | 580 | 12 | |
Median | 111.5 | 2131 | 18 | |
Q3 | 305.25 | 2245.5 | 19 | |
Max | 10316 | 3157 | 19 | |
Avg | 353.95 | 1466.22 | 15.03 | |
Number of input cells | Min | 1 | 10 | 5 |
Q1 | 45 | 13 | 5 | |
Median | 129 | 21 | 6 | |
Q3 | 430 | 60 | 9 | |
Max | 24067 | 733 | 9 | |
Avg | 601.54 | 90.67 | 7.17 | |
Number of unique formulas** | Min | 2 | 13 | 10 |
Q1 | 6 | 21 | 12 | |
Median | 11 | 23 | 18 | |
Q3 | 26 | 28 | 19 | |
Max | 895 | 90 | 19 | |
Avg | 24.19 | 25.91 | 15.02 | |
% Copied formulas*** | Min | 0 | 84 | 0 |
Q1 | 75 | 96 | 0 | |
Median | 89 | 98 | 0 | |
Q3 | 95 | 99 | 0 | |
Max | 100 | 99 | 5 | |
Avg | 82 | 97 | 0 | |
Number of spreadsheets with IFs | 109 | 112 | 349 | |
% of spreadsheets with IFs | 19 | 94 | 100 | |
Of those, number of Ifs | Min | 1 | 142 | 7 |
Q1 | 22 | 143 | 7 | |
Median | 54 | 532,5 | 9 | |
Q3 | 136 | 1616 | 9 | |
Max | 7839 | 3234 | 9 | |
Avg | 371.44 | 1023.28 | 8.01 | |
Average Number of operators per formula cell | Min | 0.34 | 2.54 | 2 |
Q1 | 1 | 3.33 | 2.21 | |
Median | 1.48 | 3.87 | 2.33 | |
Q3 | 2 | 5.92 | 3.17 | |
Max | 17 | 9.25 | 3.9 | |
Avg | 1.89 | 4.47 | 2.72 | |
Number of positive testing decisions per test set | Min | 0 | 1 | 0 |
Q1 | 9 | 6 | 2 | |
Median | 28 | 8 | 4 | |
Q3 | 92 | 10 | 7 | |
Max | 2962 | 17 | 17 | |
Avg | 79.81 | 8.08 | 4.76 | |
NUMBER OF NegatIVE TESTING DECISIONS PER TEST SET | Min | 1 | 1 | 1 |
Q1 | 1 | 2 | 1 | |
Median | 1 | 3 | 1 | |
Q3 | 2 | 4 | 2 | |
Max | 72 | 10 | 14 | |
Avg | 2.86 | 3.18 | 1.77 |
*P/G = Payroll/Gradebook
** The set unique formulas is a subset of all formulas \(C_{unique}\subseteq C_{formula}\) such that \( \forall c, c’ \in C_{unique}, c \neq c’: \ell(c)\neq \ell(c’) \) where \(\ell(c)\) is the formula in R1C1 notation of cell c.
*** Percentage of copied formulas \( P_C=(1-\frac{C_{unique}}{C_{formula}})*100 \)