# Corpora Comparison

The following table provides a rough comparison of the corpora. The modified EUSES corpus convinces by the diversity of the spreadsheets and the high numbers of base spreadsheets and faulty versions. The Payroll/Gradebook corpus is the only corpus that comes with authentic testing decisions, i.e. the testing decisions are provided by users. All other  corpora have artificial created testing decisions. The main advantage of the Info1 corpus is that it comes with real faults. The Integer corpus is useful for evaluating tools that cannot handle floating point numbers.

Corpus EUSES Info1 Integer P/G*
Number of base spreadsheets 184 2 33 2
Number of faulty versions 576 119 231 349
Spreadsheet diversity large small medium small
Spreadsheet origin authentic exercise mixed laboratory
Fault orign injected real injected injected
Fault complexity single** multiple  multiple multiple
Testing decisions origin artificial artificial artificial user-provided
Testing decisions quality always correct always correct always correct wrong classifications possible
Testing decision area result cells formula cells result cells arbitrary cells
Domain Real Real Integer Real

*P/G = Payroll/Gradebook
** an additional version of the EUSES corpus with double and tripe faults has been added

The following table provides a quantitative comparison of three of the corpora. The number of input and formula cells are good indicators for the size of the spreadsheets. While the Payroll/Gradebook spreadsheets are very small spreadsheets, the EUSES spreadsheets have between 6 and more than 10,000 formula cells. The smallest Info1 spreadsheet has even 501 formula cells.
A high percentage of copied cells indicates that grouping techniques, which treat similar cells as a unit, are well suited for this corpus. Both EUSES and Info1 have a high percentages of copied formulas. The number of IFs is a rough indicator for the success of dynamic techniques, in which the concrete evaluation of conditions is important. Info1 contains the largest number of IF statements.
The average number of operators per formula cell indicates the complexity of the spreadsheet. The high number of provided testing decisions in the EUSES corpus originates from fact that they are automatically generated by comparing the results cells of a faulty spreadsheet with the correct spreadsheet. Such a high number of testing decisions would never be provided by a user.

Feature EUSES Info1 P/G*
Number of formula cells Min 6 501 10
Q1 40 580 12
Median 111.5 2131 18
Q3 305.25 2245.5 19
Max 10316 3157 19
Avg 353.95 1466.22 15.03
Number of input cells Min 1 10 5
Q1 45 13 5
Median 129 21 6
Q3 430 60 9
Max 24067 733 9
Avg 601.54 90.67 7.17
Number of unique formulas** Min 2 13 10
Q1 6 21 12
Median 11 23 18
Q3 26 28 19
Max 895 90 19
Avg 24.19 25.91 15.02
% Copied formulas*** Min 0 84 0
Q1 75 96 0
Median 89 98 0
Q3 95 99 0
Max 100 99 5
Avg 82 97 0
Number of spreadsheets with IFs 109 112 349
% of spreadsheets with IFs 19 94 100
Of those, number of Ifs Min 1 142 7
Q1 22 143 7
Median 54 532,5 9
Q3 136 1616 9
Max 7839 3234 9
Avg 371.44 1023.28 8.01
Average Number of operators per formula cell Min 0.34 2.54 2
Q1 1 3.33 2.21
Median 1.48 3.87 2.33
Q3 2 5.92 3.17
Max 17 9.25 3.9
Avg 1.89 4.47 2.72
Number of positive testing decisions per test set Min 0 1 0
Q1 9 6 2
Median 28 8 4
Q3 92 10 7
Max 2962 17 17
Avg 79.81 8.08 4.76
NUMBER OF NegatIVE TESTING DECISIONS PER TEST SET Min 1 1 1
Q1 1 2 1
Median 1 3 1
Q3 2 4 2
Max 72 10 14
Avg 2.86 3.18 1.77

*P/G = Payroll/Gradebook

** The set unique formulas is a subset of all formulas $$C_{unique}\subseteq C_{formula}$$ such that $$\forall c, c’ \in C_{unique}, c \neq c’: \ell(c)\neq \ell(c’)$$ where $$\ell(c)$$ is the formula in R1C1 notation of cell c.

*** Percentage of copied formulas $$P_C=(1-\frac{C_{unique}}{C_{formula}})*100$$