Corpora Comparison | Spreadsheets

The following table provides a rough comparison of the corpora. The modified EUSES corpus convinces by the diversity of the spreadsheets and the high numbers of base spreadsheets and faulty versions. The Payroll/Gradebook corpus is the only corpus that comes with authentic testing decisions, i.e. the testing decisions are provided by users. All other corpora have artificial created testing decisions. The main advantage of the Info1 corpus is that it comes with real faults. The Integer corpus is useful for evaluating tools that cannot handle floating point numbers.

Corpus	EUSES	Info1	Integer	P/G*
Number of base spreadsheets	184	2	33	2
Number of faulty versions	576	119	231	349
Spreadsheet diversity	large	small	medium	small
Spreadsheet origin	authentic	exercise	mixed	laboratory
Fault orign	injected	real	injected	injected
Fault complexity	single**	multiple	multiple	multiple
Testing decisions origin	artificial	artificial	artificial	user-provided
Testing decisions quality	always correct	always correct	always correct	wrong classifications possible
Testing decision area	result cells	formula cells	result cells	arbitrary cells
Domain	Real	Real	Integer	Real

*P/G = Payroll/Gradebook
** an additional version of the EUSES corpus with double and tripe faults has been added

The following table provides a quantitative comparison of three of the corpora. The number of input and formula cells are good indicators for the size of the spreadsheets. While the Payroll/Gradebook spreadsheets are very small spreadsheets, the EUSES spreadsheets have between 6 and more than 10,000 formula cells. The smallest Info1 spreadsheet has even 501 formula cells.
A high percentage of copied cells indicates that grouping techniques, which treat similar cells as a unit, are well suited for this corpus. Both EUSES and Info1 have a high percentages of copied formulas. The number of IFs is a rough indicator for the success of dynamic techniques, in which the concrete evaluation of conditions is important. Info1 contains the largest number of IF statements.
The average number of operators per formula cell indicates the complexity of the spreadsheet. The high number of provided testing decisions in the EUSES corpus originates from fact that they are automatically generated by comparing the results cells of a faulty spreadsheet with the correct spreadsheet. Such a high number of testing decisions would never be provided by a user.

Feature		EUSES	Info1	P/G*
Number of formula cells	Min	6	501	10
	Q1	40	580	12
	Median	111.5	2131	18
	Q3	305.25	2245.5	19
	Max	10316	3157	19
	Avg	353.95	1466.22	15.03
Number of input cells	Min	1	10	5
	Q1	45	13	5
	Median	129	21	6
	Q3	430	60	9
	Max	24067	733	9
	Avg	601.54	90.67	7.17
Number of unique formulas**	Min	2	13	10
	Q1	6	21	12
	Median	11	23	18
	Q3	26	28	19
	Max	895	90	19
	Avg	24.19	25.91	15.02
% Copied formulas***	Min	0	84	0
	Q1	75	96	0
	Median	89	98	0
	Q3	95	99	0
	Max	100	99	5
	Avg	82	97	0
Number of spreadsheets with IFs		109	112	349
% of spreadsheets with IFs		19	94	100
Of those, number of Ifs	Min	1	142	7
	Q1	22	143	7
	Median	54	532,5	9
	Q3	136	1616	9
	Max	7839	3234	9
	Avg	371.44	1023.28	8.01
Average Number of operators per formula cell	Min	0.34	2.54	2
	Q1	1	3.33	2.21
	Median	1.48	3.87	2.33
	Q3	2	5.92	3.17
	Max	17	9.25	3.9
	Avg	1.89	4.47	2.72
Number of positive testing decisions per test set	Min	0	1	0
	Q1	9	6	2
	Median	28	8	4
	Q3	92	10	7
	Max	2962	17	17
	Avg	79.81	8.08	4.76
NUMBER OF NegatIVE TESTING DECISIONS PER TEST SET	Min	1	1	1
	Q1	1	2	1
	Median	1	3	1
	Q3	2	4	2
	Max	72	10	14
	Avg	2.86	3.18	1.77

*P/G = Payroll/Gradebook

** The set unique formulas is a subset of all formulas \(C_{unique}\subseteq C_{formula}\) such that \( \forall c, c’ \in C_{unique}, c \neq c’: \ell(c)\neq \ell(c’) \) where \(\ell(c)\) is the formula in R1C1 notation of cell c.

*** Percentage of copied formulas \( P_C=(1-\frac{C_{unique}}{C_{formula}})*100 \)

Benchmarks and quality assurance techniques for spreadsheets