SPP 1307 Home Related Work Goals References Source code and tools Test instances: Texts Test instances: Patterns

Test instances: Texts

We collected real world test data to test algorithms and data structures for approximate pattern matching under realistic conditions. The texts come from different relevant application areas and have sizes up to several Gigabytes. We furthermore generated several synthetic texts using our tool tt-generate to perform tests under controlled conditions.

text-english: This is a concatenation of natural language text files downloaded from the collection of freely available electronic books of the Project Gutenberg in August 2009. In a preprocessing step we unified all line endings to \n, as well as removed the headers, footers, and licenses of the electronic book files with our tool strip_headers. In this text we only included English texts classified as ASCII texts by the Unix file command (some of the sequences nevertheless include a few non-ASCII or non-printable characters).
text-german: In this text we included German texts, consistently converted to ISO-8859 encoding.
text-chinese: In this text we included Chinese texts, consistently converted to UTF-8 encoding.
dna-human5: This text represents the DNA sequence of the human genome (in the following order: Chromosomes 1–22, X, Y). It was downloaded from the NCBI-GenBank (National Center for Biotechnology Information) in August 2009. The chromosome sequences consist of the letters A, C, G, and T, corresponding to the four nucleic acids. Furthermore they contain some occurences of N, M, and R, representing an unkown nucleic acid in the IUB/IUPAC code.
dna-human4: The same as dna-human5 except that all N are stripped out.
uniform-ascii, uniform-dna4, uniform-protein, uniform-binary: Synthetic texts where the characters are drawn uniformly from the respective underlying alphabet.
fibonacci: The Fibonacci string over the binary alphabet 0, 1.
markov-english: Synthetic text generated with a Markov process of order 3; the parameters where estimated from text-english.
markov-dna4: Synthetic text generated with a Markov process of order 9; the parameters where estimated from dna-human4.
markov-protein: Synthetic text generated with a Markov process of order 3; the parameters where estimated from protein-all.

For all the texts we computed some statistical values that are of interest when evaluating algorithms and data structures for approximate pattern matching (using our tools tt-analyze and file_statistics). The columns of the table are explained below.

File name	Encoding	File size	Sequence length	Alphabet size	Inv. prob. matching	Alphabet characters	1-grams	2-grams	3-grams	H₀	H₁	H₂	H₃	H₄	H₅	gzip	bzip2	xz
text-english	ASCII	8520 MiB	8 934 380 282	185	15.74	_ !"#$%&'()*+,-./0123456789:;<=>? @A...	185	7 424	113 678	4.53	3.63	2.96	2.44	2.10	1.89	37.77 %	28.10 %	23.54 %
text-german	ISO-8859	204 MiB	214 072 799	190	14.52	_ !"#$%&'()*+,-./0123456789:;<=>? @A...	190	8 914	101 319	4.57	3.59	2.94	2.47	2.12	1.88	37.24 %	28.13 %	21.06 %
text-chinese	UTF-8	147 MiB	52 812 589	16 565	104.12	...ä��ä��ä��ä��ä��ä��ä��ä��ä��ä��ä��ä��ä��ä��ä��ä��...	16 565	2 699 021	16 683 667	9.46	7.37	4.85				44.51 %	33.90 %	30.65 %
dna-human5	ASCII	2952 MiB	3 095 677 412	5	4.41	ACGNT	5	25	120	2.29	1.66	1.65	1.64	1.64	1.63	22.90 %	21.84 %	18.79 %
dna-human4	ASCII	2728 MiB	2 861 327 131	4	3.87	ACGT	4	16	64	1.98	1.94	1.93	1.92	1.91	1.90	26.74 %	25.56 %	21.98 %
protein-all	ASCII	5195 MiB	5 448 131 689	26	16.32	ABCDEFGHIJKLMNOPQRSTUVWXYZ	26	617	10 303	4.14	4.13	4.12	4.11	4.04	3.75	55.01 %	51.42 %	34.75 %
uniform-ascii	ASCII	1024 MiB	1 073 741 824	94	94.00	!"#$%&'()*+,-./0123456789:;<=>? @A...	94	8 836	830 584	6.55	6.55	6.55	6.55	6.55	6.55	83.01 %	82.81 %	83.49 %
uniform-dna4	ASCII	1024 MiB	1 073 741 824	4	4.00	ACGT	4	16	64	2.00	2.00	2.00	2.00	2.00	2.00	28.56 %	27.34 %	26.56 %
uniform-protein	ASCII	1024 MiB	1 073 741 824	26	26.00	ABCDEFGHIJKLMNOPQRSTUVWXYZ	26	676	17 576	4.70	4.70	4.70	4.70	4.69	4.69	63.53 %	59.79 %	61.70 %
uniform-binary	ASCII	1024 MiB	1 073 741 824	2	2.00	01	2	4	8	1.00	1.00	1.00	1.00	1.00	1.00	14.91 %	16.01 %	13.41 %
fibonacci	ASCII	1024 MiB	1 073 741 824	2	1.89	01	2	3	4	0.96	0.59	0.37	0.37	0.23	0.23	0.44 %	0.01 %	0.18 %
markov-english	ASCII	1024 MiB	1 073 741 824	139	15.55	!"#$%&'()*+,-./0123456789:;<=>? @A...	139	8 624	207 137	4.62	3.72	3.00	2.41	2.40	2.38	47.43 %	38.54 %	38.34 %
markov-dna4	ASCII	1024 MiB	1 073 741 824	4	3.89	ACGT	4	16	64	1.98	1.94	1.93	1.92	1.91	1.90	27.37 %	26.11 %	24.51 %
markov-protein	ASCII	1024 MiB	1 073 741 824	26	16.02	ABCDEFGHIJKLMNOPQRSTUVWXYZ	26	625	10 906	4.14	4.13	4.12	4.11	4.04	4.00	58.47 %	54.02 %	54.42 %

Explanation of the columns:

File name:: File name of the text.
Encoding:: Character encoding of the text.
File size:: File size of the text counted in MiB = 2²⁰ bytes.
Sequence length:: Length in characters of the text. This can be different from the file size because here we count actual characters and not bytes.
Inverse probability of matching:: This is the reciprocal of the probability that two randomly chosen text characters match. This measure can be used as an indication of size of the effectively used alphabet. For uniformly distributed texts, it equals the alphabet size [FN05].
Alphabet characters:: The set of characters occuring in the text collection.
q-grams:: Number of different substrings of length q.
H_k:: Empiciral entropy of order k, a measure for the randomness of a string. See, for example, Giovanni Manzini (2001): An analysis of the Burrows-Wheeler transform. (Some higher order entropies could not be computed because of the exploding memory requirements of the big alphabets; the values in italics are estimates.)
gzip, bzip2, xz:: Ratio of the compressed file divided by the original file size (using highest compression level: gzip -9 / bzip2 -9 / xz -9). The value of the respective best compressor is indicated in bold.

All texts are available for download (see above). Additionally there are prefixes of the texts available with sequence lengths that are powers of 2 (between 2¹⁶ ≈ 64 KiB and 2³² ≈ 4 GiB):

Prefix length	2¹⁶	2¹⁸	2²⁰	2²²	2²⁴	2²⁶	2²⁸	2³⁰	2³²
text-english
text-german
text-chinese
dna-human5
dna-human4
protein-all
uniform-ascii
uniform-dna4
uniform-protein
uniform-binary
fibonacci
markov-english
markov-dna4
markov-protein

We also compiled a collection of patterns for approximate pattern matching.