LEA

Test instances: Texts

We collected real world test data to test algorithms and data structures for approximate pattern matching under realistic conditions. The texts come from different relevant application areas and have sizes up to several Gigabytes. We furthermore generated several synthetic texts using our tool tt-generate to perform tests under controlled conditions.

text-english
This is a concatenation of natural language text files downloaded from the collection of freely available electronic books of the Project Gutenberg in August 2009. In a preprocessing step we unified all line endings to \n, as well as removed the headers, footers, and licenses of the electronic book files with our tool strip_headers. In this text we only included English texts classified as ASCII texts by the Unix file command (some of the sequences nevertheless include a few non-ASCII or non-printable characters).
text-german
In this text we included German texts, consistently converted to ISO-8859 encoding.
text-chinese
In this text we included Chinese texts, consistently converted to UTF-8 encoding.
dna-human5
This text represents the DNA sequence of the human genome (in the following order: Chromosomes 1–22, X, Y). It was downloaded from the NCBI-GenBank (National Center for Biotechnology Information) in August 2009. The chromosome sequences consist of the letters A, C, G, and T, corresponding to the four nucleic acids. Furthermore they contain some occurences of N, M, and R, representing an unkown nucleic acid in the IUB/IUPAC code.
dna-human4
The same as dna-human5 except that all N are stripped out.
uniform-ascii, uniform-dna4, uniform-protein, uniform-binary
Synthetic texts where the characters are drawn uniformly from the respective underlying alphabet.
fibonacci
The Fibonacci string over the binary alphabet 0, 1.
markov-english
Synthetic text generated with a Markov process of order 3; the parameters where estimated from text-english.
markov-dna4
Synthetic text generated with a Markov process of order 9; the parameters where estimated from dna-human4.
markov-protein
Synthetic text generated with a Markov process of order 3; the parameters where estimated from protein-all.

For all the texts we computed some statistical values that are of interest when evaluating algorithms and data structures for approximate pattern matching (using our tools tt-analyze and file_statistics). The columns of the table are explained below.

File name Encoding File size Sequence length Alphabet
size
Inv. prob.
matching
Alphabet characters 1-grams2-grams3-grams H0H1H2H3H4H5 gzipbzip2xz
text-english ASCII 8520 MiB8 934 380 282 185 15.74 _ !"#$%&'()*+,-./0123456789:;<=>? @A... 1857 424113 6784.533.632.962.442.101.8937.77 %28.10 %23.54 %
text-german ISO-8859 204 MiB214 072 799 190 14.52 _ !"#$%&'()*+,-./0123456789:;<=>? @A... 1908 914101 3194.573.592.942.472.121.8837.24 %28.13 %21.06 %
text-chinese UTF-8 147 MiB52 812 589 16 565 104.12 ...ääääääääääääääää... 16 5652 699 02116 683 6679.467.374.8544.51 %33.90 %30.65 %
dna-human5 ASCII 2952 MiB3 095 677 412 5 4.41 ACGNT 5251202.291.661.651.641.641.6322.90 %21.84 %18.79 %
dna-human4 ASCII 2728 MiB2 861 327 131 4 3.87 ACGT 416641.981.941.931.921.911.9026.74 %25.56 %21.98 %
protein-all ASCII 5195 MiB5 448 131 689 26 16.32 ABCDEFGHIJKLMNOPQRSTUVWXYZ 2661710 3034.144.134.124.114.043.7555.01 %51.42 %34.75 %
uniform-ascii ASCII 1024 MiB1 073 741 824 94 94.00 !"#$%&'()*+,-./0123456789:;<=>? @A... 948 836830 5846.556.556.556.556.556.5583.01 %82.81 %83.49 %
uniform-dna4 ASCII 1024 MiB1 073 741 824 4 4.00 ACGT 416642.002.002.002.002.002.0028.56 %27.34 %26.56 %
uniform-protein ASCII 1024 MiB1 073 741 824 26 26.00 ABCDEFGHIJKLMNOPQRSTUVWXYZ 2667617 5764.704.704.704.704.694.6963.53 %59.79 %61.70 %
uniform-binary ASCII 1024 MiB1 073 741 824 2 2.00 01 2481.001.001.001.001.001.0014.91 %16.01 %13.41 %
fibonacci ASCII 1024 MiB1 073 741 824 2 1.89 01 2340.960.590.370.370.230.230.44 %0.01 %0.18 %
markov-english ASCII 1024 MiB1 073 741 824 139 15.55 !"#$%&'()*+,-./0123456789:;<=>? @A... 1398 624207 1374.623.723.002.412.402.3847.43 %38.54 %38.34 %
markov-dna4 ASCII 1024 MiB1 073 741 824 4 3.89 ACGT 416641.981.941.931.921.911.9027.37 %26.11 %24.51 %
markov-protein ASCII 1024 MiB1 073 741 824 26 16.02 ABCDEFGHIJKLMNOPQRSTUVWXYZ 2662510 9064.144.134.124.114.044.0058.47 %54.02 %54.42 %

Explanation of the columns:

File name:
File name of the text.
Encoding:
Character encoding of the text.
File size:
File size of the text counted in MiB = 220 bytes.
Sequence length:
Length in characters of the text. This can be different from the file size because here we count actual characters and not bytes.
Inverse probability of matching:
This is the reciprocal of the probability that two randomly chosen text characters match. This measure can be used as an indication of size of the effectively used alphabet. For uniformly distributed texts, it equals the alphabet size [FN05].
Alphabet characters:
The set of characters occuring in the text collection.
q-grams:
Number of different substrings of length q.
Hk:
Empiciral entropy of order k, a measure for the randomness of a string. See, for example, Giovanni Manzini (2001): An analysis of the Burrows-Wheeler transform. (Some higher order entropies could not be computed because of the exploding memory requirements of the big alphabets; the values in italics are estimates.)
gzip, bzip2, xz:
Ratio of the compressed file divided by the original file size (using highest compression level: gzip -9 / bzip2 -9 / xz -9). The value of the respective best compressor is indicated in bold.

All texts are available for download (see above). Additionally there are prefixes of the texts available with sequence lengths that are powers of 2 (between 216 ≈ 64 KiB and 232 ≈ 4 GiB):

Prefix length216218220222224226228230232
text-english download download download download download download download download download
text-german download download download download download download
text-chinese download download download download download
dna-human5 download download download download download download download download
dna-human4 download download download download download download download download
protein-all download download download download download download download download download
uniform-ascii download download download download download download download
uniform-dna4 download download download download download download download
uniform-protein download download download download download download download
uniform-binary download download download download download download download
fibonacci download download download download download download download
markov-english download download download download download download download
markov-dna4 download download download download download download download
markov-protein download download download download download download download

We also compiled a collection of patterns for approximate pattern matching.