Test instances: Texts
We collected real world test data to test algorithms and data structures for approximate pattern matching under realistic conditions. The texts come from different relevant application areas and have sizes up to several Gigabytes. We furthermore generated several synthetic texts using our tool tt-generate to perform tests under controlled conditions.
- text-english
- This is a concatenation of natural language text files downloaded from the collection of freely available electronic books of the Project Gutenberg in August 2009. In a preprocessing step we unified all line endings to \n, as well as removed the headers, footers, and licenses of the electronic book files with our tool strip_headers. In this text we only included English texts classified as ASCII texts by the Unix file command (some of the sequences nevertheless include a few non-ASCII or non-printable characters).
- text-german
- In this text we included German texts, consistently converted to ISO-8859 encoding.
- text-chinese
- In this text we included Chinese texts, consistently converted to UTF-8 encoding.
- dna-human5
- This text represents the DNA sequence of the human genome (in the following order: Chromosomes 1–22, X, Y). It was downloaded from the NCBI-GenBank (National Center for Biotechnology Information) in August 2009. The chromosome sequences consist of the letters A, C, G, and T, corresponding to the four nucleic acids. Furthermore they contain some occurences of N, M, and R, representing an unkown nucleic acid in the IUB/IUPAC code.
- dna-human4
- The same as dna-human5 except that all N are stripped out.
- uniform-ascii, uniform-dna4, uniform-protein, uniform-binary
- Synthetic texts where the characters are drawn uniformly from the respective underlying alphabet.
- fibonacci
- The Fibonacci string over the binary alphabet 0, 1.
- markov-english
- Synthetic text generated with a Markov process of order 3; the parameters where estimated from text-english.
- markov-dna4
- Synthetic text generated with a Markov process of order 9; the parameters where estimated from dna-human4.
- markov-protein
- Synthetic text generated with a Markov process of order 3; the parameters where estimated from protein-all.
For all the texts we computed some statistical values that are of interest when evaluating algorithms and data structures for approximate pattern matching (using our tools tt-analyze and file_statistics). The columns of the table are explained below.
File name | Encoding | File size | Sequence length | Alphabet size |
Inv. prob. matching |
Alphabet characters | 1-grams | 2-grams | 3-grams | H0 | H1 | H2 | H3 | H4 | H5 | gzip | bzip2 | xz |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
text-english | ASCII | 8520 MiB | 8 934 380 282 | 185 | 15.74 | _ !"#$%&'()*+,-./0123456789:;<=>? @A... | 185 | 7 424 | 113 678 | 4.53 | 3.63 | 2.96 | 2.44 | 2.10 | 1.89 | 37.77 % | 28.10 % | 23.54 % |
text-german | ISO-8859 | 204 MiB | 214 072 799 | 190 | 14.52 | _ !"#$%&'()*+,-./0123456789:;<=>? @A... | 190 | 8 914 | 101 319 | 4.57 | 3.59 | 2.94 | 2.47 | 2.12 | 1.88 | 37.24 % | 28.13 % | 21.06 % |
text-chinese | UTF-8 | 147 MiB | 52 812 589 | 16 565 | 104.12 | ...ääääääääääääääää... | 16 565 | 2 699 021 | 16 683 667 | 9.46 | 7.37 | 4.85 | 44.51 % | 33.90 % | 30.65 % | |||
dna-human5 | ASCII | 2952 MiB | 3 095 677 412 | 5 | 4.41 | ACGNT | 5 | 25 | 120 | 2.29 | 1.66 | 1.65 | 1.64 | 1.64 | 1.63 | 22.90 % | 21.84 % | 18.79 % |
dna-human4 | ASCII | 2728 MiB | 2 861 327 131 | 4 | 3.87 | ACGT | 4 | 16 | 64 | 1.98 | 1.94 | 1.93 | 1.92 | 1.91 | 1.90 | 26.74 % | 25.56 % | 21.98 % |
protein-all | ASCII | 5195 MiB | 5 448 131 689 | 26 | 16.32 | ABCDEFGHIJKLMNOPQRSTUVWXYZ | 26 | 617 | 10 303 | 4.14 | 4.13 | 4.12 | 4.11 | 4.04 | 3.75 | 55.01 % | 51.42 % | 34.75 % |
uniform-ascii | ASCII | 1024 MiB | 1 073 741 824 | 94 | 94.00 | !"#$%&'()*+,-./0123456789:;<=>? @A... | 94 | 8 836 | 830 584 | 6.55 | 6.55 | 6.55 | 6.55 | 6.55 | 6.55 | 83.01 % | 82.81 % | 83.49 % |
uniform-dna4 | ASCII | 1024 MiB | 1 073 741 824 | 4 | 4.00 | ACGT | 4 | 16 | 64 | 2.00 | 2.00 | 2.00 | 2.00 | 2.00 | 2.00 | 28.56 % | 27.34 % | 26.56 % |
uniform-protein | ASCII | 1024 MiB | 1 073 741 824 | 26 | 26.00 | ABCDEFGHIJKLMNOPQRSTUVWXYZ | 26 | 676 | 17 576 | 4.70 | 4.70 | 4.70 | 4.70 | 4.69 | 4.69 | 63.53 % | 59.79 % | 61.70 % |
uniform-binary | ASCII | 1024 MiB | 1 073 741 824 | 2 | 2.00 | 01 | 2 | 4 | 8 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 14.91 % | 16.01 % | 13.41 % |
fibonacci | ASCII | 1024 MiB | 1 073 741 824 | 2 | 1.89 | 01 | 2 | 3 | 4 | 0.96 | 0.59 | 0.37 | 0.37 | 0.23 | 0.23 | 0.44 % | 0.01 % | 0.18 % |
markov-english | ASCII | 1024 MiB | 1 073 741 824 | 139 | 15.55 | !"#$%&'()*+,-./0123456789:;<=>? @A... | 139 | 8 624 | 207 137 | 4.62 | 3.72 | 3.00 | 2.41 | 2.40 | 2.38 | 47.43 % | 38.54 % | 38.34 % |
markov-dna4 | ASCII | 1024 MiB | 1 073 741 824 | 4 | 3.89 | ACGT | 4 | 16 | 64 | 1.98 | 1.94 | 1.93 | 1.92 | 1.91 | 1.90 | 27.37 % | 26.11 % | 24.51 % |
markov-protein | ASCII | 1024 MiB | 1 073 741 824 | 26 | 16.02 | ABCDEFGHIJKLMNOPQRSTUVWXYZ | 26 | 625 | 10 906 | 4.14 | 4.13 | 4.12 | 4.11 | 4.04 | 4.00 | 58.47 % | 54.02 % | 54.42 % |
Explanation of the columns:
- File name:
- File name of the text.
- Encoding:
- Character encoding of the text.
- File size:
- File size of the text counted in MiB = 220 bytes.
- Sequence length:
- Length in characters of the text. This can be different from the file size because here we count actual characters and not bytes.
- Inverse probability of matching:
- This is the reciprocal of the probability that two randomly chosen text characters match. This measure can be used as an indication of size of the effectively used alphabet. For uniformly distributed texts, it equals the alphabet size [FN05].
- Alphabet characters:
- The set of characters occuring in the text collection.
- q-grams:
- Number of different substrings of length q.
- Hk:
- Empiciral entropy of order k, a measure for the randomness of a string. See, for example, Giovanni Manzini (2001):
An analysis of the Burrows-Wheeler transform
. (Some higher order entropies could not be computed because of the exploding memory requirements of the big alphabets; the values in italics are estimates.) - gzip, bzip2, xz:
- Ratio of the compressed file divided by the original file size (using highest compression level: gzip -9 / bzip2 -9 / xz -9). The value of the respective best compressor is indicated in bold.
All texts are available for download (see above). Additionally there are prefixes of the texts available with sequence lengths that are powers of 2 (between 216 ≈ 64 KiB and 232 ≈ 4 GiB):
We also compiled a collection of patterns for approximate pattern matching.