LEA

file_statistics.cpp File Reference


Detailed Description

Efficient computation of some text statistics for a given file, including the text length, alphabet size, number of distinct q-grams and empirical entropy.

The input is read from stdin (with a single pass over the text) and the output is written to stdout. The ouput is formatted as comma seperated values to be easily written into a csv file. It is also possible to handle UTF-8 encoded strings.

Depending on the command line parameter STATISTIC, different values can be computed.

  • simple: The output will be the text length (number of actual characters, works also with multi-byte encoding), the alphabet characters (concatenated set of characters occurring in the text, non-printable characters replaced with a "_"), and the size of the alphabet (equal to the length of the former string).
  • qgrams_X: The output will be the number of different substrings of length X.
  • entropy_X: The output will be the empirical entropy of order X. The calculation of the empirical entropy is based on the following paper: Giovanni Manzini (2001): "An analysis of the Burrows-Wheeler transform" http://dx.doi.org/10.1145/382780.382782

Usage:

  •      file_statistics STATISTIC [ENCODING=single-byte [FILETYPE=plain]] < FILE
    
Parameters:
STATISTICThe name of the statistic to compute. Has to be one of the following: (simple | qgrams_X | entropy_X) where X is an integer.
ENCODINGThe encoding of the input file. Has to be one of the following: (single-byte | UTF-8).
FILETYPEWhether the input file should be treated as a regular text file or as a FASTA file. The only difference is that for a FASTA file all lines starting with a > will be ignored. Has to be one of the following: (plain | fasta).

Examples:

  • Determine length, alphabet characters, and alphabet size of the single line "mississippi":
         $ ./file_statistics simple < mississippi.txt
         "11";"psim";"4";
    
  • Determine the number of different 3-grams in a chinese text:
         $ ./file_statistics qgrams_3 UTF-8 plain < chinese.txt
         "135344";
    
  • Calculate the empirical entropy of a fasta-file:
         $ ./file_statistics entropy_2 single-byte fasta < test.fasta
         "1.94";
    
Returns:
0 on success, something else on error
Remarks:
With this program and on a regular desktop computer (2 Gigabyte RAM) it is possible to compute statistics for texts with sizes of several Gigabytes, for example for the DNA sequence of the human genome. The following table gives an example of computed values. The following table shows some examples of computed values. (Values of empty cells could not efficiently be computed because of large text and/or alphabet sizes.)
     File                | length        | qgrams_1 |  qgrams_2 |   qgrams_3 | ent_0 | ent_1 | ent_2 | ent_3 | ent_4 | ent_5 
     --------------------+---------------+----------+-----------+------------+-------+-------+-------+-------+-------+-------
     DNA of human genome | 3,095,677,412 |        7 |        31 |        133 |  2.21 |  1.79 |  1.78 |  1.77 |  1.77 |  1.76 
     Texts (English)     | 8,790,836,971 |      184 |    10,762 |    324,678 |  4.50 |  3.03 |       |       |       |       
     Texts (German)      |   210,528,730 |      189 |    10,121 |    137,459 |  4.52 |  3.59 |  2.96 |  2.49 |  2.15 |  1.90 
     Texts (Chinese)     |    51,649,808 |   16,564 | 2,706,626 | 16,329,056 |  9.51 |  7.38 |  4.82 |       |       |       

Download: