LEA

tt-analyze.cpp File Reference


Detailed Description

tt-analyze calculates efficiently some statistical properties of texts (such as entropy, mutual information function, qgram distributions, word distributions) and estimates parameters for probability models implemented by tt-generate (such as discrete autoregressive process, markov chain, approximate repeats model by Allison et al (1998)).

The tt-generate and tt-analyze tools are intended to be used together. More details on the tools can be found in Andre Dau (2010): Analysis of the structure and statistical properties of texts and generation of random texts.

tt-analyze needs configuration settings which can be specified in a settings file or directly in the command line (see Usage). The sample_config.ini, which comes with the source code, contains all available settings and a short explanation for each setting. If a setting is not set a default value is used.

In order to further faciliate the estimation of parameters and the generation of new random texts, tt-analyze allows the usage of presets. Presets define a minimal set of settings for tt-analyze which are sufficient to estimate all necessary parameters for a specific probability model of tt-generate.

If a setting is defined in multiple places, the precedence is as follow: Command line argument over settings file over preset.

The results can be written to a result directory as csv-files. Along with the results, a file settings.ini will be generated. It contains all used settings as well as general information such as the complete path to the source file, the date of execution, the file length and the command line string. The naming convention for the output files is [module_name]_[content_type].csv

The contents of the files can also be written to stdout. This is especially useful when combining tt-analyze and tt-generate since this allows direct piping from tt-analyze to tt-generate.

WARNING: If printing to stdout all modules must be executed one after another. Otherwise the modules will print to stdout simultaneously resulting in a corrupt stream. This means every module needs a unique thread group number in [module_selection] (see sample_config.ini for more details).

Modules:

 FrequencyDistribution   --   character distribution
                              bigram distribution
                              word distribution
 Correlation             --   character distribution
                              mutual information function
                              estimated autocorrelation parameters of a Discrete Autoregressive Process  
                              (see: Jacobs, P. A. & Lewis, P. A. W.: 
                                    Stationary Discrete Autoregressive-Moving Average Time Series 
                                    Generated By Mixtures
                                    In: Journal of Time Series Analysis 4 (1983), Nr. 1, pp. 19-36
                                    http://dx.doi.org/10.1111/j.1467-9892.1983.tb00354.x)
                              (see: Dehnert, M. & Helm, W. E. & Huett, M.-Th.: 
                                    A Discrete Autoregressive Process as a model for short-range 
                                    correlations in DNA sequences
                                    In: Physica A 327 (2003), pp. 535-553
                                    http://dx.doi.org/10.1016/S0378-4371(03)00399-6)
                              (see: Huett, M.-Th. & Dehnert, M. : 
                                    Methoden der Bioinformatik: Eine Einfuehrung
                                    Springer, 2006)
 Entropy                 --   character distribution
                              qgram distribution
                              block entropy
                              conditional entropy
                              (The entropy estimator uses a correction term by Miller; 
                              see: Schuermann, T. & Grassberger, P.: 
                                   Entropy estimation of symbol sequences 
                                   In: CHAOS 6 (1996), Nr. 3, pp. 414-427
                                   http://dx.doi.org/10.1063/1.166191s)
 ApproximateRepeats      --   character distribution
                              qgram distribution
                              approximate repeat model parameters:
                                  direct repeat
                                  inverted repeat
                                  mirror repeat
                              (see: Allison, L. & Edgoose, T. & Dix, T. I.: 
                                    Compression of Strings with Approximate Repeats
                                    In: Intelligent Systems in Mol. Biol. (1998), pp. 8-16)

Usage:

     tt-analyze [preset] [arguments]
Parameters:
presetZero or one of the setting presets from the list below.
argumentsZero or more arguments from the list below.

Presets:

      markov <markov_order>                           --  Preset to estimate parameters for a 
                                                          markov chain
                                                          Enables the Entropy module and activates 
                                                          all submodules except entropy calculation
      dar <dar_process_order>                         --  Preset to estimate parameters for a 
                                                          Discrete Autoregressive Process
                                                          Enables the Correlation module and disables 
                                                          the mutual information submodule
      repeatsDna <number_iterations>      \
                 <markov_order>           \
                 <minimum_hit_length>     \
                 <minimum_region_size>    \
                 <minimum_relative_probability>     --    Preset to estimate parameters for the repeat 
                                                          model (by Allison et. al) on dna sequences
                                                          Alphabet is set to [ACGT], A is inverse 
                                                          to T and C to G
                                                          Character case is ignored
                                                          Enables ApproximateRepeats modul and all 
                                                          three repeat models

Arguments:

      -o <directory>              -  output_directory (default: stdout)
      -s <file>                   -  settings file (default: default settings)
      -i <file>                   -  input file (default: read from stdin)
                                     NOTE: Because of implementation details -i <file> should always 
                                     be preferred to stdin for large files
      -id <integer>               -  file id which will be printed in result headers (default: -1)
      --stdout                    -  print to stdout (default: only print to stdout if -o <directory> 
                                     is not specified)
      -Dsection::setting=value    -  set 'value' for 'setting' in 'section'

Examples:

  • Piping the results from tt-analyze to tt-generator using preset markov:
         $ tt-analyze markov 5 -i test.txt | tt-generate 500 markov > generated.txt
    
  • Complex example with settings file, disable approximate repeats module via command line, use output directory and stdout, specify file id in result file headers:
         $ tt-analyze -s sample_config.ini --stdout -Dmodule_selection::ApproximateRepeats=0  \
                      -o ~/result_dir -i dna.fasta -id 123 
    
Returns:
0 on success, something else on error

Download: