tt-analyze.cpp File Reference
Detailed Description
tt-analyze
calculates efficiently some statistical properties of texts (such as entropy, mutual information function, qgram distributions, word distributions) and estimates parameters for probability models implemented by tt-generate
(such as discrete autoregressive process, markov chain, approximate repeats model by Allison et al (1998)).
The tt-generate
and tt-analyze
tools are intended to be used together. More details on the tools can be found in Andre Dau (2010): Analysis of the structure and statistical properties of texts and generation of random texts.
tt-analyze
needs configuration settings which can be specified in a settings file or directly in the command line (see Usage). The sample_config.ini, which comes with the source code, contains all available settings and a short explanation for each setting. If a setting is not set a default value is used.
In order to further faciliate the estimation of parameters and the generation of new random texts, tt-analyze
allows the usage of presets. Presets define a minimal set of settings for tt-analyze
which are sufficient to estimate all necessary parameters for a specific probability model of tt-generate
.
If a setting is defined in multiple places, the precedence is as follow: Command line argument over settings file over preset.
The results can be written to a result directory as csv-files. Along with the results, a file settings.ini will be generated. It contains all used settings as well as general information such as the complete path to the source file, the date of execution, the file length and the command line string. The naming convention for the output files is [module_name]_[content_type].csv
The contents of the files can also be written to stdout
. This is especially useful when combining tt-analyze
and tt-generate
since this allows direct piping from tt-analyze
to tt-generate
.
WARNING: If printing to stdout
all modules must be executed one after another. Otherwise the modules will print to stdout
simultaneously resulting in a corrupt stream. This means every module needs a unique thread group number in [module_selection]
(see sample_config.ini for more details).
Modules:
FrequencyDistribution -- character distribution bigram distribution word distribution Correlation -- character distribution mutual information function estimated autocorrelation parameters of a Discrete Autoregressive Process (see: Jacobs, P. A. & Lewis, P. A. W.: Stationary Discrete Autoregressive-Moving Average Time Series Generated By Mixtures In: Journal of Time Series Analysis 4 (1983), Nr. 1, pp. 19-36 http://dx.doi.org/10.1111/j.1467-9892.1983.tb00354.x) (see: Dehnert, M. & Helm, W. E. & Huett, M.-Th.: A Discrete Autoregressive Process as a model for short-range correlations in DNA sequences In: Physica A 327 (2003), pp. 535-553 http://dx.doi.org/10.1016/S0378-4371(03)00399-6) (see: Huett, M.-Th. & Dehnert, M. : Methoden der Bioinformatik: Eine Einfuehrung Springer, 2006) Entropy -- character distribution qgram distribution block entropy conditional entropy (The entropy estimator uses a correction term by Miller; see: Schuermann, T. & Grassberger, P.: Entropy estimation of symbol sequences In: CHAOS 6 (1996), Nr. 3, pp. 414-427 http://dx.doi.org/10.1063/1.166191s) ApproximateRepeats -- character distribution qgram distribution approximate repeat model parameters: direct repeat inverted repeat mirror repeat (see: Allison, L. & Edgoose, T. & Dix, T. I.: Compression of Strings with Approximate Repeats In: Intelligent Systems in Mol. Biol. (1998), pp. 8-16)
Usage:
tt-analyze [preset] [arguments]
- Parameters:
-
preset Zero or one of the setting presets from the list below. arguments Zero or more arguments from the list below.
Presets:
markov <markov_order> -- Preset to estimate parameters for a markov chain Enables the Entropy module and activates all submodules except entropy calculation dar <dar_process_order> -- Preset to estimate parameters for a Discrete Autoregressive Process Enables the Correlation module and disables the mutual information submodule repeatsDna <number_iterations> \ <markov_order> \ <minimum_hit_length> \ <minimum_region_size> \ <minimum_relative_probability> -- Preset to estimate parameters for the repeat model (by Allison et. al) on dna sequences Alphabet is set to [ACGT], A is inverse to T and C to G Character case is ignored Enables ApproximateRepeats modul and all three repeat models
Arguments:
-o <directory> - output_directory (default: stdout) -s <file> - settings file (default: default settings) -i <file> - input file (default: read from stdin) NOTE: Because of implementation details -i <file> should always be preferred to stdin for large files -id <integer> - file id which will be printed in result headers (default: -1) --stdout - print to stdout (default: only print to stdout if -o <directory> is not specified) -Dsection::setting=value - set 'value' for 'setting' in 'section'
Examples:
- Piping the results from tt-analyze to tt-generator using preset
markov
:$ tt-analyze markov 5 -i test.txt | tt-generate 500 markov > generated.txt
- Complex example with settings file, disable approximate repeats module via command line, use output directory and stdout, specify file id in result file headers:
$ tt-analyze -s sample_config.ini --stdout -Dmodule_selection::ApproximateRepeats=0 \ -o ~/result_dir -i dna.fasta -id 123
- Returns:
- 0 on success, something else on error
Download:
- The newest version of this tool can be downloaded from http://wwwmayr.in.tum.de/spp1307/downloads.html