tt-generate.cpp File Reference
Detailed Description
tt-generate
generates random texts and allows the user to choose between different (probability) models (such as discrete autoregressive process, approximate repeats model by Allison et al (1998), markov chain, uniform distribution, fibonacci word).
The more complex models need parameters which can be estimated from texts using tt-analyze
.
The tt-generate
and tt-analyze
tools are intended to be used together. More details on the tools can be found in Andre Dau (2010): Analysis of the structure and statistical properties of texts and generation of random texts.
The required format of the parameter files is best described by the examples in the sample output folder which come bundled with the source code.
All input files have to be in csv format and must have a header. The last line of the header must contain the field content_type
which specifies the type of the file. Different probability models need different parameter file types as input.
Input files can also be read directly from stdin
making it possible to pipe the output of tt-analyze
to tt-generate
. To pass multiple files via stdin
the files have to be concatenate in an arbitrary order. If a parameter file is passed both via stdin
and the command line the file specified in the command line is always preferred.
Usage:
tt-generate <file_length> <model> [arguments]
- Parameters:
-
file_length The length of the file to be generated. model A probability model (and its parameters) from the list below. arguments Zero or more arguments from the list below.
Arguments:
-o <file> - output_file (default: stdout) -p <parameter_type> <file> - specify parameter file (default: read from stdin) --stdout - print to stdout (default: only print to stdout if -o <file> is not specified)
Models:
markov -- Markov chain dar -- Discrete Autoregressive Process dar(p) (see: Jacobs, P. A. & Lewis, P. A. W.: Stationary Discrete Autoregressive-Moving Average Time Series Generated By Mixtures In: Journal of Time Series Analysis 4 (1983), Nr. 1, pp. 19-36 http://dx.doi.org/10.1111/j.1467-9892.1983.tb00354.x) (see: Dehnert, M. & Helm, W. E. & Huett, M.-Th.: A Discrete Autoregressive Process as a model for short-range correlations in DNA sequences In: Physica A 327 (2003), pp. 535-553 http://dx.doi.org/10.1016/S0378-4371(03)00399-6) (see: Huett, M.-Th. & Dehnert, M. : Methoden der Bioinformatik: Eine Einfuehrung Springer, 2006) repeats -- Repeat machine (see: Allison, L. & Edgoose, T. & Dix, T. I.: Compression of Strings with Approximate Repeats In: Intelligent Systems in Mol. Biol. (1998), pp. 8-16) uniform <alphabet> -- Uniform distribution of characters <alphabet> is either a string containing all symbols of the alphabet or one of the following presets: --dna -> ACGT --dna5 -> ACGTN --rna -> ACGU --rna5 -> ACGUN --amino -> ARNDCEQGHILKMFPSTWYV --amino23 -> ARNDCEQGHILKMFPSTWYVBZX fibonacci [alphabet] [--use-tmp-file] -- Fibonacci word All arguments are optional. Alphabet is a string consisting of two characters. Default alphabet: "ab" If --use-tmp-file is specified the fibonacci word is created using a (temporary) file as a buffer. This is usually much slower but can save memory space.
Required parameter file types:
markov -- character_distribution (required) qgram_distribution (optional; for higher order markov chains) dar -- character_distribution (required) autocorrelation_dar (required) repeats -- character_distribution (required) qgram_distribution (optional; for higher order markov chains) direct_repeat (optional; for direct repeats) mirror_repeat (optional; for mirror repeats) inverted_repeat (optional; for inverted repeats) fibonacci -- none uniform -- none
Examples:
- The first 12 letters of the infinite fibonacci word using 0 and 1 as the alphabet:
$ tt-generate 12 fibonacci "01" 010010100100
- Uniform distribution over the dna alphabet 'ACGT':
$ tt-generate 50 uniform --dna TTGATCTATGTCAGAATGCCTAAGAGTGTTGTGATCTGATGAACGCTCGT
- Uniform distribution over the alphabet '<>+-.,[]':
$ tt-generate 90 uniform "<>+-.,[]" >>[[]<-<[<+->>..,,>.+++>.[<+<,,.-->,][>+<+,,]][][<--.,<>+>]<[+-.->+--.,<[[,]].,>>+<-<.>+->
- Markov chain of order 5. Pipe estimated parameters directly from tt-analyze to tt-generate:
$ tt-analyze markov 5 -i input.fasta | tt-generate 100 markov > output.fasta
- Complex example. Repeats model, print to file and stdout, parameter files given excplicitly in command line:
$ tt-generate 30000000 repeats \ -o sample_output.txt --stdout \ -p qgram_distribution ApproximateRepeats_qgram_distribution.csv \ -p character_distribution ApproximateRepeats_character_distribution.csv \ -p direct_repeat ApproximateRepeats_direct_repeat.csv
- Returns:
- 0 on success, something else on error
Download:
- The newest version of this tool can be downloaded from http://wwwmayr.in.tum.de/spp1307/downloads.html