LEA

tt-generate.cpp File Reference


Detailed Description

tt-generate generates random texts and allows the user to choose between different (probability) models (such as discrete autoregressive process, approximate repeats model by Allison et al (1998), markov chain, uniform distribution, fibonacci word).


The more complex models need parameters which can be estimated from texts using tt-analyze.

The tt-generate and tt-analyze tools are intended to be used together. More details on the tools can be found in Andre Dau (2010): Analysis of the structure and statistical properties of texts and generation of random texts.

The required format of the parameter files is best described by the examples in the sample output folder which come bundled with the source code.

All input files have to be in csv format and must have a header. The last line of the header must contain the field content_type which specifies the type of the file. Different probability models need different parameter file types as input.

Input files can also be read directly from stdin making it possible to pipe the output of tt-analyze to tt-generate. To pass multiple files via stdin the files have to be concatenate in an arbitrary order. If a parameter file is passed both via stdin and the command line the file specified in the command line is always preferred.

Usage:

     tt-generate <file_length> <model> [arguments]
Parameters:
file_lengthThe length of the file to be generated.
modelA probability model (and its parameters) from the list below.
argumentsZero or more arguments from the list below.

Arguments:

      -o <file>                    -  output_file (default: stdout)
      -p <parameter_type> <file>   -  specify parameter file (default: read from stdin)
      --stdout                     -  print to stdout (default: only print to stdout if -o <file> 
                                      is not specified)

Models:

      markov                                  --  Markov chain
      dar                                     --  Discrete Autoregressive Process dar(p)
                                                  (see: Jacobs, P. A. & Lewis, P. A. W.: 
                                                        Stationary Discrete Autoregressive-Moving 
                                                        Average Time Series Generated By Mixtures
                                                        In: Journal of Time Series Analysis 4 (1983), 
                                                        Nr. 1, pp. 19-36
                                                        http://dx.doi.org/10.1111/j.1467-9892.1983.tb00354.x)
                                                  (see: Dehnert, M. & Helm, W. E. & Huett, M.-Th.: 
                                                        A Discrete Autoregressive Process as a model 
                                                        for short-range correlations in DNA sequences
                                                        In: Physica A 327 (2003), pp. 535-553
                                                        http://dx.doi.org/10.1016/S0378-4371(03)00399-6)
                                                  (see: Huett, M.-Th. & Dehnert, M. : 
                                                        Methoden der Bioinformatik: Eine Einfuehrung
                                                        Springer, 2006)
      repeats                                 --  Repeat machine 
                                                  (see: Allison, L. & Edgoose, T. & Dix, T. I.: 
                                                        Compression of Strings with Approximate Repeats
                                                        In: Intelligent Systems in Mol. Biol. (1998), 
                                                        pp. 8-16)
      uniform <alphabet>                      --  Uniform distribution of characters
                                                  <alphabet> is either a string containing all symbols 
                                                  of the alphabet or one of the following presets:
                                                      --dna           -> ACGT
                                                      --dna5          -> ACGTN
                                                      --rna           -> ACGU
                                                      --rna5          -> ACGUN
                                                      --amino         -> ARNDCEQGHILKMFPSTWYV
                                                      --amino23       -> ARNDCEQGHILKMFPSTWYVBZX
      fibonacci [alphabet] [--use-tmp-file]   --  Fibonacci word
                                                  All arguments are optional.
                                                  Alphabet is a string consisting of two characters. 
                                                  Default alphabet: "ab" 
                                                  If --use-tmp-file is specified the fibonacci word 
                                                  is created using a (temporary) file as a buffer.
                                                  This is usually much slower but can save memory space.

Required parameter file types:

      markov          --  character_distribution  (required)
                          qgram_distribution      (optional; for higher order markov chains)
      dar             --  character_distribution  (required)
                          autocorrelation_dar     (required)
      repeats         --  character_distribution  (required)
                          qgram_distribution      (optional; for higher order markov chains)
                          direct_repeat           (optional; for direct repeats)
                          mirror_repeat           (optional; for mirror repeats)
                          inverted_repeat         (optional; for inverted repeats)
      fibonacci       --  none
      uniform         --  none

Examples:

  • The first 12 letters of the infinite fibonacci word using 0 and 1 as the alphabet:
         $ tt-generate 12 fibonacci "01"
         010010100100
    
  • Uniform distribution over the dna alphabet 'ACGT':
         $ tt-generate 50 uniform --dna
         TTGATCTATGTCAGAATGCCTAAGAGTGTTGTGATCTGATGAACGCTCGT
    
  • Uniform distribution over the alphabet '<>+-.,[]':
         $ tt-generate 90 uniform "<>+-.,[]"
         >>[[]<-<[<+->>..,,>.+++>.[<+<,,.-->,][>+<+,,]][][<--.,<>+>]<[+-.->+--.,<[[,]].,>>+<-<.>+->
    
  • Markov chain of order 5. Pipe estimated parameters directly from tt-analyze to tt-generate:
         $ tt-analyze markov 5 -i input.fasta | tt-generate 100 markov > output.fasta
    
  • Complex example. Repeats model, print to file and stdout, parameter files given excplicitly in command line:
         $ tt-generate 30000000 repeats \
                  -o sample_output.txt --stdout \
                  -p qgram_distribution ApproximateRepeats_qgram_distribution.csv \
                  -p character_distribution ApproximateRepeats_character_distribution.csv \
                  -p direct_repeat ApproximateRepeats_direct_repeat.csv
    
Returns:
0 on success, something else on error

Download: