PAPI: Practical Approximate Pattern Matching with Index Structures

SPP 1307 Home Related Work Goals References Downloads Test instances

Detailed Description

tt-generate generates random texts and allows the user to choose between different (probability) models (such as discrete autoregressive process, approximate repeats model by Allison et al (1998), markov chain, uniform distribution, fibonacci word).

The more complex models need parameters which can be estimated from texts using tt-analyze.

The tt-generate and tt-analyze tools are intended to be used together. More details on the tools can be found in Andre Dau (2010): Analysis of the structure and statistical properties of texts and generation of random texts.

The required format of the parameter files is best described by the examples in the sample output folder which come bundled with the source code.

All input files have to be in csv format and must have a header. The last line of the header must contain the field content_type which specifies the type of the file. Different probability models need different parameter file types as input.

Input files can also be read directly from stdin making it possible to pipe the output of tt-analyze to tt-generate. To pass multiple files via stdin the files have to be concatenate in an arbitrary order. If a parameter file is passed both via stdin and the command line the file specified in the command line is always preferred.

Usage:

     tt-generate <file_length> <model> [arguments]

Parameters:

file_length	The length of the file to be generated.
model	A probability model (and its parameters) from the list below.
arguments	Zero or more arguments from the list below.

Arguments:

      -o <file>                    -  output_file (default: stdout)
      -p <parameter_type> <file>   -  specify parameter file (default: read from stdin)
      --stdout                     -  print to stdout (default: only print to stdout if -o <file> 
                                      is not specified)

Models:

      markov                                  --  Markov chain
      dar                                     --  Discrete Autoregressive Process dar(p)
                                                  (see: Jacobs, P. A. & Lewis, P. A. W.: 
                                                        Stationary Discrete Autoregressive-Moving 
                                                        Average Time Series Generated By Mixtures
                                                        In: Journal of Time Series Analysis 4 (1983), 
                                                        Nr. 1, pp. 19-36
                                                        http://dx.doi.org/10.1111/j.1467-9892.1983.tb00354.x)
                                                  (see: Dehnert, M. & Helm, W. E. & Huett, M.-Th.: 
                                                        A Discrete Autoregressive Process as a model 
                                                        for short-range correlations in DNA sequences
                                                        In: Physica A 327 (2003), pp. 535-553
                                                        http://dx.doi.org/10.1016/S0378-4371(03)00399-6)
                                                  (see: Huett, M.-Th. & Dehnert, M. : 
                                                        Methoden der Bioinformatik: Eine Einfuehrung
                                                        Springer, 2006)
      repeats                                 --  Repeat machine 
                                                  (see: Allison, L. & Edgoose, T. & Dix, T. I.: 
                                                        Compression of Strings with Approximate Repeats
                                                        In: Intelligent Systems in Mol. Biol. (1998), 
                                                        pp. 8-16)
      uniform <alphabet>                      --  Uniform distribution of characters
                                                  <alphabet> is either a string containing all symbols 
                                                  of the alphabet or one of the following presets:
                                                      --dna           -> ACGT
                                                      --dna5          -> ACGTN
                                                      --rna           -> ACGU
                                                      --rna5          -> ACGUN
                                                      --amino         -> ARNDCEQGHILKMFPSTWYV
                                                      --amino23       -> ARNDCEQGHILKMFPSTWYVBZX
      fibonacci [alphabet] [--use-tmp-file]   --  Fibonacci word
                                                  All arguments are optional.
                                                  Alphabet is a string consisting of two characters. 
                                                  Default alphabet: "ab" 
                                                  If --use-tmp-file is specified the fibonacci word 
                                                  is created using a (temporary) file as a buffer.
                                                  This is usually much slower but can save memory space.

Required parameter file types:

      markov          --  character_distribution  (required)
                          qgram_distribution      (optional; for higher order markov chains)
      dar             --  character_distribution  (required)
                          autocorrelation_dar     (required)
      repeats         --  character_distribution  (required)
                          qgram_distribution      (optional; for higher order markov chains)
                          direct_repeat           (optional; for direct repeats)
                          mirror_repeat           (optional; for mirror repeats)
                          inverted_repeat         (optional; for inverted repeats)
      fibonacci       --  none
      uniform         --  none

Examples:

The first 12 letters of the infinite fibonacci word using 0 and 1 as the alphabet:
```
     $ tt-generate 12 fibonacci "01"
     010010100100
```

Uniform distribution over the dna alphabet 'ACGT':

     $ tt-generate 50 uniform --dna
     TTGATCTATGTCAGAATGCCTAAGAGTGTTGTGATCTGATGAACGCTCGT

Uniform distribution over the alphabet '<>+-.,[]':

     $ tt-generate 90 uniform "<>+-.,[]"
     >>[[]<-<[<+->>..,,>.+++>.[<+<,,.-->,][>+<+,,]][][<--.,<>+>]<[+-.->+--.,<[[,]].,>>+<-<.>+->

Markov chain of order 5. Pipe estimated parameters directly from tt-analyze to tt-generate:
```
     $ tt-analyze markov 5 -i input.fasta | tt-generate 100 markov > output.fasta
```

Complex example. Repeats model, print to file and stdout, parameter files given excplicitly in command line:

     $ tt-generate 30000000 repeats \
              -o sample_output.txt --stdout \
              -p qgram_distribution ApproximateRepeats_qgram_distribution.csv \
              -p character_distribution ApproximateRepeats_character_distribution.csv \
              -p direct_repeat ApproximateRepeats_direct_repeat.csv

Returns:: 0 on success, something else on error

Download:

The newest version of this tool can be downloaded from http://wwwmayr.in.tum.de/spp1307/downloads.html

tt-generate.cpp File Reference

Detailed Description