LEA

generate_patterns.cpp File Reference


Detailed Description

Extracts substrings from a text and generates a set of search patterns for approximate pattern matching.

The text is read from INFILE and the patterns are printed to stdout, separated by line breaks. It is also possible to handle UTF-8 encoded strings.

The program will extract PATTERN_COUNT substrings from the file INFILE, each having length PATTERN_LENGTH. Then it will for each pattern introduce MAX_DISTANCE modifications. If DISTANCE_MEASURE is edit, these modifications can be deletions, insertions and substitutions of single characters. If DISTANCE_MEASURE is hamming, the operations are only be substitutions.

Line breaks are ignored, so that patterns can also span two or more lines (this is especially useful for FASTA files).

Usage:

  •      generate_patterns PATTERN_COUNT PATTERN_LENGTH DISTANCE_MEASURE MAX_DISTANCE INFILE [ENCODING=single-byte [FILETYPE=plain]]
    
Parameters:
PATTERN_COUNTMaximum number of patterns to be generated.
PATTERN_LENGTHLength of the patterns to be generated.
DISTANCE_MEASUREDistance measure. Has to be one of the following (edit | hamming).
MAX_DISTANCEMaximum numbers of operations to perform on the substrings.
INFILEName of the input text file to extract the patterns.
ENCODINGThe encoding of the input file. Has to be one of the following: (single-byte | UTF-8).
FILETYPEWhether the input file should be treated as a regular text file or as a FASTA file. The only difference is that for a FASTA file all lines starting with a > will be ignored, and patterns are not allowed to span two different FASTA sequences. Has to be one of the following: (plain | fasta).

Examples:

  • Generate 5 patterns from this source code file allowing no modifications:
         $ ./generate_patterns 5 10 edit 1 generate_patterns.cpp
         oximate Pa
         tream>#inc
         SoFar = 0;
         MeasureStr
         exical_cas
    
  • Generate 5 patterns from this source code file using edit distance and allowing one modification:
         $ ./generate_patterns 5 10 edit 1 generate_patterns.cpp
         oximatre Pa
         tr_am>#inc
         SoFar , 0;
         MeasureSr
         exicalcas
    
Returns:
0 on success, something else on error

Download: