Back to SeqWord Project main page

SeqWord Sniffer

Download

This program was developed to allow an automatic search in genomic DNA sequences for loci enriched with putative horizontally transferred elements, fitness genes, giant genes, or genes for ribosomal RNA and proteins. Predictions are made using oligonucleotide signatures of the genomic fragments.
Installation
Options for the run and preset scenario
Input and output files
Editing the task list
Setting task conditions
Addition of a new task
Save new scenario
Setting the size of the sliding window
Input and output folders
Publications
Installation The program needs no installation. Download SeqWordSniffer_for_Win.zip (contains an executable program SeqWordSniffer.exe file for Windows with Python 2.5 installed) or SeqWordSniffer_Py.zip (Python version of the program compatible with all OS with Python 2.5 installed) from the site http://www.bi.up.ac.za/SeqWord/downloads/. Unzip file to a selected directory. A folder SeqWordSniffer will appear with several files inside and two subordinate folders input and output. To process genomic DNA sequences in FASTA or GenBank formats copy them to the folder input and run the file SeqWordSniffer.exe or python SeqWordSniffer.py depending on the version of the program you have got. A console window will appear (examples of the Command Prompt window in MS Windows are shown below):

Options for the run and the preset scenario The window shows the run options by default. Several sets of options were prepared to identify  loci of interest in a genomic DNA sequence prior to annotation. The identification is based on analysis of the oligonucleotide usage (OU) statistical parameters as described in our previous publications [1, 2]. By default the options are set to identify horizontally transferred genomic elements. To change the scenario of identification, press <C> + <Enter>:
Input and output files To run the program press <Y> + <Enter>. The program sequentially processes all files of genomic DNA sequences in FASTA ('FNA','FAS','FST','FASTA') or GenBank ("GBK","GBFF") formats and saves the results as text files in the folder 'output'. The output files contain information about all genomic fragments enriched with the genes of interest as in the following example:

<GI> 1 <COORDINATES> 583441-620599 
          [583441:586308:dir] 
          hypothetical [586849:587388:dir] 
          [587952:588434:dir] 
          [588652:590184:dir] 
          [590325:598688:dir] 
          [599481:600833:rev] 
          [600973:601719:dir] 
          [601854:602831:dir] 
          [602964:605069:dir] 
          [605069:605704:dir] 
          [605726:606337:dir] 
          [606359:606967:dir] 
          [607847:614539:dir] 
          [614543:615793:dir] 
          [615803:616642:dir] 
          [617188:618138:rev] 
          [618386:618586:dir] 
          [618721:619155:dir] 
          [619250:619579:rev] 
<END> 
<GI> 2 <COORDINATES> 3487108-3508599 
          [3487108:3487680:rev] 
          [3488641:3489390:rev] 
          [3489577:3490188:dir] 
          [3490741:3491295:rev] 
          [3491355:3491678:rev] 
          [3491722:3492231:dir] 
          [3492528:3493094:rev] 
          [3493390:3494199:rev] 
          [3494291:3495052:rev] 
          [3495061:3497235:rev] 
          [3497425:3498099:rev] 
          internal repeat sequences detected; contains peptidase family M23/M37 as detected by pfam-hmmr [3498112:3507534:rev] 
<END>

In this example 2 gene islands were identified in the given genome. Each block starts with the island ID and its coordinates in the genome: <GI> 1 <COORDINATES> 583441-620599

If a GenBank file was processed, the annotation and coordinates [left : right : strand] of all genes inside the genomic fragment follow. The end of the block is marked by <END>.

 

Edition of the task list The user may change the default options. To change the set of the OU statistical parameters the program calculates to identify the genomic fragments press <T> + <Enter>:

Each task is presented by a line defining the task category and the condition used to select the genomic fragments. Remember that the fragment will be selected only if it meets all set conditions. To remove a condition press <R> + <Enter>, then select the number of the task to remove it from the list.
To return to the main menu press <Q> + <Enter>.
Setting task conditions To edit the condition of one of the tasks press <E> + <Enter> . Now type the number of the task to edit and press <Enter>. A submenu of edit options will appear as shown below:

Use the option <M> to choose the type of the threshold values:
  • sigmas - to set the threshold values in sigmas of the normal distribution;
  • fraction - to set the threshold as a fraction of the total number of genomic fragments;
  • absolute - to use as the threshold an absolute value of the OU statistical paramenetrs.

To choose the type of comparison,- bigger than, smaller then or between, - press the key <G>, <S> or <B> respectively and press <Enter>. The program will prompt to enter the values of one or two (if the option Between is used) thresholds. To choose values of thresholds consult the SeqWord Browser program (http://www.bi.up.ac.za/SeqWord/mhhapplet.php) as in the examples below:

Addition of a new task To add a new task press <A>+ <Enter>. The program will show a new menu:
  1. To choose the task category press <C>+ <Enter> and choose from the list: 
  • 0. return back to the previous menu; 
  • 1. GRV (generalized relative variance); 
  • 2. PS (pattern skew); 
  • 3. RV (relative variance); 
  • 4. D (pattern deviation - by default); 
  • 5. GCS (GC-skew); 
  • 6. GD (generalized pattern deviation); 
  • 7. GC (GC-content); 
  • 8. AT (AT-content); 
  • 9. GPS (generalized pattern skew); 
  • 10. ATS (AT-skew); (for more about OU statistical parameters see Reva and Tümmler, 2005) 
  1. To change the oligonucleotide word length press <W>+ <Enter> and enter an integer from 2 to 7 (4 by default). 
  2. To set the normalization press <N>+<Enter> and enter an integer from 0 (no normalization) to word_length - 1. (Normalization by the mononucleotide content of the sequence, - option 1, - is set by default. Remember, that when generalized parameters are selected, - GRV, GD or GPS, - for normalization the frequencies of the complete genome are taken into consideration, whereas by default the parameters are normalized by the content of the genomic region selected by a sliding window.) 
  3. The program allows execution of simple mathematical operations with the OU statistical parameters such as subtraction and division (or [par1-par2]/par3 if the subtrahend (par2) and the divisor (par3) are both set). Thus, in the scenario of identification of horizontally transferred gene islands the program calculates deviation n1_4mer:GRV/n1_4mer:RV - this ratio is around 1.0 for the core sequence but higher than 2 in genomic fragments from the accessory genome. (When setting the divisor be sure that this parameter is never zero!) To set subtraction or division of the parameters, press correspondingly <S>+<Enter> or <D>+<Enter>. The program will show a menu similar to the discussed above menu for addition of a new task.

Press <A>+<Enter> to add a subtrahend or a divisor, or to add the new task to the list. In the letter case the program will show the condition setting menu that was described above. Press <Q>+<Enter> to return to the task edit menu and again <Q>+<Enter> to return to the main menu.

Save new scenario If the list of tasks is changed, the program changes the name of the current scenario to "User defined". To save the new list of tasks in the main menu press <A>+<Enter> and name your scenario.
Setting the size of the sliding window The program identifies gene islands by using a sliding window approach. To achieve optimal speed and accuracy of identification of gene islands the program flexibly changes the step of the sliding window choosing between big, medium and small steps (see below):

To change the values of the sliding window length (8 000 bp), big step (2 000 bp), medium step (500 bp) and small step (100 bp) set by default, press the keys <L>, <B>, <M> and <S> correspondingly and press <Enter>. The program will prompt you to enter new values. (Remember that for statistical reliability the sliding window size should not be shorter than 4600 bp for tetranucleotide usage analysis, 1200 bp for trinucleotides and 600 bp for dinucleotides.
Input and output folders By default the program reads sequence files from the folder input and saves the result files (see an example above) to the folder output. A user may change names of the input and output folders from the main menu by selecting the options <I> and <O>. In addition to the text files with coordinates of identified gene islands it is possible to instruct the program to save the sequences of the gene islands to FASTA files. To do this press <F>+<Enter>.
Publications
  1. Ganesan H, Rakitianskaia AS, Davenport CF, Tümmler B, Reva ON. (2008) The SeqWord Genome Browser: an online tool for the identification and visualization of atypical regions of bacterial genomes through oligonucleotide usage. BMC Bioinformatics. 9:333.
  2. Reva, O.N., Tümmler, B. (2005). Differentiation of regions with atypical oligonucleotide composition in bacterial genomes. BMC Bioinformatics. 6:251.
  3. Reva O., Tümmler B. (2008) Think big - giant genes in bacteria. Environ. Microbiol. 10(3), 768-777.