DOGMA Help

  1. Platforms/Browsers
  2. Userids/Passwords
  3. Unique IDs
  4. Retrieving or removing an existing annotation
  5. Input file format
  6. Input file problems
  7. Gapped alignment
  8. Percent identity cutoffs
  9. Genetic code
  10. E-values
  11. Reorienting genomes
  12. COVE threshold
  13. Databases
  14. Numberline panel
  15. Extracting sequences
  16. Displaying summaries
  17. Annotation panel
  18. Sequin input window
  19. Deleting a gene
  20. Using Sequin to submit files to GenBank
  21. Misc. data files
  1. Platforms/Browsers

  2. Userids
    Your userid gives you access to the your own data files. Your userid will give you access only to your own data. Because this site is not encrypted, the passwords are not secure and you should use a password that is not used for secure purposes. Userid and passwords are case sensitive. Userids should be alpha-numeric (no spaces!). If you are having trouble logging in, make sure you are typing with proper capitalization. If you forget your userid or password, send email to staciakwyman@gmail.com

  3. Unique IDs
    Unique IDs are how you identify and distinguish between your different annotations. When you are retrieving an existing ID, your will be given a list of annotations by unique ID. It is best to name the annotation with the genome name and perhaps the DOGMA settings that the annotation is created with. Unique IDs should be alpha-numeric (no spaces!). It's a good idea to identify them by taxon and their settings. For example, if you had several nicotiana annotations, you might name them this way:
    • nicotianaGap50
    • nicotianaUngap60
    • etc...

  4. Retrieving or removing an existing annotation
    When you enter your userid and then hit submit in the retrieve section, you will be presented with a list of the annotations that exist in your data directory and the date that it was last edited. You may remove obsolete annotations here, or view and edit unfinished annotations. If you remove an annotation, it can be retrieved, but not quickly or easily and perhaps not the most recent copy, so be sure you want to delete it.

  5. Input file format
    The input file should be in FASTA format. The first line should begin with a ">" followed immediately by the SeqID and then the nucleotide sequence beginning on the next line. In order for a feature table to be imported into Sequin, the SeqId of the FASTA file (the first word immediately after > ), must match the SeqId of the feature table. So be aware that the first word (after the >) on the first line of the FASTA file will be used as the SeqId in Sequin.

    For example, in the following file, the SeqId is Sc_16:

    >Sc_16 [organism=Saccharomyces cerevisiae] [strain=S288C] 
    CGACCACAATGGTACGATTGTTCATAAATCAGGAGATGTTCCTATTCATATAAAGATACC
    AAACAGATCTCTAATACATGACCAGGATATCAACTTCTATAATGGTTCCGAAAACGAAAG
    AAAACCAAATCTAGAGCGTAGAGACGTCGACCGTGTTGGTGATCCAATGAGGATGGATAG
    [etc.]
    
    And the first line of the feature table file would look like:
    >Features Sc_16
    

    More information is available at the NCBI site.

    Sequence data may contain any of the characters from the IUPAC code:

    A adenosine       M A or C (amino)
    C cytidine     S G or C (strong)
    G guanine     W A or T (weak)
    T thymidine     B G or T or C
    U uridine     D G or A or T
    R G or A (purine)     H A or C or T
    Y T or C (pyrimidine)     V G or C or A
    K G or T (keto)           N A G C T (any)

  6. Problems? Read below
    IF YOU GET OUTPUT WITH NO GENES:
    This usually means your input file had the wrong kind of line breaks. Follow the directions below and try uploading your file again. If you have a situation not covered in the instructions below, send staciakwyman@gmail.com email.

    Creating an input file:
    To prepare your file for uploading to DOGMA, several steps may be required depending on which program (Sequencher, MacVector, Consed, Word, etc.) you used to save your data. Different programs save files with different kinds of line breaks, and it must be a specific kind for uploading to DOGMA. In the instructions below, I am assuming you are using a Mac running OS X.

    • To save from Sequencer or MacVector: If you have a Sequencher (or MacVector) file, you'll need to execute several steps to get it ready to input to DOGMA.
      Save the file as Pearson/FASTA format from the export menu (FASTA text in MacVector).

      Then you have two options to convert it to a file which may be uploaded through the browser to DOGMA (if you are you Mac OS X):
      1. Using a terminal window.
        • Open a terminal window (the terminal application is in the Utilities folder under Applications).
        • cd to the directory with the file in it (if the file is on the Desktop, open the terminal window and then type "cd Desktop" at the prompt).
        • At the prompt, type:
          native2ascii filename filename
        • Then try uploading the file to DOGMA
      2. Using MS Word.
        • Open the file in MS Word, then save as MS-DOS Text. NOT TEXT ONLY.
    • To save from MS Word
      Save the file as MS-DOS text. NOT TEXT ONLY.
    • Other:
      • If you aren't sure what program the file came from and it isn't uploading correctly to DOGMA, try running native2ascii on it (as in the directions for saving from Sequencher above. Doing this to a file will do no harm if it's already in ASCII format). The symptom of not uploading correctly is that DOGMA finds no genes in input sequence.
      • Files saved from Consed should be OK for uploading to DOGMA.
      • Files downloaded through the browser from GenBank (as in the tutorial) should be OK for uploading.

  7. To gap or not to gap
    MT Genomes For mitchondrial genomes, we have not found any genomes for which ungapped BLAST finds something that gapped BLAST misses, so gapped BLAST should be used for mt genomes. You could also run both and compare the results.

    CP Genomes The option for gapped and ungapped alignment may give slightly different results for chloroplast genomes. The ungapped option will generally generate shorter BLASTX matches with higher percent identity scores. This will result in some genes appearing as multiple, contiguous pieces in the annotation window. The gapped alignment will generate longer BLASTX matches but for genomes with some protein coding genes that are quite divergent from the ones in the database, these genes may be missed. The more divergent genes can be found by either running an ungapped alignment or by running a gapped alignment with a lower percent identity cutoff than the 60% default value. The best thing to do is run it both gapped (with and without lowering the percent identity cutoff) and ungapped alignment and compare the results. We also recommend that users examine the BLASTX results before deciding if a gene is present when there are differences between gapped and ungapped annotations (BLAST results for a gene can be viewed by clicking on the gene name in the top panel). We have evaluated the differences between these two options for several genomes but we would be interested in hearing about any differences you detect.
    I suggest that when you are annotating a complete cp genome, keep one of the runs as the "master" copy to which you can add genes which are found with difference settings.

  8. Percent identity cutoffs

    Protein coding gene cutoffs In the first run, try the default values for the percent identity cutoff and if you are missing genes, lower the value and run it again. If your genome is only distantly related to the other genomes in the database (see cp taxa or mt taxa for a list of genomes in the database).

    Chloroplast tRNA cutoffs Chloroplast tRNAs are identified by BLASTN searches against a data base of 15 chloroplast genomes. The high level of nucleotide sequence conservation among the same tRNAs in different genomes enables accurate identification of chloroplast tRNAs by this method. When selecting the percent identity cutoff for RNAs in cp genomes, we recommend using a high cutoff (85-90%) to avoid spurious trna matches. The optimum percent identity may vary some depending on how divergent the genome you are annotating is relative to the sequences in the database. Thus, it may be necessary to do multiple runs with different percent identity settings. Furthermore, putatively identified tRNAs could be confirmed with other available programs for identifying tRNAs.

  9. Which genetic code do I use?
    For chloroplast annotations, use number 11 (Bacterial and Plant Plastid), which is identical to the standard genetic code, but has more start codons. The standard genetic code has just ATG, TTG, CTG, number 11 also has ATT, ATC, ATA and GTG. Using Genetic Code 11 instead of the Standard Genetic Code for chloroplast genomes only has the effect of highlighting more potential start codons, but should have no other effect.

  10. BLAST E-values
    Blast e-values tell you approximately how many sequences with that score you would expect to find in the database by chance. Therefore, a higher number (i.e. 1e-5) indicates that the sequence is more likely to appear by chance, and is therefore a less stringent cutoff. A lower number (i.e. 1e-30) indicates the sequence is less likely to appear by chance, and is therefore a more stringent cutoff. Note, however, that these scores are based on the length of the match and the length of the database and can be misleading at times.

  11. Reorienting mitochondrial genomes If you want your fasta file to be reoriented so that the cox1 gene turns up as the first gene in the annotation, select Yes here. This is to provide consistency in finished genomes. Cox1 was chosen as a reliable starting place because it tends to be the most conserved gene for mitochondrial genomes.

  12. COVE threshold
    Only tRNAs scoring above this value are reported. This cuts down on the number of spurius tRNAs that are found. If many tRNAs are found which overlap with protein coding genes, set this value higher to screen out some of the lower-scoring tRNAs.

  13. Database taxa
    Click
    here for a list of taxa in the chloroplast database.
    Click here for a list of taxa in the mitochondrial database.

  14. Numberline panel
    This panel displays all the genes graphically on a numberline. When the colored block representing the gene is clicked, that gene is displayed in the top panel. When you mouse over the gene block and hold the mouse still, information about that gene will appear in a yellow box. It gives the gene name, gene start, gene end, and strand of the gene. Protein coding genes are light purple on the forward strand, and dark striped purple on the reverse. Transfer RNAs are light blue on the forward strand, and dark striped blue on the reverse. Ribosomal RNAs are light green on the forward strand, and dark striped green on the reverse. After a gene has been annotated, it shows up in the middle panel with a black border. For genes with introns, the black border connects the exons.

  15. Extracting sequences There are several kinds of sequences which can be extracted from the annotation. All of the sequences are generated from the text summary file, so if genes are not yet annotated, the sequences generated may not be correct.
    • Intergenic sequences This will return the set of nucleotide sequences between all the genes in the input.
    • Introns This will return the nucleotide sequences of all the introns for annotated genes with more than one exon.
    • Protein coding aa sequences This will return the set of sequences for all the genes in the text summary, translated to amino acids. For genes with introns, the intron sequences are not returned and the translated exons are each given on a separate line.
    • Protein coding nt sequences This will return the set of nucleotide sequences for all the genes in the text summary. For genes with introns, the intron sequences are not returned and the exons are each given on a separate line.
    • tRNAsThe nucleotide sequence for the tRNA is returned.
    • rRNAsThe nucleotide sequence for the rRNA is returned.

  16. Displaying summaries Clicking on the Show Sequin Format button brings up the Sequin input file. If no genes have been annotated, it will be empty. Clicking on the Show Text Summary button brings up the list of genes produced by BLASTing against the database. This list is modified as the genes are annotated. It does not represent a correct list until all genes have been annotated. (For example, the BLAST hits don't contain stop codons, so all the protein coding genes are initially 3 nucleotides short.) Once a gene has been annotated, it is shown labelled "Annotated" in the text summary. Exons are listed separately, one after another in the text summary.

  17. Annotating genes
    Protein coding genes
    When you click on a gene in the numberline panel, information about that gene appears in the top panel. Both the forward and reverse strands of the DNA sequence containing that gene plus 60 basepairs up and down stream of the gene are shown. The amino acid sequence for the input genome is shown directly adjacent (above for genes on the forward strand, below for genes on the reverse strand) to the nucleotide sequence. The amino acid sequences for the closest BLAST hits (how many is a setting the user can change on the first form page) are show next. The user needs to pick a start and stop codon for the gene based on these BLAST hits.

    Protein coding genes with introns
    For a gene with an intron, it will likely show up as two adjacent genes on the number line. When annotating a gene with an intron, the start and stop codons of the first and last exons must first be annotated, and then the exons can be joined (using the button on the Sequin window) in the annotation.

    Example: When nicotiana is run through DOGMA, the gene atpF has an intron and shows up on the numberline in two pieces: from 12206 12673 and 13294 13452 on the reverse strand. Before the exons can be joined, the start codon must be chosen by clicking on the second exon (since it is on the reverse strand). Then the first exon can be selected and the stop codon picked and then the "Add an exon" button in the Sequin window can be selected. This will bring up a list of all the potential exons for that gene.

  18. Sequin input window
    The Sequin Window appears when a gene block is clicked on in the numberline panel. It appears with all of the required information already filled in when the window appears, but the user needs to verify the beginning and end of the gene. When a link for a potential start and stop codon is click in the top panel of the Annotation Window, the values are automatically updated in the Sequin Window. Values in the Sequin Window may also be manually updated.

  19. Deleting a gene
    There are two ways to remove a gene. One is to, select the "Delete Gene" button from the bottom panel in DOGMA and then type in the name of the gene exactly as it appears in the text summary. You will be asked to confirm the gene you want to remove. The other way to delete a gene is to click the delete button next to the gene you want to remove in the text summary for the annotation. Once a gene is deleted, it can't be undone. You can, however, add a gene using the "Add Gene" button in the bottom panel.

  20. Preparing files to input to Sequin program
    In order for a feature table to be imported into Sequin, the SeqId of the FASTA file (the first word after > ), must match the SeqId of the feature table. So be aware that the first word (after the >) on the first line of the FASTA file will be used as the SeqId in Sequin.

    Example: In the following file, the SeqId is Sc_16:

    >Sc_16 [organism=Saccharomyces cerevisiae] [strain=S288C] 
    CGACCACAATGGTACGATTGTTCATAAATCAGGAGATGTTCCTATTCATATAAAGATACC
    AAACAGATCTCTAATACATGACCAGGATATCAACTTCTATAATGGTTCCGAAAACGAAAG
    AAAACCAAATCTAGAGCGTAGAGACGTCGACCGTGTTGGTGATCCAATGAGGATGGATAG
    [etc.]
    

    More information is available at the NCBI site.

    When preparing a Sequin submission, you'll need two files: the FASTA file containing the genome sequence, and the feature table file created by DOGMA.
    To prepare the feature table file after annotating all the genes:

    1. Click on the ``Show Sequin Format'' button.
    2. Copy and paste the contents of the file as into Word.
    3. Save the Word file as ``Text only.''

    The Sequin page at NCBI is Sequin. You can download the Sequin stand-alone application from this page. Click on the sequin.Mac.hqx link on the page to download a file called Sequin Folder.sit. This should automagically uncompress and inside that folder, you should double-click on the SequinOSX application. You should read through the Sequin documentation on the NCBI site to familiarize yourself with it, but you can also pretty much ignore it.

    1. Choose Start New Submission
    2. Enter the information for the 4 tabs: Submission, Contact, Authors, and Affiliation. It won't let you proceed until it has this information.
    3. You can save and re-use this information by going back to the Submission tab and selecting Export Submitter Info from the File menu (convention is to name it with a .sbt extension). Then, for your next submission, when you are on the Submission tab, select Import Submitter Info from the File menu and load the saved file.
    4. The next form you typically won't need to change, just select Next Form.
    5. Now fill out the organism information under the Organism tab. For chloroplasts, you should change Location of Sequence to Chloroplast. Sequin will then fill in the genetic code for you.
    6. Under the Nucleotide tab, you should import your import your fasta file. For chloroplasts, change topology to circular.
    7. Click next page, do nothing on this tab, then click Next Form. When it says "You have not entered proteins. Is this correct?" say OK.
    8. Next you get a window with the GenBank format, and now you import your feature table file by choosing ``File>Open'' and selecting the feature table file. The GenBank format should then include all your annotations and the protein sequences for all your protein coding genes.
    9. You should then select "Search>Validate" to see any errors that you might need to fix.
    10. When you save the Sequin file and quit, Sequin will tell you to email the file to gb-sub@ncbi.nlm.nih.gov. You can also save the GenBank flat file.

  21. Misc. data files
    For each annotation, there are many files which are stored with it. These can be accessed by giving the URL prefix for the annotation:

    http://dogma.ccbb.utexas.edu/data/USERID/UNIQUE/FILENAME

    Some files you might be interested in looking at are:

    • missed.txt If DOGMA missed any genes, they will be listed in this file.
    • cm_out (for mt data) This is the raw cove output. It gives the cove score and the coordinates of putative tRNAs.
    • cm_strings This is the sequences that COVE identified as potential tRNAs. It may have putative tRNAs which are not in DOGMA because the folding program wansn't able to fold the string of nucleotides.
    • blast_output this is a directory containing the BLAST output for each of the genes.

    Sorry, for security reasons, directory listing has been turned off for this server so some of the files are no longer available.