For example, in the following file, the SeqId is Sc_16:
>Sc_16 [organism=Saccharomyces cerevisiae] [strain=S288C] CGACCACAATGGTACGATTGTTCATAAATCAGGAGATGTTCCTATTCATATAAAGATACC AAACAGATCTCTAATACATGACCAGGATATCAACTTCTATAATGGTTCCGAAAACGAAAG AAAACCAAATCTAGAGCGTAGAGACGTCGACCGTGTTGGTGATCCAATGAGGATGGATAG [etc.]And the first line of the feature table file would look like:
>Features Sc_16
More information is available at the NCBI site.
Sequence data may contain any of the characters from the IUPAC code:
A | adenosine | M | A or C (amino) | |
C | cytidine | S | G or C (strong) | |
G | guanine | W | A or T (weak) | |
T | thymidine | B | G or T or C | |
U | uridine | D | G or A or T | |
R | G or A (purine) | H | A or C or T | |
Y | T or C (pyrimidine) | V | G or C or A | |
K | G or T (keto) | N | A G C T (any) |
Creating an input file:
CP Genomes
The option for gapped and ungapped alignment may give slightly
different results for chloroplast genomes. The ungapped option will
generally generate shorter BLASTX matches with higher percent identity scores.
This will result in some genes appearing as multiple, contiguous pieces in
the annotation window. The gapped alignment will generate longer BLASTX
matches but for genomes with some protein coding genes that are quite
divergent from the ones in the database, these genes may be missed.
The more divergent genes can be found by either running an ungapped alignment
or by running a gapped alignment with a lower percent identity cutoff than the
60% default value. The best thing to do is run it both gapped
(with and without lowering the percent identity cutoff) and ungapped alignment
and compare the results. We also recommend that users examine the
BLASTX results before deciding if a gene is present when there are differences
between gapped and ungapped annotations (BLAST results for a gene
can be viewed by clicking on the gene name in the top panel).
We have evaluated the differences
between these two options for several genomes but we would be interested
in hearing about any differences you detect.
Protein coding gene cutoffs
In the first run, try the default values for the percent identity cutoff
and if you
are missing genes, lower the value and run it again. If your genome is
only distantly related
to the other genomes in the database (see cp taxa or mt taxa for a list of genomes in the database).
Chloroplast tRNA cutoffs
Chloroplast tRNAs are identified by BLASTN searches against a
data base of 15 chloroplast genomes.
The high level of nucleotide sequence conservation among the
same tRNAs in different genomes enables accurate identification of
chloroplast tRNAs by this method.
When selecting the percent identity cutoff for RNAs in cp genomes, we
recommend using a high cutoff (85-90%) to avoid spurious trna matches.
The optimum percent identity may vary some depending on how divergent
the genome you are annotating is relative to the sequences in the database.
Thus, it may be necessary to do multiple runs with different percent
identity settings. Furthermore, putatively identified tRNAs could be
confirmed with other available programs for identifying tRNAs.
Protein coding genes with introns
Example: When nicotiana is run through DOGMA,
the gene atpF has an intron and shows up on the numberline in
two pieces: from 12206 12673 and 13294 13452 on the reverse strand. Before the
exons can be joined, the start codon must be chosen by clicking on the
second exon (since it is on the reverse strand). Then the first exon can be
selected and the stop codon picked and then the "Add an exon" button in
the Sequin window can be selected. This will bring up a list of all the
potential exons for that gene.
Example: In the following file, the SeqId is Sc_16:
More information is available at
the NCBI site.
When preparing a Sequin submission, you'll need two files: the FASTA file
containing the genome sequence, and the feature table file created by
DOGMA.
The Sequin page at NCBI is Sequin.
You can download the Sequin stand-alone application from this
page. Click on the sequin.Mac.hqx link on
the page to download a file called Sequin Folder.sit. This should
automagically uncompress and inside that folder, you should double-click on
the SequinOSX application.
You should read through the Sequin documentation on the NCBI site to
familiarize yourself with it, but you can also pretty much ignore it.
http://dogma.ccbb.utexas.edu/data/USERID/UNIQUE/FILENAME
Some files you might be
interested in looking at are:
IF YOU GET OUTPUT WITH NO GENES:
This usually means your input file had the wrong kind of line breaks. Follow
the directions below and try uploading your file again. If you have a situation not covered in the instructions below, send staciakwyman@gmail.com email.
To prepare your file for uploading to DOGMA, several steps may be
required depending on which program (Sequencher, MacVector, Consed,
Word, etc.) you used to save your data. Different programs save files
with different kinds of line breaks, and it must be a specific kind for
uploading to DOGMA. In the instructions below, I am assuming you are using
a Mac running OS X.
Save the file as Pearson/FASTA format from the export menu (FASTA
text in MacVector).
Then you have two options to convert it to a file which may be uploaded through the
browser to DOGMA (if you are you Mac OS X):
native2ascii filename filename
Save the file as MS-DOS text. NOT TEXT ONLY.
MT Genomes
For mitchondrial genomes, we have not found any genomes for which ungapped
BLAST finds something that gapped BLAST misses, so gapped BLAST should be
used for mt genomes. You could also run both and compare the results.
I suggest that when you are annotating a complete cp genome, keep one of the runs as the "master" copy to which you can add genes which are found with
difference settings.
For chloroplast annotations, use number 11 (Bacterial and Plant Plastid),
which is identical to the standard genetic
code, but has more start codons.
The standard genetic code has
just ATG, TTG, CTG, number 11 also has
ATT, ATC, ATA and GTG. Using Genetic Code 11 instead of the Standard Genetic Code for
chloroplast genomes only has the effect of highlighting more potential start codons, but
should have no other effect.
Blast e-values
tell you approximately how
many sequences with that score you would expect to find in
the database by chance.
Therefore, a higher number (i.e. 1e-5) indicates
that the sequence is more likely
to appear by chance, and is therefore a less stringent cutoff. A
lower number (i.e. 1e-30) indicates the sequence is less likely to
appear by chance, and is therefore a more stringent cutoff.
Note, however, that these scores are based on the length of the match and the
length of the database and can be misleading at times.
Only tRNAs scoring above this value are reported. This cuts down on the number of spurius tRNAs that are found. If many tRNAs are found
which overlap with protein coding genes, set this value higher to screen out some
of the lower-scoring tRNAs.
Click here for a list of taxa in the chloroplast
database.
Click here for a list of taxa in the mitochondrial
database.
This panel displays all the genes graphically on a numberline. When the colored block
representing the gene is clicked, that gene is displayed in the top panel.
When you mouse over the gene block and hold the mouse still,
information about that gene will appear in
a yellow box. It gives the gene name, gene start, gene end, and strand of the gene.
Protein coding genes are light purple on the forward strand, and
dark striped purple on the reverse.
Transfer RNAs are light blue on the forward strand, and
dark striped blue on the reverse.
Ribosomal RNAs are light green on the forward strand, and
dark striped green on the reverse.
After a gene has been annotated, it shows up in the middle panel with a black
border. For genes with introns, the black border connects the exons.
Protein coding genes
When you click on a gene in the numberline panel, information about that gene
appears in the top panel. Both the forward and reverse strands of the DNA
sequence containing that gene plus 60 basepairs up and down stream of the gene
are shown. The amino acid sequence for the input genome is shown directly
adjacent (above for genes on the
forward strand, below for genes on the reverse strand)
to the nucleotide sequence. The amino acid sequences for the closest
BLAST hits (how many is a setting
the user can change on the first form page) are
show next. The user needs to pick a start and stop codon for the gene based
on these BLAST hits.
For a gene with an intron, it will likely show up as two adjacent genes
on the number line. When annotating a gene with an intron, the start and stop codons of the first and last exons must first be annotated, and then the
exons can be joined (using the button on the Sequin window) in the annotation.
The Sequin Window appears when a gene block is clicked on in the numberline panel.
It appears with
all of the required information already filled in when the
window appears, but the user needs to verify the
beginning and end of the gene. When
a link for a potential start and stop codon is click in the top panel of the
Annotation Window, the values are automatically updated in the Sequin Window.
Values in the Sequin Window may also be manually updated.
There are two ways to remove a gene. One is to, select the "Delete Gene"
button from the bottom panel in DOGMA and then type in the name of the
gene exactly as it appears in the text summary. You will be asked to
confirm the gene you want to remove. The other way to delete a gene is
to click the delete button next to the gene you want to remove in the
text summary for the annotation. Once a gene is deleted, it can't be undone.
You can, however, add a gene using the "Add Gene" button in the bottom panel.
In order for a feature table to be imported into Sequin, the SeqId of
the FASTA file (the first word after > ), must match the SeqId of the
feature table. So be aware that the first word (after the >) on the first line
of the FASTA file will be used as the SeqId in Sequin.
>Sc_16 [organism=Saccharomyces cerevisiae] [strain=S288C]
CGACCACAATGGTACGATTGTTCATAAATCAGGAGATGTTCCTATTCATATAAAGATACC
AAACAGATCTCTAATACATGACCAGGATATCAACTTCTATAATGGTTCCGAAAACGAAAG
AAAACCAAATCTAGAGCGTAGAGACGTCGACCGTGTTGGTGATCCAATGAGGATGGATAG
[etc.]
To prepare the feature table file after annotating all the genes:
For each annotation, there are many files which are stored with it.
These can be
accessed by giving the URL prefix for the annotation:
Sorry, for security reasons, directory listing has been turned off for this server so some of the files are no longer available.