Re: Maker at Texas A&M

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Maker at Texas A&M

Carson Hinton Holt
Re: Maker at Texas A&M When looking at prokaryotes, FastA and BLAST will give you basically the same results because you’re not looking for complex alignments, just a single exon.  For Eukaryotes while FastA would give a better alignment, I’m not really concerned because I’m only using BLAST to seed the alignment.  I then produce a polished alignment using Exonerate which is better than both FastA and BLAST.  It is slower than both if I give it the whole database to work with, which is why I first seed with BLAST and then trim out the region and the hit to be polished externally by Exonerate.

I tend to use UniProt/Swiss-Prot most often because it is a manually curated non-redundant dataset, so there is very very high confidence in the protein structure.  Of course for organisms with few well annotated close relatives, relying solely on UniProt will not provide as good of results because the organisms will be so divergent from everything in UniProt.  So I sometimes combine UniProt with the closest annotated relative even if that relative is not annotated that well.

Thanks,
Carson

On 10/8/10 4:37 PM, "Rodolfo Aramayo" <raramayo@...> wrote:

Carson,

Great talking to you yesterday. Very informative.

Talking to a friend of mine he raised the issue of FastA. According to him when comparing DNA/ESTs to proteins Fasta is far better, slower, but better as it produces a more meaningful alignments. So my question to you is:

Have you consider allowing people to use Fasta instead of Blast?

And a second question is: The database you use for Blast is Uniprot, right?

Take Care

--Rodolfo


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Maker at Texas A&M

Carson Hinton Holt
Re: Maker at Texas A&M I usually use UniProtKB/Swiss-Prot (manually curated) as apposed to UniProtKB/TrEMBL (unreviewed, automatically generated).  UniRef is a clustering of both of these plus some of another database, UniParc, into sub groups based on sequence similarity.  The problems with the other databases are both size and quality.  UniProtKB/Swiss-Prot is high quality with ~500,000 entries (very big compute).  UniProtKB/TrEMBL is lower quality with ~12,000,000 entries (huge compute, it will bring most computers to a screeching halt).  All of UniRef is similarly large.  UniParc is just a mish-mash of everything the EBI can get its hands on with ~24,000,000 entries representing a non-redundant set of most annotated proteins in the world (if a BLAST job takes 1 hour on UniProtKB/Swiss-Prot, it will take >50 hours against UniParc).

So selecting a database is a balance between maximizing sensitivity, specificity, and processing efficiency.  If you have a couple of close relatives to the organism, then just using those genome annotations is simple and fast since you will have less than 50,000 entries.  But for a comprehensive dataset UniProtKB/Swiss-Prot  is relatively fast (compared to other databases).  It takes about 10x longer than just using a few close relatives, but it is high quality.

Thanks,
Carson


On 10/8/10 10:06 PM, "Rodolfo Aramayo" <raramayo@...> wrote:

Carson,

So which UniProt database do you use then?

uniref100.fasta or uniref90.fasta or uniref50.fasta?

--Rodolfo

On Fri, Oct 8, 2010 at 18:08, Carson Holt <carson.holt@...> wrote:
When looking at prokaryotes, FastA and BLAST will give you basically the same results because you’re not looking for complex alignments, just a single exon.  For Eukaryotes while FastA would give a better alignment, I’m not really concerned because I’m only using BLAST to seed the alignment.  I then produce a polished alignment using Exonerate which is better than both FastA and BLAST.  It is slower than both if I give it the whole database to work with, which is why I first seed with BLAST and then trim out the region and the hit to be polished externally by Exonerate.

I tend to use UniProt/Swiss-Prot most often because it is a manually curated non-redundant dataset, so there is very very high confidence in the protein structure.  Of course for organisms with few well annotated close relatives, relying solely on UniProt will not provide as good of results because the organisms will be so divergent from everything in UniProt.  So I sometimes combine UniProt with the closest annotated relative even if that relative is not annotated that well.

Thanks,
Carson


On 10/8/10 4:37 PM, "Rodolfo Aramayo" <raramayo@... <http://raramayo@...> > wrote:

Carson,

Great talking to you yesterday. Very informative.

Talking to a friend of mine he raised the issue of FastA. According to him when comparing DNA/ESTs to proteins Fasta is far better, slower, but better as it produces a more meaningful alignments. So my question to you is:

Have you consider allowing people to use Fasta instead of Blast?

And a second question is: The database you use for Blast is Uniprot, right?

Take Care

--Rodolfo




_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org